# Baseline for INCPR Educational Challenge on AIcrowd
#### Author : Ayush Shivani

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ayushshivani/aicrowd_educational_baselines/blob/master/INCPR_baseline.ipynb)


## Download Necessary Packages

In [1]:
import sys
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install scikit-learn



## Download data

In [None]:
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_incpr/data/public/test.csv
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_incpr/data/public/train.csv


## Import packages

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import f1_score,precision_score,recall_score,accuracy_score

## Load the data

In [5]:
train_data_path = "train.csv" #path where data is stored

In [6]:
train_data = pd.read_csv(train_data_path) #load data in dataframe using pandas

In [16]:
train_data # Visualize the data

Unnamed: 0,age,workclass,fnlwgt,education,education num,marital status,occupation,relationship,race,sex,capital gain,capital liss,working hours per weel,native country,income
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,1
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,1
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,1
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,1
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,1
5,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,1
6,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,0
7,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,0
8,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,0
9,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,0


Select the clumns which you need to train on. We can also select columns with data type as text , but then we need to map thise text to numbers and tehn select them.

In [20]:
train_data = train_data[['age','education num','capital gain','capital liss','working hours per weel','income']]

## Split the data in train/Validation

In [53]:
X_train, X_test= train_test_split(train_data, test_size=0.2, random_state=42) 

Check which coloum contains the variable that needs to be predicted. Here it is the last column.

In [54]:
X_train,y_train = X_train.iloc[:,:-1],X_train.iloc[:,-1]
X_test,y_test = X_test.iloc[:,:-1],X_test.iloc[:,-1]

## Define the classifier

In [56]:
classifier = LogisticRegression(solver = 'lbfgs',multi_class='auto', max_iter=1000)

One can set more parameters. To see the list of parameters visit [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

We can also use other classifiers. To read more about sklear classifiers visit [here](https://scikit-learn.org/stable/supervised_learning.html).

## Train the classifier

In [57]:
classifier.fit(X_train, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

# Predict on test set

In [58]:
y_pred = classifier.predict(X_test)

## Find the scores 

In [60]:
precision = precision_score(y_test,y_pred,average='micro')
recall = recall_score(y_test,y_pred,average='micro')
accuracy = accuracy_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred,average='macro')

In [61]:
print("Accuracy of the model is :" ,accuracy)
print("Recall of the model is :" ,recall)
print("Precision of the model is :" ,precision)
print("F1 score of the model is :" ,f1)

Accuracy of the model is : 0.8143427518427518
Recall of the model is : 0.8143427518427518
Precision of the model is : 0.8143427518427518
F1 score of the model is : 0.7010332966849817


# Prediction on Evaluation Set

# Load the evaluation data

In [62]:
final_test_path = "test.csv"
final_test = pd.read_csv(final_test_path)
final_test = final_test[['age','education num','capital gain','capital liss','working hours per weel']] 

## Predict on evaluation set

In [63]:
submission = classifier.predict(final_test)

## Save the prediction to csv

In [64]:
submission = pd.DataFrame(submission)
submission.to_csv('submission.csv',header=['income'],index=False)

### Go to [platform](https://www.aicrowd.com/challenges/icmpr-income-prediction). Participate in the challenge and submit the submission.csv generated.