In [1]:
import pandas as pd

In [28]:
data = pd.read_csv("data.csv")

In [29]:
data.head()

Unnamed: 0,id,age,workclass,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,over_50k
0,12106,32,Private,HS-grad,9,Divorced,Adm-clerical,Other-relative,W hite,Female,0,0,40,United-States,<=50K
1,28951,43,State-gov,Some-college,10,Divorced,Adm-clerical,Unmarried,W hite,Female,0,0,40,United-States,<=50K
2,24570,35,Private,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
3,16358,31,Private,,14,Never-married,Prof-specialty,Not-in-family,Black,Male,0,0,40,United-States,<=50K
4,9375,64,Private,Some-college,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,10566,0,35,United-States,<=50K


ID Column: id,              

Target: over_50k

**Handle missing values: We'll search for the columns that has missing values**

In [30]:
data.isna().sum()

id                   0
age                  0
workclass            0
education         3024
education_num        0
marital_status       0
occupation           0
relationship         0
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country       0
over_50k          3271
dtype: int64

**Columnns to drop: ID(no use), Education(as it has missing values) but the corresponding columnn "Education_num" doesn't have any and we see clear mapping between those two.**

In [31]:
data.drop(["id", "education"], axis = 1, inplace = True) 

In [32]:
data

Unnamed: 0,age,workclass,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,over_50k
0,32,Private,9,Divorced,Adm-clerical,Other-relative,W hite,Female,0,0,40,United-States,<=50K
1,43,State-gov,10,Divorced,Adm-clerical,Unmarried,W hite,Female,0,0,40,United-States,<=50K
2,35,Private,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
3,31,Private,14,Never-married,Prof-specialty,Not-in-family,Black,Male,0,0,40,United-States,<=50K
4,64,Private,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,10566,0,35,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...
30200,56,Private,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,4508,0,40,United-States,<=50K
30201,58,Private,6,Married-civ-spouse,Other-service,Wife,White,Female,0,0,8,United-States,<=50K
30202,39,Private,10,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,>50K
30203,40,State-gov,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,38,United-States,


**Now, we'll encode all string into numerical values as models don't understand strings**

In [34]:
from sklearn import preprocessing 
  
# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 

In [39]:
data['workclass']= label_encoder.fit_transform(data['workclass']) 
data["marital_status"]= label_encoder.fit_transform(data["marital_status"]) 
data["occupation"]= label_encoder.fit_transform(data["occupation"]) 
data["relationship"]= label_encoder.fit_transform(data["relationship"]) 
data["sex"]= label_encoder.fit_transform(data["sex"]) 
data["race"]= label_encoder.fit_transform(data["race"]) 
data["native_country"]= label_encoder.fit_transform(data["native_country"]) 

In [40]:
data.head()

Unnamed: 0,age,workclass,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,over_50k
0,32,4,9,0,1,2,4,0,0,0,40,39,<=50K
1,43,7,10,0,1,4,4,0,0,0,40,39,<=50K
2,35,4,9,2,4,5,5,0,0,0,40,39,>50K
3,31,4,14,4,10,1,2,1,0,0,40,39,<=50K
4,64,4,10,2,4,5,5,0,10566,0,35,39,<=50K


**So we'll divide data into training and test dataset:**
1. Training set: Rows that have a target label for "over_50k", we'll train the model on this data and check accuracy. 
2. Test set: Rows that does not have a target label for "over_50k", we'll make predictions on this dataset.

In [48]:
train = data[data["over_50k"].notnull()]
test = data[data["over_50k"].isna()]

In [49]:
test.drop(["over_50k"], axis = 1, inplace = True) 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


**Again, we need to encode the target but this time we'll use replace method:**

In [54]:
train["over_50k"].unique()

array(['<=50K', '>50K'], dtype=object)

In [55]:
cleanup_nums = {"over_50k": {"<=50K": 0, ">50K":1 }}

In [58]:
train.replace(cleanup_nums, inplace=True)
train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  regex=regex,


Unnamed: 0,age,workclass,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,over_50k
0,32,4,9,0,1,2,4,0,0,0,40,39,0
1,43,7,10,0,1,4,4,0,0,0,40,39,0
2,35,4,9,2,4,5,5,0,0,0,40,39,1
3,31,4,14,4,10,1,2,1,0,0,40,39,0
4,64,4,10,2,4,5,5,0,10566,0,35,39,0


**This is our final training set on which we'll train the model. We'll also save a copy of this.**

In [60]:
train.to_csv("training_data.csv",index=0)

**We'll also save a copy for testing data.**

In [62]:
test.to_csv("test.csv",index=0)

In [68]:
y = train.over_50k
X = train.drop('over_50k',axis=1)

**Here comes, Model fitting , we'll split the training set in train and test, here train set will be used for training and test will be used for checking accuracy**

In [69]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [70]:
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.86


**To make predictions using this model, we'll use the test data that had no target value initially**

In [73]:
pred = logreg.predict(test)

In [75]:
test['Predictions'] = pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


**We'll replace the 0,1 class back to its original label**

In [76]:
reverse_nums = {"Predictions": {0:"<=50K", 1:">50K" }}

In [77]:
test.replace(reverse_nums, inplace=True)
test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  regex=regex,


Unnamed: 0,age,workclass,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,Predictions
33,34,1,13,2,11,0,5,1,0,1902,48,39,<=50K
60,41,4,13,2,4,0,5,1,0,0,50,39,<=50K
70,31,0,13,2,0,5,5,0,0,0,25,39,<=50K
71,56,4,13,2,10,0,5,1,0,0,40,39,<=50K
79,52,6,13,2,4,0,5,1,0,0,50,39,<=50K


In [78]:
test.to_csv("Predictions.csv",index=0)