# Income Prediction



## Prepare the dataset

We will use the [Census Income dataset](https://archive.ics.uci.edu/ml/datasets/Adult). The task for this dataset is to predict whether an individual's income exceeds $50,000 per year or not based on a set of attributes/features (including age, occupation, education, and other factors). The dataset was constructed from the 1994 US Census database.

In [None]:
import pandas as pd
import numpy as np
seed = 0
np.random.seed(0)
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None)
data.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 
                'marital-status', 'occupation', 'relationship', 'race', 'sex', 
                'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
                'class']

data = data.drop(['fnlwgt', 'education-num'], axis=1)

data = data.replace(' ?', np.nan)

data.sample(10, random_state=seed)

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
22278,27,Private,Some-college,Divorced,Adm-clerical,Unmarried,White,Female,0,0,44,United-States,<=50K
8950,27,Private,Bachelors,Never-married,Prof-specialty,Not-in-family,White,Female,0,0,40,United-States,<=50K
7838,25,Private,Assoc-acdm,Married-civ-spouse,Sales,Husband,White,Male,0,0,40,United-States,<=50K
16505,46,Private,5th-6th,Married-civ-spouse,Transport-moving,Husband,Amer-Indian-Eskimo,Male,0,1902,40,United-States,<=50K
19140,45,Private,11th,Divorced,Transport-moving,Not-in-family,White,Male,0,2824,76,United-States,>50K
12319,29,Private,Bachelors,Married-civ-spouse,Prof-specialty,Own-child,White,Male,0,0,75,United-States,<=50K
28589,42,Local-gov,HS-grad,Separated,Other-service,Not-in-family,Black,Female,0,0,60,,<=50K
10000,34,Private,Some-college,Divorced,Adm-clerical,Not-in-family,White,Female,0,0,50,United-States,<=50K
28530,60,Local-gov,Assoc-voc,Divorced,Prof-specialty,Unmarried,White,Male,0,0,40,United-States,>50K
24237,19,Private,Some-college,Never-married,Adm-clerical,Own-child,White,Female,0,0,40,United-States,<=50K


In [None]:
data.shape

(32561, 13)

In [None]:
from sklearn_pandas import CategoricalImputer
from sklearn.preprocessing import LabelEncoder
print("Number of NaN values")
print(data.isna().sum())
print("\n")
# Using the most frequent strategy for replacing NaN values
categoricalImputer = CategoricalImputer(strategy = 'most_frequent')
data['workclass'] = categoricalImputer.fit_transform(data['workclass'])
data['native-country'] = categoricalImputer.fit_transform(data['native-country'])
data['occupation'] = categoricalImputer.fit_transform(data['occupation'])
print("Number of Nan Values after categoricalImputer")
print(data.isna().sum())


Number of NaN values
age                  0
workclass         1836
education            0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
class                0
dtype: int64


Number of Nan Values after categoricalImputer
age               0
workclass         0
education         0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
class             0
dtype: int64


In [None]:
# Finding features to exclude by checking correlation with class(income)
# We'll copy data to dataPreprocessing to find features to ignore
# Encoding the categorical data
encoder = LabelEncoder()
dataPreprocessing = data.copy()
dataPreprocessing[['class', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']] = dataPreprocessing[['class', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']].apply(encoder.fit_transform)
dataPreprocessing.sample(10, random_state=seed)



Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
22278,27,3,15,0,0,4,4,0,0,0,44,38,0
8950,27,3,9,4,9,1,4,0,0,0,40,38,0
7838,25,3,7,2,11,0,4,1,0,0,40,38,0
16505,46,3,4,2,13,0,0,1,0,1902,40,38,0
19140,45,3,1,0,13,1,4,1,0,2824,76,38,1
12319,29,3,9,2,9,3,4,1,0,0,75,38,0
28589,42,1,11,5,7,1,2,0,0,0,60,38,0
10000,34,3,15,0,0,1,4,0,0,0,50,38,0
28530,60,1,8,0,9,4,4,1,0,0,40,38,1
24237,19,3,15,4,0,3,4,0,0,0,40,38,0


In [None]:
# Scalling the numerical data using MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
dataPreprocessing[['age', 'capital-gain', 'capital-loss', 'hours-per-week']] = scaler.fit_transform(dataPreprocessing[['age', 'capital-gain', 'capital-loss', 'hours-per-week']])
dataPreprocessing.sample(10, random_state=seed)

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
22278,0.136986,3,15,0,0,4,4,0,0.0,0.0,0.438776,38,0
8950,0.136986,3,9,4,9,1,4,0,0.0,0.0,0.397959,38,0
7838,0.109589,3,7,2,11,0,4,1,0.0,0.0,0.397959,38,0
16505,0.39726,3,4,2,13,0,0,1,0.0,0.436639,0.397959,38,0
19140,0.383562,3,1,0,13,1,4,1,0.0,0.648301,0.765306,38,1
12319,0.164384,3,9,2,9,3,4,1,0.0,0.0,0.755102,38,0
28589,0.342466,1,11,5,7,1,2,0,0.0,0.0,0.602041,38,0
10000,0.232877,3,15,0,0,1,4,0,0.0,0.0,0.5,38,0
28530,0.589041,1,8,0,9,4,4,1,0.0,0.0,0.397959,38,1
24237,0.027397,3,15,4,0,3,4,0,0.0,0.0,0.397959,38,0


In [None]:
# Checking correlation with income
print("\n Checking correlation with income \n")
abs(dataPreprocessing.corrwith(dataPreprocessing['class'])).sort_values(ascending=False)[1:]


 Checking correlation with income 



relationship      0.250918
age               0.234037
hours-per-week    0.229689
capital-gain      0.223329
sex               0.215980
marital-status    0.199307
capital-loss      0.150526
education         0.079317
race              0.071846
occupation        0.034625
native-country    0.023058
workclass         0.002693
dtype: float64

In [None]:
# we find workclass has the least correlation of 0.002693 which can be ignored
# similarly we can drop native-country as well which has a correlation of 0.023058
data = data.drop(['workclass', 'native-country'], axis=1)
data.sample(10, random_state=seed)





Unnamed: 0,age,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,class
22278,27,Some-college,Divorced,Adm-clerical,Unmarried,White,Female,0,0,44,<=50K
8950,27,Bachelors,Never-married,Prof-specialty,Not-in-family,White,Female,0,0,40,<=50K
7838,25,Assoc-acdm,Married-civ-spouse,Sales,Husband,White,Male,0,0,40,<=50K
16505,46,5th-6th,Married-civ-spouse,Transport-moving,Husband,Amer-Indian-Eskimo,Male,0,1902,40,<=50K
19140,45,11th,Divorced,Transport-moving,Not-in-family,White,Male,0,2824,76,>50K
12319,29,Bachelors,Married-civ-spouse,Prof-specialty,Own-child,White,Male,0,0,75,<=50K
28589,42,HS-grad,Separated,Other-service,Not-in-family,Black,Female,0,0,60,<=50K
10000,34,Some-college,Divorced,Adm-clerical,Not-in-family,White,Female,0,0,50,<=50K
28530,60,Assoc-voc,Divorced,Prof-specialty,Unmarried,White,Male,0,0,40,>50K
24237,19,Some-college,Never-married,Adm-clerical,Own-child,White,Female,0,0,40,<=50K


In [None]:
# Encoding class and sex with LabelEncoder
encoder = LabelEncoder()
data[['class','sex']] = data[['class','sex']].apply(encoder.fit_transform)
data.sample(10, random_state=seed)

Unnamed: 0,age,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,class
22278,27,Some-college,Divorced,Adm-clerical,Unmarried,White,0,0,0,44,0
8950,27,Bachelors,Never-married,Prof-specialty,Not-in-family,White,0,0,0,40,0
7838,25,Assoc-acdm,Married-civ-spouse,Sales,Husband,White,1,0,0,40,0
16505,46,5th-6th,Married-civ-spouse,Transport-moving,Husband,Amer-Indian-Eskimo,1,0,1902,40,0
19140,45,11th,Divorced,Transport-moving,Not-in-family,White,1,0,2824,76,1
12319,29,Bachelors,Married-civ-spouse,Prof-specialty,Own-child,White,1,0,0,75,0
28589,42,HS-grad,Separated,Other-service,Not-in-family,Black,0,0,0,60,0
10000,34,Some-college,Divorced,Adm-clerical,Not-in-family,White,0,0,0,50,0
28530,60,Assoc-voc,Divorced,Prof-specialty,Unmarried,White,1,0,0,40,1
24237,19,Some-college,Never-married,Adm-clerical,Own-child,White,0,0,0,40,0


In [None]:
# Scaling numerical data using MinMaxScaler
scaler = MinMaxScaler()
data[['age', 'capital-gain', 'capital-loss', 'hours-per-week']] = scaler.fit_transform(data[['age', 'capital-gain', 'capital-loss', 'hours-per-week']])
data.sample(10, random_state=seed)

Unnamed: 0,age,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,class
22278,0.136986,Some-college,Divorced,Adm-clerical,Unmarried,White,0,0.0,0.0,0.438776,0
8950,0.136986,Bachelors,Never-married,Prof-specialty,Not-in-family,White,0,0.0,0.0,0.397959,0
7838,0.109589,Assoc-acdm,Married-civ-spouse,Sales,Husband,White,1,0.0,0.0,0.397959,0
16505,0.39726,5th-6th,Married-civ-spouse,Transport-moving,Husband,Amer-Indian-Eskimo,1,0.0,0.436639,0.397959,0
19140,0.383562,11th,Divorced,Transport-moving,Not-in-family,White,1,0.0,0.648301,0.765306,1
12319,0.164384,Bachelors,Married-civ-spouse,Prof-specialty,Own-child,White,1,0.0,0.0,0.755102,0
28589,0.342466,HS-grad,Separated,Other-service,Not-in-family,Black,0,0.0,0.0,0.602041,0
10000,0.232877,Some-college,Divorced,Adm-clerical,Not-in-family,White,0,0.0,0.0,0.5,0
28530,0.589041,Assoc-voc,Divorced,Prof-specialty,Unmarried,White,1,0.0,0.0,0.397959,1
24237,0.027397,Some-college,Never-married,Adm-clerical,Own-child,White,0,0.0,0.0,0.397959,0


In [None]:
# Removing the duplicate values from data
print("\n Before removing duplicates:", data.duplicated().sum())
data = data[~data.duplicated()]
print("\n After removing duplicates:", data.duplicated().sum())
print("\n Data shape : ", data.shape)


 Before removing duplicates: 5133

 After removing duplicates: 0

 Data shape :  (27428, 11)


In [None]:
# creating numerical features from categorical features
data = pd.get_dummies(data)


In [None]:
encoded = list(data.columns)
print("{} total features after one-hot encoding.".format(len(encoded)))

54 total features after one-hot encoding.


## Model training



In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


#We split the data into train(60%) and test(40%) sets
X_train, X_test, y_train, y_test = train_test_split(data[['age', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week',
       'education_ 10th', 'education_ 11th', 'education_ 12th',
       'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th',
       'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc',
       'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad',
       'education_ Masters', 'education_ Preschool', 'education_ Prof-school',
       'education_ Some-college', 'marital-status_ Divorced',
       'marital-status_ Married-AF-spouse',
       'marital-status_ Married-civ-spouse',
       'marital-status_ Married-spouse-absent',
       'marital-status_ Never-married', 'marital-status_ Separated',
       'marital-status_ Widowed', 'occupation_ Adm-clerical',
       'occupation_ Armed-Forces', 'occupation_ Craft-repair',
       'occupation_ Exec-managerial', 'occupation_ Farming-fishing',
       'occupation_ Handlers-cleaners', 'occupation_ Machine-op-inspct',
       'occupation_ Other-service', 'occupation_ Priv-house-serv',
       'occupation_ Prof-specialty', 'occupation_ Protective-serv',
       'occupation_ Sales', 'occupation_ Tech-support',
       'occupation_ Transport-moving', 'relationship_ Husband',
       'relationship_ Not-in-family', 'relationship_ Other-relative',
       'relationship_ Own-child', 'relationship_ Unmarried',
       'relationship_ Wife', 'race_ Amer-Indian-Eskimo',
       'race_ Asian-Pac-Islander', 'race_ Black', 'race_ Other',
       'race_ White']], 
                                                    data['class'], 
                                                    test_size=0.4, 
                                                    random_state=seed)

# We can now find the will find the optimal hyperparameters for a Random Forest Classifier using GridSearchCV
param_grid = {'bootstrap': [True], 
              'max_depth': list(range(5,10)),
              'max_features': list(range(1,5)),
              'min_samples_leaf': [3, 4, 5],
              'n_estimators' : list(range(10, 201, 19))
             }
rfc = RandomForestClassifier(random_state=seed)
clf = GridSearchCV(rfc, param_grid)
clf.fit(X_train, y_train)
print("The best params for the model are", clf.best_params_)


The best params for the model are {'bootstrap': True, 'max_depth': 9, 'max_features': 4, 'min_samples_leaf': 4, 'n_estimators': 181}


In [None]:
# We initialise the Random Forest Classifier with the optimal hyperparameters that was returned from the GridSearchCV
randomForestClassifier = RandomForestClassifier(bootstrap= True, max_depth= 9, max_features=4, min_samples_leaf=4, n_estimators= 181, random_state=seed)
randomForestClassifier.fit(X_train, y_train)

RandomForestClassifier(max_depth=9, max_features=4, min_samples_leaf=4,
                       n_estimators=181, random_state=0)

In [None]:
# We can further find the feature importance of the features and drop unimportant features to speed up the process of training.
feat_imp = pd.DataFrame(zip(X_train.columns.tolist(), randomForestClassifier.feature_importances_ * 100), columns=['feature', 'importance'])
feat_imp

Unnamed: 0,feature,importance
0,age,6.180114
1,sex,2.427851
2,capital-gain,20.691384
3,capital-loss,4.859182
4,hours-per-week,4.534605
5,education_ 10th,0.258531
6,education_ 11th,0.334049
7,education_ 12th,0.025507
8,education_ 1st-4th,0.046661
9,education_ 5th-6th,0.10747


In [None]:
# since race had the least importance of we can drop the columns as this will make the training process faster
X_train = X_train.drop(['race_ White', 'race_ Black', 'race_ Other', 'race_ Asian-Pac-Islander', 'race_ Amer-Indian-Eskimo'], axis=1)
X_test = X_test.drop(['race_ White', 'race_ Black', 'race_ Other', 'race_ Asian-Pac-Islander', 'race_ Amer-Indian-Eskimo'], axis=1)
print(len(X_train.columns))

48


In [None]:
# we now perform k(5)-fold cross validation on the training data and print the accuracy of each fold
randomForestClassifier = RandomForestClassifier(bootstrap= True, max_depth= 9, max_features=4, min_samples_leaf=3, random_state=seed)
k = 5
X_train['class'] = y_train
folds = np.array_split(X_train.sample(frac=1, random_state=seed), indices_or_sections=k)
for i in range(len(folds)):
  print("Number of samples in fold " + str(i) + " is " + str(len(folds[i])))
  print("Number of samples in each class in fold "+ str(i) + " is \n" + str(folds[i]['class'].value_counts()))
  print("*"*20)
accuracies = []
for i in range(len(folds)):
  x_test = folds[i].iloc[:, 0:48]
  y_test_fold = folds[i].iloc[:, 48]
  trainData = []
  for j in range(len(folds)):
    if j != i:
      trainData.append(folds[j])
  traiDataFrame = pd.concat(trainData)
  randomForestClassifier.fit(traiDataFrame.iloc[:, 0:48], traiDataFrame.iloc[:,48])
  y_pred = randomForestClassifier.predict(x_test)
  accuracies.append(round(accuracy_score(y_test_fold, y_pred),2))
print("The acuracies across the folds are", accuracies)

Number of samples in fold 0 is 3292
Number of samples in each class in fold 0 is 
0    2490
1     802
Name: class, dtype: int64
********************
Number of samples in fold 1 is 3291
Number of samples in each class in fold 1 is 
0    2491
1     800
Name: class, dtype: int64
********************
Number of samples in fold 2 is 3291
Number of samples in each class in fold 2 is 
0    2493
1     798
Name: class, dtype: int64
********************
Number of samples in fold 3 is 3291
Number of samples in each class in fold 3 is 
0    2482
1     809
Name: class, dtype: int64
********************
Number of samples in fold 4 is 3291
Number of samples in each class in fold 4 is 
0    2505
1     786
Name: class, dtype: int64
********************
The acuracies across the folds are [0.85, 0.85, 0.86, 0.85, 0.85]


## Evaluation

We will evaluate the model on the following metrics using the test set: 
- Overall accuracy
- Precision
- Recall
- F1 score

In [None]:
from sklearn.metrics import classification_report

# we can now use the model to predict on the test set and print the accuracy, precision, recall, f1-score and support
y_pred = randomForestClassifier.predict(X_test)
print("Accuracy using the test data is ",round(accuracy_score(y_test, y_pred), 2))
print(classification_report(y_test, y_pred))

Accuracy using the test data is  0.85
              precision    recall  f1-score   support

           0       0.85      0.97      0.91      8261
           1       0.86      0.47      0.60      2711

    accuracy                           0.85     10972
   macro avg       0.85      0.72      0.76     10972
weighted avg       0.85      0.85      0.83     10972

