# The Income Prediction Problem

### Students
Dmitry Timerbaev, Daria Dobrego, Veronika Nikiforova, Vicente Tanco Aguas

## Setup of the problem

In this assignment you are asked to work with results of a big survey of adults in order to construct a binary classifier to predict the principal level of income of people. You will need to analyze and optimize the structure of the original dataset, apply a clustering technique, as well as grid-search and cross-validation to optimize the chosen classifier.

The data is given in **adult.csv**. Variable **income** is the outcome variable.

Data description can be found by this URL: https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names 


In [1]:
# load traditional libraries
import numpy as np
import pandas as pd

# for hierarchical cluster analysis
from scipy.cluster.hierarchy import dendrogram, linkage, cut_tree

In [16]:
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

In [3]:
# if necessary, install xgboost and import the XGBClassifier
from xgboost import XGBClassifier

## Task 1 (1 point). Dataset optimization and descriptive analysis

-	Analyze the structure and features of the dataset.
-	Optimize the dataset (correct features, deal with NAs, etc.).
-	Check whether the classes are balanced.
-	Give brief comments.


In [4]:
# read the data
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num',
           'marital-status', 'occupation', 'relationship', 'race', 'sex',
           'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

df = pd.read_csv('adult.csv', header=None, names=columns, na_values=' ?')
df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [5]:
### optimize the dataset

# deal with variable "education"
# Because 'education' and 'education-num' columns contain essentially the same information, we need to drop either of the 
# columns to avoid overfitting
df.drop('education', axis=1,inplace=True)

# recode income as binary - ' <=50K': 0, ' >50K': 1
for i in ['income']:
    df[i] = pd.Categorical(df[i]).codes

# deal with NA
# columns 'workclass','occupation' and 'native-country' contain NA values. drop them and check if there are any afterwards
df.dropna(axis=0, inplace=True)
df.isna().any()

age               False
workclass         False
fnlwgt            False
education-num     False
marital-status    False
occupation        False
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country    False
income            False
dtype: bool

In [6]:
# check the balance of classes
df['income'].value_counts(normalize=True)

0    0.751078
1    0.248922
Name: income, dtype: float64

In [7]:
# preview of the data
df.head()

Unnamed: 0,age,workclass,fnlwgt,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [8]:
# one-hot encoding of nominal features
X = pd.get_dummies(df).drop('income', axis=1)

# separate the outcome variable
y = df['income']
print(pd.value_counts(y))

0    22654
1     7508
Name: income, dtype: int64


**Give your comments here**

We have an imbalance problem among y-values. Negative class constitute 75% of all values. This imbalance may result in overfitting and model inefficiency

# Task 2 (1 point). Classification on the whole dataset

-	Run the chosen classifier on the whole dataset with the **default** values of parameters.
-	Using the **classification_report** function, evaluate the results of classification.
-	Give brief comments.


In [9]:
# run the  classifier (with default parameters) on the whole sample
model = SGDClassifier()
model.fit(X,y)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [10]:
# make predictions on the full (training) sample
y_pred_fullsample = model.predict(X)

In [11]:
# make classification report
print (classification_report(y, y_pred_fullsample))

              precision    recall  f1-score   support

           0       0.80      0.95      0.87     22654
           1       0.66      0.28      0.39      7508

    accuracy                           0.79     30162
   macro avg       0.73      0.62      0.63     30162
weighted avg       0.77      0.79      0.75     30162



**Give your comments here**

Classification report showed high recall for negative class and extremely small recall for positive class (due to imbalance).

# Task 3 (1 point). Cluster analysis 

- Run a hierarchical cluster analysis on the full dataset (exclude the outcome variable).
- Identify **3 clusters**. Choose the biggest cluster for further analysis.
- Study the proportion of classes in the chosen cluster. Give comments.


In [12]:
# exclude the outcome variable
X_hca = df.drop('income', axis=1)
X_hca.head()

Unnamed: 0,age,workclass,fnlwgt,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


In [13]:
# make codes for categorical features
for field in ['workclass', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']:
    X_hca[field] = pd.Categorical(X_hca[field]).codes

In [23]:
# run the hierarchical cluster analysis
# setting up hca
hca = linkage(X_hca, method='ward', metric='euclidean')
cut = cut_tree(hca,n_clusters=3,height=None)

In [36]:
# identify which cluster is the biggest?
cluster1 = df[cut == 0]
cluster2 = df[cut == 1]
cluster3 = df[cut == 2]
cluster2.shape

(14614, 14)

**Give your comments here**

second cluster is the biggest. we compared shapes of three clusters

# Task 4 (1 point). Classification on 1 cluster

- Run the chosen classifier with the **default** parameters on the chosen cluster.
- Using the **classification_report** function, evaluate the results of classification.
- Compare the results of classification on the full sample and on the chosen cluster. Give comments.

In [38]:
# run your calculations here
X1 = pd.get_dummies(cluster2)
cluster_predict = model.predict(X1)
print (classification_report(y, cluster_predict))

ValueError: Found input variables with inconsistent numbers of samples: [30162, 14614]

**Give your comments here**

# Task 5 (1 point). Classification on 1 cluster

- To run grid-search, use a **3-fold cross-validation** and **3 important parameters** (including *learning_rate*) with 2 different values of each; use **grid_scores_** to get the results.
- Optimize the parameters of the chosen classifier on the full sample; compare the results with the non-optimized classification.
- Optimize the parameters of the chosen classifier on the chosen cluster; compare the results with the non-optimized classification.
- Give comments on the results and the best combinations of parameters for the both cases.


In [None]:
# run your calculations here


**Give your comments here**