# IGA-04. The Income Prediction Problem

### Students


## Setup of the problem

In this assignment you are asked to work with results of a big survey of adults in order to construct a binary classifier to predict the principal level of income of people. You will need to analyze and optimize the structure of the original dataset, apply a clustering technique, as well as grid-search and cross-validation to optimize the chosen classifier.

The data is given in **adult.csv**. Variable **income** is the outcome variable.

Data description can be found by this URL: https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names 


In [7]:
# load traditional libraries
import numpy as np
import pandas as pd

# for hierarchical cluster analysis
from scipy.cluster.hierarchy import dendrogram, linkage, cut_tree

In [8]:
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve
from sklearn.model_selection import GridSearchCV

In [9]:
# if necessary, install xgboost and import the XGBClassifier
from xgboost import XGBClassifier

## Task 1 (1 point). Dataset optimization and descriptive analysis

-	Analyze the structure and features of the dataset.
-	Optimize the dataset (correct features, deal with NAs, etc.).
-	Check whether the classes are balanced.
-	Give brief comments.


In [10]:
# read the data
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num',
           'marital-status', 'occupation', 'relationship', 'race', 'sex',
           'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

df = pd.read_csv('adult.csv', header=None, names=columns, na_values=' ?')

In [6]:
### optimize the dataset

# deal with variable "education"


# recode income as binary - ' <=50K': 0, ' >50K': 1


# deal with NA
df.dropna

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [None]:
# check the balance of classes
df['income'].value_counts(normalize=True)

In [None]:
# preview of the data
df.head()

In [None]:
# one-hot encoding of nominal features
X = pd.get_dummies(df).drop('income', axis=1)

# separate the outcome variable
y = df['income']
print(pd.value_counts(y))

**Give your comments here**

# Task 2 (1 point). Classification on the whole dataset

-	Run the chosen classifier on the whole dataset with the **default** values of parameters.
-	Using the **classification_report** function, evaluate the results of classification.
-	Give brief comments.


In [None]:
# run the  classifier (with default parameters) on the whole sample



In [None]:
# make predictions on the full (training) sample
y_pred_fullsample = 

In [None]:
# make classification report
print (classification_report(y, y_pred_fullsample))

**Give your comments here**

# Task 3 (1 point). Cluster analysis 

- Run a hierarchical cluster analysis on the full dataset (exclude the outcome variable).
- Identify **3 clusters**. Choose the biggest cluster for further analysis.
- Study the proportion of classes in the chosen cluster. Give comments.


In [None]:
# exclude the outcome variable
X_hca = df.drop('income', axis=1)

In [None]:
# make codes for categorical features
for field in ['workclass', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']:
    X_hca[field] = pd.Categorical(X_hca[field]).codes

In [None]:
# run the hierarchical cluster analysis



In [None]:
# identify which cluster is the biggest?



**Give your comments here**

# Task 4 (1 point). Classification on 1 cluster

- Run the chosen classifier with the **default** parameters on the chosen cluster.
- Using the **classification_report** function, evaluate the results of classification.
- Compare the results of classification on the full sample and on the chosen cluster. Give comments.

In [None]:
# run your calculations here


**Give your comments here**

# Task 5 (1 point). Classification on 1 cluster

- To run grid-search, use a **3-fold cross-validation** and **3 important parameters** (including *learning_rate*) with 2 different values of each; use **grid_scores_** to get the results.
- Optimize the parameters of the chosen classifier on the full sample; compare the results with the non-optimized classification.
- Optimize the parameters of the chosen classifier on the chosen cluster; compare the results with the non-optimized classification.
- Give comments on the results and the best combinations of parameters for the both cases.


In [None]:
# run your calculations here


**Give your comments here**