# Assignment 2

Submitted by \<Bashier Kaddoura\>.

In this assigment, we will work with the *Adult* data set. Please download the data from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/2/adult). Extract the data files into the subdirectory: `../data/adult/` (relative to `./src/`).

## Variable Description

There are several files that you will get in the download archive. We will only use one file: `adult.data`. The file is comma-separated, does not contains headers, and the variable specification is below.


|Variable Name |Role |Type |Demographic |Description |Units |Missing Values|
|--------------|-----|-----|------------|------------|------|--------------|
|age |Feature |Integer |Age |N/A | |no|
|workclass |Feature |Categorical |Income |Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. | |yes|
|fnlwgt |Feature |Integer | | | |no|
|education |Feature |Categorical |Education Level |Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. | |no|
|education-num |Feature |Integer |Education Level | | |no|
|marital-status |Feature |Categorical |Other |Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. | |no|
|occupation |Feature |Categorical |Other |Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. | |yes|
|relationship |Feature |Categorical |Other |Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. | |no|
|race |Feature |Categorical |Race |White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. | |no|
|sex |Feature |Binary |Sex |Female, Male. | |no|
|capital-gain |Feature |Integer | | | |no|
|capital-loss |Feature |Integer | | | |no|
|hours-per-week |Feature |Integer | | | |no|
|native-country |Feature |Categorical |Other |United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. | |yes|
|income |Target |Binary |Income |>50K, <=50K. | |no|


## Objective

The objective of this assignment is to construct a preprocessing and model pipeline to predict the variable `income`. We will evaluate this pipeline using cross-validation.

# Load the data

Assuming that the files `adult.data` and `adult.test` are in `../data/adult/`, then you can use the code below to load them.

In [13]:
import pandas as pd
columns = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
    'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week',
    'native-country', 'income'
]
adult_dt = (pd.read_csv('../data/adult/adult.data', header = None, names = columns)
              .assign(income = lambda x: (x.income.str.strip() == '>50K')*1))


# Get X and Y

Create the features data frame and target data:

+ Create a dataframe `X` that holds the features (all columns that are not `income`).
+ Create a dataframe `Y` that holds the target data (`income`).
+ From `X` and `Y`, obtain the training and testing data sets:

    - Use a train-test split of 70-30%. 
    - Set the random state of the splitting function to 42.

In [14]:
from sklearn.model_selection import train_test_split

# adult_dt.info()
# adult_dt.head(5)
# adult_dt.info()
X = adult_dt.drop(columns='income')
Y = adult_dt['income']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [22]:
X_train.shape, Y_train.shape

((22792, 14), (22792,))

In [44]:
Y

0        0
1        0
2        0
3        0
4        0
        ..
32556    0
32557    1
32558    0
32559    0
32560    1
Name: income, Length: 32561, dtype: int32

## Random States

Please comment: 

+ What is the [random state](https://scikit-learn.org/stable/glossary.html#term-random_state) of the [splitting function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)? 
+ Why is it [useful](https://en.wikipedia.org/wiki/Reproducibility)?

It's the control for the random number generator. It's useful for reproducibility of the random number generator when run across different tests.

# Preprocessing

Create a [Column Transformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) that treats the features as follows:

- Numerical variables

    * Apply [KNN-based imputation for completing missing values](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html):
        
        + Consider the 7 nearest neighbours.
        + Weight each neighbour by the inverse of its distance, causing closer neigbours to have more influence than more distant ones.
    * [Scale features using statistics that are robust to outliers](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler).

- Categorical variables: 
    
    * Apply a [simple imputation strategy](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer):

        + Use the most frequent value to complete missing values, also called the *mode*.

    * Apply [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html):
        
        + Handle unknown labels if they exist.
        + Drop one column for binary variables.
    
    
The column transformer should look like this:

![](./img/assignment_2__column_transformer.png)

In [45]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import RobustScaler, OneHotEncoder

In [56]:
num_cols = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

cat_cols =['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

numeric = Pipeline(steps= [
    ("knn", KNNImputer(n_neighbors=7, weights='distance')),
    ("robust", RobustScaler())
])

categoric = Pipeline(steps= [
    ("simple", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", drop="if_binary"))
])

ctransformer = ColumnTransformer([
    ("num_transforms", numeric, num_cols),
    ("cat_transforms", categoric, cat_cols)
])

In [57]:
ctransformer

## Model Pipeline

Create a [model pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html): 

+ Add a step labelled `preprocessing` and assign the Column Transformer from the previous section.
+ Add a step labelled `classifier` and assign a [`RandomForestClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to it.

The pipeline looks like this:

![](./img/assignment_2__pipeline.png)

In [58]:
from sklearn.ensemble import RandomForestClassifier

pipe_ctransforms = Pipeline(steps= [
    ("preprocessing", ctransformer),
    ("classifier", RandomForestClassifier())
])

pipe_ctransforms

# Cross-Validation

Evaluate the model pipeline using [`cross_validate()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html):

+ Measure the following [preformance metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values): negative log loss, ROC AUC, accuracy, and balanced accuracy.
+ Report the training and validation results. 
+ Use five folds.


In [59]:
from sklearn.model_selection import cross_validate

scoring = ['neg_log_loss', 'roc_auc', 'accuracy', 'balanced_accuracy']

res_simple_dict = cross_validate(pipe_ctransforms, X_train, Y_train, cv = 5, scoring = scoring, return_train_score=True)



In [60]:
res_simple_dict

{'fit_time': array([13.47650456, 13.56593323, 13.39303851, 13.56036711, 13.60377026]),
 'score_time': array([0.31289458, 0.27197027, 0.26628113, 0.27301049, 0.2735126 ]),
 'test_neg_log_loss': array([-0.35648188, -0.35156782, -0.39534324, -0.39194719, -0.38675178]),
 'train_neg_log_loss': array([-0.08148619, -0.08109291, -0.08136545, -0.08153114, -0.08111686]),
 'test_roc_auc': array([0.90467576, 0.90384583, 0.90206051, 0.90582312, 0.90245822]),
 'train_roc_auc': array([1.        , 1.        , 1.        , 1.        , 0.99999998]),
 'test_accuracy': array([0.8499671 , 0.84689625, 0.85388328, 0.85958754, 0.85673541]),
 'train_accuracy': array([1.        , 0.99983546, 1.        , 0.99994516, 0.99994516]),
 'test_balanced_accuracy': array([0.77558629, 0.76925718, 0.77741042, 0.78209833, 0.77744672]),
 'train_balanced_accuracy': array([1.        , 0.99966071, 1.        , 0.99988693, 0.99988693])}

Display the fold-level results as a pandas data frame and sorted by negative log loss of the test (validation) set.

In [61]:
res_simple = pd.DataFrame(res_simple_dict).assign(experiment = 1)

In [62]:
res_simple

Unnamed: 0,fit_time,score_time,test_neg_log_loss,train_neg_log_loss,test_roc_auc,train_roc_auc,test_accuracy,train_accuracy,test_balanced_accuracy,train_balanced_accuracy,experiment
0,13.476505,0.312895,-0.356482,-0.081486,0.904676,1.0,0.849967,1.0,0.775586,1.0,1
1,13.565933,0.27197,-0.351568,-0.081093,0.903846,1.0,0.846896,0.999835,0.769257,0.999661,1
2,13.393039,0.266281,-0.395343,-0.081365,0.902061,1.0,0.853883,1.0,0.77741,1.0,1
3,13.560367,0.27301,-0.391947,-0.081531,0.905823,1.0,0.859588,0.999945,0.782098,0.999887,1
4,13.60377,0.273513,-0.386752,-0.081117,0.902458,1.0,0.856735,0.999945,0.777447,0.999887,1


Calculate the mean of each metric. 

In [63]:
res_simple.mean()

fit_time                   13.519923
score_time                  0.279534
test_neg_log_loss          -0.376418
train_neg_log_loss         -0.081319
test_roc_auc                0.903773
train_roc_auc               1.000000
test_accuracy               0.853414
train_accuracy              0.999945
test_balanced_accuracy      0.776360
train_balanced_accuracy     0.999887
experiment                  1.000000
dtype: float64

Calculate the same performance metrics (negative log loss, ROC AUC, accuracy, and balanced accuracy) using the testing data `X_test` and `Y_test`. Display results as a dictionary.

*Tip*: both, `roc_auc()` and `neg_log_loss()` will require prediction scores from `pipe.predict_proba()`. However, for `roc_auc()` you should only pass the last column `Y_pred_proba[:, 1]`. Use `Y_pred_proba` with `neg_log_loss()`.

In [None]:
scoring = ['neg_log_loss', 'roc_auc', 'accuracy', 'balanced_accuracy']

res_simple_dict = cross_validate(pipe_ctransforms, X_test, Y_test, cv = 5, scoring = scoring, return_train_score=True)

# Target Recoding

In the first code chunk of this document, we loaded the data and immediately recoded the target variable `income`. Why is this [convenient](https://scikit-learn.org/stable/modules/model_evaluation.html#binary-case)?

The specific line was:

```
adult_dt = (pd.read_csv('../data/adult/adult.data', header = None, names = columns)
              .assign(income = lambda x: (x.income.str.strip() == '>50K')*1))
```

It's used as a benchmark for the model's performance. The model's efficiency can be measured by comparing the predicted values with the actual target values.

# Reference

Becker,Barry and Kohavi,Ronny. (1996). Adult. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.