## Task 1: Working with a dataset with categorical features

In Assignment 1, we didn't have to do much preprocessing, because all the features in the two datasets were numerical. (Actually, in the second dataset, we removed all non-numerical features.) In this assignment, we'll instead consider how to deal with non-numerical features.

We'll use the famous Adult dataset. This is a binary classification task, where our task is to predict whether an American individual earns more than $50,000 a year, given a number of numerical and categorical features. (The dataset was extracted from a 1994 census database.)

### Step 1. Reading the data

Please download the two CSV files, the training set and the test set, and save them into your working directory This is the official train/test split defined by the people who created the dataset. It's the same data as in the the public distribution, except that we converted the format into a standard CSV format.

1. Write code to read the CSV file, for instance by using Pandas as in Assignment 1. 
2. Then split the data into an input part X and an output part Y. The output variable, which the classifier will predict, is called target.





In [None]:
# import required libs
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import expon

In [None]:
# Import csv file for Training data
train = pd.read_csv('adult_train.csv')

# Import csv file for Test data
test = pd.read_csv('adult_test.csv')

# Shuffle the dataset.
train_shuffled = train.sample(frac=1.0, random_state=0)
test_shuffled = test.sample(frac=1.0, random_state=0)

# Split into input part X and output part Y.
Xtrain = train_shuffled.drop('target', axis=1)
Ytrain = train_shuffled['target']
Xtest = test_shuffled.drop('target', axis=1)
Ytest = test_shuffled['target']

In [None]:
Xtrain.head(3)

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
22278,49,Local-gov,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,40,United-States
8950,49,Private,HS-grad,9,Divorced,Other-service,Not-in-family,Black,Female,0,0,40,United-States
7838,31,Private,Prof-school,15,Never-married,Prof-specialty,Not-in-family,White,Male,0,0,50,United-States


### Step 2: Encoding the features as numbers.

If you look at the data, you will note that it contains several features with categorical values, such as workclass, education etc. All scikit-learn models work with numerical data internally; this means that the categorical features need to be converted to numbers. The most straightforward way to carry out such a conversion is to use one-hot encoding of the features, also known as dummy variables in statistics. In this approach, we define one new column for each observed value of the feature.

Scikit-learn includes a number of tools that can do one-hot encoding of categorical features and we'll see how to use one of them, the DictVectorizer. An alternative approach that is a bit more Pandas-friendly and gives more low-level control is to use the recently introduced ColumnTransformer; if you're interested, you can read an introduction to this approach here. We won't use a ColumnTransformer here because it will make Task 3 in this assignment a bit too annoying to solve.

The DictVectorizer is used when we store our features as named attributes in dictionaries. For instance, we could represent one individual from the Adult dataset as follows:

    {'age': 44,
    'workclass': 'Private',
    'education': 'Some-college',
    'education-num': 10,
    'marital-status': 'Married-civ-spouse',
    'occupation': 'Machine-op-inspct',
    'relationship': 'Husband',
    'race': 'Black',
    'sex': 'Male',
    'capital-gain': 7688,
    'capital-loss': 0,
    'hours-per-week': 40,
    'native-country': 'United-States'}

Pandas includes a utility to convert a DataFrame into a list of dictionaries:

    dicts_for_my_training_data = my_training_data.to_dict('records')

Then make a DictVectorizer and apply it, writing something like the following:

    dv = DictVectorizer()
    X_train_encoded = dv.fit_transform(dicts_for_my_training_data)

The method fit_transform will first call fit, which as usual is the "training" method. For a DictVectorizer, "training" consists of building the mapping from categories to column positions. Then, the transform method will be called, which converts the data into a matrix.

Now that you have a numerical representation of the data, you can compute a cross-validation accuracy for the training set using one of the classifiers you explored in Programming Assignment 1.

To handle the test data, you just call transform, because this time the vectorizer does not need to be "trained." Use this to compute the accuracy on the test set.

    X_test_encoded = dv.transform(dicts_for_my_test_data)

In [None]:
# convert training data to dict - creates a list of dicts 
# where each row's (person's) information is gathered in one dict, 
dicts_Xtrain = Xtrain.to_dict('records')
dicts_Xtest = Xtest.to_dict('records')

# show first 2 person's informations
dicts_Xtrain[:1] 

[{'age': 49,
  'workclass': 'Local-gov',
  'education': 'HS-grad',
  'education-num': 9,
  'marital-status': 'Married-civ-spouse',
  'occupation': 'Transport-moving',
  'relationship': 'Husband',
  'race': 'White',
  'sex': 'Male',
  'capital-gain': 0,
  'capital-loss': 0,
  'hours-per-week': 40,
  'native-country': 'United-States'}]

In [None]:
# create DictVectorizer 
dv = DictVectorizer()

# one-hot encode the list with all dicts for both train and test
Xtrain_encoded = dv.fit_transform(dicts_Xtrain)
Xtest_encoded = dv.transform(dicts_Xtest)

In [None]:
# Baseline with dummy classifier
dummy = DummyClassifier(strategy='most_frequent')
# train
dummy.fit(Xtrain_encoded, Ytrain)
# cross validation on train data
dummy_score_train = cross_val_score(dummy, Xtrain_encoded, Ytrain).mean()
# accuracy on test data
dummy_score_test = accuracy_score(Ytest, dummy.predict(Xtest_encoded))

print("Dummy Classifier")
print("Average Train score:",dummy_score_train)
print("Average Test score:",dummy_score_test)

Dummy Classifier
Average Train score: 0.7591904454179904
Average Test score: 0.7637737239727289


In [None]:
# Decision Tree Classifier
dtc = DecisionTreeClassifier() 
# train
dtc.fit(Xtrain_encoded, Ytrain)
# cross validation on train data
dtc_score_train = cross_val_score(dtc, Xtrain_encoded, Ytrain).mean()
# accuracy on test data
dtc_score_test = accuracy_score(Ytest, dtc.predict(Xtest_encoded))

print("Decision Tree Classifier")
print("Average Train score:",dtc_score_train)
print("Average Test score:",dtc_score_test)

Decision Tree Classifier
Average Train score: 0.8189860592555205
Average Test score: 0.8179473005343653


### Step 3. Combining the steps.

In the example above, we first transformed the list of dictionaries into a numerical matrix, and then we used this matrix when training the classifier. A separate preprocessing step was carried out for the test set.

In machine learning setups, we often use long chains of preprocessing steps. The one-hot encoding is one example, and other such steps might be scaling, feature selection, imputation of missing values, etc. As you can imagine, keeping track of the preprocessing steps can be tedious and error-prone, so it makes sense to handle such preprocessing chains automatically.

A Pipeline consists of a sequence of scikit-learn modules. The most convenient way to build a Pipeline is to use the utility function make_pipeline. For instance, to build a pipeline consisting of a vectorization step and then a decision tree classifier, we could write

    from sklearn.pipeline import make_pipeline
    
    pipeline = make_pipeline(
    DictVectorizer(),
    DecisionTreeClassifier() )
    
The Pipeline can be treated as any classifier: we can call fit and predict as usual. Concretely, when we call fit on a Pipeline, it will in turn call fit_transform on all intermediate steps and then fit on the final step. When we call predict, transform will be called on the intermediate steps and then predict on the final step.

Build a pipeline that includes the classifier that you selected previously, and make sure that it works.

Importing data and converting the data to dics

In [None]:
# import train data
train_data = pd.read_csv('adult_train.csv')

n_cols = len(train_data.columns)

# split X and Y from data
Xtrain = train_data.iloc[:, :n_cols-1].to_dict('records')
Ytrain = train_data.iloc[:, n_cols-1]

# import test data
test_data = pd.read_csv('adult_test.csv')
# split X and Y from data
Xtest = test_data.iloc[:, :n_cols-1].to_dict('records')
Ytest = test_data.iloc[:, n_cols-1]

Initialize the DecisionTreeClassifier and adding hyperparameters that can be tested

In [None]:
# Creating the model
dtc = DecisionTreeClassifier()

# specifying parameters for the models
param_grid = {'criterion':['gini', 'entropy'], 
                     'splitter':['best', 'random'],
                     'max_features':['auto', 'sqrt', 'log2'],
                     'max_depth': [1,2,3,4,5,6,7,8,9,10] }

Trying out different parameters for for GridSearchCV and RandomizedSearchCV for the DecisionTreeClassifier.

In [None]:
# GridSearchCV
gridsearch = GridSearchCV(dtc, param_grid)
gridsearch.fit(Xtrain_encoded, Ytrain)
gridsearch.best_params_

{'criterion': 'entropy',
 'max_depth': 5,
 'max_features': 'auto',
 'splitter': 'random'}

In [None]:
# storing the best parameter for GridSearchCV
g_splitter = gridsearch.best_params_['splitter']
g_max_features = gridsearch.best_params_['max_features']
g_criterion = gridsearch.best_params_['criterion']
g_max_depth = gridsearch.best_params_['max_depth']
# priting the best parameters
print(g_splitter, g_max_features, g_criterion, g_max_depth)

random auto entropy 5


In [None]:
# RandomSearchCV
randomsearch = RandomizedSearchCV(dtc, param_grid, n_iter=10)
randomsearch.fit(Xtrain_encoded, Ytrain)
randomsearch.best_params_

{'splitter': 'best',
 'max_features': 'auto',
 'max_depth': 3,
 'criterion': 'entropy'}

In [None]:
# storing the best parameter for RandomizedSearchCV
r_splitter = randomsearch.best_params_['splitter']
r_max_features = randomsearch.best_params_['max_features']
r_criterion = randomsearch.best_params_['criterion']
r_max_depth = randomsearch.best_params_['max_depth']
# priting the best parameters
print(r_splitter, r_max_features, r_criterion, r_max_depth)

best auto entropy 3


#### Pipeline Baseline

Baseline pipeline without any hyperparameters

In [None]:
pipeline = make_pipeline(
    DictVectorizer(),
    StandardScaler(with_mean=False),
    SelectKBest(k=100),
    DecisionTreeClassifier()) 

In [None]:
pipeline.fit(Xtrain, Ytrain)

Pipeline(steps=[('dictvectorizer', DictVectorizer()),
                ('standardscaler', StandardScaler(with_mean=False)),
                ('selectkbest', SelectKBest(k=100)),
                ('decisiontreeclassifier', DecisionTreeClassifier())])

In [None]:
accuracy_score(Ytest, pipeline.predict(Xtest))

0.8189300411522634

#### Pipeline GridSearchCV

In [None]:
# GridSearchCV 
g_pipeline = make_pipeline(
    DictVectorizer(),
    StandardScaler(with_mean=False),
    SelectKBest(k=100),
    DecisionTreeClassifier(max_features=g_max_features, splitter=g_splitter,
                           criterion=g_criterion, max_depth=g_max_depth)) 

In [None]:
g_pipeline.fit(Xtrain, Ytrain)

Pipeline(steps=[('dictvectorizer', DictVectorizer()),
                ('standardscaler', StandardScaler(with_mean=False)),
                ('selectkbest', SelectKBest(k=100)),
                ('decisiontreeclassifier',
                 DecisionTreeClassifier(criterion='entropy', max_depth=5,
                                        max_features='auto',
                                        splitter='random'))])

In [None]:
accuracy_score(Ytest, g_pipeline.predict(Xtest))

0.7916589890055893

#### Pipeline RandomizedSearchCV

In [None]:
# RandomizedSearchCV 
r_pipeline = make_pipeline(
    DictVectorizer(),
    StandardScaler(with_mean=False),
    SelectKBest(k=100),
    DecisionTreeClassifier(max_features=r_max_features,splitter=r_splitter,
                           criterion=r_criterion, max_depth=r_max_depth))

In [None]:
r_pipeline.fit(Xtrain, Ytrain)

Pipeline(steps=[('dictvectorizer', DictVectorizer()),
                ('standardscaler', StandardScaler(with_mean=False)),
                ('selectkbest', SelectKBest(k=100)),
                ('decisiontreeclassifier',
                 DecisionTreeClassifier(criterion='entropy', max_depth=3,
                                        max_features='auto'))])

In [None]:
accuracy_score(Ytest, r_pipeline.predict(Xtest))

0.7637737239727289

The accuracy from the pipelines differs, but not to a huge extent. The normal pipeline without patameter tuning had the lowest accuracy. The GridSearchCV with parameter tuning had a slightly higher accuracy then the baseline pipe. Lastly, the RandomizedSearchCV had the higest accuracy with hyperparameters: 

    'splitter': 'best',
    'max_features': 'sqrt',
    'max_depth': 8,
    'criterion': 'gini'}