# Practice set 2: Preprocessing with `scikit-learn`

<br><br>

## Imports 

In [None]:
from hashlib import sha1
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.impute import SimpleImputer

from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

<br><br>

## Introduction <a name="in"></a>
<hr>

A crucial step when using machine learning algorithms on real-world datasets is preprocessing. This homework will give you some practice of data preprocessing and building a supervised machine learning pipeline on a medium-sized dataset which has different types of features. 

<br><br>

## Exercise 1: Dataset and preliminary EDA
<hr>

In this lab, you will be working on [the adult census dataset](https://www.kaggle.com/uciml/adult-census-income#). Download the CSV and save it as `adult.csv` under the data folder in the current folder. 

This is a classification dataset and the classification task is to predict whether income exceeds 50K per year or not based on the census data. Go through the information on [the dataset and features](http://archive.ics.uci.edu/ml/datasets/Adult).

The starter code below loads the data CSV (assuming that it is saved as `adult.csv` under the data folder). 

_Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary._

In [None]:
census_df = pd.read_csv("data/adult.csv")
census_df.shape

<br><br>

### 1.1 Data splitting 
rubric={autograde}

In order to avoid violation of the golden rule, the first step before we do anything is splitting the data. 

**Your tasks:**

1. Split the data into `train_df` (40%) and `test_df` (60%) with `random_state = 123`. Keep the target column (`income`) in the splits so that we can use it in the exploratory data analysis.  

_Usually, having more data for training is a good idea. But here I'm using 40%/60% split because running cross-validation with this dataset can take a long time on a modest laptop. A smaller training data means it will be a bit faster to train the model on your laptop. A side advantage of this is that with a bigger test split, we'll have a more reliable estimate of the model performance!_

<div class="alert alert-warning">

Solution_1_1
    
</div>

In [None]:
train_df = None
test_df = None

# BEGIN SOLUTION
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
train_df, test_df = train_test_split(census_df, test_size=0.6, random_state=123)

# END SOLUTION

In [None]:
assert not train_df is None and not test_df is None, "Please use the provided variables."
assert train_df.shape == (13024, 15), "The dimensions of the training set are incorrect"
assert test_df.shape == (19537, 15), "The dimensions of the test set are incorrect"
assert train_df.loc[12846][['age', 'education', 'occupation', 'capital.loss']].tolist() == [49, 'Some-college', 'Craft-repair', 0], "Are you using the provided random state?"
assert not 20713 in train_df.index, 'Are you using the provided random state?' 

<br><br>

Let's examine our `train_df`. 

In [None]:
train_df.sort_index()

We see some missing values represented with a "?". Probably these were the questions not answered by some people during the census.  Usually `.describe()` or `.info()` methods would give you information on missing values. But here, they won't pick "?" as missing values because they are encoded as strings instead of an actual NaN in Python. So let's replace them with `np.nan` before we carry out EDA. If you do not do it, you'll encounter an error later on when you try to pass this data to a classifier. 

In [None]:
train_df = train_df.replace("?", np.nan)
test_df = test_df.replace("?", np.nan)
train_df.shape

In [None]:
train_df.sort_index()

The "?" symbols are now replaced with NaN values. 

<br><br>

### 1.2 `describe()` method
rubric={autograde}

**Your tasks:**

1. Examine the output of `train_df.describe()` with `include='all'` argument and store it in a variable called `census_summary`.
2. What are the highest hours per week someone reported? Store it in a variable called `max_hours_per_week`.
3. What is the most frequently occurring occupation in this dataset? Store it in a variable called `most_freq_occupation`.
4. Store the column names of the columns with missing values as a list in a variable called `missing_vals_cols`. 
5. Store the column names of all numeric-looking columns, irrespective of whether you want to include them in your model or not, as a list in a variable called `numeric_cols`.  

<div class="alert alert-warning">

Solution_1.2
    
</div>

In [None]:
census_summary = None

# BEGIN SOLUTION
census_summary = train_df.describe(include="all")
census_summary
# END SOLUTION

In [None]:
max_hours_per_week = None

# BEGIN SOLUTION
max_hours_per_week = census_summary.loc['max']['hours.per.week']
max_hours_per_week
# END SOLUTION

In [None]:
most_freq_occupation = None

# BEGIN SOLUTION
most_freq_occupation = census_summary.loc['top']['occupation']
most_freq_occupation
# END SOLUTION

In [None]:
# BEGIN SOLUTION
train_df.info()
# END SOLUTION

In [None]:
missing_vals_cols = None
numeric_cols = None

# BEGIN SOLUTION

missing_vals_cols = ['workclass', 'occupation', 'native.country']
numeric_cols = ['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week']

# END SOLUTION

In [None]:
# Sorting the lists for the autograder
missing_vals_cols.sort()
numeric_cols.sort()

In [None]:
# Task 1
assert isinstance(census_summary, pd.DataFrame), "census_summary dataftame is not created"
assert census_summary.shape == (11, 15), "census_summary shape is incorrect. Probably you are not including all columns"
assert census_summary.loc['min']['age'] == 17.0, "census_summary dataframe is incorrect"
assert census_summary.loc['top']['occupation'] == "Prof-specialty", "census_summary dataframe is incorrect"

In [None]:
# Task 2
assert (sha1(str(max_hours_per_week).encode('utf8')).hexdigest() == "3359de52c8ae993fe0f8fe9c5168a0065bd3c7a4"), "max_hours_per_week are incorrect"

In [None]:
# Task 3
assert (sha1(str(most_freq_occupation).encode('utf8')).hexdigest() == "97165f50eddb0d28a382b0366274e2fe38505644"), "most_freq_occupation is incorrect"

In [None]:
# Task 4
assert (sha1(str(missing_vals_cols).encode('utf8')).hexdigest() == "6bc5e13d4d66b306e52701ee9a1e5e21bf19aeb0"), "Please use the exact column/feature name. Also, make sure the lists are sorted."

In [None]:
# Task 5
assert (sha1(str(numeric_cols).encode('utf8')).hexdigest() == "615afaf5011128d641ab8a73289d57bd01a3ec37"), "Please use the exact column/feature name. Also, make sure the lists are sorted."

<br><br>

### 1.3 Visualizing features
rubric={point}

**Your tasks:**

1. For each numeric feature in `numeric_cols` you identified above, visualize the histograms for <=50K and >50K classes. 
2. Write a sentence or two describing your observations. 

> You can use the library of your choice for visualization. 

<div class="alert alert-warning">

Solution_1.3
    
</div>

In [None]:
## IGNORE ## 
# BEGIN SOLUTION
for feat in numeric_cols:
    ax = train_df.groupby("income")[feat].plot.hist(bins=40, alpha=0.4, legend=True, density=True)
    plt.xlabel(feat)
    plt.title("Histogram of " + feat)
    plt.show()
# END SOLUTION

**Preliminary observations**: 
From the density histograms above it seems like middle age, higher capital.gain, higher capital.loss, higher education.num and more work hours.per.week are associated with higher income (>50K income). 

<br><br><br><br>

## Exercise 2: Identifying different feature types and transformations  
<hr>

Usually, data comes in a format which is not directly passed to machine learning models. A machine learning practitioner needs to examine each column carefully and come up with a way to effectively encode the information given in each column. Let's identify what kind of features do we have and consider some reasonable ways to encode them.  

### 2.1 Identify transformations to apply
rubric={points}

Below we are providing possible transformations which can be applied on each column in `census_df`.  

**Your tasks:**
1. Write your justification or explanation for each row in the _Explanation_ column. An example explanation is given for the `age` feature. 

> You can find the information about the columns [here](http://archive.ics.uci.edu/ml/datasets/Adult).

> This question is a bit open-ended. If you do not agree with the provided transformation, feel free to argue your case in the explanation. That said, for the autograder to work in the rest of the assignment, go with the transformations provided below. 

<div class="alert alert-warning">

Solution_2.1
    
</div>

| Feature | Transformation | Explanation
| --- | ----------- | ----- |
| age | scaling |  A numeric feature with no missing values. It will be a good idea to apply scaling, as the range of values (17 to 90) is quite different compared to other numeric features.|
| workclass | imputation, one-hot encoding | |
| fnlwgt | drop |  |
| education | ordinal encoding | |
| education.num | drop | |
| marital.status | one-hot encoding  | |
| occupation | imputation, one-hot encoding  | |
| relationship | one-hot encoding  | |
| race | drop  |  |
| sex | one-hot encoding with "binary=True" | |
| capital.gain | scaling |  | 
| capital.loss | scaling |  |
| hours.per.week | scaling | |
| native.country | imputation, one-hot encoding | | 


| Feature | Transformation | Explanation
| --- | ----------- | ----- |
| age | scaling |  numeric variable in the range 17 to 90. No missing values.|
| workclass | imputation, one-hot encoding | categorical variable with XX categories, some missing values|
| fnlwgt | drop | The column represents the weight assigned by the US census bureau to each row. This is not really a feature which is relevant to predict the income |
| education | ordinal encoding | The column has a number of education levels which have some ordering associated with them. For example, "Bachelors" < "Masters"|
| education.num | drop | duplicate column |
| marital.status | one-hot encoding  | Categorical column with no missing values |
| occupation | imputation, one-hot encoding  | Categorical column with missing values |
| relationship | one-hot encoding  | Categorical column with no missing values |
| race | drop  | Categorical column with no missing values. It might not be a good idea to include the race feature to predict income. Such systems get used in applications which can affect real people. For example, this prediction might be used in deciding whether to approve a loan application or not. Influencing this decision by race feature might harm people belonging to certain race.   |
| sex | one-hot encoding with "binary=True" | Although sex in general is not binary, in this dataset, there are only two possible values for this feature and that's why we are treating it as a binary feature and applying one-hot encoding with "binary=True". So only one column will be created for the feature. |
| capital.gain | scaling | numeric feature with no missing values | 
| capital.loss | scaling | numeric feature with no missing values |
| hours.per.week | scaling | numeric feature with no missing values |
| native.country | imputation, one-hot encoding | categorical feature with missing values.| 

<br><br>

### 2.4 Identify feature types 
rubric={autograde}


**Your tasks:**
1. Based on the types of transformations we want to apply on the features above, identify different feature types and store them in the variables below as lists.  

<div class="alert alert-warning">
    
Solution_2.2
    
</div>

In [None]:
# Fill in the lists below.
numeric_features = []
categorical_features = []
ordinal_features = []
binary_features = []
drop_features = []
target = "income"

# BEGIN SOLUTION

numeric_features = ["capital.gain", "age", "capital.loss","hours.per.week"]
categorical_features = ["marital.status", "native.country", "relationship", "occupation", "workclass"]
ordinal_features = ["education"]
binary_features = ["sex"]
drop_features = ["fnlwgt", "race", "education.num"]

# END SOLUTION

In [None]:
# Sorting all the lists above for the autograder
numeric_features.sort()
categorical_features.sort()
ordinal_features.sort()
binary_features.sort()
drop_features.sort()

In [None]:
assert (sha1(str(numeric_features).encode('utf8')).hexdigest() == "71401cf60034fd69eee7398866359f612adf3e15"), "numeric_features list is not correct"
assert (sha1(str(categorical_features).encode('utf8')).hexdigest() == "af1a4022c0362405678be5c3a6735578a8c0069f"), "categorical_features list is not correct"
assert (sha1(str(ordinal_features).encode('utf8')).hexdigest() == "95b86602c44211f3ad662bb58b8e53d024106d05"), "ordinal_features list is not correct"
assert (sha1(str(binary_features).encode('utf8')).hexdigest() == "d4b7aa4c56ac2f98e6ac9cec7768484b415b7337"), "binary_features list is not correct"
assert (sha1(str(drop_features).encode('utf8')).hexdigest() == "62aab57d42c54be3dfd3c55020e5a167ca1a84c3"), "drop_features list is not correct"
assert (sha1(str(target).encode('utf8')).hexdigest() == "0f613350b66e64d92ef21bc4dcdbf8996cb4edf0"), "target variable is not set correctly"

<br><br><br><br>

## Exercise 3: Baseline models 

### 3.1 Separating feature vectors and targets  
rubric={autograde}

**Your tasks:**

1. Create `X_train`, `y_train`, `X_test`, `y_test` from `train_df` and `test_df`. 

<div class="alert alert-warning">
    
Solution_3.1
    
</div>

In [None]:
X_train = None
y_train = None
X_test = None
y_test = None

# BEGIN SOLUTION
X_train = train_df.drop(columns=[target])
y_train = train_df[target]

X_test = test_df.drop(columns=[target])
y_test = test_df[target]
# END SOLUTION

In [None]:
assert not X_train is None, "Your answer does not exist. Have you passed in the correct variable?"
assert not y_train is None, "Your answer does not exist. Have you passed in the correct variable?"
assert not X_test is None, "Your answer does not exist. Have you passed in the correct variable?"
assert not y_test is None, "Your answer does not exist. Have you passed in the correct variable?"
assert X_train.shape == (13024, 14), "The dimensions of X_train are incorrect"
assert y_train.shape == (13024, ), "The dimensions of y_train are incorrect. Are you splitting correctly"
assert X_test.shape == (19537,14), "The dimensions of X_test are incorrect. Are you splitting correctly? Are you using single brackets?"
assert y_test.shape == (19537,), "The dimensions of y_test are incorrect. Are you splitting correctly? Are you using single brackets?"
assert 'income' not in list(X_train.columns), "Make sure the target variable is not part of your X dataset."

<br><br>

### 3.2 Dummy classifier
rubric={autograde}

**Your tasks:**

1. Carry out 5-fold cross-validation using `scikit-learn`'s `cross_validate` function with `return_train_scores=True` and store the results as a dataframe named `dummy_df` where each row corresponds to the results from a cross-validation fold. 

<div class="alert alert-warning">
    
Solution_3.2
    
</div>

In [None]:
dummy_df = None 

# BEGIN SOLUTION
dummy = DummyClassifier()
scores = cross_validate(dummy, X_train, y_train, return_train_score=True)
dummy_df = pd.DataFrame(scores)
dummy_df
# END SOLUTION

In [None]:
assert not dummy_df is None, "Have you used the correct variable to store the results?"
assert sorted(list(dummy_df.columns)) == ['fit_time','score_time','test_score','train_score'], "Your solution contains incorrect columns."
assert dummy_df.shape == (5,4), "Are you carrying out 5-fold cross-validation and are you passing return_train_score=True?"
assert np.isclose(round(dummy_df['test_score'].mean(),3), 0.758), "The test scores seem wrong. Are you calling the cross_validate correctly?"
assert np.isclose(round(dummy_df['train_score'].mean(),3), 0.758), "The train scores seem wrong. Are you calling the cross_validate correctly?"

<br><br>

### 3.3 Discussion
rubric={reasoning}

**Your tasks:**

1. Hopefully, you were able to run cross-validation with dummy classifier successfully in the question above. At this point, if you train [`sklearn`'s `SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) model on `X_train` and `y_train` would it work? Why or why not? 

<div class="alert alert-warning">
    
Solution_3.3
    
</div>

It won't work at this point because our data is not preprocessed yet; we have some categorical columns and some NaN values in numeric columns. We need to preprocess it first before feeding it into ML algorithms.

<br><br><br><br>

## Exercise 4: Column transformer 
<hr>

In this dataset, we have different types of features: numeric features, an ordinal feature, categorical features, and a binary feature. We want to apply different transformations on different columns and therefore we need a column transformer. First, we'll define different transformations on different types of features and then will create a `scikit-learn`'s `ColumnTransformer` using `make_column_transformer`. For example, the code below creates a `numeric_transformer` for numeric features. 

In [None]:
numeric_transformer = StandardScaler()

In the exercises below, you'll create transformers for other types of features. 

<br><br>

### 4.1 Preprocessing ordinal features
rubric={autograde}

**Your tasks:**

1. Create a transformer called `ordinal_transformer` for our ordinal features. 

> Ordering of some of the education levels is not obvious. Assume that "HS-grad" < "Prof-school" < "Assoc-voc" < "Assoc-acdm" < "Some-college" < "Bachelors"

<div class="alert alert-warning">
    
Solution_4.1
    
</div>

In [None]:
ordinal_transformer = None

# BEGIN SOLUTION
train_df["education"].unique()
# END SOLUTION

In [None]:
# BEGIN SOLUTION
education_levels = [
    "Preschool",
    "1st-4th",
    "5th-6th",
    "7th-8th",
    "9th",
    "10th",
    "11th",
    "12th",
    "HS-grad",
    "Prof-school",
    "Assoc-voc",
    "Assoc-acdm",
    "Some-college",
    "Bachelors",
    "Masters",
    "Doctorate",
]
assert set(education_levels) == set(train_df["education"].unique())
# END SOLUTION

In [None]:
# BEGIN SOLUTION
ordinal_transformer = OrdinalEncoder(categories=[education_levels], dtype=int)
# END SOLUTION

In [None]:
assert not ordinal_transformer is None, "Are you using the correct variable name?"
assert type(ordinal_transformer.get_params()['categories'][0]) is list, "Are you passing education levels as a list of lists?"
assert ordinal_transformer.get_params()['dtype'] == int, "Please set the dtype to int"
assert (sha1(str(ordinal_transformer.get_params()['categories'][0]).encode('utf8')).hexdigest() == "893a03d114b2af09b53247866c6eea54ebfd090f") or (sha1(str(ordinal_transformer.get_params()['categories'][0]).encode('utf8')).hexdigest() == "81059b8bebc9ddb03d61bf07cfd9b9b6b0da288e"), "Make sure you are passing categories sorted on levels of education. (Ascending or descending shouldn't matter.)"

<br><br>

### 4.2 Preprocessing binary features
rubric={autograde}

**Your tasks:**

1. Create a transformer called `binary_transformer` for our binary features.

> _Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary._

<div class="alert alert-warning">
    
Solution_4.2
    
</div>

In [None]:
binary_transformer = None
# BEGIN SOLUTION
binary_transformer = OneHotEncoder(drop="if_binary", dtype=int)
# END SOLUTION

In [None]:
assert not binary_transformer is None, "Are you using the correct variable name?"
assert binary_transformer.get_params()['drop'] == 'if_binary', "Are you passing `drop=if_binary`?"
assert binary_transformer.get_params()['dtype'] == int, "Please set the dtype to int"

<br><br>

### 4.3 Preprocessing categorical features
rubric={autograde}

In Exercise 2.3, we saw that there are 3 categorical features with missing values. So first we need to impute the missing values and then encode these features with one-hot encoding. For the purpose of this assignment, let's just have imputation as the first step for all categorical features even when they do not have missing values. This should be OK because if a feature doesn't have any missing value,  imputation won't be applied. 

If we want to apply more than one transformation on a set of features, we need to create a [`scikit-learn` `Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). For example, for categorical features we can create a `scikit-learn` `Pipeline` with first step as imputation and the second step as one-hot encoding. 

**Your tasks:**

1. Create a `sklearn` `Pipeline` using [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) called `categorical_transformer` for our categorical features with two steps: `SimpleImputer` for imputation with `strategy="constant"` and `fill_value="missing"` and `OneHotEncoder` with `handle_unknown="ignore"` and `sparse=False` for one-hot encoding. 

<div class="alert alert-warning">
    
Solution_4.3
    
</div>

In [None]:
categorical_transformer = None

# BEGIN SOLUTION
categorical_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore", sparse=False),
)
# END SOLUTION

In [None]:
assert not categorical_transformer is None, "Are you using the correct variable name?"
assert type(categorical_transformer) is Pipeline, "Are you creating a scikit-learn Pipeline?"
assert len(categorical_transformer.get_params()['steps']) == 2, "Are you creating a pipeline with two steps?"
assert categorical_transformer.get_params()['simpleimputer__strategy'] == 'constant', "Are you passing strategy=constant in the SimpleImputer?"
assert categorical_transformer.get_params()['simpleimputer__fill_value'] == 'missing', "Are you passing fill_value='missing' in the SimpleImputer?"
assert categorical_transformer.get_params()['onehotencoder__handle_unknown'] == 'ignore', "Are you passing handle_unknown = 'ignore' argument to your OHE?"
assert categorical_transformer.get_params()['onehotencoder__sparse'] == False, "Are you creating a sparase matrix for OHE?"

<br><br>

### 4.4 Creating a column transformer. 
rubric={autograde}

**Your tasks:**
1. Create a `sklearn` `ColumnTransformer` named `preprocessor` using [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html) with the transformers defined in the previous exercises. Use the sequence below in the column transformer and add a "drop" step for the `drop_features` in the end.  
    - `numeric_transformer`
    - `ordinal_transformer`
    - `binary_transformer`
    - `categorical_transformer`
2. Transform the data by calling `fit_transform` on the training set and save it as a dataframe in a variable called `transformed_df`. How many new columns have been created in the preprocessed data in comparison to the original `X_train`? Store the difference between the number of columns in `transformed_df` and `X_train` in a variable called `n_new_cols`. 

> You are not required to do this but optionally you can try to get column names of the transformed data and create the dataframe `transformed_df` with proper column names. 

<div class="alert alert-warning">
    
Solution_4.4
    
</div>

In [None]:
preprocessor = None

# BEGIN SOLUTION
preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features),
    (ordinal_transformer, ordinal_features),    
    (binary_transformer, binary_features),    
    (categorical_transformer, categorical_features),
    ("drop", drop_features),
)
# END SOLUTION

In [None]:
transformed_df = None
n_new_cols = None

# BEGIN SOLUTION
data = preprocessor.fit_transform(X_train)
ohe_feats = preprocessor.named_transformers_['pipeline'].named_steps['onehotencoder'].get_feature_names_out(categorical_features).tolist()
feature_names = numeric_features + ordinal_features + binary_features + ohe_feats
transformed_df = pd.DataFrame(data, columns=feature_names)
n_new_cols = transformed_df.shape[1] - X_train.shape[1]
# END SOLUTION

In [None]:
# task 1
assert not preprocessor is None, "Are you using the correct variable name?"
assert len(preprocessor.get_params()['transformers']) in range(4,6,1), "Have you included all the transformers?"
assert 'onehotencoder' in preprocessor.get_params().keys(), 'Either the categorical_transformer or binary_transformer is not included.'
assert 'standardscaler' in preprocessor.get_params().keys(), 'numeric_transformer is not included.'
assert 'ordinalencoder' in preprocessor.get_params().keys(), 'ordinal_transformer is not included.'
assert 'drop' in preprocessor.get_params().keys(), 'drop features step is not included.'

In [None]:
# task 2
assert not transformed_df is None, "Are you using the correct variable name?"
assert sha1(str(transformed_df.shape).encode('utf8')).hexdigest() == 'a0521f0cdbcd77cd213e7d1a3cfc13c1c7c92a6e', "The shape of the transformed data is incorrect."

In [None]:
assert sha1(str(n_new_cols).encode('utf8')).hexdigest() == 'b7103ca278a75cad8f7d065acda0c2e80da0b7dc', "The number of new columns (n_new_cols) is incorrect."

<br><br>

### 4.5 Short answer questions
rubric={reasoning:8}

**Your tasks:**

Answer each of the following questions in 2 to 3 sentences. 

1. What is the problem with calling `fit_transform` on your test data with `StandardScaler`?
2. Why is it important to follow the Golden Rule? If you violate it, will that give you a worse classifier?
3. What are two advantages of using sklearn Pipelines? 
4. When is it appropriate to use sklearn `ColumnTransformer`? 

<div class="alert alert-warning">
    
Solution_4.5
    
</div>

1. You need to perform the same transformations on the train and test data, otherwise the results will not make sense.
2. Not necessarily a worse classifier, but you'll get an overly optimistic estimate of your model performance when you compute test accuracy which is bad.
3. (1) prevents violating the Golden Rule, (2) helps keep track of all your transformations in one place.
4. When we have different types of features and we want to apply different transformations on different features. 

<br><br><br><br>

## Exercise 5: Building models 

Now that we have preprocessed features, we are ready to build models. Below, I'm providing the function we used in class which returns mean cross-validation score along with standard deviation for a given model. Use it to keep track of your results. 

In [None]:
results_dict = {}  # dictionary to store all the results

In [None]:
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

Below, I'm showing an example where I call `mean_std_cross_val_scores` with `DummyClassifier`. The function calls `cross_validate` with the passed arguments and returns a series with mean cross-validation results and std of cross-validation. When you train new models, you can just add the results of these models in `results_dict`, which can be easily converted to a dataframe so that you can have a table with all your results. 

In [None]:
# Baseline model

from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(random_state = 123)
pipe = make_pipeline(preprocessor, dummy)
results_dict["dummy"] = mean_std_cross_val_scores(
    pipe, X_train, y_train, cv=5, return_train_score=True
)
results_df = pd.DataFrame(results_dict).T
results_df

<br><br>

### 5.1 Trying different classifiers
rubric={points}

**Your tasks:**

1. For each of the models in the starter code below: 
    - Define a pipeline with two steps: `preprocessor` from 4.4 and the model as your classifier. 
    - Carry out 5-fold cross-validation with the pipeline and get the mean cross-validation scores with std by calling the `mean_std_cross_val_scores` function above. 
    - Store the results in a dataframe called `income_pred_results_df` with the model names in the `models` dictionary below as the index and each row representing results returned by `mean_std_cross_val_scores` function above. In other words, `income_pred_results_df` should look similar to the `results_df` dataframe above with more rows for the models below. 
    
> This might take a while to run. Be patient! 

In [None]:
models = {
    "decision tree": DecisionTreeClassifier(random_state=123),
    "kNN": KNeighborsClassifier(),
    "RBF SVM": SVC(random_state=123),
}

<div class="alert alert-warning">
    
Solution_5.1
    
</div>

In [None]:
income_pred_results_df = None 
# BEGIN SOLUTION
for model_name, model in models.items():
    # print(model_name, ":")
    pipe = make_pipeline(preprocessor, model)
    results_dict[model_name] = mean_std_cross_val_scores(
        pipe, X_train, y_train, cv=5, return_train_score=True
    )
# END SOLUTION

In [None]:
## IGNORE ##
# BEGIN SOLUTION
income_pred_results_df = pd.DataFrame(results_dict).T
income_pred_results_df
# END SOLUTION

<br><br>

### 5.2 Discussion 
rubric={points}

**Your tasks:**

1. Examine the train and validation accuracies and `fit` and `score` times for all the models in the results above. How do the validation accuracies compare to the `DummyClassifier` model? Which model has the best validation accuracy? Which model has the fastest `fit` time? What about fastest `score` time? Which model is overfitting the most and the least?  


<div class="alert alert-warning">
    
Solution_5.2
    
</div>

####  Observations

- All three models have better cross-validation scores than the dummy classifier. Of course, the dummy classifier is the fastest in fitting and scoring and the most underfit model. But let's focus on other three interesting models in the discussion below.   
- SVC has the best cross-validation scores (mean validation score = 0.855), followed by KNN and decision tree.  
- As expected, KNN is the fastest model for fitting and decision tree is the fastest model for scoring. SVM seems to be much slower compared to KNN and decision tree but is more accurate. 
- Decision tree is clearly overfitting the most (mean train score = 0.987, mean validation score = 0.814) and SVM is overfitting the least (mean train score = 0.855, mean validation score = 0.852). The standard deviation between cross-validation scores is also smaller for SVM RBF compared to KNN and decision tree. So SVM RBF is giving us the best cross-validation scores, it's not overfitting much, it's not sensitive to the training data it's trained on, and it's likely to generalize well on unseen data. That said it's slower compared to the other two models. 

<br><br>

### 5.3 Hyperparameter optimization
rubric={points}

In this exercise, you'll carry out hyperparameter optimization for the hyperparameter `C` of SVC RBF classifier. In practice, you'll carry out hyperparameter optimization for all different hyperparameters of the most promising classifiers. For the purpose of this assignment, we'll only do it for the `SVC` classifier with one hyperparameter, namely `C`. 

**Your tasks:**

1. For each `C` value in the `param_grid` below: 
    - Create a pipeline object with two steps: preprocessor from 4.4 and `SVC` classifier with the `C` value.
    - Carry out 5-fold cross validation with the pipeline.  
    - Store the results in `results_dict` and display results as a pandas DataFrame. 
2. Which hyperparameter value seems to be performing the best? In this 
assignment, consider the hyperparameter value that gives you the highest cross-validation score as the "best" one. Store it in a variable called `best_C`. (Since this question is not autograded, please store the value directly as a number, something like `best_C = 0.001`, if `C = 0.001` is giving you the highest CV score.) Is it different than the default value for the hyperparameter used by `scikit-learn`? 

> Note: Running this will take a while. Please be patient. 

In [None]:
param_grid = {"C": np.logspace(-1, 2, 4)}
param_grid

<div class="alert alert-warning">
    
Solution_5.3
    
</div>

In [None]:
## IGNORE ##
# BEGIN SOLUTION
for param in param_grid["C"]:
    model_name = "RBF SVC"
    pipe = make_pipeline(preprocessor, SVC(C=param))

    key = model_name + "(C= " + str(param) + ")"
    results_dict[key] = mean_std_cross_val_scores(
        pipe, X_train, y_train, cv=5, return_train_score=True
    )
# END SOLUTION    

In [None]:
## IGNORE ##
# BEGIN SOLUTION
results_df = pd.DataFrame(results_dict).T
results_df
# END SOLUTION    

In [None]:
best_C = None

# BEGIN SOLUTION
# best_C_index = results_df.index.values[np.argmax(results_df["test_score"])]
best_C = 100.0
best_C
# END SOLUTION    

The hyperparameter C = 100.0 is giving the best results, which is better than the default value for the hyperparameter in `scikit-learn`. 

<br><br><br><br>

## Exercise 6: Evaluating on the test set 
<hr>

Now that we have a best performing model, it's time to assess our model on the set aside test set. In this exercise, you'll examine whether the results you obtained using cross-validation on the train set are consistent with the results on the test set. 

### 6.1 Scoring on the unseen test set 
rubric={autograde}

**Your tasks:**

1. Create a pipeline named `final_pipeline` with the preprocessor from 4.4 as the first step and the best performing SVC model from 5.4 as the second step. 
2. Train the pipeline on the entire training set `X_train` and `y_train`. 
3. Score the pipeline on `X_test` and `y_test` and store the score in a variable called `test_score`.  

<div class="alert alert-warning">
    
Solution_6.1
    
</div>

In [None]:
final_pipeline = None
test_score = None

# BEGIN SOLUTION

final_pipeline = make_pipeline(preprocessor, SVC(C=best_C))
final_pipeline.fit(X_train, y_train)
test_score = final_pipeline.score(X_test, y_test)

# END SOLUTION

- The test results are more or less consistent with the validation results, which is great!! 

<br><br><br><br>

## Exercise 7: Food for thought
<hr>

### (Challenging) 7.1 The `native.country` column
rubric={points}

In our column transformer above, we treated `native.country` as a categorical feature, where a new column will be created for each unique category in this column.

**Your tasks:**

1. Examine the `value_counts` for this column.
2. Point out the problems/limitations associated with the current encoding of this column.   
3. Propose and implement a better approach to encode the column. Justify why is your approach better.  
4. Examine whether you get better accuracy with your best model when you use this encoding. Discuss your results.   

<div class="alert alert-warning">
    
Solution_7.1
    
</div>

In [None]:
# BEGIN SOLUTION
X_train['native.country'].value_counts()
# END SOLUTION

Seems like most of the values for this column are United States, which makes sense given that the this is United States Census data. There are multiple possible ways to encode this column. Below I'm considering the 15 most frequently occurring countries.   

In [None]:
X_train["native.country"].value_counts()[:15].index.tolist()

In [None]:
# BEGIN SOLUTION
num_most_freq = 15
most_frequent = X_train["native.country"].value_counts()[:15].index.tolist()
most_frequent
# END SOLUTION

In [None]:
preprocessor_q_7_1 = None
# BEGIN SOLUTION
categorical_native_country = ['native.country']
categorical_features_no_country = ["marital.status", "relationship", "occupation", "workclass"]

categorical_native_country_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore", categories= [most_frequent])
)

preprocessor_q_7_1  = make_column_transformer(
    (numeric_transformer, numeric_features),
    (ordinal_transformer, ordinal_features),    
    (binary_transformer, binary_features),    
    (categorical_native_country_transformer,categorical_native_country),
    (categorical_transformer, categorical_features_no_country),    
    ("drop", drop_features),
)
preprocessor_q_7_1.fit(X_train)
# END SOLUTION

In [None]:
transformed_df_q_7_1  = None
n_new_cols_q_7_1  = None

# BEGIN SOLUTION
data_q_7_1 = preprocessor_q_7_1 .fit_transform(X_train)
native_country_feats = preprocessor_q_7_1.named_transformers_['pipeline-1'].named_steps['onehotencoder'].get_feature_names_out(categorical_native_country).tolist()
ohe_feats_q_7_1 = preprocessor_q_7_1.named_transformers_['pipeline-2'].named_steps['onehotencoder'].get_feature_names_out(categorical_features_no_country).tolist()
feature_names_q_7_1 = numeric_features + ordinal_features + binary_features + native_country_feats + ohe_feats_q_7_1
transformed_df_q_7_1 = pd.DataFrame(data_q_7_1, columns=feature_names_q_7_1)
n_new_cols_q_7_1 = transformed_df_q_7_1.shape[1] - X_train.shape[1]
# END SOLUTION

In [None]:
# BEGIN SOLUTION
transformed_df_q_7_1
# END SOLUTION

Now that the preprocessor is working, let's try our best SVC model with this new encoding. It's not necessary that the same hyperparameter will give 

In [None]:
# BEGIN SOLUTION
svc_pipe_q_7_1 = make_pipeline(preprocessor_q_7_1, SVC(C=best_C))
results_dict['SVC (C=100.0), most freq 15 countries'] = mean_std_cross_val_scores(
    svc_pipe_q_7_1, X_train, y_train, cv=5, return_train_score=True
)
# END SOLUTION

In [None]:
# BEGIN SOLUTION

pd.DataFrame(results_dict).T

# END SOLUTION

It seems like we are getting more or less the same cross-validation scores with all countries vs. 15 most frequent countries. The standard deviation is a bit low and a tiny bit less overfitting. So it might be a better idea to go with this simpler model with less number of features.  

<br><br>

### (Challenging) 7.2 Column transformer on Spotify Tracks DB
rubric={points}

Download [Spotify Tracks DB](https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db) dataset. The features in this dataset are similar to Kaggle's [Spotify Song Attributes](https://www.kaggle.com/geomack/spotifyclassification/home) dataset. But the prediction task for this dataset is a regression task of predicting song popularity. See the documentation of spotify-specific features [here](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/). 

**Your tasks:**

1. Identify different types of features based on what kind of transformations you want to apply on them. Clearly justify your choices. 
2. Define a column transformer for this data and show the transformed data as a dataframe with appropriate column names.  

<div class="alert alert-warning">
    
Solution_7.2
    
</div>

In [None]:
# BEGIN SOLUTION
spotify_df = pd.read_csv("data/SpotifyFeatures.csv")
spotify_df.head()
# END SOLUTION

I'm cleaning up the CSV a bit. In particular, 

1. I'm changing popularity of 0 to 1 to avoid divide by zero errors latter. Note that the popularity ranges from 0 to 100, with 0 being least popular and 100 being most popular. So changing the popularity from 0 to 1 should not make a huge difference.
2. Seems like the genre feature has two slightly different versions of the category Children's Music with two different quotation marks (` and '). I'm mapping them both to "Children's Music".  

In [None]:
# BEGIN SOLUTION
spotify_df.loc[spotify_df["popularity"] == 0, "popularity"] = 1
spotify_df["genre"].value_counts()
# END SOLUTION

In [None]:
# BEGIN SOLUTION
spotify_df.loc[spotify_df["genre"] == "Children’s Music", "genre"] = "Children's Music"
# END SOLUTION

> Interesting observation: It seems right to collapse these categories into one category but for some reason doing this makes a big difference in R^2 scores. When you skip the step above, i.e., when you keep the two categories separate, I was able to get `Ridge` R^2 score with all features and default alpha value as 0.73. Weird!! 

This dataset is large, which can be computationally intensive. So when we split the data, I am putting most of the data in the test split.

In [None]:
# BEGIN SOLUTION
spotify_df.shape
# END SOLUTION

In [None]:
# BEGIN SOLUTION
spotify_train_df, spotify_test_df = train_test_split(spotify_df, test_size=0.97, random_state=123)
# END SOLUTION

In [None]:
# BEGIN SOLUTION
spotify_train_df.shape
# END SOLUTION

In [None]:
# BEGIN SOLUTION
spotify_train_df.info()
# END SOLUTION

I am defining different feature types and a couple of preprocessors below.  

In [None]:
# BEGIN SOLUTION
drop_features_spotify = ["track_id", "artist_name"] # artist name can be a useful feature but most of the values are unique so droping it. Creating groupings of artists might be a good idea.
binary_features_spotify = ["mode"] # only two possible values
categorical_features_spotify = ["genre", "time_signature", "key"] # no particular ordering, apply OHE  
text_feature_spotify = "track_name" # track name can be a useful feature if people have preferences for songs on specific topic.
target_spotify = "popularity"
numeric_features_spotify = list(
    set(spotify_train_df.columns)
    - set(drop_features_spotify)
    - set([text_feature_spotify])
    - set(binary_features_spotify)
    - set(categorical_features_spotify)
    - set([target_spotify])
)
assert spotify_train_df.columns.shape[0] == len(
    drop_features_spotify
    + binary_features_spotify
    + categorical_features_spotify
    + numeric_features_spotify
    + [text_feature_spotify]
    + [target_spotify]
)
# END SOLUTION

In [None]:
# BEGIN SOLUTION
from sklearn.feature_extraction.text import CountVectorizer 
preprocessor_q_7_2 = make_column_transformer(
    (StandardScaler(), numeric_features_spotify),
    (OneHotEncoder(drop="if_binary", dtype="int"), binary_features_spotify),
    (OneHotEncoder(handle_unknown="ignore", dtype="int", sparse=False), categorical_features_spotify),
    (CountVectorizer(stop_words="english", max_features=100), text_feature_spotify),
    ("drop", drop_features_spotify),
)  # preprocessor which includes all features
# END SOLUTION

In [None]:
# BEGIN SOLUTION
spotify_X_train, spotify_y_train = spotify_train_df.drop(columns=[target_spotify]), spotify_train_df[target_spotify]
spotify_X_test, spotify_y_test = spotify_test_df.drop(columns=[target_spotify]), spotify_test_df[target_spotify]
# END SOLUTION

In [None]:
# BEGIN SOLUTION
transformed_spotify = preprocessor_q_7_2.fit_transform(spotify_X_train)
# END SOLUTION

In [None]:
# BEGIN SOLUTION
ohe_feats_spotify = preprocessor_q_7_2.named_transformers_['onehotencoder-2'].get_feature_names_out(categorical_features_spotify).tolist()
text_feats_spotify = preprocessor_q_7_2.named_transformers_['countvectorizer'].get_feature_names_out().tolist()
col_names = numeric_features_spotify + binary_features_spotify + ohe_feats_spotify + text_feats_spotify
# END SOLUTION

In [None]:
# BEGIN SOLUTION
pd.DataFrame(transformed_spotify, columns=col_names)
# END SOLUTION

OK. Seems like the preprocessor is working! 

<br><br><br><br>

Congratulations on finishing the assignment! You are now ready to build a simple supervised machine learning pipeline on real-world datasets! Well done :clap:! 

![](img/eva-well-done.png)

