# DSCI 571 - Supervised Learning I

# Lab 2: Preprocessing and building a simple ML pipeline

## Table of Contents

- [Submission instructions](#si)
- [Introduction](#in)
- [Exercise 1: Introducing the dataset](#1)
- [Exercise 2: Exploratory data analysis (EDA)](#2)
- [Exercise 3: Preprocessing and pipelines](#3)
- [Exercise 4: Building models](#4)
- [Exercise 5: Evaluating on the test set](#5)

In [None]:
# Import libraries
from hashlib import sha1

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import HTML

pd.set_option("display.max_colwidth", 200)

from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import (
    FunctionTransformer,
    Normalizer,
    OneHotEncoder,
    StandardScaler,
    normalize,
    scale,
)
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

## Submission instructions <a name="si"></a>
<hr>
rubric={mechanics:5}

You will receive marks for correctly submitting this assignment. 

To correctly submit this assignment follow the instructions below:

- Push your assignment to your GitHub repository. 
- Add a link to your GitHub repository here: LINK TO YOUR GITHUB REPO 
- Upload an HTML render of your assignment to Canvas. The last cell of this notebook will help you do that.
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).

[Here](https://github.com/UBC-MDS/public/tree/master/rubric) you will find the description of each rubric used in MDS.

**NOTE: The data you download for use in this lab SHOULD NOT BE PUSHED TO YOUR REPOSITORY. You might be penalised for pushing datasets to your repository. I have seeded the repository with `.gitignore` and hoping that it won't let you push CSVs.**

## Introduction <a name="in"></a>
<hr>

A crucial step when using machine learning algorithms on real-world datasets is preprocessing. This lab will give you some practice to build a preliminary supervised machine learning pipeline on a real-world dataset. 

## Exercise 1: Introducing the dataset <a name="1"></a>
<hr>


In this lab you will be working on a sample of [the adult census dataset](https://www.kaggle.com/uciml/adult-census-income#). Download the CSV and save it as `adult.csv` locally in the lab folder. The repository is seeded with `.gitignore *.csv`. So you won't be able to push the CSV in your repository. 

This is a classification dataset and the classification task is to predict whether income exceeds 50K per year or not based on the census data. You can find more information on the dataset and features [here](http://archive.ics.uci.edu/ml/datasets/Adult).

The starter code below loads the data CSV (assuming that it is saved as `adult.csv` in this folder). 

*Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary.*

In [None]:
### BEGIN STARTER CODE

# For the purpose of this lab, I am undersampling the dataset so that the labels are balanced.
# We'll learn about dealing with unbalanced data in DSCI 573.
adult_df_large = pd.read_csv("adult.csv")
g50k = adult_df_large[adult_df_large["income"] == ">50K"]
leq50k_sample = adult_df_large[adult_df_large["income"] == "<=50K"].sample(
    g50k.shape[0]
)
census_df = pd.concat([g50k, leq50k_sample])
census_df.shape

### END STARTER CODE

### 1.1 Data splitting 
rubric={accuracy:2}

In order to avoid violation of the golden rule, the first step before we do anything is splitting the data. 

**Your tasks:**

1. Split the data into `train_df` (80%) and `test_df` (20%). Keep the target column (`income`) in the splits so that we can use it in EDA. 


In [None]:
### YOUR ANSWER HERE

## Exercise 2: Exploratory data analysis (EDA) <a name="2"></a> 
<hr>

Let's examine our `train_df`. 

In [None]:
### BEGIN STARTER CODE

train_df.sort_index().head()

### END STARTER CODE

We see some missing values represented with a "?". Probably these were the questions not answered by some people during the census.  Usually `.describe()` or `.info()` methods would give you information on missing values. But here, they won't pick "?" as missing values as they are encoded as strings instead of an actual NaN in Python. So let's replace them with `np.NaN` before we carry out EDA. If you do not do it, you'll encounter an error later on when you try to pass this data to a classifier. 

In [None]:
### BEGIN STARTER CODE

train_df_nan = train_df.replace("?", np.NaN)
test_df_nan = test_df.replace("?", np.NaN)

### END STARTER CODE

### 2.1 Numeric vs. categorical features
rubric={reasoning:5}

**Your tasks:**

1. Identify numeric and categorical features and create lists for each of them. 
2. Are there any features which are neither numeric nor categorical in this dataset? If yes, create a separate list for those features. 

In [None]:
### BEGIN STATER_CODE

# Fill in the lists below.
numeric_features = []
categorical_features = []
remainder_features = []

### END STATER_CODE

In [None]:
### YOUR ANSWER HERE

### 2.2 Visualizing features
rubric={viz:4,reasoning:2}

**Your tasks**
1. Use `train_df_nan.info()` method to describe information of each feature and `train_df_nan.describe()` using the `include="all"` argument to show summary statistics of each feature. 
2. Visualize the histograms of numeric features using either `altair` or pandas plotting. 
3. Which features seem relevant for the given prediction task? 

You don't have to but you are welcome to use `pandas_profiling` for more elaborate visualization and EDA. 

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### 2.3 Separating feature vectors and targets  
rubric={accuracy:2,reasoning:2}

**Your tasks:**

1. Create `X_train`, `y_train`, `X_test`, `y_test` from `train_df_nan` and `test_df_nan`. 
2. At this point, if you train [`sklearn`'s `SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) model on `X_train` and `y_train` would it work? Why or why not?

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

## Exercise 3: Preprocessing <a name="3"></a>
<hr>

In this exercise you'll be wrangling the dataset so that it's suitable to be used with `scikit-learn` classifiers. 

### 3.1 Identifying transformations that need to be applied
rubric={reasoning:7}

**Your tasks:**

1. Identify the columns on which transformations need to be applied and tell us what transformation you would apply in what order by filling in the table below. Example transformations are shown for the feature `age` in the table.  
2. Are there any columns where no transformations need to be applied? 

### BEGIN STARTED CODE
| Feature | Transformation |
| --- | ----------- |
| age | imputation, scaling |
| workclass |  |
| fnlwgt |  |
| education |  |
| education.num |  |
| marital.status |  |
| occupation |  |
| relationship |  |
| race |  |
| sex |  |
| capital.gain |  |
| capital.loss |  |
| hours.per.week |  |
| native.country |  |

### END STARTED CODE

### YOUR ANSWER HERE

### 3.2 Imputing missing values **without** `sklearn.pipeline.Pipeline`
rubric={accuracy:5,reasoning:2}

In this exercise you'll be imputing missing values **without using `scikit-learn` pipelines**. The goal here is two-fold. First, to understand what happens under the hood when you use `scikit-learn` `Pipelines`, and second, to convince yourself why it's a good idea to use pipelines.  

**Your tasks:**
1. For numeric features, use [`scikit-learn`'s `SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) to impute `NaN` values with `strategy="median"`. Remember to apply the transformations on both the train and test splits.  
2. For categorical features, use [`scikit-learn`'s `SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) to impute `NaN` values by a constant string "missing". Remember to apply the transformations on both the train and test splits.
3. Show train split of categorical features in a dataframe after transforming the missing values. Do you see imputed missing values in the dataframe?  
4. When using the `SimpleImputer` transformer on the numeric columns, is there any problem with calling `fit_transform` on the test split? Why or why not? 

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### 3.3 One-hot encoding **without** `sklearn.pipeline.Pipeline`
rubric={accuracy:8,reasoning:4}     

**Your tasks:**

1. Apply one-hot encoding to the categorical features of imputed `X_train` and `X_test` from 3.2 using [`scikit-learn`'s `OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) and show the new columns created using the `categories_` attribute of the `OneHotEncoder` object.
2. Create preprocessed train and test splits, `X_train_pp` and `X_test_pp`, by horizontally stacking transformed numeric columns from 3.2 and transformed categorical columns with imputation and OHE applied. 
3. What's the shape of `X_train_pp` and `X_test_pp`?
4. Carry out cross validation using `cross_validate` on the preprocessed `X_train_pp` and `y_train` using the [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier) with default parameters. 
5. Are we violating the golden rule when we call `cross_validate` in 4.? Why or why not? 

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
transformed_ohe

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### 3.4 Using `sklearn.pipeline.Pipeline`
rubric={accuracy:8,reasoning:2}

As noted in 3.2 and 3.3, when we want to apply a series of transformations, the code becomes unwieldy quite quickly. We can do this much more elegantly using [`scikit-learn` pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).   

Let's carry out preprocessing using pipelines now. Note that you can define pipelines in two ways: by using [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and providing named steps or by using [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) which allows for simplified pipeline construction. In the latter case, the names of the steps will be set to the lowercase of their types automatically. You may use the method of your choice.  


1. Define a `Pipeline` for numerical features with two steps: 
    - [`SimpleImputer()`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) with `strategy = "median"` 
    - [`StandardScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
2. Define a `Pipeline` for categorical features with two steps: 
    - [`SimpleImputer()`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) with `strategy = "constant"`
    - [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) with `handle_unknown="ignore"`

3. Define a [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) called `preprocessor` for the numerical, categorical, and remainder features.
4. Fit the `preprocessor` on `X_train` and `y_train`.
5. Examine the new features created by the `OneHotEncoder`. How many new features are created for the categorical feature `marital.status`?

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

## Exercise 4: Building models <a name="4"></a>
<hr>

Now that we have preprocessed features, we are ready to build models. 

In [None]:
### BEGIN STARTER CODE

# Lets create an empty dictionary to store all the results
results_dict = {}

### END STARTER CODE

In [None]:
### BEGIN STARTER CODE

# You may use or adapt this function to keep your results organized
def store_cross_val_results(model_name, scores, results_dict):
    """
    Stores mean scores from cross_validate in results_dict for
    the given model model_name.

    Parameters
    ----------
    model_name :
        scikit-learn classification model
    scores : dict
        object return by `cross_validate`
    results_dict: dict
        dictionary to store results

    Returns
    ----------
        None

    """
    results_dict[model_name] = {
        "mean_train_accuracy": "{:0.4f}".format(np.mean(scores["train_score"])),
        "mean_validation_accuracy": "{:0.4f}".format(np.mean(scores["test_score"])),
        "mean_fit_time (s)": "{:0.4f}".format(np.mean(scores["fit_time"])),
        "mean_score_time (s)": "{:0.4f}".format(np.mean(scores["score_time"])),
        "std_train_score": "{:0.4f}".format(scores["train_score"].std()),
        "std_test_score": "{:0.4f}".format(scores["test_score"].std()),
    }


### END STARTER CODE

### 4.1 Building a baseline model 
rubric={accuracy:4}

**Your tasks:**
1. Define a pipeline with two steps: preprocessor from 3.4 and `scikit-learn`'s `DummyClassifier` with `strategy="prior"` as your classifier.  
2. Carry out 5-fold cross validation with the pipeline. Store the results in `results_dict` above. (You may use the function above `store_cross_val_results` to store the results.) 

In [None]:
### YOUR ANSWER HERE

### 4.2 Trying different classifiers
rubric={accuracy:6,reasoning:4}

**Your tasks:**

1. For each of the models in the starter code below: 
    - Define a pipeline with two steps: preprocessor from 3.4 and the model as your classifier.  
    - Carry out 5-fold cross validation with the pipeline using `cross_validate`.
    - Store the results in `results_dict`. (You may use the function above `store_cross_val_results` to store the results.) 
2. Display all the results so far as a dataframe. 
3. Compare the train and validation accuracies and `fit` and `score` times in each case. How do the the validation accuracies compare to the baseline model from 4.1? Which model has the best validation accuracy? Which model is the fastest one?  

In [None]:
### BEGIN STARTER CODE 

models = {
    "decision tree": DecisionTreeClassifier(),
    "kNN": KNeighborsClassifier(),
    "RBF SVM": SVC(),
}

### END STARTER CODE 

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### (optional) 4.3 Exploring importance of scaling
rubric={reasoning:1}

In this exercise you'll examine whether scaling helps in case of KNN and RBF SVM. 

**Your tasks:**

1. Create a column transformer without the `StandardScaler` step for `numeric_features`. 
2. Repeat the steps in 4.2 with this new column transformer. 
3. Compare the results with scaled numeric features with unscaled numeric features. Is scaling necessary for decision trees? Why or why not?

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### 4.4 Hyperparameter optimization
rubric={accuracy:4,reasoning:2}

In this exercise, you'll carry out hyperparameter optimization for the hyperparameter `C` of SVC RBF classifier. In practice you'll carry out hyperparameter optimization for all different hyperparameters for the most promising classifiers. For the purpose of this assignment, we'll only do it for our best performing `SVC` classifier with one hyperparameter, `C`. 

**Your tasks:**

1. For each `C` value in the `param_grid` in the starter code below: 
    - Create a pipeline object with two steps: preprocessor from 3.4 and SVC classifier with the value of `C`.
    - Carry out cross-validation using `cross_validate` and store results in the `results_dict` using the function `store_cross_val_results`. You may pass the `model_name` as `SVC` + the current `C` value. 
2. Which hyperparameter value seems to be performing the best? Is it different than the default value for the hyperparameter used by `scikit-learn`? 

In [None]:
### BEGIN STARTER CODE

param_grid = {"C": np.logspace(-3, 2, 6)}

### END STARTER CODE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### (Optional) 4.5 Hyperparameter optimization of multiple parameters
rubric={reasoning:1}

In the previous exercise we carried out hyperparameter optimization of only one hyperparameter. But as we saw in class `SVC` has two important hyperparameters: `C` and `gamma` which may interact with each other and we need to optimize them both simultaneously.  

**Your tasks:**
1. Carry out hyperparameter optimization of `C` and `gamma` simultaneously for the param grid of your choice. 
2. Do you get a different value for `C` than in 4.4? 

**Note: The material required to answer this question is not covered this week. This block I am trying something new. In some of the labs I will be including an optional question which leads to the material in the upcoming week. It's a low-risk question and is worth only one point. The evaluation of this question is going to be pretty lenient. The intention here is not to get the perfect answer from you but to get you thinking about the upcoming material.**

In [None]:
### YOUR ANSWER HERE

## Exercise 5: Evaluating on the test set <a name="5"></a>
<hr>

Now that we have a best performing model, it's time to assess our model on the set aside test set. In this exercise you'll examine whether the results you obtained using cross-validation on the train set are consistent with the results on the test set. 

### 5.1 Scoring on the unseen test set 
rubric={accuracy:4,reasoning:4}

**Your tasks:**

1. Report the results on `X_test` for all classifiers. Pick the best hyperparameter from 4.4 for the SVC RBF classifier. 
2. Compare and discuss the train, validation, and test results of all classifiers. 

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### Submission to Canvas

**PLEASE READ: When you are ready to submit your assignment do the following:**

- Run all cells in your notebook to make sure there are no errors by doing Kernel -->  Restart Kernel and Run All Cells...
- If you are using the "571" `conda` environment, make sure to select it before running all cells. 
- Convert your notebook to .html format using the `convert_notebook()` function below or by File -> Export Notebook As... -> Export Notebook to HTML
- Run the code `submit()` below to go through an interactive submission process to Canvas.
After submission, be sure to do a final push of all your work to GitHub (including the rendered html file).

In [None]:
# from canvasutils.submit import convert_notebook, submit

# convert_notebook("lab1.ipynb", "html")  # uncomment and run when you want to try convert your notebook (or you can convert manually from the File menu)
# submit(course_code=53670, token=False)  # uncomment and run when ready to submit to Canvas

### Congratulations on finishing the lab! Now you are ready to build a simple ML pipeline on real-world datasets! Well done :clap:! 