# BAIT 509 Assignment 2: Preprocessing, Pipelines and Hyperparameter Tuning

__Evaluates__: Lectures 4 - 6. 

__Rubrics__: Your solutions will be assessed primarily on the accuracy of your coding, as well as the clarity and correctness of your written responses. The MDS rubrics provide a good guide as to what is expected of you in your responses to the assignment questions and how the TAs will grade your answers. See the following links for more details:

- [mechanics_rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_mech.md): submit an assignment correctly.
- [accuracy rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_accuracy.md): evaluating your code.
- [reasoning rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_reasoning.md): evaluating your written responses.
- [autograde rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_autograde.md): evaluating questions that are either right or wrong (can be done either manually or automatically).

## Tidy Submission 
rubric={mechanics:2}

- Complete this assignment by filling out this jupyter notebook.
- Any place you see `...` or `____`, you must fill in the function, variable, or data to complete the code.
- Use proper English, spelling, and grammar.
- You will submit two files on Canvas:
    1. This jupyter notebook file containing your responses ( an `.ipynb` file); and,
    2. An `.html` file of your completed notebook that will render directly on Canvas without having to be downloaded.
        - To generate this html file you can click `File` -> `Export Notebook As` -> `HTML` in JupyterLab or type the following into a terminal `jupyter nbconvert --to html_embed assignment.ipynb`).
    
Submit your assignment through UBC Canvas by the deadline listed there.

## Introduction and learning goals <a name="in"></a>
<hr>

Welcome to the assignment! In this assignment, you will practice:

- Identify when to implement feature transformations such as imputation and scaling.
- Apply `sklearn.pipeline.Pipeline` to build a machine learning pipeline.
- Use `sklearn` for applying numerical feature transformations on the data.
- Identify when it's appropriate to apply ordinal encoding vs one-hot encoding.
- Explain strategies to deal with categorical variables with too many categories.
- Use `ColumnTransformer` to build all our transformations together into one object and use it with `scikit-learn` pipelines.
- Carry out hyperparameter optimization using `sklearn`'s `GridSearchCV` and `RandomizedSearchCV`.

## Introduction <a name="in"></a>
<hr>

A crucial step when using machine learning algorithms on real-world datasets is preprocessing. This assignment will give you some practice to build a preliminary supervised machine learning pipeline on a real-world dataset. 

## Exercise 1: Introducing and Exploring the dataset <a name="1"></a>
<hr>

In this assignment, you will be working on a sample of [the adult census dataset](https://www.kaggle.com/uciml/adult-census-income#) that we provide as `census.csv`. We have made some modifications to this data so that it's easier to work with. 

This is a classification dataset and the classification task is to predict whether income exceeds 50K per year or not based on the census data. You can find more information on the dataset and features [here](http://archive.ics.uci.edu/ml/datasets/Adult).


*Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary.*

In [None]:
import pandas as pd

census_df = pd.read_csv("census.csv")
census_df

### 1.1 Data splitting 
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">
    
To avoid violation of the golden rule, the first step before we do anything is splitting the data. 

Split the data into `train_df` (80%) and `test_df` (20%). Keep the target column (`income`) in the splits so that we can use it in EDA. 
Please use `random_state=893`, so that your results are consistent with what we expect.
    
</div>

In [None]:
...

Let's examine our `train_df`,
you can just follow along for the next few cells.

In [None]:
train_df

In [None]:
train_df.info()

It looks like things are in order,
but there is a hidden gotcha with this dataframe.
Let's look at the unique values of each column.

In [None]:
from IPython.display import HTML  # This step is just to avoid the long columns being truncated as "..."

HTML(
    train_df
    .select_dtypes(object)
    .apply(lambda x: sorted(pd.unique(x)))
    .to_frame()
    .to_html()
)

You can see that there are question marks in the columns "workclass", "occupation", and "native_country".
Unfortunately it seems like the people collecting this data used a non-conventional way to indicate missing/unknown values
instead of using the standard blank/NaN.
Our first step would be to do this conversion manually,
so that `?` is not interpreted as an actual value by our models.

In [None]:
import numpy as np
train_df_nan = train_df.replace("?", np.nan)
test_df_nan = test_df.replace("?", np.nan)

### 1.2 Describing your data
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">
    
Use `.describe()` to show summary statistics of each feature in the `train_df_nan` dataframe.
Show numerical and categorical columns separately,
either using the lists we created above,
or by reading the docstring to describe to figure out how to limit which column types are shown.

</div>

In [None]:
# Numerical
...

In [None]:
# Categorical
...

### 1.3 Identifying potentially important features
rubric={reasoning:2}

<div class="alert alert-info" style="color:black">
    
We have provided you with some code that will visualize the distributions of all the numeric and categorical features in the census data.
Study the visualizations below
to suggest which features you think seem relevant
for the given prediction task of building a model to identify who makes over and under 50k.
List these features and briefly explain your rationale in why you have selected them.

</div>

YOUR ANSWER HERE

In [None]:
import altair as alt

alt.data_transformers.disable_max_rows()  # Allows us to plot big datasets

alt.Chart(train_df.sort_values('income')).mark_bar(opacity=0.6).encode(
    alt.X(alt.repeat(), type='quantitative', bin=alt.Bin(maxbins=50)),
    alt.Y('count()', stack=None),
    alt.Color('income')
).properties(
    height=200
).repeat(
    train_df_nan.select_dtypes('number').columns.to_list(),
    columns=2
)

In [None]:
alt.Chart(train_df.sort_values('income')).mark_bar(opacity=0.6).encode(
    alt.X(alt.repeat(), type='nominal'),
    alt.Y('count()', stack=None),
    alt.Color('income')
).properties(
    height=200
).repeat(
    train_df_nan.select_dtypes('object').columns.to_list(),
    columns=1
)

### 1.4 Separating feature vectors and targets  
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">

Create `X_train`, `y_train`, `X_test`, `y_test` from `train_df_nan` and `test_df_nan`. 
    
</div>

In [None]:
...

### 1.5 Training?
rubric={reasoning:2}


<div class="alert alert-info" style="color:black">

If you train [`sklearn`'s `SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) model on `X_train` and `y_train` at this point, would it work? Why or why not?
    
</div>

YOUR ANSWER HERE

## Exercise 2: Preprocessing <a name="3"></a>
<hr>

In this exercise, you'll be wrangling the dataset so that it's suitable to be used with `scikit-learn` classifiers. 

### 2.1 Identifying transformations that need to be applied
rubric={reasoning:7}

<div class="alert alert-info" style="color:black">

Identify the columns on which transformations need to be applied and tell us what transformation you would apply in what order by filling in the table below. Example transformations are shown for the feature `age` in the table.  

Note that for this problem, no ordinal encoding will be executed on this dataset. 

Are there any columns that you think should be dropped from the features? If so, explain your answer.

</div>

| Feature | Transformation |
| --- | ----------- |
| age | imputation, scaling |
| workclass |  |
| fnlwgt |  |
| education |  |
| education_num |  |
| marital_status |  |
| occupation |  |
| relationship |  |
| race |  |
| sex |  |
| capital_gain |  |
| capital_loss |  |
| hours_per_week |  |
| native_country |  |

YOUR ANSWER HERE

### 2.2 Numeric vs. categorical features
rubric={reasoning:2}

<div class="alert alert-info" style="color:black">

Since we will apply different preprocessing steps on the numerical and categorical columns,
we first need to identify the numeric and categorical features and create lists for each of them
(make sure not to include the target column).

*Save the column names as string elements in each of the corresponding list variables below*
    
</div>

In [None]:
numeric_features = [...]
categorical_features = [...]

### 2.3 Numeric feature pipeline
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">

Let's start making our pipelines. Use `make_pipeline()` or `Pipeline()` to make a pipeline for the numeric features called `numeric_transformer`. 
This pipeline will only have one step, 
the `StandardScaler()`,
so technically we didn't need to make a pipeline,
but it is good to be in the habit of working with pipelines
and it also gives us the option to name this step if we want.
    
</div>

In [None]:
numeric_transformer = ...

### 2.4 Categorical feature pipeline
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">

Next, make a pipeline for the categorical features called `categorical_transformer`. 
To keep things simple,
we will impute on all columns,
including those where we did not find missing values in the training data.
Use `SimpleImputation()` with `strategy='most_frequent'`. 
Add a OneHotEncoder as the second step and configure it to ignore unknown values in the test data.
    
</div>

In [None]:
categorical_transformer = ...

### 2.5 ColumnTransformer
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">

Create a column transformer that applies our numeric pipeline transformations to the numeric feature columns
and our categorical pipeline transformations to the categorical feature columns.
Assign this columns transformer to the variable `preprocessor`.
<div>

In [None]:
preprocessor = ...

## Exercise 3: Building a Model <a name="4"></a>
<hr>

Now that we have preprocessed features, we are ready to build models. 

### 3.1 Dummy Classifier
rubric={accuracy:3}

<div class="alert alert-info" style="color:black">

Now that we have our preprocessing pipeline setup,
let's move on to the model building.
First,
it's important to build a dummy classifier to establish a baseline score to compare our model to.
Make a `DummyClassifier` that predicts the most common label, train it, and then score it on the training and test sets
(in two separate cells so that both scores are displayed).
    
</div>

In [None]:
dummy = ...

In [None]:
...

### 3.2 Main pipeline
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">

Define a main pipeline that transforms all the different features and uses an `SVC` model with default hyperparameters. 
If you are using `Pipeline` instead of `make_pipeline`, name each of your steps `columntransformer` and `svc` respectively. 
    
</div>

In [None]:
main_pipe = ...

### 3.3 Hyperparameter tuning/optimization

rubric={accuracy:3}

<div class="alert alert-info" style="color:black">

Now that we have our pipelines and a model, let's tune the hyperparameters `gamma` and `C`.
For this tuning,
construct a grid where each hyperparameter can take the values `0.1, 1, 10, 100`
and randomly search for the best combination.

To save some running time on your laptops,
use 3-fold crossvalidation to evaluate each result
and only search for 7 iterations,
and set `n_jobs=-1`.
Return the train and testing score,
set `random_state=289`,
and optionally `verbose=2` if you want to see information as the search is occurring.
Don't forget to fit the best model from the `RandomizedSearchCV` object
on all the training data as the final step.

*This search is quite demanding computationally so be prepared for this to take 2 or 3 minutes and your fan may start to run!*
    
</div>

### 3.4 Choosing your hyperparameters
rubric={accuracy:2, reasoning:1}

<div class="alert alert-info" style="color:black">

We are displaying the results from the random hyperparameter search
as a dataframe below.
Looking at this table,
which values for `gamma` and `C` would you choose for your final model and why? 
You can answer this by either manually by using the table
or by accessing the corresponding attributes from the random search object.

</div>

In [None]:
pd.DataFrame(random_search.cv_results_)[["params", "mean_test_score", "mean_train_score", "rank_test_score"]]

YOUR ANSWER HERE

# 4. Evaluating on the test set <a name="5"></a>
<hr>

Now that we have a best-performing model, it's time to assess our model on the test set. 

### 4.1 Scoring your final model
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">

What is the training and test score of the best scoring model? 
Score the model in two separate cells so that both the training and test scores are displayed.
    
    
</div>

In [None]:
...

In [None]:
...

### 4.2 Assessing your model
rubric={reasoning:2}

<div class="alert alert-info" style="color:black">

Compare your final model accuracy with your baseline model from question 3.1,
do you consider that our model is performing better than the baseline to such as extent that you would prefer it on deployment data?

Briefly describe one aspect of our model development in this notebook that either supports your confidence in the model we have,
or one possible improvement to what we did here that you think could have increased our model score.
    
</div>

YOUR ANSWER HERE

### Submission to Canvas

**PLEASE READ: When you are ready to submit your assignment do the following:**

- Read through your solutions
- **Restart your kernel and clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Convert your notebook to .html format by going to File -> Export Notebook As... -> Export Notebook to HTML
- Upload your `.ipynb` file and the `.html` file to Canvas under Assignment1. 
- **DO NOT** upload any `.csv` files. 

### Congratulations on finishing Assignment 2! Now you are ready to build a simple ML pipeline on real-world datasets!