# BAIT 509 Assignment 2: Preprocessing, Pipelines and Hyperparameter Tuning

__Evaluates__: Lectures 4 - 6. 

__Rubrics__: Your solutions will be assessed primarily on the accuracy of your coding, as well as the clarity and correctness of your written responses. The MDS rubrics provide a good guide as to what is expected of you in your responses to the assignment questions and how the TAs will grade your answers. See the following links for more details:

- [mechanics_rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_mech.md): submit an assignment correctly.
- [accuracy rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_accuracy.md): evaluating your code.
- [reasoning rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_reasoning.md): evaluating your written responses.
- [autograde rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_autograde.md): evaluating questions that are either right or wrong (can be done either manually or automatically).

## Tidy Submission 
rubric={mechanics:2}

- Complete this assignment by filling out this jupyter notebook.
- Any place you see `...` or `____`, you must fill in the function, variable, or data to complete the code.
- Use proper English, spelling, and grammar.
- You will submit two files on Canvas:
    1. This jupyter notebook file containing your responses ( an `.ipynb` file); and,
    2. An `.html` file of your completed notebook that will render directly on Canvas without having to be downloaded.
        - To generate this html file you can click `File` -> `Export Notebook As` -> `HTML` in JupyterLab or type the following into a terminal `jupyter nbconvert --to html_embed assignment.ipynb`).
    
Submit your assignment through UBC Canvas by the deadline listed there.

## Introduction and learning goals <a name="in"></a>
<hr>

Welcome to the assignment! In this assignment, you will practice:

- Identify when to implement feature transformations such as imputation and scaling.
- Apply `sklearn.pipeline.Pipeline` to build a machine learning pipeline.
- Use `sklearn` for applying numerical feature transformations on the data.
- Identify when it's appropriate to apply ordinal encoding vs one-hot encoding.
- Explain strategies to deal with categorical variables with too many categories.
- Use `ColumnTransformer` to build all our transformations together into one object and use it with `scikit-learn` pipelines.
- Carry out hyperparameter optimization using `sklearn`'s `GridSearchCV` and `RandomizedSearchCV`.

## Introduction <a name="in"></a>
<hr>

A crucial step when using machine learning algorithms on real-world datasets is preprocessing. This assignment will give you some practice to build a preliminary supervised machine learning pipeline on a real-world dataset. 

## Exercise 1: Introducing and Exploring the dataset <a name="1"></a>
<hr>

In this assignment, you will be working on a sample of [the adult census dataset](https://www.kaggle.com/uciml/adult-census-income#) that we provide as `census.csv`. We have made some modifications to this data so that it's easier to work with. 

This is a classification dataset and the classification task is to predict whether income exceeds 50K per year or not based on the census data. You can find more information on the dataset and features [here](http://archive.ics.uci.edu/ml/datasets/Adult).


*Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary.*

In [1]:
import pandas as pd

census_df = pd.read_csv("census.csv")
census_df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,74,State-gov,88638,Doctorate,16,Never-married,Prof-specialty,Other-relative,White,Female,0,3683,20,United-States,>50K
1,41,Private,70037,Some-college,10,Never-married,Craft-repair,Unmarried,White,Male,0,3004,60,?,>50K
2,45,Private,172274,Doctorate,16,Divorced,Prof-specialty,Unmarried,Black,Female,0,3004,35,United-States,>50K
3,38,Self-emp-not-inc,164526,Prof-school,15,Never-married,Prof-specialty,Not-in-family,White,Male,0,2824,45,United-States,>50K
4,52,Private,129177,Bachelors,13,Widowed,Other-service,Not-in-family,White,Female,0,2824,20,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15677,28,Self-emp-not-inc,70100,Bachelors,13,Never-married,Prof-specialty,Not-in-family,White,Male,0,0,60,United-States,<=50K
15678,23,Private,109307,Assoc-voc,11,Never-married,Exec-managerial,Own-child,White,Male,0,0,40,United-States,<=50K
15679,45,Self-emp-not-inc,184285,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Male,0,0,45,United-States,<=50K
15680,21,Private,89991,Some-college,10,Never-married,Adm-clerical,Own-child,White,Female,0,0,40,United-States,<=50K


### 1.1 Data splitting 
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">
    
To avoid violation of the golden rule, the first step before we do anything is splitting the data. 

Split the data into `train_df` (80%) and `test_df` (20%). Keep the target column (`income`) in the splits so that we can use it in EDA. 
Please use `random_state=893`, so that your results are consistent with what we expect.
    
</div>

In [2]:
...

Ellipsis

In [3]:
from sklearn.model_selection import train_test_split


train_df, test_df = train_test_split(census_df, test_size=0.2, random_state=893)

Let's examine our `train_df`,
you can just follow along for the next few cells.

In [4]:
train_df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
11523,31,Self-emp-not-inc,30290,HS-grad,9,Never-married,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K
2234,39,State-gov,122011,Masters,14,Married-civ-spouse,Prof-specialty,Wife,White,Female,5178,0,38,United-States,>50K
3531,63,Self-emp-not-inc,125178,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,40,United-States,>50K
4597,35,State-gov,89040,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,>50K
10095,27,Private,176972,Assoc-voc,11,Never-married,Craft-repair,Not-in-family,White,Male,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7086,32,Private,226696,Bachelors,13,Never-married,Exec-managerial,Not-in-family,White,Male,0,0,55,United-States,>50K
3620,44,Local-gov,254146,Masters,14,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,40,United-States,>50K
14968,30,Private,252752,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,White,Female,0,0,40,United-States,<=50K
1628,46,Private,243190,HS-grad,9,Married-civ-spouse,Adm-clerical,Wife,White,Female,7688,0,40,United-States,>50K


In [5]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12545 entries, 11523 to 11762
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             12545 non-null  int64 
 1   workclass       12545 non-null  object
 2   fnlwgt          12545 non-null  int64 
 3   education       12545 non-null  object
 4   education_num   12545 non-null  int64 
 5   marital_status  12545 non-null  object
 6   occupation      12545 non-null  object
 7   relationship    12545 non-null  object
 8   race            12545 non-null  object
 9   sex             12545 non-null  object
 10  capital_gain    12545 non-null  int64 
 11  capital_loss    12545 non-null  int64 
 12  hours_per_week  12545 non-null  int64 
 13  native_country  12545 non-null  object
 14  income          12545 non-null  object
dtypes: int64(6), object(9)
memory usage: 1.5+ MB


It looks like things are in order,
but there is a hidden gotcha with this dataframe.
Let's look at the unique values of each column.

In [6]:
from IPython.display import HTML  # This step is just to avoid the long columns being truncated as "..."

HTML(
    train_df
    .select_dtypes(object)
    .apply(lambda x: sorted(pd.unique(x)))
    .to_frame()
    .to_html()
)

Unnamed: 0,0
workclass,"[?, Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov, Without-pay]"
education,"[10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, Bachelors, Doctorate, HS-grad, Masters, Preschool, Prof-school, Some-college]"
marital_status,"[Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed]"
occupation,"[?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving]"
relationship,"[Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife]"
race,"[Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, White]"
sex,"[Female, Male]"
native_country,"[?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam, Yugoslavia]"
income,"[<=50K, >50K]"


You can see that there are question marks in the columns "workclass", "occupation", and "native_country".
Unfortunately it seems like the people collecting this data used a non-conventional way to indicate missing/unknown values
instead of using the standard blank/NaN.
Our first step would be to do this conversion manually,
so that `?` is not interpreted as an actual value by our models.

In [7]:
import numpy as np
train_df_nan = train_df.replace("?", np.nan)
test_df_nan = test_df.replace("?", np.nan)

### 1.2 Describing your data
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">
    
Use `.describe()` to show summary statistics of each feature in the `train_df_nan` dataframe.
Show numerical and categorical columns separately,
either using the lists we created above,
or by reading the docstring to describe to figure out how to limit which column types are shown.

</div>

In [8]:
# Numerical
...

In [9]:
# Categorical
...

In [10]:
train_df_nan.describe()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,12545.0,12545.0,12545.0,12545.0,12545.0,12545.0
mean,40.579514,189935.6,10.59442,1991.478198,123.42774,42.224552
std,12.870556,106284.4,2.617251,10164.524089,478.764687,12.217008
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,31.0,117872.0,9.0,0.0,0.0,40.0
50%,40.0,177995.0,10.0,0.0,0.0,40.0
75%,49.0,236391.0,13.0,0.0,0.0,50.0
max,90.0,1484705.0,16.0,99999.0,3900.0,99.0


In [11]:
train_df_nan.describe(include=object)

Unnamed: 0,workclass,education,marital_status,occupation,relationship,race,sex,native_country,income
count,11994,12545,12545,11993,12545,12545,12545,12309,12545
unique,8,16,7,14,6,5,2,40,2
top,Private,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K
freq,8501,3567,7427,2107,6577,10933,9161,11283,6309


### 1.3 Identifying potentially important features
rubric={reasoning:2}

<div class="alert alert-info" style="color:black">
    
We have provided you with some code that will visualize the distributions of all the numeric and categorical features in the census data.
Study the visualizations below
to suggest which features you think seem relevant
for the given prediction task of building a model to identify who makes over and under 50k.
List these features and briefly explain your rationale in why you have selected them.

</div>

YOUR ANSWER HERE

### Solution
They should mention 4-5 of these and not more than 1-2 incorrect ones.
- education_number, age, sex, occupation, marital status, education, relationship

The rationale is simply that there appears to be difference between the two classes within these features.

In [12]:
import altair as alt

alt.data_transformers.disable_max_rows()  # Allows us to plot big datasets

alt.Chart(train_df.sort_values('income')).mark_bar(opacity=0.6).encode(
    alt.X(alt.repeat(), type='quantitative', bin=alt.Bin(maxbins=50)),
    alt.Y('count()', stack=None),
    alt.Color('income')
).properties(
    height=200
).repeat(
    train_df_nan.select_dtypes('number').columns.to_list(),
    columns=2
)

In [13]:
alt.Chart(train_df.sort_values('income')).mark_bar(opacity=0.6).encode(
    alt.X(alt.repeat(), type='nominal'),
    alt.Y('count()', stack=None),
    alt.Color('income')
).properties(
    height=200
).repeat(
    train_df_nan.select_dtypes('object').columns.to_list(),
    columns=1
)

### 1.4 Separating feature vectors and targets  
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">

Create `X_train`, `y_train`, `X_test`, `y_test` from `train_df_nan` and `test_df_nan`. 
    
</div>

In [14]:
...

Ellipsis

In [15]:
# Solution
X_train = train_df_nan.drop(columns='income')
y_train = train_df_nan['income']
X_test = test_df_nan.drop(columns='income')
y_test = test_df_nan['income']

### 1.5 Training?
rubric={reasoning:2}


<div class="alert alert-info" style="color:black">

If you train [`sklearn`'s `SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) model on `X_train` and `y_train` at this point, would it work? Why or why not?
    
</div>

YOUR ANSWER HERE

**Solution**
It would not work because we still have categorical features and missing values in our dataframe.

## Exercise 2: Preprocessing <a name="3"></a>
<hr>

In this exercise, you'll be wrangling the dataset so that it's suitable to be used with `scikit-learn` classifiers. 

### 2.1 Identifying transformations that need to be applied
rubric={reasoning:7}

<div class="alert alert-info" style="color:black">

Identify the columns on which transformations need to be applied and tell us what transformation you would apply in what order by filling in the table below. Example transformations are shown for the feature `age` in the table.  

Note that for this problem, no ordinal encoding will be executed on this dataset. 

Are there any columns that you think should be dropped from the features? If so, explain your answer.

</div>

| Feature | Transformation |
| --- | ----------- |
| age | imputation, scaling |
| workclass |  |
| fnlwgt |  |
| education |  |
| education_num |  |
| marital_status |  |
| occupation |  |
| relationship |  |
| race |  |
| sex |  |
| capital_gain |  |
| capital_loss |  |
| hours_per_week |  |
| native_country |  |

YOUR ANSWER HERE

### Solution

| Feature | Transformation |
| --- | ----------- |
| age | scaling |
| workclass | imputation, OHE |
| fnlwgt | scaling |
| education | OHE |
| education.num | scaling |
| marital.status | OHE  |
| occupation | imputation, OHE  |
| relationship | OHE  |
| race | OHE  |
| sex | OHE  |
| capital.gain | scaling |
| capital.loss | scaling |
| hours.per.week | scaling |
| native.country | imputation, OHE |

It's fine to apply imputation for all features to make things simpler.
This would also deal with missing values in test data,
which had both advantages and disadvantages depending on why the values are missing
(it is probably good not to impute blindly and let the model fail at first if the test data is very different from the training data).

### 2.2 Numeric vs. categorical features
rubric={reasoning:2}

<div class="alert alert-info" style="color:black">

Since we will apply different preprocessing steps on the numerical and categorical columns,
we first need to identify the numeric and categorical features and create lists for each of them
(make sure not to include the target column).

*Save the column names as string elements in each of the corresponding list variables below*
    
</div>

In [16]:
numeric_features = [...]
categorical_features = [...]

In [17]:
# Solution
# Typing out the list by hand is also OK
numeric_features = train_df_nan.select_dtypes('number').columns.to_list()
categorical_features = train_df_nan.select_dtypes('object').drop(columns='income').columns.to_list()

### 2.3 Numeric feature pipeline
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">

Let's start making our pipelines. Use `make_pipeline()` or `Pipeline()` to make a pipeline for the numeric features called `numeric_transformer`. 
This pipeline will only have one step, 
the `StandardScaler()`,
so technically we didn't need to make a pipeline,
but it is good to be in the habit of working with pipelines
and it also gives us the option to name this step if we want.
    
</div>

In [18]:
numeric_transformer = ...

In [19]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline


numeric_transformer = make_pipeline(StandardScaler())

### 2.4 Categorical feature pipeline
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">

Next, make a pipeline for the categorical features called `categorical_transformer`. 
To keep things simple,
we will impute on all columns,
including those where we did not find missing values in the training data.
Use `SimpleImputation()` with `strategy='most_frequent'`. 
Add a OneHotEncoder as the second step and configure it to ignore unknown values in the test data.
    
</div>

In [20]:
categorical_transformer = ...

In [21]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


categorical_transformer = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    OneHotEncoder(handle_unknown='ignore')
)

### 2.5 ColumnTransformer
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">

Create a column transformer that applies our numeric pipeline transformations to the numeric feature columns
and our categorical pipeline transformations to the categorical feature columns.
Assign this columns transformer to the variable `preprocessor`.
<div>

In [22]:
preprocessor = ...

In [23]:
from sklearn.compose import make_column_transformer

preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features),
    (categorical_transformer, categorical_features)
)

## Exercise 3: Building a Model <a name="4"></a>
<hr>

Now that we have preprocessed features, we are ready to build models. 

### 3.1 Dummy Classifier
rubric={accuracy:3}

<div class="alert alert-info" style="color:black">

Now that we have our preprocessing pipeline setup,
let's move on to the model building.
First,
it's important to build a dummy classifier to establish a baseline score to compare our model to.
Make a `DummyClassifier` that predicts the most common label, train it, and then score it on the training and test sets
(in two separate cells so that both scores are displayed).
    
</div>

In [24]:
dummy = ...

In [32]:
...

Ellipsis

In [25]:
# Solution
from sklearn.dummy import DummyClassifier


dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
dummy.score(X_train, y_train)

0.5029095257074532

In [26]:
# Solution
dummy.score(X_test, y_test)

0.48836467963021996

### 3.2 Main pipeline
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">

Define a main pipeline that transforms all the different features and uses an `SVC` model with default hyperparameters. 
If you are using `Pipeline` instead of `make_pipeline`, name each of your steps `columntransformer` and `svc` respectively. 
    
</div>

In [27]:
main_pipe = ...

In [28]:
# Solution
from sklearn.svm import SVC

main_pipe = make_pipeline(
    preprocessor,
    SVC()
)

### 3.3 Hyperparameter tuning/optimization

rubric={accuracy:3}

<div class="alert alert-info" style="color:black">

Now that we have our pipelines and a model, let's tune the hyperparameters `gamma` and `C`.
For this tuning,
construct a grid where each hyperparameter can take the values `0.1, 1, 10, 100`
and randomly search for the best combination.

To save some running time on your laptops,
use 3-fold crossvalidation to evaluate each result
and only search for 7 iterations,
and set `n_jobs=-1`.
Return the train and testing score,
set `random_state=289`,
and optionally `verbose=2` if you want to see information as the search is occurring.
Don't forget to fit the best model from the `RandomizedSearchCV` object
on all the training data as the final step.

*This search is quite demanding computationally so be prepared for this to take 2 or 3 minutes and your fan may start to run!*
    
</div>

In [29]:
from sklearn.model_selection import RandomizedSearchCV


param_grid = {
    "svc__gamma": [0.1, 1.0, 10, 100],
    "svc__C": [0.1, 1.0, 10, 100]
}

random_search = RandomizedSearchCV(
    main_pipe,
    param_grid,
    cv=3,
    n_iter=7,
    return_train_score=True,
    random_state=289,
    n_jobs=-1,
    verbose=2
)
random_search.fit(X_train, y_train)

Fitting 3 folds for each of 7 candidates, totalling 21 fits
[CV] END .........................svc__C=0.1, svc__gamma=0.1; total time=   9.0s
[CV] END .........................svc__C=0.1, svc__gamma=0.1; total time=   9.0s
[CV] END .........................svc__C=0.1, svc__gamma=0.1; total time=   8.5s
[CV] END ..........................svc__C=10, svc__gamma=1.0; total time=  19.7s
[CV] END ..........................svc__C=10, svc__gamma=1.0; total time=  20.3s
[CV] END ..........................svc__C=10, svc__gamma=1.0; total time=  20.0s
[CV] END ..........................svc__C=1.0, svc__gamma=10; total time=  21.2s
[CV] END ..........................svc__C=1.0, svc__gamma=10; total time=  22.2s
[CV] END ..........................svc__C=1.0, svc__gamma=10; total time=  22.3s
[CV] END ..........................svc__C=10, svc__gamma=0.1; total time=   9.0s
[CV] END ..........................svc__C=10, svc__gamma=0.1; total time=   9.0s
[CV] END ..........................svc__C=10, svc

RandomizedSearchCV(cv=3,
                   estimator=Pipeline(steps=[('columntransformer',
                                              ColumnTransformer(transformers=[('pipeline-1',
                                                                               Pipeline(steps=[('standardscaler',
                                                                                                StandardScaler())]),
                                                                               ['age',
                                                                                'fnlwgt',
                                                                                'education_num',
                                                                                'capital_gain',
                                                                                'capital_loss',
                                                                                'hours_per_week']),
                  

### 3.4 Choosing your hyperparameters
rubric={accuracy:2, reasoning:1}

<div class="alert alert-info" style="color:black">

We are displaying the results from the random hyperparameter search
as a dataframe below.
Looking at this table,
which values for `gamma` and `C` would you choose for your final model and why? 
You can answer this by either manually by using the table
or by accessing the corresponding attributes from the random search object.

</div>

In [30]:
pd.DataFrame(random_search.cv_results_)[["params", "mean_test_score", "mean_train_score", "rank_test_score"]]

Unnamed: 0,params,mean_test_score,mean_train_score,rank_test_score
0,"{'svc__gamma': 10, 'svc__C': 1.0}",0.592109,0.990514,7
1,"{'svc__gamma': 1.0, 'svc__C': 10}",0.72786,0.97776,4
2,"{'svc__gamma': 0.1, 'svc__C': 0.1}",0.814269,0.821642,1
3,"{'svc__gamma': 1.0, 'svc__C': 100}",0.718932,0.988442,5
4,"{'svc__gamma': 0.1, 'svc__C': 10}",0.811957,0.90275,2
5,"{'svc__gamma': 0.1, 'svc__C': 100}",0.782623,0.941809,3
6,"{'svc__gamma': 10, 'svc__C': 10}",0.595537,0.996732,6


YOUR ANSWER HERE

In [31]:
# Solution
# Why: These have the higest cv score
random_search.best_params_

{'svc__gamma': 0.1, 'svc__C': 0.1}

# 4. Evaluating on the test set <a name="5"></a>
<hr>

Now that we have a best-performing model, it's time to assess our model on the test set. 

### 4.1 Scoring your final model
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">

What is the training and test score of the best scoring model? 
Score the model in two separate cells so that both the training and test scores are displayed.
    
    
</div>

In [32]:
...

Ellipsis

In [32]:
...

Ellipsis

In [33]:
random_search.score(X_train, y_train)

0.8255878836189717

In [34]:
random_search.score(X_test, y_test)

0.8202103920943576

### 4.2 Assessing your model
rubric={reasoning:2}

<div class="alert alert-info" style="color:black">

Compare your final model accuracy with your baseline model from question 3.1,
do you consider that our model is performing better than the baseline to such as extent that you would prefer it on deployment data?

Briefly describe one aspect of our model development in this notebook that either supports your confidence in the model we have,
or one possible improvement to what we did here that you think could have increased our model score.
    
</div>

YOUR ANSWER HERE

**Solution**

Main: Our SVC model is ~80% accurate whereas the dummy model was no better than randomly guessing.

Possible supportive arguments:
- We got consistent cross-validation results and test results, which provides further reassurance.
- We have large dataset which reduces the probability of unlucky splits.
    
Possible improvements/caveats:
- We only tested one model, SVC.
- We didn't carry out extensive hyperparameter optimization in the interest of time,
  so it might be possible to find a better performing model with more elaborate hyperparameter optimization.  
- We haven't accounted for missing values in all columns, so we might encounter problems if 
- As always, our data might not be representative of the unseen deployment data, e.g. if something has changed in how it was collected.

### Submission to Canvas

**PLEASE READ: When you are ready to submit your assignment do the following:**

- Read through your solutions
- **Restart your kernel and clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Convert your notebook to .html format by going to File -> Export Notebook As... -> Export Notebook to HTML
- Upload your `.ipynb` file and the `.html` file to Canvas under Assignment1. 
- **DO NOT** upload any `.csv` files. 

### Congratulations on finishing Assignment 2! Now you are ready to build a simple ML pipeline on real-world datasets!

In [40]:
# Generate student version without solutions
!jupyter nbconvert --to notebook --output=assignment2.ipynb assignment2_solutions.ipynb \
    --TagRemovePreprocessor.enabled=True \
    --TagRemovePreprocessor.remove_cell_tags='{"solution"}' \
    --ClearOutputPreprocessor.enabled=True 

[NbConvertApp] Converting notebook assignment1_solutions.ipynb to notebook
[NbConvertApp] Writing 25043 bytes to assignment1.ipynb
