## Introduction to Machine Learning  

## Assignment 6:  Preprocessing Categorical Variables

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links of 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).

### Assignment Learning Goals:

By the end of the module, students are expected to:

- Explain `handle_unknown="ignore"` hyperparameter of `scikit-learn`'s `OneHotEncoder`.
- Identify when it's appropriate to apply ordinal encoding vs one-hot encoding.
- Explain strategies to deal with categorical variables with too many categories.
- Explain why text data needs a different treatment than categorical variables.
- Use `scikit-learn`'s `CountVectorizer` to encode text data.
- Explain different hyperparameters of `CountVectorizer`.
- Use `ColumnTransformer` to build all our transformations together into one object and use it with `scikit-learn` pipelines.

This assignment covers [Module 6](https://ml-learn.mds.ubc.ca/en/module6) of the online course. You should complete this module before attempting this assignment.

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct. These questions are worth 2 points.

In [None]:
# Import libraries needed for this lab
from hashlib import sha1

import altair as alt
import graphviz
import numpy as np
import pandas as pd

from sklearn import tree
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import make_column_transformer 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import (
    FunctionTransformer,
    Normalizer,
    OneHotEncoder,
    StandardScaler,
    normalize,
    scale)
from sklearn.svm import SVC

import test_assignment6 as t
#alt.renderers.enable('mimetype')
alt.data_transformers.disable_max_rows()

## 1. Introducing and Exploring the dataset <a name="1"></a>
<hr>


In this lab you will be working with [the Olympics Games DataSet](https://www.kaggle.com/samruddhim/olympics-althlete-events-analysis).

Our problem is to predict the medal type of each example. 
 You can find more information on the dataset and features [here](https://www.kaggle.com/samruddhim/olympics-althlete-events-analysis).


*Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary.*


The following starter code preprocesses the data to get rid of rows with `NaN` values in the target column `Medal`.

In [None]:
medal_df = pd.read_csv("data/athlete_events.csv")
medal_df = medal_df.dropna(subset=['Medal'])

**Question 1.1** <br> {points: 1}  

In order to avoid violating the golden rule, before we do anything with the data, let's split it.

Split the data into `train_df` (80%) and `test_df` (20%). 

Keep the target column (`Medal`) in the splits so that we can use it in EDA. 

Make sure to set `random_state=123` for grading purposes. 


In [None]:
train_df, test_df = None, None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_1(train_df,test_df)

**Question 1.2** <br> {points: 1}  

How many examples are there in our training data? 

Save your answer in an object named `training_size`.

In [None]:
training_size = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
training_size

In [None]:
t.test_1_2(training_size)

**Question 1.3** <br> {points: 3}  

Let's examine our `train_df` a bit. 

What is the youngest and oldest age of an athlete that won a medal in the Olympics?

Save the results in objects `youngest_age` and `oldest_age`. 


In [None]:
youngest_age = None
oldest_age = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
# check that the variable exists
assert 'oldest_age' in globals(
), "Please make sure that your solution is named 'oldest_age'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

In [None]:
t.test_1_3_2(youngest_age)

**Question 1.4** <br> {points: 1}  

Look at the column dtypes using `.info()`.

How many non numeric **features** are there? 

Save the results in an object named `num_cat_feats`.

In [None]:
num_cat_feats = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_4(num_cat_feats)

**Question 1.5** <br> {points: 3}  

Let's take a look at some of the columns and the categories within them. 

Use `.describe` to answer the following questions. Save the describe dataframe in an object named `describe_df`.  

a) Which categorical feature has the most unique values? Save this in an object named `most_unique`. 

b) How many binary columns are there? Save this in an object named `binary_cols`. 

c) How many categorical features have missing values? Save this number in an object named `missing_cat`.



In [None]:

most_unique = None
binary_cols = None
missing_cat = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_1_5_1(most_unique)

In [None]:
t.test_1_5_2(binary_cols)

In [None]:
t.test_1_5_3(missing_cat)

**Question 1.6** <br> {points: 2}  

Filter or groupby the `train_df` dataframe to answer the next question. 

Which `NOC` won the most medals? Save this in an object named `most_medals`. 

Which `NOC` won the most `Gold` medals? Save this in an object named `most_gold`. 


In [None]:
most_medals = None
most_gold = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_1_6_1(most_medals)

In [None]:
t.test_1_6_2(most_gold)

We are going to separate feature vectors from the targets.

We are only going to use the folowing columns:

- `Sex`
- `Age`
- `Height`
- `Weight`
- `NOC`
- `Year`
- `Season`
- `City`
- `Sport`


and using `Medal` as the target column. 

We've created  `X_train`, `y_train`, `X_test`, `y_test` for you. 

In [None]:
X_train = train_df.drop(columns=['ID', 'Name', 'Team', 'Event','Medal', 'Games'])
y_train = train_df['Medal']

X_test = test_df.drop(columns=['ID', 'Name', 'Team', 'Event','Medal', 'Games'])
y_test = test_df['Medal']

X_train.head()

## 2. Preprocessing and building your pipelines

**Question 2.1** <br> {points: 4}  

Before you can start preprocessing our data, you need to identify the binary, categorical, ordinal and numeric columns in your `X_train` and build lists of each feature type. 


Save the column names in lists named  `numeric_feats`, `binary_feats`, `categorical_feats` and `ordinal_feat`.


In [None]:
X_train.head()

In [None]:
numeric_feats = None 
binary_feats = None 
categorical_feats = None 
ordinal_feat = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_1_1(numeric_feats)

In [None]:
t.test_2_1_2(binary_feats)

In [None]:
t.test_2_1_3(categorical_feats)

In [None]:
t.test_2_1_4(ordinal_feat)

**Question 2.2** <br> {points: 1}  

Ok let's start making our pipelines. Use `make_pipeline()` to make a pipeline for the numeric features called `numeric_transformer`. 

Use `SimpleImputation()` with `strategy=median`. For the second step make sure to use standardization with `StandardScaler()`.

In [None]:
numeric_transformer = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_2(numeric_transformer)

**Question 2.3** <br> {points: 1}  

Next, use `make_pipeline()` to make a pipeline for the categorical features called `categorical_transformer`. 

Use `SimpleImputation()` with `strategy=most_frequent`. 

Make sure to use the necessary one-hot encoding transformer with `dtype=int` and `handle_unknown="ignore"`.

In [None]:
categorical_transformer = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_3(categorical_transformer)

**Question 2.4** <br> {points: 1}  
  
Use `make_pipeline()` to make a pipeline for the binary features call `binary_transformer`. 

Use `SimpleImputation()` with `strategy=most_frequent`. 

Make sure to use the necessary one-hot encoding transformer with `dtype=int`.

In [None]:
binary_transformer = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_4(binary_transformer)

**Question 2.5** <br> {points: 1}  


Define a column transformer using [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html) called `preprocessor` for the numerical, categorical, and remainding feature types.


In [None]:
preprocessor = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_5(preprocessor)

# 3. Model Building

**Question 3.1** <br> {points: 1}  

It's important to build a dummy classifier to compare our model to. Make a `DummyClassifier` using `strategy="prior"`. 

Carry out 5-fold cross validation on `X_train` and `y_train` using ` cross_validate()`. Don't forget to include the training score. 

Save the results in a dataframe named `dummy_scores`. 

In [None]:
dummy_scores = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_3_1(dummy_scores)

**Question 3.2** <br> {points: 1}  


Define a main pipeline called `main_pipe` that transforms all the different features and uses a `RandomForestClassifier` model using `random_state=77` and setting the hyperparameter `n_estimators` to 10. 

In [None]:
main_pipe = None
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_2(main_pipe)

**Question 3.3** <br> {points: 1}  

Perform 5 fold cross-validation on `X_train` and `y_train` using the main pipeline `main_pipe`. Make sure to set `return_train_score=True` and save the result in a dataframe called `scores_df`. 

*Note: This could take 5 minutes. Remember how large our training data is.*

In [None]:
scores_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_3(scores_df)

**Question 3.4** <br> {points: 2}

What is the mean training and cross-validation scores? 

Save the mean training score in `mean_training_score` and the mean cross-validation score in the object named `cv_score`.

In [None]:
mean_training_score = None
cv_score = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
print(mean_training_score, cv_score)

In [None]:
# check that the variable exists
assert 'cv_score' in globals(
), "Please make sure that your solution is named 'cv_score'"

assert 'mean_training_score' in globals(
), "Please make sure that your solution is named 'mean_training_score'"
# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 3.5** <br> {points: 1}

Is the model overfitting or underfitting? 

A) Overfitting

B) Underfitting

C) Neither

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer3_05`.*

In [None]:
answer3_05 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_05

In [None]:
t.test_3_5(answer3_05)

**Question 3.6** <br> {points: 1}

Which model performed better?

A) `RandomForestClassifier`

B) `DummyClassifier`

C) Neither

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer3_06`.*

In [None]:
answer3_06 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_06

In [None]:
t.test_3_6(answer3_06)

**Question 3.7** <br> {points: 1}  
Now that we have our pipelines and a model let's tune the hyperparameter `max_depth`. 

Sweep over the hyperparameters in `param_grid` using `RandomizedSearchCV` with a  `cv=5`, `n_iter=5` and setting `return_train_score=True`. Don't forget to set `random_state=77`.

Save your grid search in an object named `depth_search`. 

You may also want to set `verbose=2` since it may take some time. 

Don't forget to fit `depth_search`.


In [None]:

param_grid = {
    "randomforestclassifier__max_depth": range(1,151,10)
}
depth_search = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_3_7(depth_search)

**Question 3.8** <br> {points: 1}  

Obtain the results for cross validation from grid search using `depth_search.cv_results_`.

Select the columns:

- `mean_test_score`
- `param_randomforestclassifier__max_depth`
- `mean_fit_time`
- `rank_test_score`

Sort your values in ascending order of `rank_test_score`. 

Make sure to save it as a dataframe and display it. Save this as an object named `grid_results`.

In [None]:
grid_results = None


# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_8(grid_results)

**Question 3.9** <br> {points: 1} 

What is the best hyperparameter value for `n_estimators`? Save it in an object named `best_depth`. 

What was the corresponding validation score for it? Save this in an object named `best_depth_score`. 

*Hint: `.best_params_`  and `.best_score_` are helpful here.* 

In [None]:
best_depth = None 

best_depth_score = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_9(depth_search, best_depth, best_depth_score)

# 4. Evaluating on the test set <a name="5"></a>
<hr>

Now that we have a best performing model, it's time to assess our model on the set aside test set. 

**Question 4.1** <br> {points: 2} 

What is the training score of the best scoring model? Save the result in an object named `train_score`. 

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
assert 'train_score' in globals(
), "Please make sure that your solution is named 'train_score'"
# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 4.2** <br> {points: 1} 


What is the test score of the best model? 

Score the best model from `depth_search` on `X_test` and `y_test`. 

Save the result in an object named `test_score`. 


In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_4_2(test_score)

# 5. Text Data

Let's develop our own SMS spam filtering system using Kaggle's [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset) that was originally referenced from [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). 

We will use `CountVectorizer` to encode text messages and `SVC` for classification. 

**Sorry for the offensive language in some text messages; it's the reality of such platforms 😔. If you are sensitive to such language try not to read the raw messages.** 

In [None]:
sms_df = pd.read_csv("data/spam.csv", encoding="latin-1")
sms_df = sms_df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})

In [None]:
sms_df.shape

**Question 5.1** <br> {points: 1}  

Split `sms_df` into train (80%) and test splits (20%) setting `random_state=123`. 
Name your objects `text_train_df` and `text_test_df`. 
Examine the first few rows of the train portion. 

In [None]:
text_train_df, text_test_df = None, None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5_1(text_train_df, text_test_df)

**Question 5.2** <br> {points: 1}  

Split both `text_train_df` and `text_test_df` into the target and feature columns. Here,  `target` is the target column (`y`) and `sms` is the column in your `X`. 
    
Name your objects `X_text_train`, `y_text_train` and  `X_text_test` `y_text_test`.

*Hint: Make sure that you are using single brackets (a Pandas Series) for your target (y) objects. The tests will not pass unless your y variables are of type Pandas Series. This can be done by selecting the column target with single square brackets.*

In [None]:
X_text_train = None
y_text_train = None
X_text_test = None
y_text_test = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5_2(X_text_train, X_text_test, y_text_train, y_text_test)

**Question 5.3** <br> {points: 2}  

Note that in case of text data, the usual EDA is not applicable. In this exercise will carry out some simple EDA to get a sense of the data.  

What's the label distribution in the target column (How many `ham` and how many `spam` values do you have in the column `target`) in the training set? 

Save the result in an object named `target_freq`.

The autograder is expecting an answer as a pandas series. 

*Hint: There is function that we use quite often that will give us the frequency of each category in a column.*

In [None]:
target_freq = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
assert 'target_freq' in globals(
), "Please make sure that your solution is named 'target_freq'"
# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 5.4** <br> {points: 1} 

What's the average length in characters of text messages? Save the value to the nearest whole value in an object named `avg_text`. 

*Hint: `str.len()` may come in handy here.* 

In [None]:
avg_text = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5_4(avg_text)

**Question 5.5** <br> {points: 1} 

Would you classify `sms` column as a categorical column? Does it make sense to carry out one-hot encoding on this column?

A) It is a categorical column and I would carry out one-hot encoding on this column.

B) It is a categorical column and I would **NOT** carry out one-hot encoding on this column.

C) It is a free text column and I would carry out one-hot encoding on this column.

D) It is a free text column and I would **NOT** carry out one-hot encoding on this column.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer5_05`.*

In [None]:
answer5_05 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer5_05

In [None]:
t.test_5_5(answer5_05)

**Question 5.6** <br> {points: 0}  
Import `CountVectorizer` from the appropriate library. 

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5_6()

**Question 5.7** <br> {points: 1} 

Transform the training data using `CountVectorizer` with default parameters. Create an object named `vec`, fit it on `X_text_train` and `y_text_train` and transform `X_text_train`. 

Save the newly transformed `X_text_train` in an object named `transformed_X_train`. 

In [None]:
vec = None
transformed_X_train = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5_7(transformed_X_train)

**Question 5.8** <br> {points: 1} 

How many features have been created to represent each text message? 

Save the value in an object named `vocab_size`.

In [None]:
vocab_size = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5_8(vocab_size)

**Question 5.9** <br> {points: 2} 

What does each feature represent and each feature value represent? 

A) A word in the corpus with the value representing the number of times the word occurs in the given text message.

B) A text message in the corpus with the value representing the distance from the closest text in the corpus.

C) An example in the corpus with the value representing the length of the text message.

D) A sentence in the corpus with the value representing the number of times the sentence occurs in the given text message.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer5_09`.*

In [None]:
answer5_09 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer5_09

In [None]:
assert 'answer5_09' in globals(
), "Please make sure that your solution is named 'answer5_09'"
# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 5.10** <br> {points: 1} 

Build a pipeline named `dummy_pipe` for feature extraction using `CountVectorizer` with `binary=True` and `DummyClassifier` with strategy equal to `most_frequent`.

Use `cross_validate()`setting `cv=5` with `dummy_pipe` and set `return_train_score=True` on `X_text_train` and `y_text_train` to obtain the train and test scores. 

Save this in a dataframe named `dummy_scores`. 

In [None]:
dummy_pipe = None
dummy_scores = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5_10(dummy_pipe, dummy_scores)

**Question 5.11** <br> {points: 1} 

What are the mean values of the columns in `dummy_scores`? Save this in an object named `dummy_scores_mean`

In [None]:
dummy_scores_mean = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5_11(dummy_scores_mean)

**Question 5.12** <br> {points: 1} 

Very often representing your free text feature values in a binary format works better in practice than the default one and so we are going with that. 

Now build a pipeline named `svc_pipe_binary` for feature extraction using `CountVectorizer` with `binary=True` and `SVC` with default hyperparameters. Make sure you are using `make_pipeline()` for this. 

Cross validate on `svc_pipe_binary` using `X_text_train` and `y_text_train` and setting `cv=5`  and `return_train_score=True`.  

Save the results in a dataframe named `svc_scores`. 


In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5_12(svc_scores)

**Question 5.13** <br> {points: 1} 

What are the mean values of the columns in `svc_scores`? Save this in an object named `svc_scores_mean`.

In [None]:
svc_scores_mean = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5_13(svc_scores_mean)

**Question 5.14** <br> {points: 1} 

Are you getting better results with `SVC` compared to `DummyClassifier`?

A) I am getting better results with `SVC`.

B) I am getting better results with `DummyClassifier`.


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer5_14`.*

In [None]:
answer5_14 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer5_14

In [None]:
t.test_5_14(answer5_14)

## Attributions
- The Olympics Games DataSet - [Kaggle](https://www.kaggle.com/samruddhim/olympics-althlete-events-analysis)

- The SMS Spam Collection Dataset - [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset) and [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)

    *Almeida, T.A., GÃ³mez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011*


- MDS DSCI 571 - Supervised Learning I - [MDS's GitHub website](https://github.com/UBC-MDS/DSCI_571_sup-learn-1) 


## Before Submitting 

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel and clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  