# DSCI 571 - Supervised Learning I
# Lab 3: Text classification and hyperparameter optimization

## Table of Contents

- [Submission guidelines](#sg)
- [Introduction](#in)
- [Exercise 1: Introducing the dataset and EDA](#1)
- [Exercise 2: Preprocessing](#2)
- [Exercise 3: Model building](#3)
- [Exercise 4: Hyperparameter optimization](#4)
- [Exercise 5: Test results](#5)

In [None]:
import numpy as np
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer

# train test split and cross validation
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.svm import SVC

## Submission instructions <a name="si"></a>
<hr>
rubric={mechanics:2}

You will receive marks for correctly submitting this assignment. 

To correctly submit this assignment follow the instructions below:

- Push your assignment to your GitHub repository. 
- Add a link to your GitHub repository here: LINK TO YOUR GITHUB REPO 
- Upload an HTML render of your assignment to Canvas. The last cell of this notebook will help you do that.
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).

[Here](https://github.com/UBC-MDS/public/tree/master/rubric) you will find the description of each rubric used in MDS.

**NOTE: The data you download for use in this lab SHOULD NOT BE PUSHED TO YOUR REPOSITORY. You might be penalised for pushing datasets to your repository. I have seeded the repository with `.gitignore` and hoping that it won't let you push CSVs.**

## Introduction <a name="in"></a>
<hr>

In this lab, we'll focus on two things:
1. Working with text data
2. Hyperparameter optimization

As this is a quiz week, this lab is a bit lighter compared to other labs.  

## Exercise 1: Introducing the dataset and EDA <a name="1"></a>
<hr>

Let's develop our own SMS spam filtering system using Kaggle's [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset). We will use `CountVectorizer` to encode text messages and `SVC` for classification. Download the data CSV in the lab folder. **Sorry for the offensive language in some text messages; it's the reality of such platforms 😔. If you are sensitive to such language try not to read the raw messages.** 

The starter code below reads the CSV assuming that it's present in the current directory and renames columns. As usual do not push the CSV in your repo. 

In [None]:
### BEGIN STARTER CODE

sms_df = pd.read_csv("spam.csv", encoding="latin-1")
sms_df = sms_df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})

### END STARTER CODE

### 1.1 Data splitting 
rubric={reasoning:1}

**Your tasks:**

1. Split `sms_df` into train (80%) and test splits (20%). 
2. Examine the first few rows of the train portion. 

In [None]:
### YOUR ANSWER HERE

### 1.2 Simple EDA 
rubric={accuracy:3,reasoning:2}

Note that in case of text data the usual EDA is not applicable. In this exercise will carry out some simple EDA to get a sense of the data.  

**Your tasks:**

1. What's the label distribution in the target column? 
2. What's the average length in characters of text messages? Show the shortest and longest text messages. 
3. Would you classify `sms` column as a categorical column? Does it make sense to carry out one-hot encoding on this column? Why or why not? 

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### (optional) 1.3 Word clouds
rubric={reasoning:1}

**Your tasks:**
1. Create two word clouds: one for text messages with `spam` target and other for text messages with `non-spam` targets. You may use [the `wordcloud` package](https://github.com/amueller/word_cloud) for this, which you will have to install in your environment.  

In [None]:
### YOUR ANSWER HERE

## Exercise 2: Preprocessing <a name="2"></a>
<hr>

We will be encoding the text data using [sklearn's `CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). In this exercise we will explore different options of `CountVectorizer`. 

### 2.1 `CountVectorizer` with default parameters
rubric={accuracy:2,reasoning:2}

1. Transform the training data using `CountVectorizer` with default parameters. 
2. How many features have been created to represent each text message? What does each feature represent? 

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE 

### 2.2 `CountVectorizer` transformer with `binary=True`
rubric={accuracy:2,reasoning:1}

1. Transform the training data using `CountVectorizer` with `binary=True`. 
2. What does each feature represent with this option? 

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE 

### 2.3 `max_features` hyperparameter of `CountVectorizer` 
rubric={accuracy:2,reasoning:5}

1. Now pass `max_features=10` to `CountVectorizer` and transform the training data again. 
2. How many features have been created to represent each text message now?
3. Are we likely to overfit or underfit with less number of features? 
4. What would happen if you encounter a word in test data that's not present in `max_features` of the training data? 

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE 

## Exercise 3: Model building <a name="3"></a>
<hr>

## 3.1 `DummyClassifier`
rubric={accuracy:2}

**Your tasks:**
1. Build a pipeline for feature extraction using `CountVectorizer` with `binary=True` and `DummyClassifier`.
2. Report mean cross-validation scores of the pipeline. 

In [None]:
### YOUR ANSWER HERE

### 3.2 `SVC` with default parameters
rubric={accuracy:2,reasoning:1}

1. Now build a pipeline for feature extraction using `CountVectorizer` with `binary=True` and `SVC` with default hyperparameters.
2. Are you getting better results with `SVC` compared to `DummyClassifier`?


In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

## Exercise 4: Hyperparameter optimization <a name="4"></a>
<hr>

So far we have been writing loops to try a bunch of different hyperparameter values and pick the one with lowest validation (or cross-validation) error. This operation is so common, in fact, that `scikit-learn` has some [built-in methods](https://scikit-learn.org/stable/modules/grid_search.html) to do it for you. In this exercise, we will focus on two such methods:

1. [`sklearn.model_selection.GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) 
2. [`sklearn.model_selection.RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

### 4.1 Optimizing `gamma` for RBF SVM 
rubric={accuracy:4,reasoning:2}

1. Carry out hyperparameter search over `gamma` by sweeping the hyperparameter through the values $10^{-3}, \ldots, 10^{-1}, 1, \ldots, 10^{3}$ using [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) and 10-fold cross-validation. (The `param_grid` is given in the starter code below.) 
2. Note the best hyperparameter value and the corresponding best cross-validation score. Compare the score with exercise 3.2 (i.e., `SVC` with default parameters). 

A few tips about `GridSearchCV`: 
- The starter code below defines the parameter grid for `gamma` which you can pass to your `GridSearchCV`. Note the syntax `clf__gamma`. We have two steps in our pipeline and we can access the parameters of these steps using __ to go deeper. So `clf__gamma` means `gamma` of `clf` step of the pipeline. 
- Setting `n_jobs=-1` should speed it up (if you have a multi-core processor).
- Similar to `cross_validate` you can pass `return_train_score=True` to your `GridSearchCV` object.
- After running `fit` on the `GridSearchCV` object, 
    - you can access best hyperparameter values with `grid.best_params_` and best scores with `grid.best_score_` if `grid' is your `GridSearchCV` object.
    - you can access mean train and cross-validation scores for all hyperparameter values via `grid.cv_results_` dictionary. 

In [None]:
### YOUR ANSWER HERE

In [None]:
### BEGIN STARTER CODE

param_grid = {"clf__gamma": 10.0 ** np.arange(-3, 3)}

### END STARTER CODE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### 4.2 Jointly optimizing `C` and `gamma`
rubric={accuracy:4,reasoning:2}

Let's optimize `C` hyperparameter along with `gamma`. 

**Your tasks:**

1. Expand your search to cover the `C` hyperparameter in addition to `gamma`, sweeping the hyperparameter through values $10^{-3}, 10^{-2}, \ldots, 10^{3}$. Use the same `gamma` values from Exercise 4.1. 
2. Did you get the same best `gamma` value that you got when optimizing `gamma` only? Why or why not?  

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### 4.3 Optimizing `C`, `gamma`, and `max_features` jointly with `RandomizedSearchCV`
rubric={accuracy:4,reasoning:3}

In addition to `GridSearchCV` there are other approaches like `RandomizedSearchCV`, which, as its name implies, tries random hyperparameter configurations instead of performing an exhaustive grid search. In this exercise we will explore `RandomizedSearchCV`. 

**Your tasks:**
1. Jointly optimize `C` and `gamma` hyperparameter of SVC RBF, and `max_features` hyperparameter of `CountVectorizer` using [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html). For `C` and `gamma` you may use the range of hyperparameter values from the previous questions. For `max_features` hyperparameter use the range of values of your choice. 
2. Name a situation in which random search would be strongly preferable over a grid search, and briefly explain why.  

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### (optional) 4.4: More sophisticated hyperparameter tuning. 
rubric={reasoning:1} 

There are all sorts of software packages that make hyperparameter tuning with `scikit-learn` even more automated. For example:

- [hyperopt-sklearn](https://github.com/hyperopt/hyperopt-sklearn)
- [auto-sklearn](https://github.com/automl/auto-sklearn)
- [SigOptSearchCV](https://sigopt.com/docs/overview/scikit_learn)

Give one of these a try and report your thoughts. Or, if you're even more adventurous, you could try a package that isn't tied to `scikit-learn`. There are many options for you to play around with in your ample free time:

- [TPOT](https://github.com/rhiever/tpot)
- [hyperopt](https://github.com/hyperopt/hyperopt)
- [hyperband](https://github.com/zygmuntz/hyperband)
- [SMAC](http://www.cs.ubc.ca/labs/beta/Projects/SMAC/)
- [MOE](https://github.com/Yelp/MOE)
- [pybo](https://github.com/mwhoffman/pybo)
- [spearmint](https://github.com/HIPS/Spearmint)
- [BayesOpt](https://github.com/rmcantin/bayesopt)

Note: this list isn't meant to be exhaustive. 

In other news, the recently announced [Amazon SageMaker](https://aws.amazon.com/sagemaker/) is also supposed to do hyperparameter optimization for you (among many other things it does).

## Exercise 5: Test results <a name="5"></a>
<hr>

Now that we have done extensive hyperparameter search, it's time to try our best model on the test split. 

### 5.1 Report test scores
rubric={accuracy:2,reasoning:2}

**Your tasks**

1. Fit the `best_estimator_` from the `RandomizedSearchCV` in 4.3 on `X_train` and `y_train` 
2. Score the fit model on `X_test` and `y_test`. 
3. Compare your test scores with your best estimator scores from the previous exercise. 

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE