## Introduction to Machine Learning  

## Assignment 8: Linear Models

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links of 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).

### Assignment Learning Goals:

By the end of the module, students are expected to:

- Explain the general intuition behind linear models.
- Explain the `fit` and `predict` paradigm of linear models.
- Use `scikit-learn`'s `LogisticRegression` classifier.
    - Use `fit`, `predict` and `predict_proba`.   
    - Use `coef_` to interpret the model weights.
- Explain the advantages and limitations of linear classifiers. 
- Apply scikit-learn regression model (e.g., Ridge) to regression problems.
- Relate the Ridge hyperparameter `alpha` to the `LogisticRegression` hyperparameter `C`.


This assignment covers [Module 8](https://ml-learn.mds.ubc.ca/en/module8) of the online course. You should complete this module before attempting this assignment.

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct. These questions are worth 2 points.

In [None]:
# Import libraries needed for this lab
from hashlib import sha1

import altair as alt
import graphviz
import numpy as np
import pandas as pd
import string
from sklearn import tree
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import make_column_transformer 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import (
    FunctionTransformer,
    Normalizer,
    OneHotEncoder,
    StandardScaler,
    normalize,
    scale)
from sklearn.metrics import plot_confusion_matrix, classification_report
from sklearn.svm import SVC, SVR

from scipy.stats import lognorm, loguniform, randint

import test_assignment8 as t
#alt.renderers.enable('mimetype')
alt.data_transformers.disable_max_rows()

# 1. Sentiment analysis on the IMDB dataset 

<img src="https://ia.media-imdb.com/images/M/MV5BMTk3ODA4Mjc0NF5BMl5BcG5nXkFtZTgwNDc1MzQ2OTE@._V1_.png"  width = "40%" alt="404 image" />

In this exercise, you will carry out sentiment analysis on a real corpus, [the IMDB movie review dataset](https://www.kaggle.com/utathya/imdb-review-dataset).
The starter code below loads the data CSV file (assuming that it's in the data directory) as a pandas DataFrame called `imdb_df`.

We have done a bit of preprocessing on the dataset and we will use the train/test split that's already provided.

In [None]:
imdb_df = pd.read_csv("data/imdb_speed.csv")
train_df = imdb_df[imdb_df['type'] == "train"]
test_df = imdb_df[imdb_df['type'] == "test"]
train_df.head()

**Question 1.1** <br> {points: 1}  

Let's now separate our feature vectors from the target.

Use the column `review` as your `X` and the `label` column as your target `y`. 

You will need to do this for both `train_df` and `test_df`.

Save the results in objects named `X_train`, `y_train`, `X_test` and `y_test`. 

(Makes sure that all 4 of these objects are of type Pandas Series. We will be using `CountVectorizer` for future questions and this transformation requires an input of Pandas Series)

In [None]:
X_train, y_train = None, None
X_test, y_test = None, None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_1(X_train,y_train,X_test,y_test)

**Question 1.2** <br> {points: 1}  

What is the distribution of target values (`label`) in the train split? Your answer should be of type Pandas Series and saved in an object named `class_dist`.

In [None]:
class_dist = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_2(class_dist)

**Question 1.3** <br> {points: 1}  

Do any of your columns have any null values? 

A) Yes

B) No

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer1_3`.*

In [None]:
answer1_3 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_3



In [None]:
t.test_1_3(answer1_3)

**Question 1.4** <br> {points: 2}  

***Challenge question!***

How many words are present in each review? 

Add a column `review_wordcount` to the `train_df` dataframe and save this new dataframe as an object named `review_length_df`.


In [None]:
review_length_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_4(review_length_df)

**Question 1.5** <br> {points: 3}  

What is the average word count for each review label (pos and neg)?

Save the average negative label word count and the average positive label word count to the nearest full number in objects named `neg_wc_avg` and `pos_wc_avg` respectively.

In [None]:
neg_wc_avg = None
pos_wc_avg = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
# check that the variable exists
assert 'neg_wc_avg' in globals(
), "Please make sure that your solution is named 'neg_wc_avg'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

In [None]:
t.test_1_5_2(pos_wc_avg)

**Question 1.6** <br> {points: 2}  

Plot the average review wordcount per label in a bar chart. 

Save the plot in an object named `plot_avg_wc`.

Remember to provide a title to your plot as well.

*Hint: remember you can plot `groupby` objects and when you do so, you'll need to reset your index.*

In [None]:
plot_avg_wc = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_1_6(plot_avg_wc)

**Question 1.7** <br> {points: 1}  

Let's make a baseline model using `DummyClassifier`.

Build a `DummyClassifier` named `dummy` using `strategy='most_frequent'`. Perform cross-validation on the training portion. Make sure that you return the training score using `return_train_score=True`. 

Save the results in a dataframe named `dummy_scores`.

In [None]:
dummy = None
dummy_scores = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_7(dummy_scores)

**Question 1.8** <br> {points: 0}

Import `CountVectorizer` and `LogisticRegression`.

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_8()

**Question 1.9** <br> {points: 1}  

Build a pipeline named `lr_pipe` that uses the `CountVectorizer()` transformer followed by the logistic regression model (set `max_iter=2000` this will help avoid any warnings).

Perform 5 fold cross-validation on the training set using `lr_pipe` and return the training score. Save the results in a dataframe named `lr_scores`.

In [None]:
lr_pipe = None
lr_scores = None
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_9(lr_pipe,lr_scores)

**Question 1.10** <br> {points: 1} 

What is the mean of each column in `lr_scores`?

Save your result in an object named `lr_mean`. 

In [None]:
lr_mean = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_10(lr_mean)

**Question 1.11** <br> {points: 1}  

Which model performs better? 

A) `DummyClassifier`

B) `LogisticRegression`

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer1_11`.*

In [None]:
answer1_11 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_11(answer1_11)

**Question 1.12** <br> {points: 2} 

Let's see if we can optimize our model by hyperparameter tuning both `max_features` and `C`. 

First, let's answer the following questions. 

i) Does `max_features` correspond to a hyperparameter for `CountVectorizer` or `LogisticRegression`? Answer the name in an object named `max_f_hyper`.

ii) Does `C` correspond to a hyperparameter for `CountVectorizer` or `LogisticRegression`? Answer the name in an object named `C_hyper`.

*Answer in the cell below by specifying either "CountVectorizer" or "LogisticRegression" for the objects named in the above question. Make sure your answer is between `""`.

In [None]:
max_f_hyper = None
C_hyper = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_12_1(max_f_hyper)

In [None]:
t.test_1_12_2(C_hyper)

**Question 1.13** <br> {points: 1} 

If we increase the `C` hyperparameter values, is that more likely to result in a model that is overfitted or underfitted? 

*Answer in the cell below by specifying either "overfitted" or "underfitted" in an object named `answer_1_13`. Make sure your answer is between `""`.

In [None]:
answer_1_13 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_13(answer_1_13)

**Question 1.14** <br> {points: 1}

The time has come to hyperparameter tune! Define a pipeline with `CountVectorizer` and `LogisticRegression` with `max_iter=1000`. Name the pipeline `main_pipe`. 

Use `RandomizedSearchCV` to jointly optimize the hyperparameters in the `params_grid` that we have provided for you. 
Name this object `random_search`. Specify `n_iter=10`, `cv=5`, `random_state=888`, `n_jobs=-1`, `verbose=3`, and `return_train_score=True`. 
Make sure to fit your model on the training portion of the IMDB dataset. 

This can take quite a while (10 minutes for me!) so please be patient.

In [None]:
param_grid = {
    "logisticregression__C": loguniform(0.01, 100),
    "countvectorizer__max_features": randint(10, 1000),
}

In [None]:
main_pipe = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
pd.DataFrame(random_search.cv_results_)

In [None]:
t.test_1_14(main_pipe,random_search)

**Question 1.15** <br> {points: 3}

What are the best hyperparameter values found by `RandomizedSearchCV` for `C` and `max_features`. Save it in an object named `optimal_parameters`. (The grader is expecting a dictionary object) 

What was the corresponding validation score? Save this in an object named `optimal_score`. 

*Hint: `.best_params_`  and `.best_score_` are helpful here.* 


In [None]:
optimal_parameters = None
optimal_score = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
# check that the variable exists
assert 'optimal_parameters' in globals(
), "Please make sure that your solution is named 'optimal_parameters'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

In [None]:
t.test_1_15_2(random_search, optimal_score)

**Question 1.16** <br> {points: 1}

Are you getting a better mean validation score than logistic regression pipeline with default hyperparameters from 1.9? 

A) Yes

B) No

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer1_16`.*

In [None]:
answer1_16 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_16(answer1_16)

# 2. Model Interpretation <a name="4"></a>
<hr>

One of the primary advantages of linear models is their ability to interpret models in terms of important features. In this exercise, we'll explore the coefficients learned by logistic regression classifier. 

**Question 2.1** <br> {points: 1}

Use `best_estimator_` to find the best estimator of `random_search` from 1.14 and save it in an object named `best_model`. 


In [None]:
best_model = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
best_model

In [None]:
t.test_2_1(best_model)

**Question 2.2** <br> {points: 1}

Use `coef_` to find the coefficients of the features. This information is exposed by the `coef_` attribute of [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) object. (*Hint: You'll have to reference `logisticregression` from the `best_model` object because `best_model` is a `Pipeline` object*.

Name this object `lr_coeffs`. 

In [None]:
lr_coeffs = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
round(max(lr_coeffs[0]),2)

In [None]:
t.test_2_2(lr_coeffs)

**Question 2.3** <br> {points: 1}

Find the features that `CountVectorizer` produced by calling `get_feature_names()` on the `CountVectorizer` object within the `best_model` object. 
(*Hint: You'll have to reference `countvectorizer` from the `best_model` object because `best_model` is a `Pipeline` object*) 

Save this in an object named `vocab`. 

In [None]:
vocab = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_3(vocab)

We've provided you the next code which combines the features with its respective feature coefficient (Our gift to you!) 

In [None]:
vocab_coef_df = pd.DataFrame(data = [vocab,lr_coeffs.flatten()]).T.rename(columns={0:'word', 1:'coefficient'})
vocab_coef_df.tail()

**Question 2.4** <br> {points: 1}

Find the 10 words whose presence are most indicative of a positive review. Save the words and their corresponding weights in a dataframe ordered from most indicative to least indicative. 

Save these in a dataframe object named `positive_words`.

In [None]:
positive_words = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
positive_words

In [None]:
t.test_2_4(positive_words)

**Question 2.5** <br> {points: 1}

Find the 10 words whose presence are most indicative of a negative review. Save the words and their corresponding weights in a dataframe ordered from most indicative to least indicative. 

Save these in a dataframe object named `negative_words`.

In [None]:
negative_words = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_5(negative_words)

**Question 2.6** <br> {points: 2}

Do the words associated with positive and negative reviews make sense? 


A) Yes

B) No

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer2_6`.*

In [None]:
answer2_6 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
# check that the variable exists
assert 'answer2_6' in globals(
), "Please make sure that your solution is named 'answer2_6'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 2.7** <br> {points: 1}

Which of the following statements are true?

i) It is useful to access the coefficient values since it helps us interpret the model to some extent.

ii) The coefficients help humans to understand which features are the most relevant features for prediction and how they impact the prediction.

iii) We can get feature importances for KNN by looking at the corresponding coefficients for each feature.

iv) Decision Trees also have a manner of seeing which features are important by looking at the tree and where the splits occur. 



Select all that apply and add them into a list named `answer_2_7`. 
For example if statement i and iv are both true, your solution will look like this: 

```
answer_2_7 = ["i", "iv"] 
```

In [None]:
answer_2_7 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_7(answer_2_7)

# 3. Test score, evaluation and `predict_proba`

**Question 3.1** <br> {points: 1}

Evaluate the best model from `random_search`  on the full training set.

Save the score in an object named `training_score`. 

In [None]:
training_score = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_1(training_score)

**Question 3.2** <br> {points: 2}

Evaluate this model on the test set. 

Save the score in an object named `test_score`. 

In [None]:
test_score = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
# check that the variable exists
assert 'test_score' in globals(
), "Please make sure that your solution is named 'test_score'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 3.3** <br> {points: 1}

How does your test score compare to the cross validation score `optimal_score` from **Question 1.15**? 

A) Our model's test score (`test_score`) is much higher than the cross validation score (`optimal_score`).

B) Our model's test score (`test_score`) is much lower than the cross validation score (`optimal_score`).

C) Our model's test score (`test_score`) is a little higher than the the cross validation score (`optimal_score`).

D) Our model's test score (`test_score`) is a little lower than the the cross validation score (`optimal_score`)

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer3_3`.*

In [None]:
answer3_3 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_3_3(answer3_3)

**Question 3.4** <br> {points: 1}

Plot a confusion matrix on the test set using the object `random_search` as your estimator and `normalize="all"` (see the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html) for more help here).

Name the plot `reviews_cm`. 

In [None]:
reviews_cm = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_4(reviews_cm)

**Question 3.5** <br> {points: 3}

Print a classification report on the `X_test` predictions of the best model from `random_search` with measurements to 4 decimal places. Use this information to answer the following questions.

A) What is the recall if we classify `pos` as our "positive" class? Save the result to 4 decimal places in an object named `answer3_5a`. 

B) What is the precision weighted average? Save the result to 4 decimal places in an object named `answer3_5b`. 

C) What is the `f1` score using `pos` as your positive class? Save the result to 4 decimal places in an object named `answer3_5c`.

In [None]:
# Use this cell to print your classification report

In [None]:
answer3_5a = None
answer3_5b = None
answer3_5c = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_5_1(answer3_5a)

In [None]:
t.test_3_5_2(answer3_5b)

In [None]:
t.test_3_5_3(answer3_5c)

**Question 3.6** <br> {points: 2}

Make a dataframe named `results_df` that contains these 5 columns: 

- `review` - this should contain the reviews from `X_test`.
- `true_label` - This should contain the true `y_test` values. 
- `predicted_y` - The predicted labels generated from `best_model` for the `X_test` reviews. 
- `neg_label_prob` - The negative probabilities generated from `best_model` for the `X_test` reviews. These can be found at index 0 of the `predict_proba` output (you can get that using `[:,0]`). 
-  `pos_label_prob` - The negative probabilities generated from `best_model` for the `X_test` reviews. These can be found at index 0 of the `predict_proba` output (you can get that using `[:,1]`). 

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_6(results_df)

**Question 3.7** <br> {points: 1}

Find the top 5 movie reviews in `results_df` with the highest predicted probability of being positive (i.e., where the model is most confident that the review is positive).

Save the reviews and the associated probability score in a dataframe named `most_pos_df`. 

In [None]:
most_pos_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_3_7(most_pos_df)

Feel free to explore these reviews and see how positive they read!

Here is the first one for you (if you got the above question right)! 

In [None]:
# most_pos_df.iloc[0,0]

**Question 3.8** <br> {points: 1}

Using `best_model`, find the 5 movie reviews in the test set with the highest predicted probability of being negative (i.e., where the model is most confident that the review is negative).

Save the reviews and the associated probability score in a dataframe named `most_neg_df`. 

In [None]:
most_neg_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_8(most_neg_df)

And what does a negative review read like?

In [None]:
# most_neg_df.iloc[0,0]

**Question 3.9 - Optional** <br> {points: 0}
This is an optional question!

(You'll get 0 marks for this one but you may have fun doing it?!) 

Using `best_model`, find the 5 movie reviews in the test set with the most divided probability of being negative or positive (i.e., where the model is least confident in either review sentiment).

Save the reviews and the associated probability score in a dataframe named `divided_revs_df`.

In [None]:
divided_revs_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_9(divided_revs_df)

If you attempted this question, uncomment the code below and read a review that the model was uncertain on classifying.

In [None]:
#print(divided_revs_df.iloc[0,0])

**Question 3.10 - Optional** <br> {points: 0}

Here is another optional question!

Examine a review from the test set where our `best_model` is making mistakes, i.e., where the true labels do not match the predicted labels. 

Save a (single) full row from `divided_revs_df` in an object named `wrong_review`. (We are expected a dataframe as the datatype for the autograder). 

In [None]:
wrong_review = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_10(wrong_review)

If you attempted this question, uncomment the code below and read the review below. Does it make sense as to why the model got it wrong?

In [None]:
# wrong_review.iloc[0,0]

## Attributions
- The IMDB DataSet - [Kaggle](https://www.kaggle.com/utathya/imdb-review-dataset)

- MDS DSCI 571 - Supervised Learning I - [MDS's GitHub website](https://github.com/UBC-MDS/DSCI_571_sup-learn-1) 


## Before Submitting 

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel and clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  