# DSCI 573 - Feature and Model Selection

# Lab 3: Ensembles and feature importances

## Table of contents
- [Submission instructions](#si)
- [Exercise 1: Data and preprocessing](#1)
- [Exercise 2: Ensembles](#2)
- [Exercise 3: Feature importances](#3)

## Submission instructions <a name="si"></a>
<hr>
rubric={mechanics:2}

You will receive marks for correctly submitting this assignment. 

To correctly submit this assignment follow the instructions below:

- Push your assignment to your GitHub repository. 
- Add a link to your GitHub repository here: LINK TO YOUR GITHUB REPO 
- Upload an HTML render of your assignment to Canvas. The last cell of this notebook will help you do that.
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).

[Here](https://github.com/UBC-MDS/public/tree/master/rubric) you will find the description of each rubric used in MDS.

**NOTE: The data you download for use in this lab SHOULD NOT BE PUSHED TO YOUR REPOSITORY. You might be penalised for pushing datasets to your repository. I have seeded the repository with `.gitignore` and hoping that it won't let you push CSVs.**

In [1]:
import os

%matplotlib inline
import string
from collections import deque

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# data
from sklearn import datasets
from sklearn.compose import ColumnTransformer, make_column_transformer

# Classifiers
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.impute import SimpleImputer

# classifiers / models
from sklearn.linear_model import LogisticRegression

# other
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC, SVR
from sklearn.tree import DecisionTreeClassifier, export_graphviz

### Exercise 1: Data and preprocessing <a name="1"></a>
<hr>

Welcome to the third lab! Because it's a quiz week, I'm trying to make this lab lighter compared to other labs. We'll be using a dataset which is smaller in size to speed things up.  

Recall Kaggle's [Spotify Song Attributes](https://www.kaggle.com/geomack/spotifyclassification/home) dataset you used in 571 lab1. The dataset contains a number of features of songs from 2017 and a binary target variable representing whether the user liked the song or not. See the documentation of all the features [here](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/). The supervised machine learning task for this dataset is predicting  whether the user likes a song or not given a number of song features.

Download the CSV. 

### Exercise 1.1 
rubric={accuracy:2,reasoning:2}

**Your tasks:**
1. Read the CSV.
2. Split the data (80%-20%) with `random_state=123` to create `train_df` and `test_df`. 
3. Do you have class imbalance? Is one class more important than the other? What would be an appropriate metric in this problem? 

In [None]:
# solution_1_1_1
### YOUR ANSWER HERE

In [None]:
# solution_1_1_2

### YOUR ANSWER HERE

In [None]:
# solution_1_1_3

### YOUR ANSWER HERE

**solution_1_1_3 (reasoning)**

### YOUR ANSWER HERE


### Exercise 1.2
rubric={accuracy:2,reasoning:3}

In 571 lab1 we excluded `artist` and `song_title` features because we did not know how to handle categorical or text features at that time. Now that we know about it, let's include `song_title` feature. 

**Your tasks:**
1. How will you encode the `song_title` feature?
2. Identify different feature types (e.g., numeric features) and store them in appropriate variables (e.g., `numeric_features`). 

Note: We are excluding `artist` feature in this lab. 

In [None]:
# solution_1_2_1 (exploration)

### YOUR ANSWER HERE

**solution_1_2_1 (description)**

### YOUR ANSWER HERE

In [None]:
# solution_1_2_2

### YOUR ANSWER HERE

### (optional) Exercise 1.3 Encoding the `artist` feature
rubric={reasoning:1}

**Your tasks:**
1. Here we are not including the `artist` feature. How you might encode it if you decide to include it? You do not actually have to encode it but pointing out the difficulties in encoding it and providing some reasonable options will be enough. 

In [2]:
# solution_1_3_1 (exploration)

### YOUR ANSWER HERE

**solution_1_3_1 (reasoning)**

### YOUR ANSWER HERE

### Exercise 1.4 Separate `X` and `y`
rubric={accuracy:2}

**Your tasks:**

1. Create `X_train`, `y_train`, `X_test`, `y_test`. 

In [None]:
# solution_1_4_1

### YOUR ANSWER HERE

### Exercise 1.5 Define a column transformer
rubric={accuracy:4}

**Your tasks:**

1. Define a column transformers called `preprocessor` using `make_column_transformer` to apply different transformations on mixed feature types:

**Notes:**
- If you are using `CountVectorizer` for encoding any of the features, use the following arguments:
    - `stop_words="english"`
    - `max_features=200`
- If you are not applying any transformations on certain features, do not forget to include them in the column transformer. You can do it using "passthrough" in a column transformer.     

In [None]:
# solution_1_5_1

### YOUR ANSWER HERE

## Exercise 2: Ensembles <a name="2"></a>
<hr>

In this exercise, you may use code from lecture notes with appropriate attributions. 

### 2.1 Dummy classifier
rubric={reasoning:1}

**Your tasks:**
1. Report mean cross-validation results along with standard deviation with the `DummyClassifier`. You can use the `strategy` of your choosing. 

In [None]:
results = {}

In [None]:
# for helper code if necessary

### YOUR ANSWER HERE

In [None]:
# solution_2_1_1

### YOUR ANSWER HERE

### 2.2 Decision tree
rubric={reasoning:2}

In 571 we used the decision tree classifier with the numeric features in the dataset. Let's use it as our second baseline. 

**Your tasks:**

1. Define a pipeline with the `preprocessor` you defined in the previous exercise and the `DecisionTreeClassifier` classifier and report mean cross-validation scores along with standard deviation. 

In [None]:
# solution_2_2_1

### YOUR ANSWER HERE

### 2.3 Different classifiers 
rubric={accuracy:5,quality:2,reasoning:4}

If you haven't already done it, you'll need to install following packages in your conda environment for this exercise. If `conda install` doesn't work on your operating system for a particular package, you may have to use `pip`. 

```
conda install -c conda-forge xgboost
conda install -c conda-forge lightgbm
conda install -c conda-forge catboost
```

**Your tasks:**

1. Define pipelines for each classifier listed below using the `preprocessor` you defined in the previous exercise. Use `random_state=2` for all your classifiers. Store all the classifiers in a dictionary called `classifiers`, where keys are classifier names and values are pipelines. 
    - `LogisticRegression`
    - `RandomForestClassifier`
    - `XGBClassifier`
    - `LGBMClassifier`
    - `CatBoostClassifier`     
2. Show mean cross-validation scores along with standard deviation for all classifiers as a dataframe. 
3. Discuss your results focusing on following points
    - Best and worst performing models 
    - Overfitting/underfitting
    - Fit time
    - Score time
    - Stability of scores 

In [None]:
# solution_2_3_1

### YOUR ANSWER HERE

In [None]:
# solution_2_3_2

### YOUR ANSWER HERE

**solution_2_3_3**

### YOUR ANSWER HERE

### 2.4 Voting classifier 
rubric={accuracy:3,reasoning:3}

**Your tasks:**

1. Create an averaging model using `sklearn`'s [`VotingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) with `soft` voting and the classifiers you used in Exercise 2.3. Show mean cross-validation scores along with standard deviation. 
2. How many models are being averaged here? Are you getting better cross-validation scores? 
3. Explain the difference between setting `voting ='soft'` vs `voting=`hard`. 

In [None]:
# solution_2_4_1

### YOUR ANSWER HERE

**solution_2_4_2**

### YOUR ANSWER HERE

**solution_2_4_3**

### YOUR ANSWER HERE

### 2.5 Stacking classifier 
rubric={accuracy:4,reasoning:1}

**Your tasks:**

1. Create a stacking model using [`sklearn's` `StackingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html) with the estimators from exercise 2.3, and logistic regression as the final estimator. You may remove `CatBoostClassifier` for speed. 
2. Show mean cross-validation scores along with standard deviation. 
3. Discuss validation scores, fit times, and score times. 

In [None]:
# solution_2_5_1
### YOUR ANSWER HERE

In [None]:
# solution_2_5_2

### YOUR ANSWER HERE

**solution_2_5_3 (reasoning)**

### YOUR ANSWER HERE

### 2.6 Examine coefficients
rubric={accuracy:3,reasoning:2}

**Your tasks:**

1. Show feature names and their corresponding coefficients passed to the final estimator in your stacking model. 
2. Which feature has the largest (in magnitude) coefficient? What does that mean? 

In [None]:
# solution_2_6_1
### YOUR ANSWER HERE

**solution_2_6_2**

### YOUR ANSWER HERE

### (optional) 2.7 Tree-based models without scaling
rubric={reasoning:1}

Scaling should not matter for tree-based classifiers. In this exercise you'll examine whether that's true or not. 

**Your tasks:**
1. Define a column transformer where you skip scaling numeric features. 
2. Show results for individual tree-based models and their averaged and stacked versions. 
3. Discuss your results.

In [None]:
# solution_2_7_1, solution_2_7_2

### YOUR ANSWER HERE

**solution_2_7_3 (reasoning)**

### YOUR ANSWER HERE

### (optional) 2.8 Visualize a stacking classifier
rubric={reasoning:1}

**Your tasks:**
1. Use `DecisionTreeClassifier` as the final estimator instead of logistic regression. 
2. Visualize the tree created by the model. 
3. Note your observations. 

In [None]:
# solution_2_8_1

### YOUR ANSWER HERE

In [None]:
# solution_2_8_2

### YOUR ANSWER HERE

**solution_2_8_3**

### YOUR ANSWER HERE

## Exercise 3: Feature importances <a name="3"></a>
<hr>


### Exercise 3.1 Logistic regression coefficients
rubric={accuracy:4}

**Your tasks:**
1. Fit the logistic regression pipeline you created in Exercise 2 on the train split. 
2. Get feature names and store them in a variable called `feature_names`. 
3. Create a dataframe with `feature_names` and corresponding coefficients. Show first 20 rows of the dataframe. 

In [None]:
# solution_3_1_1

### YOUR ANSWER HERE

In [None]:
# solution_3_1_2

### YOUR ANSWER HERE

In [None]:
# solution_3_1_3

### YOUR ANSWER HERE

### Exercise 3.2 Random forest feature importances 
rubric={accuracy:4,reasoning:2}

`LogisticRegression` is quite interpretable in terms of feature importances but it didn't give us the best performance in this task. Can we get feature importances of random forest classifier, which gave us much better scores? 

**Your tasks:**

1. Fit the `RandomForestClassifier` pipeline you created in Exercise 2 on the train split.
2. Examine feature importances for this random forest pipeline. You can access feature importances using `feature_importances_` attribute of the fit estimator. 
3. What features seem to be driving your predictions the most? Only from this information, can you tell in what direction they are driving the predictions? Why or why not? 

In [None]:
# solution_3_2_1

### YOUR ANSWER HERE

In [None]:
# solution_3_2_2

### YOUR ANSWER HERE

**solution_3_2_3**
### YOUR ANSWER HERE

### Exercise 3.3 SHAP explanations
rubric={reasoning:5}

In this exercise, we'll use [SHAP (SHapley Additive exPlanations)](https://shap.readthedocs.io/en/latest/), which is a sophisticated measure that tells us about the contribution of each feature even in non-linear models. We will use it to explain predictions made by our random forest classifier. If you haven't already done it, you'll need to install `SHAP` first. You may use the following command.  

```
conda install -c conda-forge shap
```

If it doesn't work, you might have to use `pip`.

```
pip install shap
```

In this exercise, you are given most of the code and your job is to understand the code, get it working, and comment on the plots created using shapely values. 

The code below
- creates transformed `X_train` assuming that your column transformer is called `preprocessor` and the feature names of your transformed data are called `feature_names`.
- extracts shapely values for the first 1000 examples from the training set (`X_train`) and displays them.
- Shows a number of plots created using shapely values. 

**Your tasks:**
1. Run the code. If necessary, you may adapt the starter code. 
2. Explain the dependence plot. 
3. Explain the summary plot. 
4. Explain the force plot for a specific prediction. 


The following code creates encoded `X_train` and shows it as a dataframe. 

In [None]:
import shap

X_train_enc = pd.DataFrame(
    data=preprocessor.transform(X_train).toarray(),
    columns=feature_names,
    index=X_train.index,
)
X_train_enc.head()

### END STARTER CODE

The following code extracts shapely values for the first 1000 examples from the training set. (This may take a while.)

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train_enc, y_train)
X_train_sample = X_train_enc.sample(1000, random_state=2)
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_train_enc)

The following code displays the shapely values for your `feature_names`. 

In [None]:
### BEGIN STARTER CODE

values = shap_values[0][0]
pd.DataFrame(data=values, index=feature_names, columns=["SHAP"]).sort_values(
    by="SHAP", ascending=False
)

### END STARTER CODE

### Dependence plot

In [None]:
shap.dependence_plot("danceability", shap_values[0], X_train_enc)

**solution_3_1_2** (Explain the dependence plot above.)

### BEGIN SOLUTION
### BEGIN SOLUTION

### Summary plot

In [None]:
shap.summary_plot(shap_values[0], X_train_enc)

**solution_3_1_3** (Explain the summary plot above.)

### BEGIN SOLUTION
### BEGIN SOLUTION

Let's encode the test set. 

In [None]:
X_test_enc = pd.DataFrame(
    data=preprocessor.transform(X_test).toarray(),
    columns=feature_names,
    index=X_test.index,
)
X_test_enc.head()

What's the prediction on the following test example? 

In [None]:
rf.predict(X_test_enc)[5]

Can we explain this using SHAP? 

### Force plot

In [None]:
# load JS visualization code to notebook
shap.initjs()

In [None]:
shap.force_plot(
    explainer.expected_value[0], shap_values[0][5, :], X_test_enc.iloc[5, :]
)

In [None]:
**solution_3_1_4** (Explain the force plot above.)

### BEGIN SOLUTION

### BEGIN SOLUTION

### Submission to Canvas

**PLEASE READ: When you are ready to submit your assignment do the following:**

- Run all cells in your notebook to make sure there are no errors by doing Kernel -->  Restart Kernel and Run All Cells...
- If you are using the "573" `conda` environment, make sure to select it before running all cells. 
- Convert your notebook to .html format using the `convert_notebook()` function below or by File -> Export Notebook As... -> Export Notebook to HTML
- Run the code `submit()` below to go through an interactive submission process to Canvas.
After submission, be sure to do a final push of all your work to GitHub (including the rendered html file).

In [None]:
# from canvasutils.submit import convert_notebook, submit

# convert_notebook("lab3.ipynb", "html")  # uncomment and run when you want to try convert your notebook (or you can convert manually from the File menu)
# submit(course_code=59091, token=False)  # uncomment and run when ready to submit to Canvas

Well done!! Congratulations on finishing the lab!! 