In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("Worksheet_3.ipynb")

# Worksheet 3: Preprocessing 

This worksheet is intended to help you revise and reinforce what you've learnt in the lecture.
</br>Please fill in the answers, or write the code, in the space provided.


## Imports

In [2]:
# Load necessary libraries
import numpy as np
import pandas as pd
import math
import sys
from hashlib import sha1
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score
from sklearn.svm import SVC
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option("display.max_colwidth", 200)

## Exercise 1: Preprocessing the Spotify dataset 


Remember Kaggle's [Spotify Song Attributes](https://www.kaggle.com/geomack/spotifyclassification/home) dataset from homework 1? The dataset contains a number of features of songs from 2017 and a binary variable `target` that represents whether the user liked the song (encoded as 1) or not (encoded as 0). See the documentation of all the features [here](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/). 

In homework 1, all features that appeared numeric were treated as such, while text features were dropped. This approach was taken primarily to keep things simple and familiarize you with basic `sklearn` syntax. However, with a wider range of tools available now, it's time to reconsider the different feature types and potential transformations within this dataset.

The code below reads the dataset and splits it, assuming that it's located in the `data` folder.

In [3]:
spotify_df = pd.read_csv("data/spotify.csv", index_col=0)
X_spotify = spotify_df.drop(columns=["target"])
y_spotify = spotify_df["target"]
X_train, X_test, y_train, y_test = train_test_split(
    X_spotify, y_spotify, test_size=0.2, random_state=123
)

<br><br>

### 1.1 Dummy model
rubric={points}

- Obtain the mean cross validation score using the dummy model and store it in the variable `dummy_mean_cv_score`.

<div class="alert alert-warning">

Solution_1.1
    
</div>

_Points:_ 1

In [4]:
dummy_mean_cv_score = None 

...

Ellipsis

In [5]:
grader.check("q1.1")

<br><br>

### Feature categories and transformations 

- Examine the value counts of the following features and refer to [the documentation](https://developer.spotify.com/documentation/web-api/reference/get-audio-features) for these features:
    - `time_signature`
    - `mode`
    - `key`

In [6]:
X_train['time_signature'].value_counts()

time_signature
4.0    1514
3.0      76
5.0      22
1.0       1
Name: count, dtype: int64

In [7]:
X_train['mode'].value_counts()

mode
1    1002
0     611
Name: count, dtype: int64

In [8]:
X_train['key'].value_counts()

key
1     200
7     169
0     166
9     152
2     145
11    143
5     141
6     127
10    122
8     110
4      88
3      50
Name: count, dtype: int64

Do you think these features should be treated as numeric features? 

Consider the following categorization of features and discuss with your neighbour whether it makes sense or not.  

In [9]:
numeric_feats = ['acousticness', 'danceability', 'energy',
                 'instrumentalness', 'liveness', 'loudness',
                 'speechiness', 'tempo', 'valence']
categorical_feats = ['time_signature', 'key']
passthrough_feats = ['mode']

<br><br><br><br>

### 1.2 Only numeric features
rubric={points}

**Your tasks:**

Calculate the mean cross-validation score using only the `numeric_feats`. 
1. In particular, create a pipeline with two steps:
    - Step 1: Apply StandardScaler() with default hyperparameters to scale the numeric features.
    - Step 2: Use SVC() with default hyperparameters as the estimator.
2. Compute the mean cross-validation score with the pipeline described above and store it in the variable `num_mean_cv_score` as indicated below.


<div class="alert alert-warning">

Solution_1.2
    
</div>

_Points:_ 2

In [10]:
num_mean_cv_score = None 

...

Ellipsis

In [11]:
num_mean_cv_score

In [12]:
grader.check("q1.2")

<br><br>

### 1.3 Numeric + categorical + passthrough features 
rubric={points}

Next, incorporate both categorical and passthrough features into the pipeline. Begin by defining a column transformer named `preprocessor` with distinct transformations for various feature categories:

- Apply `StandardScaler` to numeric_feats.
- Use `OneHotEncoder` with `handle_unknown = "ignore"` for categorical_feats.
- Keep `passthrough_feats` unchanged by specifying "passthrough".

Following this, create a pipeline named `svc_pipe` with preprocessor as the initial step and `SVC()` as the chosen estimator. Calculate the mean cross-validation score and store it in the variable `num_cat_mean_cv_score` below.

<div class="alert alert-warning">

Solution_1.3
    
</div>

_Points:_ 3

In [13]:
num_cat_mean_cv_score = None 
preprocessor = None
svc_pipe = None 

...

Ellipsis

In [14]:
num_cat_mean_cv_score

In [15]:
grader.check("q1.3")

<br><br>

### 1.4 Incorporating text features
rubric={points}

Remember that in homework 2, the song_title feature was excluded when working with this dataset. However, this feature can be valuable for our prediction task.

Let's incorporate it into our pipeline by applying a bag-of-words representation to the feature. Define a column transformer called `preprocessor` with the following transformations for different feature categories:

- `StandardScaler` on `numeric_feats`.
- `OneHotEncoder` on `categorical_feats`. (Pass `handle_unknown = "ignore"`.)
- `"passthrough"` for `passthrough_feats`.
- `CountVectorizer` with `max_features=50` and `stop_words="english"` on the `song_title` feature.
  
Then, create a pipeline named `svc_all_pipe` with `preprocessor` as the first step and `SVC()` as the estimator. Calculate the mean cross-validation score and store it in the variable `all_mean_cv_score` below.







Recall that you had dropped `song_title` feature in homework 2 when you worked with this dataset. But this feature can be useful for this prediction task

Include it in our pipeline by applying bag-of-words representation on the feature. So define a column transformer called `preprocessor` with the following transformations on different feature categories. 
- `StandardScaler` on `numeric_feats`
- `OneHotEncoder` on `categorical_feats` 
- `"passthrough"` the `passthrough_fets`
- `CountVectorizer` with `max_features=50` and `stop_words="english"` on `song_title` feature

Then define a pipeline `svc_all_pipe` with `preprocessor` as the first step and `SVC()` as the estimator. Get the mean cross validation score and store it in the variable `all_mean_cv_score` below.

<div class="alert alert-warning">

Solution_1.4
    
</div>

_Points:_ 3

In [16]:
all_mean_cv_score = None 
svc_all_pipe = None

...

Ellipsis

In [17]:
all_mean_cv_score

In [18]:
grader.check("q1.4")

<br><br>

### 1.5 Test scores 
rubric={points}

Now, let's fit `svc_all_pipe` to the complete `X_train` and `y_train`, and evaluate its performance on `X_test` and `y_test`. Store these evaluation scores in the `test_score` variable below.

<div class="alert alert-warning">

Solution_1.5
    
</div>

_Points:_ 1

In [19]:
test_score = None 

...

Ellipsis

In [20]:
grader.check("q1.5")

<br><br>

🔎 **Challenge:** Explore the vocabulary created by `CountVectorizer`.

🔎 **Challenge:** Incorporate the `artist` feature in the pipeline. 