# Lecture 5: Class demo

## Imports, Announcements, LOs

### Imports

In [None]:
# import the libraries
import os
import sys
sys.path.append(os.path.join(os.path.abspath("../"), "code"))
from plotting_functions import *
from utils import *

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

pd.set_option("display.max_colwidth", 200)

Do you recall [the restaurants survey](https://ubc.ca1.qualtrics.com/jfe/form/SV_73VuZiuwM1eDVrw) you completed at the start of the course?

Let's use that data for this demo. You'll find a [wrangled version](https://github.com/UBC-CS/cpsc330-2023W1/blob/main/lectures/data/cleaned_restaurant_data.csv) in the course repository.

In [None]:
df = pd.read_csv('../data/cleaned_restaurant_data.csv')

In [None]:
df

In [None]:
df.describe()

Are there any unusual values in this data that you notice?
Let's get rid of these outliers. 

In [None]:
upperbound_price = 200
lowerbound_people = 1
df = df[~(df['price'] > 200)]
restaurant_df = df[~(df['n_people'] < lowerbound_people)]
restaurant_df.shape

In [None]:
restaurant_df.describe()

### Data splitting 

We aim to predict whether a restaurant is liked or disliked.

In [None]:
# Separate `X` and `y`. 

X = restaurant_df.drop(columns=['target'])
y = restaurant_df['target']

Below I'm perturbing this data just to demonstrate a few concepts. Don't do it in real life. 

In [None]:
X.at[459, 'food_type'] = 'Quebecois'
X['price'] = X['price'] * 100

In [None]:
# Split the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

<br><br>

### EDA 

In [None]:
X_train.hist(bins=20, figsize=(12, 8));

Do you see anything interesting in these plots? 

In [None]:
X_train['food_type'].value_counts()

Error in data collection? Probably "Fusion" and "fusion" categories should be combined?

In [None]:
X_train['food_type'] = X_train['food_type'].replace("fusion", "Fusion")
X_test['food_type'] = X_test['food_type'].replace("fusion", "Fusion")

In [None]:
X_train['food_type'].value_counts()

Again, usually we should spend lots of time in EDA, but let's stop here so that we have time to learn about transformers and pipelines.   

<br><br>

### Dummy Classifier

In [None]:
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier()
scores = cross_validate(dummy, X_train, y_train, return_train_score=True)
pd.DataFrame(scores)

We have a relatively balanced distribution of both 'like' and 'dislike' classes.

<br><br>

### Let's try KNN on this data

Do you think KNN would work directly on `X_train` and `y_train`?

In [None]:
# Preprocessing and pipeline
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
# knn.fit(X_train, y_train)

We need to preprocess the data before passing it to ML models. What are the different types of features in the data? 

In [None]:
X_train.head()

- What all transformations we need to apply before training a machine learning model? 
- Can we group features based on what type of transformations we would like to apply?

In [None]:
X_train.columns

In [None]:
X_train['noise_level'].value_counts()

In [None]:
numeric_feats = ['age', 'n_people', 'price'] # Continuous and quantitative features
categorical_feats = ['north_america', 'food_type'] # Discrete and qualitative features
binary_feats = ['good_server'] # Categorical features with only two possible values 
ordinal_feats = ['noise_level'] # Some natural ordering in the categories 
noise_cats = ['no music', 'low', 'medium', 'high', 'crazy loud']
drop_feats = ['comments', 'restaurant_name'] # Let's drop them for now. 

<br><br>

Let's begin with numeric features. What if we just use numeric features to train a KNN model? Would it work? 

In [None]:
# knn.fit(X_train[numeric_feats], y_train)
X_train_num = X_train[numeric_feats]
X_test_num = X_test[numeric_feats]

We need to deal with NaN values. 

### sklearn's `SimpleImputer` 

In [None]:
# Impute numeric features using SimpleImputer
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")
imputer.fit(X_train_num)
X_train_num_imp = imputer.transform(X_train_num)
X_test_num_imp = imputer.transform(X_test_num)

In [None]:
knn.fit(X_train_num_imp, y_train)

No more errors. It worked! Let's try cross validation. 

In [None]:
cross_val_score(knn, X_train_num_imp, y_train).mean()

We have slightly improved results in comparison to the dummy model. 

### Discussion questions 

- What's the difference between sklearn estimators and transformers?  
- Can you think of a better way to impute missing values? 

<br><br><br><br>

Do we need to scale the data? 

In [None]:
X_train[numeric_feats]

In [None]:
# Scale the imputed data 

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train_num_imp)
X_train_num_imp_scaled = scaler.transform(X_train_num_imp)
X_test_num_imp_scaled = scaler.transform(X_test_num_imp)

### What are some alternative methods for scaling?
- [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html): Transform each feature to a desired range
- [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html): Scale features using median and quantiles. Robust to outliers. 
- [Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html): Works on rows rather than columns. Normalize examples individually to unit norm.
- [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html): A scaler that scales each feature by its maximum absolute value.
    - What would happen when you apply `StandardScaler` to sparse data?    
- You can also apply custom scaling on columns using [`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html). For example, when a column follows the power law distribution (a handful of your values have many data points whereas most other values have few data points) log scaling is helpful.    

- For now, let's focus on `StandardScaler`. Let's carry out cross-validation

In [None]:
cross_val_score(knn, X_train_num_imp_scaled, y_train)

In this case, we don't see a big difference with `StandardScaler`. But usually, scaling is a good idea. 

- This worked but are we doing anything wrong here? 
- What's the problem with calling `cross_val_score` with preprocessed data? 
- How would you do it properly?
<br><br><br><br>

In [None]:
# Create a pipeline 
pipe_knn = make_pipeline(
    SimpleImputer(strategy = "median"),
    StandardScaler(),
    KNeighborsClassifier()
)

In [None]:
cross_val_score(pipe_knn, X_train_num, y_train)

- What all things are happening under the hood? 
- Why is this a better approach? 

In [None]:
plot_improper_processing("kNN")

In [None]:
plot_proper_processing("kNN")

<br><br><br><br>

### Categorical features

Let's assess the scores using categorical features.

In [None]:
X_train[categorical_feats]

In [None]:
X_train['north_america'].value_counts()

In [None]:
X_train['food_type'].value_counts()

In [None]:
X_train_cat = X_train[categorical_feats]
X_test_cat = X_test[categorical_feats]

In [None]:
# One-hot encoding of categorical features 
from sklearn.preprocessing import OneHotEncoder
# Define and fit OneHotEncoder

# X_train_cat_ohe  = ohe.transform(X_train_cat) # transform the train set
# X_test_cat_ohe  = ohe.transform(X_test_cat) # transform the test set

In [None]:
# X_train_cat_ohe

- It's a sparse matrix. 
- Why? What would happen if we pass `sparse_output=False`? Why we might want to do that? 

In [None]:
# Get the OHE feature names 
# ohe_feats = ohe.get_feature_names_out().tolist()
# pd.DataFrame(X_train_cat_ohe, columns = ohe_feats)

In [None]:
# cross_val_score(knn, X_train_cat_ohe, y_train)

Are we breaking the golden rule here? Let's do this properly with a pipeline. 

In [None]:
# Code to create a pipeline for OHE and KNN
# pipe_ohe_knn = None

In [None]:
# cross_val_score(pipe_ohe_knn, X_train_cat, y_train)

- What's wrong here? 
- How can we fix this? 

In [None]:
# Fix the OHE

# pipe_ohe_knn = 

In [None]:
# cross_val_score(pipe_ohe_knn, X_train_cat, y_train)

Right now we are working with numeric and categorical features separately. But ideally when we create a model, we need to use all these features together. 

**Enter column transformer!**

How can we vertically stack  
- preprocessed numeric features
- preprocessed binary features, and  
- preprocessed categorical features?

Let's define a column transformer. 

In [None]:
from sklearn.compose import make_column_transformer

numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
binary_transformer = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder(drop="if_binary"))
categorical_transformer = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder(handle_unknown="ignore", sparse_output=False))

# preprocessor = None

How does the transformed data look like? 

In [None]:
# transformed = preprocessor.fit_transform(X_train)

In [None]:
# preprocessor

In [None]:
# Getting feature names from a column transformer
# ohe_feat_names = preprocessor.named_transformers_['pipeline-3']['onehotencoder'].get_feature_names_out(categorical_feats).tolist()
# ohe_feat_names

In [None]:
# feat_names = numeric_feats + binary_feats + ohe_feat_names

In [None]:
# pd.DataFrame(preprocessor.fit_transform(X_train), columns = feat_names)

We have new columns for the categorical features. Let's create a pipeline with the preprocessor and SVC. 

In [None]:
# svc_num_cat_pipe = make_pipeline(preprocessor, SVC())
# cross_val_score(svc_num_cat_pipe, X_train, y_train).mean()

We are getting better results! 
<br><br><br>

### Incorporating text features 

We haven't incorporated the comments feature into our pipeline yet, even though it holds significant value in indicating whether the restaurant was liked or not.

In [None]:
X_train

Let's create bag-of-words representation of the `comments` feature. But first we need to impute the rows where there are no comments. There is a small complication if we want to put `SimpleImputer` and `CountVectorizer` in a pipeline. 
- `SimpleImputer` takes a 2D array as input and produced 2D array as output. 
- `CountVectorizer` takes a 1D array as input. 

To deal with this, we will use sklearn's `FunctionTransformer` to convert the 2D output of `SimpleImputer` into a 1D array which can be passed to `CountVectorizer` as input. 

In [None]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer

reshape_for_countvectorizer = FunctionTransformer(lambda X: X.squeeze(), validate=False)
text_transformer = make_pipeline(SimpleImputer(strategy="constant", fill_value="missing"), 
                          reshape_for_countvectorizer, 
                          CountVectorizer(max_features=100, stop_words="english"))
text_pipe = make_pipeline(text_transformer, SVC())
cross_val_score(text_pipe, X_train[['comments']], y_train).mean()

Pretty good scores just with text features! Do we get better scores if we combine all features? Let's define a column transformer which carries out 
- imputation and scaling on numeric features
- imputation and one-hot encoding with `drop="if_binary"` on binary features
- imputation and one-hot encoding with `handle_unknown="ignore"` on categorical features
- imputation, reshaping, and bag-of-words transformation on the text feature

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
text_feat = ['comments']
preprocessor = make_column_transformer(
    (numeric_transformer, numeric_feats),
    (binary_transformer, binary_feats),    
    (categorical_transformer, categorical_feats),
    (text_transformer, text_feat)
)

In [None]:
preprocessor.fit_transform(X_train)

In [None]:
svc_num_cat_text_pipe = make_pipeline(preprocessor, SVC())
cross_val_score(svc_num_cat_text_pipe, X_train, y_train).mean()

No big improvement when we combine all features. 