![](img/330-banner.png)

# Lecture 6: `sklearn` `ColumnTransformer` and Text Features

UBC 2023 Summer

Instructor: Mehrdad Oveisi

## Imports

In [1]:
import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import HTML

sys.path.append("code/.")
from plotting_functions import *
from utils import *

pd.set_option("display.max_colwidth", 200)

from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

<br><br><br><br>

### Quick recap

- Types of data for our purposes
  - Categorical:
    - Nominal (sometimes just called *categorical*!), Ordinal
  - Numerical:
    - Discrete, Continuous

<br><br>

## Learning outcomes 

From this lecture, you will be able to 

- use `ColumnTransformer` to build all our transformations together into one object and use it with `sklearn` pipelines;  
- define `ColumnTransformer` where transformers contain more than one steps;
- explain `handle_unknown="ignore"` hyperparameter of `scikit-learn`'s `OneHotEncoder`;
- explain `drop="if_binary"` argument of `OneHotEncoder`;
- identify when it's appropriate to apply ordinal encoding vs one-hot encoding;
- explain strategies to deal with categorical variables with too many categories; 
- explain why text data needs a different treatment than categorical variables;
- use `scikit-learn`'s `CountVectorizer` to encode text data;
- explain different hyperparameters of `CountVectorizer`;
- incorporate text features in a machine learning pipeline.

## sklearn's [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

- In most applications, some features are categorical, some are continuous, some are binary, and some are ordinal. 

- When we want to develop supervised machine learning pipelines on real-world datasets, very often we want to apply **different transformation on different columns**.

- Enter `sklearn`'s `ColumnTransformer`!! 

- Let's look at a toy example: 

In [2]:
df = pd.read_csv("data/quiz2-grade-toy-col-transformer.csv")
df

Unnamed: 0,enjoy_course,ml_experience,major,class_attendance,university_years,lab1,lab2,lab3,lab4,quiz1,quiz2
0,yes,1,Computer Science,Excellent,3,92,93.0,84,91,92,A+
1,yes,1,Mechanical Engineering,Average,2,94,90.0,80,83,91,not A+
2,yes,0,Mathematics,Poor,3,78,85.0,83,80,80,not A+
3,no,0,Mathematics,Excellent,3,91,,92,91,89,A+
4,yes,0,Psychology,Good,4,77,83.0,90,92,85,A+
5,no,1,Economics,Good,5,70,73.0,68,74,71,not A+
6,yes,1,Computer Science,Excellent,4,80,88.0,89,88,91,A+
7,no,0,Mechanical Engineering,Poor,3,95,93.0,69,79,75,not A+
8,no,0,Linguistics,Average,2,97,90.0,94,82,80,not A+
9,yes,1,Mathematics,Average,4,95,82.0,94,94,85,not A+


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   enjoy_course      21 non-null     object 
 1   ml_experience     21 non-null     int64  
 2   major             21 non-null     object 
 3   class_attendance  21 non-null     object 
 4   university_years  21 non-null     int64  
 5   lab1              21 non-null     int64  
 6   lab2              19 non-null     float64
 7   lab3              21 non-null     int64  
 8   lab4              21 non-null     int64  
 9   quiz1             21 non-null     int64  
 10  quiz2             21 non-null     object 
dtypes: float64(1), int64(6), object(4)
memory usage: 1.9+ KB


### Transformations on the toy data

In [4]:
df.head(3)

Unnamed: 0,enjoy_course,ml_experience,major,class_attendance,university_years,lab1,lab2,lab3,lab4,quiz1,quiz2
0,yes,1,Computer Science,Excellent,3,92,93.0,84,91,92,A+
1,yes,1,Mechanical Engineering,Average,2,94,90.0,80,83,91,not A+
2,yes,0,Mathematics,Poor,3,78,85.0,83,80,80,not A+


So what **transformations are needed**?
- Scaling on numeric features
- One-hot encoding on the categorical feature `major` and binary feature `enjoy_course`
- Ordinal encoding on the ordinal feature `class_attendance`
- Imputation on the `lab2` feature
- None on the `ml_experience` feature

### `ColumnTransformer` example

#### Data

In [5]:
X = df.drop(columns=["quiz2"])
y = df["quiz2"]
X.columns

Index(['enjoy_course', 'ml_experience', 'major', 'class_attendance',
       'university_years', 'lab1', 'lab2', 'lab3', 'lab4', 'quiz1'],
      dtype='object')

#### Identify the transformations we want to apply

In [6]:
X.head()

Unnamed: 0,enjoy_course,ml_experience,major,class_attendance,university_years,lab1,lab2,lab3,lab4,quiz1
0,yes,1,Computer Science,Excellent,3,92,93.0,84,91,92
1,yes,1,Mechanical Engineering,Average,2,94,90.0,80,83,91
2,yes,0,Mathematics,Poor,3,78,85.0,83,80,80
3,no,0,Mathematics,Excellent,3,91,,92,91,89
4,yes,0,Psychology,Good,4,77,83.0,90,92,85


In [7]:
numeric_feats = ["university_years", "lab1", "lab3", "lab4", "quiz1"]  # apply scaling
categorical_feats = ["major"]  # apply one-hot encoding
passthrough_feats = ["ml_experience"]  # do not apply any transformation
drop_feats = [
    "lab2",
    "class_attendance",
    "enjoy_course",
]  # for now, do not include these features in modeling

For simplicity, let's only focus on scaling and one-hot encoding first. 

#### Create a column transformer

- Each transformation is specified by a name, a transformer object, and the columns this transformer should be applied to. 

In [8]:
from sklearn.compose import ColumnTransformer

In [9]:
ct = ColumnTransformer(
    [
        ("scaling", StandardScaler(), numeric_feats),
        ("onehot", OneHotEncoder(sparse=False), categorical_feats),
    ]
)

#### Convenient `make_column_transformer` syntax

- Similar to `make_pipeline` syntax, there is convenient `make_column_transformer` syntax. 
- The syntax automatically names each step based on its class. 
- We'll be mostly using this syntax. 

In [10]:
from sklearn.compose import make_column_transformer

ct = make_column_transformer(
    (StandardScaler(), numeric_feats),  # scaling on numeric features
    (OneHotEncoder(), categorical_feats),  # OHE on categorical features
    ("passthrough", passthrough_feats),  # no transformations on the binary features
    ("drop", drop_feats),  # drop the drop features
)

In [11]:
print(ct)

ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                 ['university_years', 'lab1', 'lab3', 'lab4',
                                  'quiz1']),
                                ('onehotencoder', OneHotEncoder(), ['major']),
                                ('passthrough', 'passthrough',
                                 ['ml_experience']),
                                ('drop', 'drop',
                                 ['lab2', 'class_attendance', 'enjoy_course'])])


In [12]:
ct

In [13]:
transformed = ct.fit_transform(X)

- When we `fit_transform`, each transformer is applied to the specified columns and the result of the transformations are concatenated horizontally. 
- A big advantage here is that we build all our transformations together into one object, and that way we're sure we do the same operations to all splits of the data.
- Otherwise we might, for example, do the OHE on both train and test but forget to scale the test data.

#### Let's examine the transformed data

In [14]:
transformed[:2]

array([[-0.09345386,  0.3589134 , -0.21733442,  0.36269995,  0.84002795,
         0.        ,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  1.        ],
       [-1.07471942,  0.59082668, -0.61420598, -0.85597188,  0.71219761,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  1.        ]])

In [15]:
type(transformed)

numpy.ndarray

***Note*** 
> The returned object is not a dataframe. So there are no column names.

#### Viewing the transformed data as a dataframe

- How can we view our transformed data as a dataframe? 
- We are adding more columns. 
- So the original columns won't directly map to the transformed data. 
- Let's create column names for the transformed data. 

In [16]:
ct.named_transformers_

{'standardscaler': StandardScaler(),
 'onehotencoder': OneHotEncoder(),
 'passthrough': 'passthrough',
 'drop': 'drop'}

In [17]:
# E.g., columns preprocessed by StandardScaler
ct.named_transformers_["standardscaler"].get_feature_names_out()

array(['university_years', 'lab1', 'lab3', 'lab4', 'quiz1'], dtype=object)

In [18]:
# Here are the new columns created by OneHotEncoder
ct.named_transformers_["onehotencoder"].get_feature_names_out()

array(['major_Biology', 'major_Computer Science', 'major_Economics',
       'major_Linguistics', 'major_Mathematics',
       'major_Mechanical Engineering', 'major_Physics',
       'major_Psychology'], dtype=object)

In [19]:
column_names = (
    numeric_feats
    + ct.named_transformers_["onehotencoder"].get_feature_names_out().tolist()
    + passthrough_feats
)
column_names

['university_years',
 'lab1',
 'lab3',
 'lab4',
 'quiz1',
 'major_Biology',
 'major_Computer Science',
 'major_Economics',
 'major_Linguistics',
 'major_Mathematics',
 'major_Mechanical Engineering',
 'major_Physics',
 'major_Psychology',
 'ml_experience']

***Note*** 
> The order of the columns in the transformed data depends upon the order of the features we pass to the `ColumnTransformer` and can be different than the order of the features in the original dataframe.

In [20]:
pd.DataFrame(transformed, columns=column_names).head()

Unnamed: 0,university_years,lab1,lab3,lab4,quiz1,major_Biology,major_Computer Science,major_Economics,major_Linguistics,major_Mathematics,major_Mechanical Engineering,major_Physics,major_Psychology,ml_experience
0,-0.093454,0.358913,-0.217334,0.3627,0.840028,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,-1.074719,0.590827,-0.614206,-0.855972,0.712198,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,-0.093454,-1.26448,-0.316552,-1.312974,-0.693936,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,-0.093454,0.242957,0.576409,0.3627,0.456537,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.887812,-1.380436,0.377973,0.515034,-0.054784,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


#### `ColumnTransformer`: Transformed data

<br>

<img src='./img/column-transformer.png' width="1500">

[Adapted from here](https://amueller.github.io/COMS4995-s20/slides/aml-04-preprocessing/#37)

#### Training models with transformed data
- We can now pass the `ColumnTransformer` object as a step in a pipeline. 

In [21]:
# Make a pipeline that applies ct to columns and then SVC to the resulting table
pipe = make_pipeline(ct, SVC())
pipe.fit(X, y)
pipe.predict(X)

array(['A+', 'not A+', 'not A+', 'A+', 'A+', 'not A+', 'A+', 'not A+',
       'not A+', 'A+', 'A+', 'A+', 'A+', 'A+', 'not A+', 'not A+', 'A+',
       'not A+', 'not A+', 'not A+', 'A+'], dtype=object)

In [22]:
pipe

<br><br><br><br>

## ❓❓ Questions for you

iClicker join links

- CPSC 330 **911**
  - https://join.iclicker.com/LFDB
- CPSC 330 **912**
  - https://join.iclicker.com/GJMY

### iClicker Exercise 6.1 

**Select all of the following statements which are TRUE.**

1. You could carry out cross-validation by passing a `ColumnTransformer` object to `cross_validate`. 
2. After applying column transformer, the order of the columns in the transformed data has to be the same as the order of the columns in the original data. 
3. After applying a column transformer, the transformed data is always going to be of different shape than the original data. 
4. When you call `fit_transform` on a `ColumnTransformer` object, you get a numpy ndarray. 

<br><br><br><br>

### Exercise 6.2
#### What transformations on what columns? 
Consider the feature columns below. 

- What transformations would you apply on each column? 

| colour  | location  |  shape |  water_content | weight |
|-----|-----|-----|-----|-----|
|   red   |   canada  |    NaN  |       84  |        100 |
| yellow  |   mexico  |   long  |       75  |        120 |
| orange  |   spain   |    NaN  |       90  |        NaN |
| magenta |    china  |    round|       NaN |        600 |
| purple  |  austria  |    NaN  |       80  |        115 |
| purple  |  turkey   |   oval  |       78  |        340 |
| green   |  mexico   |   oval  |       83  |        NaN |
| blue    | canada     | round  |      73   |       535  |
| brown   |  china     |   NaN  |       NaN |       1743 | 
| yellow  |  mexico    |  oval  |       83  |        265 |


<br><br><br><br>

## More on feature transformations

### `sklearn` `set_config`

- With multiple transformations in a column transformer, it can get tricky to keep track of everything happening inside it.  
- We can use `set_config` to display a diagram of this. 

In [23]:
from sklearn import set_config

set_config(display="diagram")

In [24]:
ct

In [25]:
print(ct)

ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                 ['university_years', 'lab1', 'lab3', 'lab4',
                                  'quiz1']),
                                ('onehotencoder', OneHotEncoder(), ['major']),
                                ('passthrough', 'passthrough',
                                 ['ml_experience']),
                                ('drop', 'drop',
                                 ['lab2', 'class_attendance', 'enjoy_course'])])


### Multiple transformations in a transformer

- Recall that `lab2` has missing values. 


In [26]:
X.head()

Unnamed: 0,enjoy_course,ml_experience,major,class_attendance,university_years,lab1,lab2,lab3,lab4,quiz1
0,yes,1,Computer Science,Excellent,3,92,93.0,84,91,92
1,yes,1,Mechanical Engineering,Average,2,94,90.0,80,83,91
2,yes,0,Mathematics,Poor,3,78,85.0,83,80,80
3,no,0,Mathematics,Excellent,3,91,,92,91,89
4,yes,0,Psychology,Good,4,77,83.0,90,92,85


- So we would like to apply more than one transformations on it: imputation and scaling.  
- We can treat `lab2` separately, but we can also include it into `numeric_feats` and apply both transformations on all numeric columns.

In [27]:
numeric_feats = [
    "university_years",
    "lab1",
    "lab2",
    "lab3",
    "lab4",
    "quiz1",
]  # apply scaling
categorical_feats = ["major"]  # apply one-hot encoding
passthrough_feats = ["ml_experience"]  # do not apply any transformation
drop_feats = ["class_attendance", "enjoy_course"]

- To apply more than one transformations we can define a pipeline inside a column transformer to **chain different transformations**.

In [28]:
ct = make_column_transformer(
    (
        make_pipeline(SimpleImputer(), StandardScaler()),
        numeric_feats,
    ),  # scaling on numeric features
    (OneHotEncoder(), categorical_feats),  # OHE on categorical features
    ("passthrough", passthrough_feats),  # no transformations on the binary features
    ("drop", drop_feats),  # drop the drop features
)

In [29]:
print(ct)

ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer()),
                                                 ('standardscaler',
                                                  StandardScaler())]),
                                 ['university_years', 'lab1', 'lab2', 'lab3',
                                  'lab4', 'quiz1']),
                                ('onehotencoder', OneHotEncoder(), ['major']),
                                ('passthrough', 'passthrough',
                                 ['ml_experience']),
                                ('drop', 'drop',
                                 ['class_attendance', 'enjoy_course'])])


In [30]:
ct

In [31]:
X_transformed = ct.fit_transform(X)

In [32]:
column_names = (
    numeric_feats
    + ct.named_transformers_["onehotencoder"].get_feature_names_out().tolist()
    + passthrough_feats
)
column_names

['university_years',
 'lab1',
 'lab2',
 'lab3',
 'lab4',
 'quiz1',
 'major_Biology',
 'major_Computer Science',
 'major_Economics',
 'major_Linguistics',
 'major_Mathematics',
 'major_Mechanical Engineering',
 'major_Physics',
 'major_Psychology',
 'ml_experience']

In [33]:
pd.DataFrame(X_transformed, columns=column_names).head()

Unnamed: 0,university_years,lab1,lab2,lab3,lab4,quiz1,major_Biology,major_Computer Science,major_Economics,major_Linguistics,major_Mathematics,major_Mechanical Engineering,major_Physics,major_Psychology,ml_experience
0,-0.093454,0.358913,0.89326,-0.217334,0.3627,0.840028,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,-1.074719,0.590827,0.294251,-0.614206,-0.855972,0.712198,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,-0.093454,-1.26448,-0.704099,-0.316552,-1.312974,-0.693936,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,-0.093454,0.242957,0.0,0.576409,0.3627,0.456537,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.887812,-1.380436,-1.103439,0.377973,0.515034,-0.054784,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


<br><br>

<br><br>

### Incorporating ***ordinal*** feature `class_attendance` 

- The `class_attendance` column is different than the `major` column in that there is some ordering of the values. 
    - Excellent > Good > Average > Poor

In [34]:
X.head()

Unnamed: 0,enjoy_course,ml_experience,major,class_attendance,university_years,lab1,lab2,lab3,lab4,quiz1
0,yes,1,Computer Science,Excellent,3,92,93.0,84,91,92
1,yes,1,Mechanical Engineering,Average,2,94,90.0,80,83,91
2,yes,0,Mathematics,Poor,3,78,85.0,83,80,80
3,no,0,Mathematics,Excellent,3,91,,92,91,89
4,yes,0,Psychology,Good,4,77,83.0,90,92,85


Let's try applying `OrdinalEncoder` on `class_attendance` column.

In [35]:
X_toy = X[["class_attendance"]]
enc = OrdinalEncoder()
enc.fit(X_toy)
X_toy_ord = enc.transform(X_toy)
X_toy_ord_df = pd.DataFrame(
    data=X_toy_ord,
    columns=["class_attendance_ord"],
    index=X_toy.index,
)

In [36]:
X_toy.join(X_toy_ord_df).head(10)

Unnamed: 0,class_attendance,class_attendance_ord
0,Excellent,1.0
1,Average,0.0
2,Poor,3.0
3,Excellent,1.0
4,Good,2.0
5,Good,2.0
6,Excellent,1.0
7,Poor,3.0
8,Average,0.0
9,Average,0.0


- What's the problem here? 
    - The encoder doesn't know the order. 
- We can examine unique categories **manually, order them based on our intuitions**, and then provide this human knowledge to the transformer. 

What are the unique categories of `class_attendance`? 

In [37]:
X_toy["class_attendance"].unique()

array(['Excellent', 'Average', 'Poor', 'Good'], dtype=object)

Let's order them manually. 

In [38]:
class_attendance_levels = ["Poor", "Average", "Good", "Excellent"]

***Note*** 
> If you use the reverse order of the categories, it wouldn't matter.

Let's make sure that we have included all categories in our manual ordering.  

In [39]:
assert set(class_attendance_levels) == set(X_toy["class_attendance"].unique())

In [40]:
oe = OrdinalEncoder(categories=[class_attendance_levels], dtype=int)
oe.fit(X_toy[["class_attendance"]])
ca_ord = oe.transform(X_toy[["class_attendance"]])
ca_ord_df = pd.DataFrame(
    data=ca_ord, columns=["class_attendance_ord"], index=X_toy.index
)
print(oe.categories_)
X_toy.join(ca_ord_df).head(10)

[array(['Poor', 'Average', 'Good', 'Excellent'], dtype=object)]


Unnamed: 0,class_attendance,class_attendance_ord
0,Excellent,3
1,Average,1
2,Poor,0
3,Excellent,3
4,Good,2
5,Good,2
6,Excellent,3
7,Poor,0
8,Average,1
9,Average,1


The encoded categories are looking better now! 

#### More than one ordinal columns?

- We can pass the manually ordered categories when we create an `OrdinalEncoder` object as a list of lists. 
- If you have more than one ordinal columns
    - manually create a list of ordered categories for each column
    - pass a list of lists to `OrdinalEncoder`, where each inner list corresponds to manually created list of ordered categories for a corresponding ordinal column. 
    

Now let's incorporate ordinal encoding of `class_attendance` in our column transformer. 

In [41]:
X.head()

Unnamed: 0,enjoy_course,ml_experience,major,class_attendance,university_years,lab1,lab2,lab3,lab4,quiz1
0,yes,1,Computer Science,Excellent,3,92,93.0,84,91,92
1,yes,1,Mechanical Engineering,Average,2,94,90.0,80,83,91
2,yes,0,Mathematics,Poor,3,78,85.0,83,80,80
3,no,0,Mathematics,Excellent,3,91,,92,91,89
4,yes,0,Psychology,Good,4,77,83.0,90,92,85


In [42]:
numeric_feats = [
    "university_years",
    "lab1",
    "lab2",
    "lab3",
    "lab4",
    "quiz1",
]  # apply scaling
categorical_feats = ["major"]  # apply one-hot encoding
ordinal_feats = ["class_attendance"]  # apply ordinal encoding     # <-- here
passthrough_feats = ["ml_experience"]  # do not apply any transformation
drop_feats = ["enjoy_course"]  # do not include these features

In [43]:
ct = make_column_transformer(
    (
        make_pipeline(SimpleImputer(), StandardScaler()),
        numeric_feats,
    ),  # scaling on numeric features
    (OneHotEncoder(), categorical_feats),  # OHE on categorical features
    (
        OrdinalEncoder(categories=[class_attendance_levels], dtype=int),   # <-- here
        ordinal_feats,
    ),  # Ordinal encoding on ordinal features
    ("passthrough", passthrough_feats),  # no transformations on the binary features
    ("drop", drop_feats),  # drop the drop features
)

In [44]:
ct

In [45]:
X_transformed = ct.fit_transform(X)

In [46]:
column_names = (
    numeric_feats
    + ct.named_transformers_["onehotencoder"].get_feature_names_out().tolist()
    + ordinal_feats
    + passthrough_feats
)
column_names

['university_years',
 'lab1',
 'lab2',
 'lab3',
 'lab4',
 'quiz1',
 'major_Biology',
 'major_Computer Science',
 'major_Economics',
 'major_Linguistics',
 'major_Mathematics',
 'major_Mechanical Engineering',
 'major_Physics',
 'major_Psychology',
 'class_attendance',
 'ml_experience']

In [47]:
pd.DataFrame(X_transformed, columns=column_names)

Unnamed: 0,university_years,lab1,lab2,lab3,lab4,quiz1,major_Biology,major_Computer Science,major_Economics,major_Linguistics,major_Mathematics,major_Mechanical Engineering,major_Physics,major_Psychology,class_attendance,ml_experience
0,-0.093454,0.358913,0.89326,-0.217334,0.3627,0.840028,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0
1,-1.074719,0.590827,0.294251,-0.614206,-0.855972,0.712198,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
2,-0.093454,-1.26448,-0.704099,-0.316552,-1.312974,-0.693936,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,-0.093454,0.242957,0.0,0.576409,0.3627,0.456537,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,3.0,0.0
4,0.887812,-1.380436,-1.103439,0.377973,0.515034,-0.054784,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0
5,1.869077,-2.192133,-3.100139,-1.804821,-2.226978,-1.844409,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
6,0.887812,-1.032566,-0.105089,0.278755,-0.094302,0.712198,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0
7,-0.093454,0.706783,0.89326,-1.705603,-1.465308,-1.333088,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
8,-1.074719,0.938697,0.294251,0.774844,-1.008306,-0.693936,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
9,0.887812,0.706783,-1.303109,0.774844,0.819702,-0.054784,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0


<br><br><br><br>

### Dealing with unknown categories

How does `OneHotEncoder` deal with unknown categories? Let's see an example:

In [48]:
X_toy = [['science', 10], ['arts', 30], ['arts', 20]]
columns=['subject', 'group']
pd.DataFrame(X_toy, columns=columns)

Unnamed: 0,subject,group
0,science,10
1,arts,30
2,arts,20


In [49]:
ohe = OneHotEncoder(handle_unknown='error')  # default value for handle_unknown is 'error'
ohe.fit(X_toy);

In [50]:
columns_ohe = ohe.get_feature_names_out(['subject', 'group']).tolist()
columns_ohe

['subject_arts', 'subject_science', 'group_10', 'group_20', 'group_30']

In [51]:
ohe.categories_

[array(['arts', 'science'], dtype=object), array([10, 20, 30], dtype=object)]

In [52]:
ex1 = ohe.transform([['arts', 10], ['science', 30]]).toarray()
ex1

array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1.]])

In [53]:
pd.DataFrame(ex1, columns=columns_ohe)

Unnamed: 0,subject_arts,subject_science,group_10,group_20,group_30
0,1.0,0.0,1.0,0.0,0.0
1,0.0,1.0,0.0,0.0,1.0


In [54]:
# ex2 = ohe.transform([['arts', 10], ['science', 4]]).toarray()

# This would give an error:
# ValueError: Found unknown categories [4] in column 1 during transform

In [55]:
ohe = OneHotEncoder(handle_unknown='ignore')  # now use 'ignore' instead of 'error'
ohe.fit(X_toy);

In [56]:
ex2 = ohe.transform([['arts', 10], ['science', 4]]).toarray()
ex2

array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])

In [57]:
pd.DataFrame(ex2, columns=columns_ohe)  # all "group" columns are 0 for value 4

Unnamed: 0,subject_arts,subject_science,group_10,group_20,group_30
0,1.0,0.0,1.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0


In [58]:
ex3 = ohe.inverse_transform([[0, 1, 1, 0, 0], [1, 0, 0, 0, 0]])
ex3

array([['science', 10],
       ['arts', None]], dtype=object)

In [59]:
pd.DataFrame(ex3, columns=columns)

Unnamed: 0,subject,group
0,science,10.0
1,arts,


In [60]:
ex4 = ohe.inverse_transform([[0, 1, 0, 0, 1], [0, 0, 0, 1, 0]])
ex4

array([['science', 30],
       [None, 20]], dtype=object)

In [61]:
pd.DataFrame(ex4, columns=columns)

Unnamed: 0,subject,group
0,science,30
1,,20


What if we know the possible categories beforehand? We can **specify categories** ahead of time.

In [62]:
ohe = OneHotEncoder(handle_unknown='error', categories=[['arts', 'science'], [10, 20, 30, 4]])
ohe.fit(X_toy);

Even though `handle_unknown='error'`, `ex2` does not give error anymore because `categories` are known.

In [63]:
ex2 = ohe.transform([['arts', 10], ['science', 4]]).toarray()
ex2

array([[1., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1.]])

In [64]:
columns_ohe = ohe.get_feature_names_out(['subject', 'group']).tolist()
pd.DataFrame(ex2, columns=columns_ohe)

Unnamed: 0,subject_arts,subject_science,group_10,group_20,group_30,group_4
0,1.0,0.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,1.0


<br><br><br><br>

###  Dealing with unknown categories in `cross_validate`

Let's create a pipeline with the column transformer and pass it to `cross_validate`. 

In [65]:
ct

In [66]:
pipe = make_pipeline(ct, SVC())

In [67]:
scores = cross_validate(pipe, X, y, return_train_score=True)
pd.DataFrame(scores)

Traceback (most recent call last):
  File "/home/mehrdad/miniconda3/envs/cpsc330/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/home/mehrdad/miniconda3/envs/cpsc330/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 429, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
  File "/home/mehrdad/miniconda3/envs/cpsc330/lib/python3.10/site-packages/sklearn/pipeline.py", line 695, in score
    Xt = transform.transform(Xt)
  File "/home/mehrdad/miniconda3/envs/cpsc330/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py", line 763, in transform
    Xs = self._fit_transform(
  File "/home/mehrdad/miniconda3/envs/cpsc330/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py", line 621, in _fit_transform
    return Parallel(n_jobs=self.n_jobs)(
  File "/home/mehrdad/miniconda3/envs/cpsc330/lib/python3.10/site-packages/joblib/parallel.py", li

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.009845,0.005134,1.0,0.9375
1,0.008087,0.004562,1.0,0.941176
2,0.008173,0.004656,0.5,1.0
3,0.016997,0.006691,0.75,0.941176
4,0.014113,0.009075,,1.0


- What's going on here??
- Let's look at the error message:
  - `ValueError: Found unknown categories ['Biology'] in column 0 during transform`
  - Same error that we got in the `OneHotEncoder` example above (`ex2`)

In [68]:
X["major"].value_counts()

Computer Science          4
Mathematics               4
Mechanical Engineering    3
Psychology                3
Economics                 2
Linguistics               2
Physics                   2
Biology                   1
Name: major, dtype: int64

- There is only <u>one</u> instance of Biology.
- During cross-validation, this is getting <u>put into the validation split</u>.
- By default, `OneHotEncoder` throws an error because you might want to know about this.

Simplest fix:
- Pass `handle_unknown="ignore"` argument to `OneHotEncoder`
- It creates a row with all zeros (as we saw in the example above)

In [69]:
ct = make_column_transformer(
    (
        make_pipeline(SimpleImputer(), StandardScaler()),
        numeric_feats,
    ),  # scaling on numeric features
    (
        OneHotEncoder(handle_unknown="ignore"),  # <-- here
        categorical_feats,
    ),  # OHE on categorical features
    (
        OrdinalEncoder(categories=[class_attendance_levels], dtype=int),
        ordinal_feats,
    ),  # Ordinal encoding on ordinal features
    ("passthrough", passthrough_feats),  # no transformations on the binary features
    ("drop", drop_feats),  # drop the drop features
)

In [70]:
ct

In [71]:
pipe = make_pipeline(ct, SVC())

In [72]:
scores = cross_validate(pipe, X, y, cv=5, return_train_score=True)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.010405,0.005089,1.0,0.9375
1,0.010239,0.004375,1.0,0.941176
2,0.014121,0.006747,0.5,1.0
3,0.012362,0.006277,0.75,0.941176
4,0.010983,0.006147,0.75,1.0


- With this approach, **all unknown categories will be represented with all zeros** and cross-validation is running OK now. 

Ask yourself the following questions when you work with categorical variables   
- Do you want this behaviour? 
- Are you expecting to get many unknown categories? Do you want to be able to distinguish between them?

<br><br><br><br>
<hr>

#### Cases where it's OK to break the golden rule 

We saw above that if we **know categories beforehand** we can specify them to `OneHotEncoder` to avoid errors during `cross_over` assessments. However, wouldn't tha be ***breaking the golden*** rule?

When we know the categories in advance and this is **one of the cases where it might be OK to violate the golden rule** and get a list of all possible values for the categorical variable. 

For example, if it's some fix number of categories. E.g., if it's something like:
  -  provinces in Canada or 
  -  majors taught at UBC. 


<hr>
<br><br><br><br>

### Categorical features with only two possible categories (binary)

- Sometimes you have features with only two possible categories. 
- If we apply `OheHotEncoder` on such columns, it'll **create two columns, which seems wasteful**, as we could represent all information in the column in just one column with say 0's and 1's with presence of absence of one of one of the categories.
- You can pass `drop="if_binary"` argument to `OneHotEncoder` in order to create only one column in such scenario. 

In [73]:
X["enjoy_course"].head()

0    yes
1    yes
2    yes
3     no
4    yes
Name: enjoy_course, dtype: object

In [74]:
ohe_enc = OneHotEncoder(drop="if_binary", dtype=int, sparse=False)
ohe_enc.fit(X[["enjoy_course"]])
transformed = ohe_enc.transform(X[["enjoy_course"]])
df = pd.DataFrame(data=transformed, columns=["enjoy_course_enc"], index=X.index)
X[["enjoy_course"]].join(df).head(10)

Unnamed: 0,enjoy_course,enjoy_course_enc
0,yes,1
1,yes,1
2,yes,1
3,no,0
4,yes,1
5,no,0
6,yes,1
7,no,0
8,no,0
9,yes,1


In [75]:
numeric_feats = [
    "university_years",
    "lab1",
    "lab2",
    "lab3",
    "lab4",
    "quiz1",
]  # apply scaling
categorical_feats = ["major"]  # apply one-hot encoding
ordinal_feats = ["class_attendance"]  # apply ordinal encoding
binary_feats = ["enjoy_course"]  # apply one-hot encoding with drop="if_binary"  # <-- here
passthrough_feats = ["ml_experience"]  # do not apply any transformation
drop_feats = []

In [76]:
ct = make_column_transformer(
    (
        make_pipeline(SimpleImputer(), StandardScaler()),
        numeric_feats,
    ),  # scaling on numeric features
    (
        OneHotEncoder(handle_unknown="ignore"),
        categorical_feats,
    ),  # OHE on categorical features
    (
        OrdinalEncoder(categories=[class_attendance_levels], dtype=int),
        ordinal_feats,
    ),  # Ordinal encoding on ordinal features
    (
        OneHotEncoder(drop="if_binary", dtype=int),    # <-- here
        binary_feats,
    ),  # OHE on categorical features
    ("passthrough", passthrough_feats),  # no transformations on the binary features
)

In [77]:
ct

In [78]:
pipe = make_pipeline(ct, SVC())

In [79]:
scores = cross_validate(pipe, X, y, cv=5, return_train_score=True)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.017912,0.010477,1.0,1.0
1,0.018409,0.01148,1.0,0.941176
2,0.01148,0.006578,0.5,1.0
3,0.01452,0.008451,1.0,0.941176
4,0.013578,0.007761,0.75,1.0


***Note***
> Do not read too much into the scores, as we are running cross-validation on a very small dataset with 21 examples. The main point here is to show you how can we use `ColumnTransformer` to apply different transformations on different columns.

## Break (5 min)

![](img/eva-coffee.png)


<br><br><br><br>

## `ColumnTransformer` on the California housing dataset 

In [80]:
housing_df = pd.read_csv("data/housing.csv")
train_df, test_df = train_test_split(housing_df, test_size=0.1, random_state=123)

train_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
6051,-117.75,34.04,22.0,2948.0,636.0,2600.0,602.0,3.125,113600.0,INLAND
20113,-119.57,37.94,17.0,346.0,130.0,51.0,20.0,3.4861,137500.0,INLAND
14289,-117.13,32.74,46.0,3355.0,768.0,1457.0,708.0,2.6604,170100.0,NEAR OCEAN
13665,-117.31,34.02,18.0,1634.0,274.0,899.0,285.0,5.2139,129300.0,INLAND
14471,-117.23,32.88,18.0,5566.0,1465.0,6303.0,1458.0,1.858,205000.0,NEAR OCEAN


Some column values are mean/median but some are not. 

Let's add some new features to the dataset which could help predicting the target: `median_house_value`. 

In [81]:
train_df = train_df.assign(
    rooms_per_household=train_df["total_rooms"] / train_df["households"]
)
test_df = test_df.assign(
    rooms_per_household=test_df["total_rooms"] / test_df["households"]
)

train_df = train_df.assign(
    bedrooms_per_household=train_df["total_bedrooms"] / train_df["households"]
)
test_df = test_df.assign(
    bedrooms_per_household=test_df["total_bedrooms"] / test_df["households"]
)

train_df = train_df.assign(
    population_per_household=train_df["population"] / train_df["households"]
)
test_df = test_df.assign(
    population_per_household=test_df["population"] / test_df["households"]
)

In [82]:
train_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,rooms_per_household,bedrooms_per_household,population_per_household
6051,-117.75,34.04,22.0,2948.0,636.0,2600.0,602.0,3.125,113600.0,INLAND,4.89701,1.056478,4.318937
20113,-119.57,37.94,17.0,346.0,130.0,51.0,20.0,3.4861,137500.0,INLAND,17.3,6.5,2.55
14289,-117.13,32.74,46.0,3355.0,768.0,1457.0,708.0,2.6604,170100.0,NEAR OCEAN,4.738701,1.084746,2.05791
13665,-117.31,34.02,18.0,1634.0,274.0,899.0,285.0,5.2139,129300.0,INLAND,5.733333,0.961404,3.154386
14471,-117.23,32.88,18.0,5566.0,1465.0,6303.0,1458.0,1.858,205000.0,NEAR OCEAN,3.817558,1.004801,4.323045


In [83]:
# Let's keep both numeric and categorical columns in the data.
X_train = train_df.drop(columns=["median_house_value"])
y_train = train_df["median_house_value"]

X_test = test_df.drop(columns=["median_house_value"])
y_test = test_df["median_house_value"]

In [84]:
from sklearn.compose import ColumnTransformer, make_column_transformer

In [85]:
X_train.head(10)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,rooms_per_household,bedrooms_per_household,population_per_household
6051,-117.75,34.04,22.0,2948.0,636.0,2600.0,602.0,3.125,INLAND,4.89701,1.056478,4.318937
20113,-119.57,37.94,17.0,346.0,130.0,51.0,20.0,3.4861,INLAND,17.3,6.5,2.55
14289,-117.13,32.74,46.0,3355.0,768.0,1457.0,708.0,2.6604,NEAR OCEAN,4.738701,1.084746,2.05791
13665,-117.31,34.02,18.0,1634.0,274.0,899.0,285.0,5.2139,INLAND,5.733333,0.961404,3.154386
14471,-117.23,32.88,18.0,5566.0,1465.0,6303.0,1458.0,1.858,NEAR OCEAN,3.817558,1.004801,4.323045
9730,-121.74,36.79,16.0,3841.0,620.0,1799.0,611.0,4.3814,<1H OCEAN,6.286416,1.01473,2.944354
14690,-117.09,32.8,36.0,2163.0,367.0,915.0,360.0,4.7188,NEAR OCEAN,6.008333,1.019444,2.541667
7938,-118.11,33.86,33.0,2389.0,410.0,1229.0,393.0,5.3889,<1H OCEAN,6.07888,1.043257,3.127226
18365,-122.12,37.28,21.0,349.0,64.0,149.0,56.0,5.8691,<1H OCEAN,6.232143,1.142857,2.660714
10931,-117.91,33.74,25.0,4273.0,965.0,2946.0,922.0,2.9926,<1H OCEAN,4.63449,1.046638,3.195228


In [86]:
X_train.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'ocean_proximity', 'rooms_per_household', 'bedrooms_per_household',
       'population_per_household'],
      dtype='object')

In [87]:
# Identify the categorical and numeric columns
numeric_features = [
    "longitude",
    "latitude",
    "housing_median_age",
    "total_rooms",
    "total_bedrooms",
    "population",
    "households",
    "median_income",
    "rooms_per_household",
    "bedrooms_per_household",
    "population_per_household",
]

categorical_features = ["ocean_proximity"]
target = "median_house_value"

- Let's create a `ColumnTransformer` for our dataset. 

In [88]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18576 entries, 6051 to 19966
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   longitude                 18576 non-null  float64
 1   latitude                  18576 non-null  float64
 2   housing_median_age        18576 non-null  float64
 3   total_rooms               18576 non-null  float64
 4   total_bedrooms            18391 non-null  float64
 5   population                18576 non-null  float64
 6   households                18576 non-null  float64
 7   median_income             18576 non-null  float64
 8   ocean_proximity           18576 non-null  object 
 9   rooms_per_household       18576 non-null  float64
 10  bedrooms_per_household    18391 non-null  float64
 11  population_per_household  18576 non-null  float64
dtypes: float64(11), object(1)
memory usage: 1.8+ MB


In [89]:
X_train["ocean_proximity"].value_counts()

<1H OCEAN     8221
INLAND        5915
NEAR OCEAN    2389
NEAR BAY      2046
ISLAND           5
Name: ocean_proximity, dtype: int64

In [90]:
numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features),
    (categorical_transformer, categorical_features),
)

In [91]:
preprocessor

In [92]:
X_train_pp = preprocessor.fit_transform(X_train)

- When we `fit` the preprocessor, it calls **`fit` on _all_** the transformers
- When we `transform` the preprocessor, it calls **`transform` on _all_** the transformers. 

We can get the new names of the columns that were generated by the one-hot encoding:

In [93]:
preprocessor

In [94]:
preprocessor.named_transformers_["onehotencoder"].get_feature_names_out(
    categorical_features
)

array(['ocean_proximity_<1H OCEAN', 'ocean_proximity_INLAND',
       'ocean_proximity_ISLAND', 'ocean_proximity_NEAR BAY',
       'ocean_proximity_NEAR OCEAN'], dtype=object)

Combining this with the numeric feature names gives us all the column names:

In [95]:
column_names = numeric_features + list(
    preprocessor.named_transformers_["onehotencoder"].get_feature_names_out(
        categorical_features
    )
)
column_names

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'rooms_per_household',
 'bedrooms_per_household',
 'population_per_household',
 'ocean_proximity_<1H OCEAN',
 'ocean_proximity_INLAND',
 'ocean_proximity_ISLAND',
 'ocean_proximity_NEAR BAY',
 'ocean_proximity_NEAR OCEAN']

Let's visualize the preprocessed training data as a dataframe. 

In [96]:
pd.DataFrame(X_train_pp, columns=column_names)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,rooms_per_household,bedrooms_per_household,population_per_household,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,0.908140,-0.743917,-0.526078,0.143120,0.235339,1.026092,0.266135,-0.389736,-0.210591,-0.083813,0.126398,0.0,1.0,0.0,0.0,0.0
1,-0.002057,1.083123,-0.923283,-1.049510,-0.969959,-1.206672,-1.253312,-0.198924,4.726412,11.166631,-0.050132,0.0,1.0,0.0,0.0,0.0
2,1.218207,-1.352930,1.380504,0.329670,0.549764,0.024896,0.542873,-0.635239,-0.273606,-0.025391,-0.099240,0.0,0.0,0.0,0.0,1.0
3,1.128188,-0.753286,-0.843842,-0.459154,-0.626949,-0.463877,-0.561467,0.714077,0.122307,-0.280310,0.010183,0.0,1.0,0.0,0.0,0.0
4,1.168196,-1.287344,-0.843842,1.343085,2.210026,4.269688,2.500924,-1.059242,-0.640266,-0.190617,0.126808,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18571,0.733102,-0.804818,0.586095,-0.875337,-0.243446,-0.822136,-0.966131,-0.118182,0.063110,-0.099558,0.071541,1.0,0.0,0.0,0.0,0.0
18572,1.163195,-1.057793,-1.161606,0.940194,0.609314,0.882438,0.728235,0.357500,0.235096,-0.163397,0.007458,1.0,0.0,0.0,0.0,0.0
18573,-1.097293,0.797355,-1.876574,0.695434,0.433046,0.881563,0.514155,0.934269,0.211892,-0.135305,0.044029,1.0,0.0,0.0,0.0,0.0
18574,-1.437367,1.008167,1.221622,-0.499947,-0.484029,-0.759944,-0.454427,0.006578,-0.273382,-0.149822,-0.132875,0.0,0.0,0.0,1.0,0.0


In [97]:
y_train.to_frame().head()

Unnamed: 0,median_house_value
6051,113600.0
20113,137500.0
14289,170100.0
13665,129300.0
14471,205000.0


In [98]:
results_dict = {}
dummy = DummyRegressor()
results_dict["dummy"] = mean_std_cross_val_scores(
    dummy, X_train, y_train, return_train_score=True
)
pd.DataFrame(results_dict).T

Unnamed: 0,fit_time,score_time,test_score,train_score
dummy,0.001 (+/- 0.001),0.000 (+/- 0.000),-0.001 (+/- 0.001),0.000 (+/- 0.000)


In [99]:
from sklearn.svm import SVR

knn_pipe = make_pipeline(preprocessor, KNeighborsRegressor())

In [100]:
knn_pipe

In [101]:
results_dict["imp + scaling + ohe + KNN"] = mean_std_cross_val_scores(
    knn_pipe, X_train, y_train, return_train_score=True
)

In [102]:
pd.DataFrame(results_dict).T

Unnamed: 0,fit_time,score_time,test_score,train_score
dummy,0.001 (+/- 0.001),0.000 (+/- 0.000),-0.001 (+/- 0.001),0.000 (+/- 0.000)
imp + scaling + ohe + KNN,0.048 (+/- 0.006),0.128 (+/- 0.014),0.721 (+/- 0.012),0.816 (+/- 0.006)


In [103]:
svr_pipe = make_pipeline(preprocessor, SVR())
results_dict["imp + scaling + ohe + SVR (default)"] = mean_std_cross_val_scores(
    svr_pipe, X_train, y_train, return_train_score=True
)

In [104]:
pd.DataFrame(results_dict).T

Unnamed: 0,fit_time,score_time,test_score,train_score
dummy,0.001 (+/- 0.001),0.000 (+/- 0.000),-0.001 (+/- 0.001),0.000 (+/- 0.000)
imp + scaling + ohe + KNN,0.048 (+/- 0.006),0.128 (+/- 0.014),0.721 (+/- 0.012),0.816 (+/- 0.006)
imp + scaling + ohe + SVR (default),12.697 (+/- 1.100),3.342 (+/- 0.547),-0.049 (+/- 0.012),-0.049 (+/- 0.001)


The results with `scikit-learn`'s default SVR hyperparameters are pretty bad. 

In [105]:
svr_C_pipe = make_pipeline(preprocessor, SVR(C=10000))
results_dict["imp + scaling + ohe + SVR (C=10000)"] = mean_std_cross_val_scores(
    svr_C_pipe, X_train, y_train, return_train_score=True
)

In [106]:
pd.DataFrame(results_dict).T

Unnamed: 0,fit_time,score_time,test_score,train_score
dummy,0.001 (+/- 0.001),0.000 (+/- 0.000),-0.001 (+/- 0.001),0.000 (+/- 0.000)
imp + scaling + ohe + KNN,0.048 (+/- 0.006),0.128 (+/- 0.014),0.721 (+/- 0.012),0.816 (+/- 0.006)
imp + scaling + ohe + SVR (default),12.697 (+/- 1.100),3.342 (+/- 0.547),-0.049 (+/- 0.012),-0.049 (+/- 0.001)
imp + scaling + ohe + SVR (C=10000),14.231 (+/- 3.049),3.306 (+/- 0.200),0.721 (+/- 0.007),0.726 (+/- 0.007)


With a **bigger value for `C`** the results are much **better**. We need to carry out systematic **hyperparameter optimization** to get better results. (Coming up next week.)

- Note that categorical features are different than free text features. Sometimes there are columns containing free text information and we we'll look at ways to deal with them in the later part of this lecture. 

### OHE with many categories

- Do we have enough data for rare categories to learn anything meaningful? 
- How about grouping them into bigger categories?
    - Example: country names into continents such as "South America" or "Asia"
- Or having "other" category for rare cases? 

### Do we actually want to use certain features for prediction?

- Do you ***want*** to use certain features such as **gender** or **race** in prediction?
- Remember that the systems you build are going to be used in some applications. 
- It's extremely important to be mindful of the consequences of including certain features in your predictive model. 

### Preprocessing the targets?

- Generally no need for this when doing classification. 
- In regression it makes sense in some cases. More on this later. 
- `sklearn` is fine with categorical labels ($y$-values) for classification problems. 

<br><br><br><br>

## Encoding text data  

In [107]:
toy_spam = [
    [
        "URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!",
        "spam",
    ],
    ["Lol you are always so convincing.", "non spam"],
    ["Nah I don't think he goes to usf, he lives around here though", "non spam"],
    [
        "URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!",
        "spam",
    ],
    [
        "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030",
        "spam",
    ],
    ["Congrats! I can't wait to see you!!", "non spam"],
]
toy_df = pd.DataFrame(toy_spam, columns=["sms", "target"])

### Spam/non spam toy example 

- What if the feature is in the form of raw text?
- The feature **`sms`** below is **neither categorical nor ordinal**. 
- How can we encode it so that we can pass it to the machine learning algorithms we have seen so far? 

In [108]:
toy_df.style.set_properties(**{"text-align": "left"})

Unnamed: 0,sms,target
0,URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!,spam
1,Lol you are always so convincing.,non spam
2,"Nah I don't think he goes to usf, he lives around here though",non spam
3,URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!,spam
4,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030,spam
5,Congrats! I can't wait to see you!!,non spam


### What if we apply OHE? 

In [109]:
### DO NOT DO THIS.
enc = OneHotEncoder(sparse=False)
transformed = enc.fit_transform(toy_df[["sms"]])
pd.DataFrame(transformed, columns=enc.categories_)

Unnamed: 0,Congrats! I can't wait to see you!!,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030,Lol you are always so convincing.,"Nah I don't think he goes to usf, he lives around here though",URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!,URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!
0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0,0.0,0.0


- We do **not have a fixed number** of categories here. 
- Each "category" (feature value) is likely to **occur only once** in the training data and we won't learn anything meaningful if we apply one-hot encoding or ordinal encoding on this feature. 

- How can we encode or represent **raw text data into fixed number of features** so that we can learn some useful patterns from it?  
- This is a well studied problem in the field of ***Natural Language Processing (NLP)***, which is concerned with giving computers the ability to understand written and spoken language. 
- Some popular representations of raw text include: 
    - **Bag of words** 
    - TF-IDF
    - Embedding representations 

### Bag of words (BOW) representation

- One of the most popular representation of raw text 
- Ignores the syntax and word order
- It has two components: 
    - The vocabulary (all unique words in all documents) 
    - A value indicating either the presence or absence or the count of each word in the document. 
        
<center>
<img src='./img/bag-of-words.png' width="600">
</center>

[Source](https://web.stanford.edu/~jurafsky/slp3/4.pdf)       

### Extracting BOW features using `scikit-learn`
- `CountVectorizer`
    - Converts a collection of text documents to a matrix of word counts.  
    - Each row represents a "document" (e.g., a text message in our example). 
    - Each column represents a word in the vocabulary (the set of unique words) in the training data. 
    - Each cell represents how often the word occurs in the document.       

***Note***
> In the Natural Language Processing (NLP) community text data  is referred to as a **corpus** (plural: corpora).

In [110]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X_counts = vec.fit_transform(toy_df["sms"])
bow_df = pd.DataFrame(
    X_counts.toarray(), columns=vec.get_feature_names_out(), index=toy_df["sms"]
)
bow_df

Unnamed: 0_level_0,08002986030,100000,11,900,always,are,around,as,been,call,...,update,urgent,usf,valued,wait,week,with,won,you,your
sms,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!,0,0,0,1,0,0,0,1,1,0,...,0,1,0,1,0,0,0,0,1,0
Lol you are always so convincing.,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
"Nah I don't think he goes to usf, he lives around here though",0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,1,1,0
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030,1,0,1,0,0,0,0,0,0,1,...,2,0,0,0,0,0,1,0,0,1
Congrats! I can't wait to see you!!,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0


### Input to `CountVectorizer.fit_transform`

In [111]:
type(toy_df["sms"])

pandas.core.series.Series

***Important Note***
> Unlike other transformers we are **passing a `Series`** object to `fit_transform`. For other transformers, you can define one transformer for more than one columns. But with `CountVectorizer` you need to define **separate `CountVectorizer` transformers for each text column**, if you have more than one text columns.

<br><br><br><br>

### Output of `CountVectorizer.fit_transform`

`fit_transform` has returned a sparse matrix:

In [112]:
X_counts

<6x61 sparse matrix of type '<class 'numpy.int64'>'
	with 71 stored elements in Compressed Sparse Row format>

### Why sparse matrices?

- Most words do not appear in a given document.
- We get massive computational savings if we **only store the nonzero elements**.
- There is a bit of overhead, because we also need to store the locations:
    - e.g. "location (3,27): 1".
    
- However, if the fraction of nonzero is small, this is a huge win.

In [113]:
print("The number of rows and columns: ", *X_counts.shape)
print("The total number of elements: ", np.prod(X_counts.shape))
print("The number of non-zero elements: ", X_counts.nnz)
print(
    "Proportion of non-zero elements: %0.4f" % (X_counts.nnz / np.prod(X_counts.shape))
)
print(
    "The value at cell (4,%d) is: %d"
    % (vec.vocabulary_["update"], X_counts[4, vec.vocabulary_["update"]])
)

The number of rows and columns:  6 61
The total number of elements:  366
The number of non-zero elements:  71
Proportion of non-zero elements: 0.1940
The value at cell (4,51) is: 2



Question for you
- What would happen if you apply `StandardScaler` on sparse data? 

### `OneHotEncoder` and sparse features 
- By default, `OneHotEncoder` also creates sparse features. 
- You could set `sparse=False` to get a regular `numpy` array. 
- If there are a huge number of categories, it may be beneficial to keep them sparse.
- For smaller number of categories, it doesn't matter much.

### Important hyperparameters of `CountVectorizer` 

- `binary`
    - whether to use absence/presence feature values or counts
- `max_features`
    - only consider top `max_features` ordered by frequency in the corpus
- `max_df`
    - max document frequency, ignore features which occur in more than `max_df` documents 
- `min_df` 
    - min document frequency, ignore features which occur in less than `min_df` documents 
- `ngram_range`
    - consider word sequences in the given range 

Let's look at all features, i.e., words (along with their frequencies).

In [114]:
vec = CountVectorizer()
X_counts = vec.fit_transform(toy_df["sms"])
bow_df = pd.DataFrame(
    X_counts.toarray(), columns=vec.get_feature_names_out(), index=toy_df["sms"]
)
print("Max value: ", bow_df.max().max())
bow_df

Max value:  2


Unnamed: 0_level_0,08002986030,100000,11,900,always,are,around,as,been,call,...,update,urgent,usf,valued,wait,week,with,won,you,your
sms,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!,0,0,0,1,0,0,0,1,1,0,...,0,1,0,1,0,0,0,0,1,0
Lol you are always so convincing.,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
"Nah I don't think he goes to usf, he lives around here though",0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,1,1,0
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030,1,0,1,0,0,0,0,0,0,1,...,2,0,0,0,0,0,1,0,0,1
Congrats! I can't wait to see you!!,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0


When we use `binary=True`, the representation uses presence/absence of words instead of word counts.   

In [115]:
vec_binary = CountVectorizer(binary=True)
X_counts = vec_binary.fit_transform(toy_df["sms"])
bow_df = pd.DataFrame(
    X_counts.toarray(), columns=vec_binary.get_feature_names_out(), index=toy_df["sms"]
)
print("Max value: ", bow_df.max().max())
bow_df

Max value:  1


Unnamed: 0_level_0,08002986030,100000,11,900,always,are,around,as,been,call,...,update,urgent,usf,valued,wait,week,with,won,you,your
sms,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!,0,0,0,1,0,0,0,1,1,0,...,0,1,0,1,0,0,0,0,1,0
Lol you are always so convincing.,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
"Nah I don't think he goes to usf, he lives around here though",0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,1,1,0
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030,1,0,1,0,0,0,0,0,0,1,...,1,0,0,0,0,0,1,0,0,1
Congrats! I can't wait to see you!!,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0


We can control the size of X (the number of features) using `max_features`.

In [116]:
vec8 = CountVectorizer(max_features=8)
X_counts = vec8.fit_transform(toy_df["sms"])
bow_df = pd.DataFrame(
    X_counts.toarray(), columns=vec8.get_feature_names_out(), index=toy_df["sms"]
)
print("Max value: ", bow_df.max().max())
print(bow_df.max())
bow_df

Max value:  2
free      2
have      1
mobile    2
the       2
to        2
update    2
urgent    1
you       1
dtype: int64


Unnamed: 0_level_0,free,have,mobile,the,to,update,urgent,you
sms,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!,0,1,0,0,1,0,1,1
Lol you are always so convincing.,0,0,0,0,0,0,0,1
"Nah I don't think he goes to usf, he lives around here though",0,0,0,0,1,0,0,0
URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!,1,1,0,0,0,0,1,1
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030,2,0,2,2,2,2,0,0
Congrats! I can't wait to see you!!,0,0,0,0,1,0,0,1


***Note (Optional)*** 

> Notice that `vec8` and `vec8_binary` have different vocabularies, which is kind of unexpected behaviour and doesn't match the documentation of `scikit-learn`.
> 
> [Here](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L1206-L1225) is the code for `binary=True` condition in `scikit-learn`. As we can see, the binarization is done before limiting the features to `max_features`, and so now we are actually looking at the document counts (in how many documents it occurs) rather than term count. This is not explained anywhere in the documentation. 
> 
> The ties in counts between different words makes it even more confusing. I don't think it'll have a big impact on the results but this is good to know! Remember that `scikit-learn` developers are also humans who are prone to make mistakes. So it's always a good habit to question whatever tools we use every now and then.

In [117]:
vec8 = CountVectorizer(max_features=8)
X_counts = vec8.fit_transform(toy_df["sms"])
bow_df = pd.DataFrame(
    X_counts.toarray(), columns=vec8.get_feature_names_out(), index=toy_df["sms"]
)
bow_df.sum().sort_values(ascending=False).rename("counts").to_frame()

Unnamed: 0,counts
to,5
you,4
free,3
have,2
mobile,2
the,2
update,2
urgent,2


In [118]:
vec8_binary = CountVectorizer(binary=True, max_features=8)
X_counts = vec8_binary.fit_transform(toy_df["sms"])
bow_df = pd.DataFrame(
    X_counts.toarray(), columns=vec8_binary.get_feature_names_out(), index=toy_df["sms"]
)
bow_df.sum().sort_values(ascending=False).rename("counts").to_frame()

Unnamed: 0,counts
to,4
you,4
free,2
have,2
prize,2
urgent,2
mobiles,1
months,1


### Preprocessing

- Note that `CountVectorizer` is carrying out some preprocessing such as the following because of the default argument values:
    - Converting words to lowercase (`lowercase=True`)
    - getting rid of punctuation and special characters (`token_pattern ='(?u)\\b\\w\\w+\\b'`)


In [119]:
pipe = make_pipeline(CountVectorizer(), SVC())

In [120]:
pipe.fit(toy_df["sms"], toy_df["target"])

In [121]:
pipe.predict(toy_df["sms"]).tolist()

['spam', 'non spam', 'non spam', 'spam', 'spam', 'non spam']

In [122]:
toy_df["target"].tolist()

['spam', 'non spam', 'non spam', 'spam', 'spam', 'non spam']

### Is this a realistic representation of text data? 

- Of course this is not a great representation of language
    - We are throwing out everything we know about language and losing a lot of information. 
    - It assumes that there is **no syntax and compositional meaning** in language.  
- But it **works surprisingly well** for many tasks. 
- We will learn more expressive representations in the coming weeks. 

<br><br>

## Demo of incorporating text features

Recall that we had dropped `song_title` feature when we worked with the Spotify dataset. 

Let's try to include it in our pipeline and examine whether we get better results. 

In [123]:
spotify_df = pd.read_csv("data/spotify.csv", index_col=0)
X_spotify = spotify_df.drop(columns=["target"])
y_spotify = spotify_df["target"]

In [124]:
X_train, X_test, y_train, y_test = train_test_split(
    X_spotify, y_spotify, test_size=0.2, random_state=123
)

In [125]:
X_train.shape

(1613, 15)

In [126]:
X_train

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,song_title,artist
1505,0.004770,0.585,214740,0.614,0.000155,10,0.0762,-5.594,0,0.0370,114.059,4.0,0.2730,Cool for the Summer,Demi Lovato
813,0.114000,0.665,216728,0.513,0.303000,0,0.1220,-7.314,1,0.3310,100.344,3.0,0.0373,Damn Son Where'd You Find This? (feat. Kelly Holiday) - Markus Maximus Remix,Markus Maximus
615,0.030200,0.798,216585,0.481,0.000000,7,0.1280,-10.488,1,0.3140,127.136,4.0,0.6400,Trill Hoe,Western Tink
319,0.106000,0.912,194040,0.317,0.000208,6,0.0723,-12.719,0,0.0378,99.346,4.0,0.9490,Who Is He (And What Is He to You?),Bill Withers
320,0.021100,0.697,236456,0.905,0.893000,6,0.1190,-7.787,0,0.0339,119.977,4.0,0.3110,Acamar,Frankey
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2012,0.001060,0.584,274404,0.932,0.002690,1,0.1290,-3.501,1,0.3330,74.976,4.0,0.2110,Like A Bitch - Kill The Noise Remix,Kill The Noise
1346,0.000021,0.535,203500,0.974,0.000149,10,0.2630,-3.566,0,0.1720,116.956,4.0,0.4310,Flag of the Beast,Emmure
1406,0.503000,0.410,256333,0.648,0.000000,7,0.2190,-4.469,1,0.0362,60.391,4.0,0.3420,Don't You Cry For Me,Cobi
1389,0.705000,0.894,222307,0.161,0.003300,4,0.3120,-14.311,1,0.0880,104.968,4.0,0.8180,장가갈 수 있을까 Can I Get Married?,Coffeeboy


Let's look at the distribution of values in the `song_title` column. 

In [127]:
X_train["song_title"].value_counts()

Pyramids                                     2
Look At Wrist                                2
Baby                                         2
The One                                      2
Best Friend                                  2
                                            ..
City Of Dreams - Radio Edit                  1
Face It                                      1
The Winner Is - from Little Miss Sunshine    1
History                                      1
Blue Ballad                                  1
Name: song_title, Length: 1579, dtype: int64

- Most of the song titles are unique, which makes sense. 
- What would happen if we apply one-hot encoding to this feature? 
- Can we encode this as a text feature? 

In [128]:
X_train.columns

Index(['acousticness', 'danceability', 'duration_ms', 'energy',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'time_signature', 'valence', 'song_title',
       'artist'],
      dtype='object')

In [129]:
numeric_features = [
    "acousticness",
    "danceability",
    "duration_ms",
    "energy",
    "instrumentalness",
    "key",
    "liveness",
    "loudness",
    "mode",
    "speechiness",
    "tempo",
    "time_signature",
    "valence",
]
drop_features = ['artist']
text_feature = "song_title"  # note that we are not creating a list here.

preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (CountVectorizer(max_features=2000, stop_words="english"), text_feature),
    ("drop", drop_features)
)

***Important***
> Note that unlike other feature types we are defining `text_feature` as a string and not as a list.

### Visualizing the transformed data 

In [130]:
transformed = preprocessor.fit_transform(X_train, y_train)
transformed.shape

(1613, 1897)

In [131]:
vocab = preprocessor.named_transformers_["countvectorizer"].get_feature_names_out()

In [132]:
vocab

array(['000', '10', '100', ..., '있을까', '장가갈', '지금'], dtype=object)

In [133]:
column_names = numeric_features + vocab.tolist()

In [134]:
df = pd.DataFrame(transformed.toarray(), columns=column_names, index=X_train.index)
df

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,...,너와의,루시아,변명,여기,이곳에서,이대로,있어줘요,있을까,장가갈,지금
1505,-0.697633,-0.194548,-0.398940,-0.318116,-0.492359,1.275623,-0.737898,0.395794,-1.280599,-0.617752,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
813,-0.276291,0.295726,-0.374443,-0.795552,0.598355,-1.487342,-0.438792,-0.052394,0.780884,2.728394,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
615,-0.599540,1.110806,-0.376205,-0.946819,-0.492917,0.446734,-0.399607,-0.879457,0.780884,2.534909,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
319,-0.307150,1.809445,-0.654016,-1.722063,-0.492168,0.170437,-0.763368,-1.460798,-1.280599,-0.608647,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
320,-0.634642,0.491835,-0.131344,1.057468,2.723273,0.170437,-0.458384,-0.175645,-1.280599,-0.653035,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2012,-0.711944,-0.200676,0.336272,1.185100,-0.483229,-1.211046,-0.393077,0.941176,0.780884,2.751157,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1346,-0.715953,-0.500969,-0.537445,1.383637,-0.492380,1.275623,0.482038,0.924239,-1.280599,0.918743,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1406,1.224228,-1.267021,0.113591,-0.157395,-0.492917,0.446734,0.194687,0.688940,0.780884,-0.626857,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1389,2.003419,1.699134,-0.305695,-2.459489,-0.481032,-0.382156,0.802042,-1.875632,0.780884,-0.037298,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


### Visualizing the vocabulary 

In [135]:
vocab[0:10]

array(['000', '10', '100', '10cm', '11', '112', '12', '1208', '144', '18'],
      dtype=object)

In [136]:
vocab[500:510]

array(['duele', 'duet', 'duke', 'dustland', 'dutchie', 'dynamite',
       'earth', 'easy', 'eazy', 'echelon'], dtype=object)

In [137]:
vocab[1800:1810]

array(['wide', 'wifey', 'wild', 'wildcard', 'wildfire', 'wiley',
       'willing', 'win', 'wind', 'window'], dtype=object)

In [138]:
vocab[0::100]

array(['000', 'ap', 'blind', 'cha', 'dallask', 'duele', 'flashlight',
       'grace', 'icarus', 'lafa', 'making', 'neck', 'pharaohs', 'redeem',
       'seeb', 'soundtrack', 'talons', 'unanswered', 'wide'], dtype=object)

Let's find songs containing the word _earth_ in them. 

In [139]:
earth_index_vocab = np.where(vocab == "earth")[0][0]
earth_index_vocab

506

In [140]:
earth_index_in_df = len(numeric_features) + earth_index_vocab
earth_index_in_df

519

In [141]:
earth_songs = df[df.iloc[:, earth_index_in_df] == 1]
earth_songs.iloc[:, earth_index_in_df - 2 : earth_index_in_df + 2]

Unnamed: 0,dutchie,dynamite,earth,easy
1851,0.0,0.0,1.0,0.0
1948,0.0,0.0,1.0,0.0


In [142]:
earth_songs.index

Int64Index([1851, 1948], dtype='int64')

In [143]:
X_train.loc[earth_songs.index]["song_title"]

1851             Softest Place On Earth
1948    Earth Song - Remastered Version
Name: song_title, dtype: object

### Model building 

Let's create a pipeline using SVC. 
- SVC works well with sparse features. 

In [144]:
pipe = make_pipeline(preprocessor, SVC())

In [145]:
results = pd.DataFrame(cross_validate(pipe, X_train, y_train, return_train_score=True))
results

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.282863,0.043009,0.749226,0.867442
1,0.181304,0.036233,0.758514,0.862016
2,0.160504,0.035495,0.712074,0.865116
3,0.161054,0.036037,0.73913,0.8567
4,0.182859,0.035012,0.729814,0.855151


In [146]:
results.mean()

fit_time       0.193717
score_time     0.037157
test_score     0.737752
train_score    0.861285
dtype: float64

Is our CV improving after incorporating this feature?
Let's examine what numbers we get when we don't include it. 

In [147]:
pipe_num = make_pipeline(StandardScaler(), SVC())

X_train_num = X_train.drop(columns=["song_title", 'artist'])

In [148]:
results = pd.DataFrame(
    cross_validate(pipe_num, X_train_num, y_train, return_train_score=True)
)
results

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.113574,0.024709,0.758514,0.810078
1,0.091848,0.025549,0.749226,0.810078
2,0.096154,0.027562,0.705882,0.821705
3,0.092446,0.023412,0.748447,0.815647
4,0.087187,0.021676,0.732919,0.812548


In [149]:
results.mean()

fit_time       0.096242
score_time     0.024582
test_score     0.738998
train_score    0.814011
dtype: float64

- Not a big difference in the results. 
- Seems like there is more overfitting when we included the `song_title` feature. 

- What about the `artist` column?
- Does it make sense to apply BOW encoding to it? 
- Let's look at the distribution of values in the `artist` column. 

In [150]:
X_train['artist'].value_counts()

Drake              14
Disclosure         12
Rick Ross          11
WALK THE MOON      10
Crystal Castles     8
                   ..
Classixx            1
Jordan Feliz        1
Travis Hayes        1
The Silvertones     1
Phil Woods          1
Name: artist, Length: 1131, dtype: int64

In [151]:
most_frequent = X_train["artist"].value_counts().iloc[:15]
most_frequent

Drake              14
Disclosure         12
Rick Ross          11
WALK THE MOON      10
Crystal Castles     8
Big Time Rush       8
FIDLAR              8
Fall Out Boy        8
Demi Lovato         7
Kanye West          7
Kina Grannis        7
Backstreet Boys     7
Beach House         6
Young Thug          6
*NSYNC              6
Name: artist, dtype: int64

- We have many unique artists. Probably it's not worth to create a "other" category here. 

In [152]:
numeric_features = [
    "acousticness",
    "danceability",
    "duration_ms",
    "energy",
    "instrumentalness",
    "key",
    "liveness",
    "loudness",
    "mode",
    "speechiness",
    "tempo",
    "time_signature",
    "valence",
]
categorical_features = ['artist']
text_feature = "song_title"  # note that we are not creating a list here.

preprocessor_artist = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(sparse=False, dtype=int, handle_unknown="ignore", categories=[most_frequent.index.values]), categorical_features),
    (CountVectorizer(max_features=2000, stop_words="english"), text_feature),
)

In [153]:
pipe = make_pipeline(preprocessor_artist, SVC())

In [154]:
results = pd.DataFrame(cross_validate(pipe, X_train, y_train, return_train_score=True))
results

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.16858,0.032908,0.755418,0.870543
1,0.159231,0.038562,0.76161,0.864341
2,0.194133,0.041704,0.712074,0.868217
3,0.197998,0.036146,0.742236,0.865995
4,0.160742,0.036273,0.732919,0.858249


In [155]:
results.mean()

fit_time       0.176137
score_time     0.037119
test_score     0.740851
train_score    0.865469
dtype: float64

Tiny bit improvement in the mean CV scores but we are still overfitting. 

<br><br><br><br>

### iClicker Exercise 6.3

**Select all of the following statements which are TRUE.**

- (A) `handle_unknown="ignore"` would treat all unknown categories equally. 
- (B) As you increase the value for `max_features` hyperparameter of `CountVectorizer` the training score is likely to go up. 
- (C) Suppose you are encoding text data using `CountVectorizer`. If you encounter a word in the validation or the test split that's not available in the training data, we'll get an error. 
- (D) In the code below, inside `cross_validate`, each fold might have slightly different number of features (columns) in the fold.

```
pipe = (CountVectorizer(), SVC())
cross_validate(pipe, X_train, y_train)
```

<br><br><br><br>

## What did we learn today?

- Motivation to use `ColumnTransformer`
- `ColumnTransformer` syntax
- Defining transformers with multiple transformations
- How to visualize transformed features in a dataframe 
- More on ordinal features 
- Different arguments `OneHotEncoder`
    - `handle_unknow="ignore"`
    - `if_binary`
- Dealing with text features
    - Bag of words representation: `CountVectorizer`

![](img/eva-talksoon.png)