![](img/330-banner.png)

## Lecture 6: `sklearn` `ColumnTransformer` and Text Features

UBC 2020-21

Instructor: Varada Kolhatkar

In [1]:
# Import libraries
from hashlib import sha1

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import HTML

pd.set_option("display.max_colwidth", 200)

from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import (
    FunctionTransformer,
    Normalizer,
    OneHotEncoder,
    OrdinalEncoder,
    StandardScaler,
    normalize,
    scale,
)
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

## Learning objectives 

From this lecture, you will be able to 

- use `ColumnTransformer` to build all our transformations together into one object and use it with `sklearn` pipelines.  
1. explain `handle_unknown="ignore"` hyperparameter of `scikit-learn`'s `OneHotEncoder`;
2. identify when it's appropriate to apply ordinal encoding vs one-hot encoding;
3. explain strategies to deal with categorical variables with too many categories; 
4. explain why text data needs a different treatment than categorical variables;
5. use `scikit-learn`'s `CountVectorizer` to encode text data;
5. explain different hyperparameters of `CountVectorizer`.

## sklearn's [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

- In most applications, some features are categorical, some are continuous, some are binary, and some are ordinal. 

- When we want to develop supervised machine learning pipelines on real-world datasets, very often we want to apply different transformation on different columns. 

- Enter `sklearn`'s `ColumnTransformer`!! 

- Let's look at a toy example: 

In [2]:
df = pd.read_csv("data/quiz2-grade-toy-col-transformer.csv")
df

Unnamed: 0,ml_experience,major,class_attendance,university_years,lab1,lab2,lab3,lab4,quiz1,quiz2
0,1,Computer Science,Excellent,3,92,93.0,84,91,92,A+
1,1,Mechanical Engineering,Average,2,94,90.0,80,83,91,not A+
2,0,Mathematics,Poor,3,78,85.0,83,80,80,not A+
3,0,Mathematics,Excellent,3,91,,92,91,89,A+
4,0,Psychology,Good,4,77,83.0,90,92,85,A+
5,1,Economics,Good,5,70,73.0,68,74,71,not A+
6,1,Computer Science,Excellent,4,80,88.0,89,88,91,A+
7,0,Mechanical Engineering,Poor,3,95,93.0,69,79,75,not A+
8,0,Linguistics,Average,2,97,90.0,94,82,80,not A+
9,1,Mathematics,Average,4,95,82.0,94,94,85,not A+


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ml_experience     21 non-null     int64  
 1   major             21 non-null     object 
 2   class_attendance  21 non-null     object 
 3   university_years  21 non-null     int64  
 4   lab1              21 non-null     int64  
 5   lab2              19 non-null     float64
 6   lab3              21 non-null     int64  
 7   lab4              21 non-null     int64  
 8   quiz1             21 non-null     int64  
 9   quiz2             21 non-null     object 
dtypes: float64(1), int64(6), object(3)
memory usage: 1.8+ KB


### Transformations on the toy data

In [4]:
df.head()

Unnamed: 0,ml_experience,major,class_attendance,university_years,lab1,lab2,lab3,lab4,quiz1,quiz2
0,1,Computer Science,Excellent,3,92,93.0,84,91,92,A+
1,1,Mechanical Engineering,Average,2,94,90.0,80,83,91,not A+
2,0,Mathematics,Poor,3,78,85.0,83,80,80,not A+
3,0,Mathematics,Excellent,3,91,,92,91,89,A+
4,0,Psychology,Good,4,77,83.0,90,92,85,A+


- Scaling on numeric features
- One-hot encoding on the categorical feature `major`
- Ordinal encoding on the ordinal feature `class_attendance`
- Imputation on the `lab2` feature
- None on the `ml_experience` feature

### `ColumnTransformer` example

#### Data

In [None]:
X = df.drop(columns=["quiz2"])
y = df["quiz2"]
X.columns

#### Identify the transformations we want to apply

In [None]:
X.head()

In [None]:
numeric_feats = ["university_years", "lab1", "lab3", "lab4", "quiz1"]  # apply scaling
categorical_feats = ["major"]  # apply one-hot encoding
binary_feats = ["ml_experience"]  # do not apply any transformation
drop_feats = ['lab2', 'class_attendance']

For simplicity, let's only focus on scaling and one-hot encoding first. 

#### Create a column transformer

- Each transformation is specified by a name, a transformer object, and the columns this transformer should be applied to. 

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
ct = ColumnTransformer(
    [("scaling", StandardScaler(), numeric_feats),
     ("onehot", OneHotEncoder(sparse=False), categorical_feats),     
    ])

- Each transformer is applied to the specified columns and the result of the transformations are concatenated horizontally. 

#### Convenient `make_column_transformer` syntax

- Similar to `make_pipeline` syntax, there is convenient `make_column_transformer` syntax. 
- The syntax automatically names each step based on its class. 
- We'll be mostly using this syntax. 

In [None]:
from sklearn.compose import make_column_transformer

ct = make_column_transformer(
    (StandardScaler(), numeric_feats), # scaling on numeric features
    (OneHotEncoder(), categorical_feats), # OHE on categorical features 
    ("passthrough", binary_feats), # no transformations on the binary features
    ("drop", drop_feats) # drop the drop features 
)

In [None]:
ct

- A big advantage here is that we build all our transformations together into one object, and that way we're sure we do the same operations to all splits of the data.

- Otherwise we might, for example, do the OHE on both train and test but forget to scale the test data.

#### Let's examine the transformed data

In [None]:
transformed = ct.fit_transform(X)
type(transformed[:2])

```{note}
Note that the returned object is not a dataframe. So there are no column names. 
```

#### Viewing the transformed data as a dataframe

- How can we view our transformed data as a dataframe? 
- We are adding more columns. 
- So the original columns won't directly to the transformed data. 
- Let's create column names for the transformed data. 

In [None]:
column_names = (
    numeric_feats
    + ct.named_transformers_["onehotencoder"].get_feature_names().tolist()
    + binary_feats
)
column_names

```{note}
Note that the order of the columns in the transformed data depends upon the order of the features we pass to the `ColumnTransformer` and can be different than the order of the features in the original dataframe.  
```

In [None]:
pd.DataFrame(transformed, columns=column_names)

#### `ColumnTransformer`: Transformed data

<br>
<img src='./img/column-transformer.png' width="1500">

[Adapted from here.](https://amueller.github.io/COMS4995-s20/slides/aml-04-preprocessing/#37)

#### Training models with transformed data
- We can now pass the `ColumnTransformer` object as a step in a pipeline. 

In [None]:
pipe = make_pipeline(ct, SVC(gamma=0.001))
pipe.fit(X, y)
pipe.predict(X)

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
X_train.head()

In [None]:
X_train.columns

In [None]:
# Identify the categorical and numeric columns
numeric_features = [
    "longitude",
    "latitude",
    "housing_median_age",
    "total_rooms",
    "total_bedrooms",
    "population",
    "households",
    "median_income",
    "rooms_per_household",
    "bedrooms_per_household",
    "population_per_household",
]

categorical_features = ["ocean_proximity"]
# reamainder_features = ["median_income"]

- Let's build a pipeline for our dataset
- create the preprocessing pipelines for both numeric and categorical data.


In [None]:
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)


categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    # remainder='passthrough'
)

In [None]:
preprocessor.fit(X_train)

When we `fit` with the preprocessor, it calls `fit` on _all_ the transformers

In [None]:
X_train_pp = preprocessor.transform(X_train)

When we transform with the preprocessor, it calls `transform` on _all_ the transformers.

We can get the new names of the columns that were generated by the one-hot encoding:

In [None]:
preprocessor.named_transformers_["cat"].named_steps["onehot"].get_feature_names(
    categorical_features
)

Combining this with the numeric feature names gives us all the column names:

In [None]:
columns = numeric_features + list(
    preprocessor.named_transformers_["cat"]
    .named_steps["onehot"]
    .get_feature_names(categorical_features)
)
columns

In [None]:
results_dict = {}
from sklearn.svm import SVR

pipe = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        # ("reg", KNeighborsRegressor()),
        ("reg", SVR(gamma=0.01)),
    ]
)

In [None]:
scores = cross_validate(pipe, X_train, y_train, return_train_score=True)
store_cross_val_results("imp + scaling + ohe + SVR", scores, results_dict)
pd.DataFrame(results_dict).T

- Note that categorical features are different than free text features. Sometimes there are columns containing free text information and we we'll look at ways to deal with them later in the course. 

### `remainder="passthrough"`
- Side note: the `ColumnTransformer` will automatically remove columns that are not being transformed:
- Use `remainder="passthrough"` of `ColumnTransformer` to keep the other columns in tact. 

#### Preprocessing the targets?

- Generally no need for this when doing classification. 
- In regression it makes sense in some cases. More on this in 573. 
- `sklearn` is fine with categorical labels ($y$-values) for classification problems. 

In [None]:
pd.DataFrame(transformed, columns=column_names)

In [None]:
X = df.drop(columns=["quiz2", "lab2"])
y = df["quiz2"]
X.columns

In [None]:
ordinal_feats = ["class_attendance"]  #
attendance_cats = ["Poor", "Average", "Good", "Excellent"]

In [None]:
from sklearn.compose import make_column_transformer

ct = make_column_transformer(
    (StandardScaler(), numeric_feats),
    (OneHotEncoder(), categorical_feats),
    # (OrdinalEncoder(categories=[attendance_cats], dtype=int), ordinal_feats),
    ("passthrough", binary_feats),
)

In [None]:
cat_feat_names.tolist()

In [None]:
transformed = ct.fit_transform(X)
cat_feat_names = ct.named_transformers_["onehotencoder"].get_feature_names()
col_names = numeric_feats + cat_feat_names.tolist() + ordinal_feats + binary_feats
col_names

In [None]:
pd.DataFrame(transformed, columns=col_names)

In [None]:
pipe = make_pipeline(ct, SVC(gamma=0.01))

In [None]:
pipe.fit(X, y)

In [None]:
numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
categorical_transformer = make_pipeline(
    SimpleImputer(strategy="median"), StandardScaler()
)
Ordinal_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

In [None]:
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    # remainder='passthrough'
)

###  `ColumnTransformer` <a name="7"></a>
- sklearn's [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) makes this more manageable.
    - A big advantage here is that we build all our transformations together into one object, and that way we're sure we do the same operations to all splits of the data. 
    - Otherwise we might, for example, do the OHE on both train and test but forget to scale the test data.    

<img src='./img/column-transformer.png' width="1500">

[Adapted from here.](https://amueller.github.io/COMS4995-s20/slides/aml-04-preprocessing/#37)

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
X_train.head()

In [None]:
X_train.columns

In [None]:
# Identify the categorical and numeric columns
numeric_features = [
    "longitude",
    "latitude",
    "housing_median_age",
    "total_rooms",
    "total_bedrooms",
    "population",
    "households",
    "median_income",
    "rooms_per_household",
    "bedrooms_per_household",
    "population_per_household",
]

categorical_features = ["ocean_proximity"]
# reamainder_features = ["median_income"]

- Let's build a pipeline for our dataset
- create the preprocessing pipelines for both numeric and categorical data.


In [None]:
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)


categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    # remainder='passthrough'
)

- The `ColumnTransformer` syntax is somewhat similar to Pipeline in that you pass in a list of tuples.
- But here each tuple has 3 values instead of 2: (name, object, list of columns)

- A big advantage here is that we build all our transformations together into one object, and that way we're sure we do the same operations to all splits of the data.

- Otherwise we might, for example, do the OHE on both train and test but forget to scale the test data.


In [None]:
preprocessor.fit(X_train)

When we `fit` with the preprocessor, it calls `fit` on _all_ the transformers

In [None]:
X_train_pp = preprocessor.transform(X_train)

When we transform with the preprocessor, it calls `transform` on _all_ the transformers.

We can get the new names of the columns that were generated by the one-hot encoding:

In [None]:
preprocessor.named_transformers_["cat"].named_steps["onehot"].get_feature_names(
    categorical_features
)

Combining this with the numeric feature names gives us all the column names:

In [None]:
columns = numeric_features + list(
    preprocessor.named_transformers_["cat"]
    .named_steps["onehot"]
    .get_feature_names(categorical_features)
)
columns

In [None]:
results_dict = {}
from sklearn.svm import SVR

pipe = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        # ("reg", KNeighborsRegressor()),
        ("reg", SVR(gamma=0.01)),
    ]
)

In [None]:
scores = cross_validate(pipe, X_train, y_train, return_train_score=True)
store_cross_val_results("imp + scaling + ohe + SVR", scores, results_dict)
pd.DataFrame(results_dict).T

- Note that categorical features are different than free text features. Sometimes there are columns containing free text information and we we'll look at ways to deal with them later in the course. 

### `remainder="passthrough"`
- Side note: the `ColumnTransformer` will automatically remove columns that are not being transformed:
- Use `remainder="passthrough"` of `ColumnTransformer` to keep the other columns in tact. 

#### Preprocessing the targets?

- Generally no need for this when doing classification. 
- In regression it makes sense in some cases. More on this in 573. 
- `sklearn` is fine with categorical labels ($y$-values) for classification problems. 

## 1. More on categorical features

## Data 

We'll be using [the adult census dataset](https://www.kaggle.com/uciml/adult-census-income#) you used in lab 2. 

This is a classification dataset and the classification task is to predict whether income exceeds 50K per year or not based on the census data. You can find more information on the dataset and features [here](http://archive.ics.uci.edu/ml/datasets/Adult).

The code below loads the data CSV (assuming that it is saved as `data/adult.csv` in this folder). 

*Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary.*

In [None]:
adult_df_large = pd.read_csv("data/adult.csv")

In [None]:
train_df, test_df = train_test_split(adult_df_large, test_size=0.2, random_state=42)

In [None]:
train_df_nan = train_df.replace("?", np.NaN)
test_df_nan = test_df.replace("?", np.NaN)

In [None]:
train_df_nan.head()

In [None]:
train_df_nan["income"].value_counts(normalize=True)

In the lab we took the simplest approach and and divided the feature in these two categories. 

In [None]:
numeric_features = [
    "age",
    "fnlwgt",
    "education.num",
    "capital.gain",
    "capital.loss",
    "hours.per.week",
]

categorical_features = [
    "workclass",
    "education",
    "marital.status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "native.country",
]

target = "income"

In [None]:
X_train = train_df_nan.drop(columns=[target])
y_train = train_df_nan[target]

X_test = test_df_nan.drop(columns=[target])
y_test = test_df_nan[target]

- We defined transformations on numeric and categorical features,  
- a column transformer, 
- a pipeline.

In [None]:
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

pipe = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("clf", SVC()),
    ]
)

### `make_pipeline` syntax

Let's create a column transformer and a pipeline using an alternative syntax `make_pipeline`. 

- shorthand for the `Pipeline` constructor
- does not permit, naming the steps
- instead, their names will be set to the lowercase of their types automatically

In [None]:
numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

categorical_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(),
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)
pipe = make_pipeline(preprocessor, SVC())

### `handle_unknown="ignore"`

In [None]:
scores = cross_validate(pipe, X_train, y_train, cv=5, return_train_score=True)

- What's going on here??
- Let's look at the error message:
`Found unknown categories ['Holand-Netherlands'] in column 6 during transform`

In [None]:
X_train["native.country"].value_counts()

- There is only one instance of Holand-Netherlands.
- During cross-validation, this is getting put into the validation split.
- By default, `OneHotEncoder` throws an error because you might want to know about this.

Simplest fix:
- Pass `handle_unknown="ignore"` argument to `OneHotEncoder`
- It creates a row with all zeros. 

In [None]:
numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

categorical_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore"),
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)
pipe = make_pipeline(preprocessor, SVC())

In [None]:
scores = cross_validate(pipe, X_train, y_train, cv=5, return_train_score=True)
pd.DataFrame(scores).mean()

- Do you want this behaviour? 
- Are you expecting to get many unknown categories? Do you want to be able to distinguish between them?
- With this approach, all unknown categories will be represented with all zeros. 

### Cases where it's OK to break the golden rule 

- If it's some fix number of categories. For example, if it's something like courses in MDS. We know the categories in advance and this is one of the cases where it might be OK to violate the golden rule and get a list of all possible values for the categorical variable. 

A common question that came up in the lab: 
- What types of features are present in this dataset other than numeric and categorical features? 

### Ordinal encoding

In [None]:
train_df[categorical_features].head()

- Most of the columns are actually categorical columns, in the sense that there is no ordinality among values. 
- What about _education_ column? 

- There is actually an order in the values and it might help to encode this column using `OrdinalEncoder`
    - Example: Masters > 10th    

In [None]:
train_df["education"].unique()

In [None]:
oe = OrdinalEncoder(dtype=int)
oe.fit(X_train[["education"]])
ed_transformed = oe.transform(X_train[["education"]])
ed_transformed = pd.DataFrame(
    data=ed_transformed, columns=["education_enc"], index=X_train.index
)
ed_transformed.head()
oe.categories_[-1]

In [None]:
pd.DataFrame(
    data=np.arange(len(oe.categories_[0])),
    columns=["transformed"],
    index=oe.categories_[0],
).head(10)

- `OrdinalEncoder` has encoded the categories by alphabetically sorting them and then assigning integers to them in that order.
- Is this what we want? 

In [None]:
train_df["education"].unique()

Let's order them manually. 

In [None]:
education_levels = [
    "Preschool",
    "1st-4th",
    "5th-6th",
    "7th-8th",
    "9th",
    "10th",
    "11th",
    "12th",
    "HS-grad",
    "Prof-school",
    "Assoc-voc",
    "Assoc-acdm",
    "Some-college",
    "Bachelors",
    "Masters",
    "Doctorate",
]

In [None]:
assert set(education_levels) == set(train_df["education"].unique())

In [None]:
oe = OrdinalEncoder(categories=[education_levels], dtype=int)
oe.fit(X_train[["education"]])
ed_transformed = oe.transform(X_train[["education"]])
ed_transformed = pd.DataFrame(
    data=ed_transformed, columns=["education_enc"], index=X_train.index
)
oe.categories_

In [None]:
numeric_features = ["age", "fnlwgt", "capital.gain", "capital.loss", "hours.per.week"]
categorical_features = [
    "workclass",
    "marital.status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "native.country",
]
ordinal_features = ["education"]
target_column = "income"

In [None]:
numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

categorical_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore"),
)

ordinal_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OrdinalEncoder(
        categories=[education_levels],
        dtype=int,
    ),
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
        ("ordinal", ordinal_transformer, ordinal_features),
    ]
)
pipe = make_pipeline(preprocessor, SVC())

In [None]:
scores = cross_validate(pipe, X_train, y_train, return_train_score=True)

In [None]:
pd.DataFrame(scores).mean()

### Binary features 

- In this dataset the only feature coded with two possible values in the dataset is sex.
- Let's try OHE on that feature. 

In [None]:
numeric_features = ["age", "fnlwgt", "capital.gain", "capital.loss", "hours.per.week"]
categorical_features = [
    "workclass",
    "marital.status",
    "occupation",
    "relationship",
    "race",
    "native.country",
]
ordinal_features = ["education"]
binary_features = ["sex"]
target_column = "income"

In [None]:
numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

categorical_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore"),
)

ordinal_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OrdinalEncoder(
        categories=[education_levels],
        dtype=int,
    ),
)

binary_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(drop="if_binary", dtype=int),
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
        ("ordinal", ordinal_transformer, ordinal_features),
        ("binary", binary_transformer, binary_features),
    ]
)
pipe = make_pipeline(preprocessor, SVC())

In [None]:
scores = cross_validate(pipe, X_train, y_train, return_train_score=True)

In [None]:
pd.DataFrame(scores)

### OHE with many categories

In [None]:
X_train["native.country"].value_counts()

- Do we have enough data for rare categories to learn anything meaningful? 
- How about grouping them into bigger categories
    - Example: "South America" or "Asia"
- Or having "other" category for rare cases? 

### Do we actually want to use certain features for prediction?

- Do you want to use `race` in prediction?
- Remember that the systems you build are going to be used in some applications. 
- It's extremely important to be mindful of the consequences of including certain features in your predictive model. 
- I would just drop the feature to avoid racial biases. 

### Categorical features (True or False)

- `handle_unknown="ignore"` would treat all unknown categories equally. 
- Creating groups of rarely occurring categories might overfit the model. 

## 2. Encoding text data  

- ML algorithms we have seen so far prefer numeric and fixed length input that looks like this: 

$$X = \begin{bmatrix}1.0 & 4.0 & \ldots & & 3.0\\ 0.0 & 2.0 & \ldots & & 6.0\\ 1.0 & 0.0 & \ldots & & 0.0\\ \end{bmatrix}$$ 

and 
$$y = \begin{bmatrix}spam \\ non spam \\ spam \end{bmatrix}$$

- But what if we are only given data in the form of raw text and associated labels?
- How can we represent such data into fixed number of features? 

### Spam/non-spam toy example

Would you be able to apply the algorithms we have seen so far on the data that looks like this?

$X = \begin{bmatrix}\text{"URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!",}\\ \text{"Lol your always so convincing."}\\ \text{"Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now!"}\\ \end{bmatrix}$ 

and 

$y = \begin{bmatrix}spam \\ non spam \\ spam \end{bmatrix}$

- In categorical features or ordinal features, we have fixed number of categories.
- In text features such as above, each feature value (i.e., each text message) is going to be different. 
- How do we encode these feature? 

### Bag of words (BOW) representation

- One way is to use a simple bag of words (BOW) representation which involves two components. 
    - The vocabulary (all unique words in all documents) 
    - A value indicating either the presence or absence or the count of each word in the document. 
        
<center>
<img src='./img/bag-of-words.png' width="800">
</center>

[Source](https://web.stanford.edu/~jurafsky/slp3/4.pdf)       

### Extracting BOW features using `scikit-learn`
- `CountVectorizer`
    - Converts a collection of text documents to a matrix of word counts.  
    - Each row represents a "document" (e.g., a text message in our example). 
    - Each column represents a word in the vocabulary in the training data. 
    - Each cell represents how often the word occurs in the document. 
    
    
Note: In the NLP community a text data set is referred to as a **corpus** (plural: corpora).    

In [None]:
X = [
    "URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!",
    "Lol you are always so convincing.",
    "Nah I don't think he goes to usf, he lives around here though",
    "URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!",
    "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030",
    "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune",
]
y = ["spam", "non spam", "non spam", "spam", "spam", "non spam"]
vec = CountVectorizer()
X_counts = vec.fit_transform(X)
bow_df = pd.DataFrame(X_counts.toarray(), columns=sorted(vec.vocabulary_), index=X)
bow_df

In [None]:
X_counts

In [None]:
print("The total number of elements: ", np.prod(X_counts.shape))
print("The number of non-zero elements: ", X_counts.nnz)
print(
    "Proportion of non-zero elements: %0.4f" % (X_counts.nnz / np.prod(X_counts.shape))
)
print(
    "The value at cell 3,%d is: %d"
    % (vec.vocabulary_["jackpot"], X_counts[3, vec.vocabulary_["jackpot"]])
)

### Why sparse matrices? 

- Most words do not appear in a given document.
- We get massive computational savings if we only store the nonzero elements.
- There is a bit of overhead, because we also need to store the locations:
    - e.g. "location (3,31): 1".
    
- However, if the fraction of nonzero is small, this is a huge win.


Question for you
- What would happen if you apply `StandardScaler` on sparse data? 

### `OneHotEncoder` and sparse features 
- By default, `OneHotEncoder` also creates sparse features. 
- You could set `sparse=False` to get a regular `numpy` array. 
- If there are a huge number of categories, it may be beneficial to keep them sparse.
- For smaller number of categories, it doesn't matter much.

### Important hyperparameters of `CountVectorizer` 

- `binary`
    - whether to use absence/presence feature values or counts
- `max_features`
    - only consider top `max_features` ordered by frequency in the corpus
- `max_df`
    - ignore features which occur in more than `max_df` documents 
- `min_df` 
    - ignore features which occur in less than `min_df` documents 
- `ngram_range`
    - consider word sequences in the given range 

In [None]:
# Let's look at all features, i.e., words (along with their frequencies).
vec_all = CountVectorizer()
X_counts = vec_all.fit_transform(X)
pd.DataFrame(
    data=X_counts.sum(axis=0).tolist()[0],
    index=vec_all.get_feature_names(),
    columns=["counts"],
).sort_values("counts", ascending=False).head(20)

In [None]:
# We can control the size of X (the number of features) using `max_features`
vec8 = CountVectorizer(max_features=8)
X_counts = vec8.fit_transform(X)
pd.DataFrame(
    data=X_counts.sum(axis=0).tolist()[0],
    index=vec8.get_feature_names(),
    columns=["counts"],
).sort_values("counts", ascending=False)

In [None]:
bow_df = pd.DataFrame(X_counts.toarray(), columns=sorted(vec8.vocabulary_), index=X)
bow_df

In [None]:
vec8_binary = CountVectorizer(binary=True, max_features=8)
X_counts = vec8_binary.fit_transform(X)
pd.DataFrame(
    data=X_counts.sum(axis=0).tolist()[0],
    index=vec8.get_feature_names(),
    columns=["counts"],
).sort_values("counts", ascending=False)

Notice that `vec8` and `vec8_binary` have different vocabularies, which is kind of unexpected behaviour and doesn't match the documentation of `scikit-learn`. 

[Here](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L1206-L1225) is the code for `binary=True` condition in `scikit-learn`. As we can see, the binarization is done before limiting the features to `max_features`, and so now we are actually looking at the document counts (in how many documents it occurs) rather than term count. This is not explained anywhere in the documentation. 

The ties in counts between different words makes it even more confusing. I don't think it'll have a big impact on the results but this is good to know! Remember that `scikit-learn` developers are also humans who are prone to make mistakes. So it's always a good habit to question whatever tools we use every now and then. 

In [None]:
bow_df = pd.DataFrame(
    X_counts.toarray(), columns=sorted(vec8_binary.vocabulary_), index=X
)
bow_df

### Preprocessing

- Note that `CountVectorizer` is carrying out some preprocessing such as because of the default argument values 
    - Converting words to lowercase (`lowercase=True`)
    - getting rid of punctuation and special characters (`token_pattern ='(?u)\\b\\w\\w+\\b'`)


In [None]:
X, y

In [None]:
pipe = make_pipeline(CountVectorizer(), SVC())

In [None]:
pipe.fit(X, y)

In [None]:
pipe.predict(X)

In [None]:
pipe.score(X, y)

### Is this a realistic representation of text data? 

- Of course this is not a great representation of language
    - We are throwing out everything we know about language and losing a lot of information. 
    - It assumes that there is no syntax and compositional meaning in language.  
- But it works surprisingly well for many tasks. 
- We will learn more expressive representations later in the program in DSCI 575 (my favorite course :))! 

In the lab you'll develop a system for spam identification on a dataset from Kaggle. 

### `CountVectorizer`: True or False

- As you increase the value for `max_features` hyperparameter of `CountVectorizer` the training score is likely to go up. 
    - Varada's answer: True because increasing the value of `max_features` means we include each and every word from the training data in the dictionary and the training score is likely to go up. 
- If we encounter a word in the validation or the test split that's not available in the training data, we'll get an error. 
    - Varada's answer: False because if the word isn't in the dictionary, we would just ignore the word. 
- `max_df` hyperparameter of `CountVectorizer` can be used to get rid of most frequently occurring words from the dictionary.    
    - True because words such as _a_, _the_, _in_, _of_ occur in most of the documents, and with `max_df` hyperparameter, we can control the features to be used based on the number of documents they occur in. So if we set this to a higher proportion, we can get rid of such stop words.    

In [None]:
restaurant_df

In [None]:
restaurant_df.describe()

In [None]:
restaurant_df = pd.read_csv("data/cleaned_restaurant_data.csv")
restaurant_subset = restaurant_df[["n_people", "price", "target"]]
clean_restaurant_subset = restaurant_subset[restaurant_df["price"] < 200]
clean_restaurant_subset.head()

In [None]:
X = clean_restaurant_subset.drop(columns=["target"])
y = clean_restaurant_subset["target"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [None]:
model = DecisionTreeClassifier(max_depth=4)
model.fit(X_train, y_train)

In [None]:
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

In [None]:
import sys

sys.path.append("code/.")
import graphviz
import IPython
import mglearn
from IPython.display import HTML, display
from plotting_functions import *
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz
from utils import *

In [None]:
model = DecisionTreeClassifier(max_depth=5)
model.fit(X_train, y_train)
plot_tree_decision_boundary_and_tree(
    model,
    X_train,
    y_train,
    height=6,
    width=16,
    eps=10,
    x_label="n_people",
    y_label="price",
)