# Dealing with categorical features

Some features are categorical.

- **Ordinal** features can be scaled, but it's not necessary.
- **Nominal** features must be encoded, eg with one-hot encoding.

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np

# First 2 features are informative. `weights` controls class balance.
X, y = make_classification(n_samples=10000,
                           n_features=4, n_classes=2,
                           n_clusters_per_class=1,
                           n_informative=2, n_redundant=0,
                           weights=None, random_state=42)

X[:, 0] = np.digitize(X[:, 0], bins=np.linspace(-4, 4, 9))

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
X[:25, 0]

## Train models on these datasets

- What happens if I train on this scaled dataset? Presumably it's not too bad, since the variable is ordinal.
- What if I mix the categories (i.e. make them non-ordinal) and train on that? Presumably it's bad.

Train on unscaled data.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
y_hat = model.predict(X_test)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_hat))

Now on scaled version.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_test_sc = scaler.transform(X_test)

In [None]:
model = LogisticRegression()
model.fit(X_train_sc, y_train)
y_hat = model.predict(X_test_sc)

In [None]:
print(classification_report(y_test, y_hat))

It doesn't do any harm.

## Nominal categories

Make another version where we make x_0 purely categorical, with no order.

I can't think of a way of doing this in NumPy, so Pandas it is.

In [None]:
import pandas as pd

shuf = {
    0: 3,
    1: 9,
    2: 1,
    3: 8,
    4: 0,
    5: 5,
    6: 2,
    7: 4,
    8: 6,
    9: 7,
}

s = pd.Series(X[:, 0])

X_shuf = X.copy()

X_shuf[:, 0] = s.replace(shuf)

In [None]:
X_shuf_train, X_shuf_test, y_shuf_train, y_shuf_test = train_test_split(X_shuf, y)

Now same thing but with the shuffled categories.

In [None]:
scaler.fit(X_shuf_train)
X_shuf_train_sc = scaler.transform(X_shuf_train)
X_shuf_test_sc = scaler.transform(X_shuf_test)

model = LogisticRegression()
model.fit(X_shuf_train_sc, y_shuf_train)
y_hat = model.predict(X_shuf_test_sc)

print(classification_report(y_shuf_test, y_hat))

Oh dear.

## And do it properly

We should dummy encode these things instead.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder

numeric_features = [1, 2, 3]
numeric_transformer = make_pipeline(StandardScaler())

categorical_features = [0]
categorical_transformer = make_pipeline(OneHotEncoder(drop='first'))

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

pipe = make_pipeline(preprocessor, LogisticRegression())

Now try the two datasets.

First we'll check we don't harm the ordinal feature:

In [None]:
pipe.fit(X_train, y_train)
y_hat = pipe.predict(X_test)

print(classification_report(y_test, y_hat))

And the nominal one:

In [None]:
pipe.fit(X_shuf_train, y_shuf_train)
y_hat = pipe.predict(X_shuf_test)

print(classification_report(y_shuf_test, y_hat))

It works!

---

&copy; 2023 Matt Hall, licensed CC BY