## OneHotEncoder and ColumnTransformer: Categorical Variables with scikit-learn

As mentioned before, scikit-learn can also perform one-hot-encoding. Using scikit-learn has the advantage of making it easy to treat training and test set in a consistent way. One-hot-encoding is implemented in the *OneHotEncoder* class.  Notably, the *OneHotEncoder* applies the encoding to all input columns:

In [3]:
# Standard imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
import sklearn
from IPython.display import display
import mglearn

# Don't display deprecation warnings
import warnings
warnings.filterwarnings('ignore')

In [4]:
demo_df = pd.DataFrame({'Integer Feature': [0, 1, 2, 1],
                        'Categorical Feature': ['socks', 'fox', 'socks', 'box']})
demo_df['Integer Feature'] = demo_df['Integer Feature'].astype(str)

In [5]:
from sklearn.preprocessing import OneHotEncoder

# Setting sparse=False means OneHotEncode will return a numpy array,
# not a sparse matrix
ohe = OneHotEncoder(sparse=False)
print(ohe.fit_transform(demo_df))

[[1. 0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0.]]


You can see that both the string and integer feature were transformed. As usual for scikit-learn, the output is not a DataFrame, so there are no column names. To obtain the correspondence of the transformed features to the original categorical variables, we can use the *get_feature_names* method:

In [7]:
print(ohe.get_feature_names())

['x0_0' 'x0_1' 'x0_2' 'x1_box' 'x1_fox' 'x1_socks']


There, the first three columns correspond to the values 0, 1, and 2 of the first original feature (called x0 here), while the last three columns correspond to the values box, fox, and socks for the second original feature (called x1 here).

In most applications, some features are categorical and some are continuous, so *OneHotEncoder* is not directly applicable, as it assumes all features are categorical. This is where the *ColumnTransformer* class comes in handy: it allows you to apply different transformations to different columns in the input data. This is incredibly useful, since continuous and categorical features need very different kinds of preprocessing.

Let’s go back to the example of the adult census data we considered earlier:

In [8]:
import os

adult_path = os.path.join(mglearn.datasets.DATA_PATH, "adult.data")
data = pd.read_csv(
    adult_path, header=None, index_col=False,
    names=['age', 'workclass', 'fnlwgt', 'education',  'education-num',
           'marital-status', 'occupation', 'relationship', 'race', 'gender',
           'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
           'income'])

data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week',
             'occupation', 'income']]

display(data.head())

Unnamed: 0,age,workclass,education,gender,hours-per-week,occupation,income
0,39,State-gov,Bachelors,Male,40,Adm-clerical,<=50K
1,50,Self-emp-not-inc,Bachelors,Male,13,Exec-managerial,<=50K
2,38,Private,HS-grad,Male,40,Handlers-cleaners,<=50K
3,53,Private,11th,Male,40,Handlers-cleaners,<=50K
4,28,Private,Bachelors,Female,40,Prof-specialty,<=50K


To apply, say, a linear model to this dataset to predict income, in addition to applying one-hot-encoding to the categorical variables, we might also want to scale the continuous variables age and hours-per-week. This is exactly what *ColumnTransformer* can do for us. 

Each transformation in the column transformer is specified by a name (we will see later why this is useful), a transformer object, and the columns this transformer should be applied to. The columns can be specified using column names, integer indices, or boolean masks. Each transformer is applied to the corresponding columns, and the result of the transformations are concatenated (horizontally). For the example earlier, using column names the specification looks like this:

In [9]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

ct = ColumnTransformer(
    [("scaling", StandardScaler(), ['age', 'hours-per-week']),
     ("onehot", OneHotEncoder(sparse=False),
     ['workclass', 'education', 'gender', 'occupation'])])

Now we can use the *ColumnTransformer* object as we would any other scikit-learn transformation, using fit and transform. So let’s build a linear model as before, but this time include scaling of the continuous variables. Note that we are calling *train_test_split* on the DataFrame containing the features, not on a NumPy array. We need to preserve the column names so that they can be used in the ColumnTransformer.