# One Hot Encoding

May 27, 2019

Made by: Cristian E. Nuno

> `OneHotEncoder` transforms each categorical feature with `n_categories` possible values into `n_categories` binary features, with one of them 1, and all others 0. - [`sklearn.preprocessing` user guide](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features)


> The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse m. - [`sklearn.preprocessing` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

![transformer](../visuals/transformer.gif)

![chris albon notecard](../visuals/ohe_chris_albon.png)

## Data

Today we'll be using [Chicago Public Schools School Year 2018-2019 school profile data](https://cenuno.github.io/pointdexter/reference/cps_sy1819.html).

In [1]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np

In [2]:
relevant_columns = ["school_id", "short_name", "primary_category", "classification_type"]

cps_sy1819 = pd.read_csv("../raw_data/cps_sy1819_profiles.csv")[relevant_columns]
cps_sy1819.head()

Unnamed: 0,school_id,short_name,primary_category,classification_type
0,609760,CARVER MILITARY HS,HS,Military academy
1,609780,MARINE LEADERSHIP AT AMES HS,HS,Military academy
2,610304,PHOENIX MILITARY HS,HS,Military academy
3,610513,AIR FORCE HS,HS,Military academy
4,610390,RICKOVER MILITARY HS,HS,Military academy


In [3]:
cps_sy1819["primary_category"].value_counts()

ES    471
HS    180
MS      9
Name: primary_category, dtype: int64

Store the `primary_category` feature as either a 2-dimensional array or a pandas DataFrame.

In [4]:
primary_category = cps_sy1819["primary_category"].values.reshape(-1, 1)
primary_category[:5]

array([['HS'],
       ['HS'],
       ['HS'],
       ['HS'],
       ['HS']], dtype=object)

Create our `OneHotEncoder()` object and specify `drop="first"` to specify the methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression [(source)](https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/preprocessing/_encoders.py#L179).

In [5]:
encoder = OneHotEncoder(drop="first").fit(primary_category)

Just like in `value_counts()`, the categories in our `OneHotEncoder()` object are sorted in alphabetical order.

In [6]:
encoder.categories_

[array(['ES', 'HS', 'MS'], dtype=object)]

This is important to know seeing as we can expect our output to have two features: one for `HS` and one for `MS` since `ES` is the first category and we specified it to be dropped.

In [7]:
encoder.get_feature_names(["primary_category"])

array(['primary_category_HS', 'primary_category_MS'], dtype=object)

Now let's place our output in a Data Frame.

In [8]:
ohe = pd.DataFrame(encoder.transform(primary_category).toarray(),
                   columns=encoder.get_feature_names(["primary_category"]))

ohe.head()

Unnamed: 0,primary_category_HS,primary_category_MS
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0


Great! Now let's column column bind `ohe` onto `cps_sy1819` after we drop the `primary_category` column from `cps_sy1819`.

In [9]:
pd.concat([cps_sy1819.drop("primary_category", axis=1), ohe], axis=1).head()

Unnamed: 0,school_id,short_name,classification_type,primary_category_HS,primary_category_MS
0,609760,CARVER MILITARY HS,Military academy,1.0,0.0
1,609780,MARINE LEADERSHIP AT AMES HS,Military academy,1.0,0.0
2,610304,PHOENIX MILITARY HS,Military academy,1.0,0.0
3,610513,AIR FORCE HS,Military academy,1.0,0.0
4,610390,RICKOVER MILITARY HS,Military academy,1.0,0.0
