# Video: Using One-Hot Encodings

This video shows off various applications of one-hot encodings.

Script:
* One-hot encodings are a common way to map categorical columns full of strings into numeric columns that are more accessible to most kinds of models.
* The basic idea is that each known categorical value will be assigned a new column of its own.
* And if that categorical value comes up, then that new column will be set to one.
* Otherwise, that column will be set zero.
* Across all these new columns, at most one of them will be set to one.
* That is the "one hot".
* The rest will be zero.
* And if the categorical value was not recognized, or was deemed infrequent enough to warrant its own column, then all of those new columns will be zero.
* Both pandas and scikit-learn have support for creating one-hot encodings automatically.
* Let's start with the pandas version.

In [None]:
import pandas as pd

In [None]:
abalone = pd.read_csv("https://raw.githubusercontent.com/bu-omds/bu-omds-data/main/data/abalone.tsv", sep="\t")

Script:
* Pandas provides a convenient function called get_dummies to add these one-hot encoding columns.
* It will automatically detect which columns to encode by checking their types.

In [None]:
pd.get_dummies(abalone)

Script:
* Here you can see that it automatically identified the sex column, and only the sex column as needing one-hot encoding.
* The new columns have values True and False which will be treated as one and zero respectively by modeling code.
* We can customize the behavior of get_dummies by calling it on just one series at a time.

In [None]:
pd.get_dummies(abalone["Sex"])

Script:
* Here, I called get_dummies on just the sex column of the abalones data.
* Note the column names are just F, I and M.
* Previously, when get_dummies was working with the whole data frame, they were called Sex_F, Sex_I and Sex_M.
* We can reproduce the data frame behavior by adding a prefix parameter to get_dummies.

In [None]:
pd.get_dummies(abalone["Sex"], prefix="Sex")

Script:
* Presumably, the pandas developers believed that the concise names were nicer if you were only looking at the one hot encodings of a single column.
* But they realized that the column name prefixes would be more helpful when used with the rest of a data frame.
* Let's look at the scikit-learn support next.

In [None]:
from sklearn.preprocessing import OneHotEncoder

Script:
* I will create this OneHotEncoder object using two options that will be important.
* The first option, handle_unknown, is set to ignore unrecognized categories.
* I set that option because the default behavior is to raise an error if a new category value is found that was not present when setting up the one hot encoding.
* The second one is to turn off sparse output.
* The default sparse output can reduce memory usage, but it is incompatible with pandas, so we'll turn it off.

In [None]:
one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

Script:
* To fit this one hot encoder, we will pass in the abalone data frame limited to just the sex column.


In [None]:
one_hot_encoder.fit(abalone[["Sex"]])

Script:
* If you do not limit the columns, it will try to encode all the other columns, even if they are numbers.
* Also note that you should pass in a list of columns here, so that you are passing in a two dimensional data frame and not a one-dimensional series.
* Let's look at the categories this encoder found.

In [None]:
one_hot_encoder.categories_

Script:
* No surprises here.
* It found F, I and M just like the pandas version.
* Let's look at the encoded output now.

In [None]:
one_hot_encoder.transform(abalone[["Sex"]])

Script:
* This output will look different if you do not disable the sparse_output option.
* The one hot encoder also has a handy function to return the names of the new columns.

In [None]:
one_hot_encoder.get_feature_names_out()

Script:
* I found that function while looking for ways to turn that array output into columns for a pandas data frame.
* But it turns out that function is not needed.
* The one hot encoder object has a set_output method which you can use to transform the output into a pandas data frame.

In [None]:
one_hot_encoder.set_output(transform="pandas")

Script:
* Now when you call the transform method, the result returned will be a pandas data frame instead of a NumPy array.

In [None]:
one_hot_encoder.transform(abalone[["Sex"]])

Script:
* If we had not turned off sparse output, that last step would have failed, since pandas is not compatible with the default sparse array format.
* To wrap up, let's make a data frame combining the original data frame with the new one hot encoded columns.

In [None]:
abalone2 = abalone.join(one_hot_encoder.transform(abalone[["Sex"]]))
abalone2

Script:
* That's it for how to add one-hot encodings in either pandas or scikit-learn.
* Both libraries make it pretty easy; we spent more time looking around than on the actual encoding.