<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Importing-Libraries-and-Creating-data" data-toc-modified-id="Importing-Libraries-and-Creating-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Importing Libraries and Creating data</a></span></li><li><span><a href="#Preparation" data-toc-modified-id="Preparation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preparation</a></span></li><li><span><a href="#Encoding" data-toc-modified-id="Encoding-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Encoding</a></span></li></ul></div>

This notebook has been prepared making use of the datasets in <a href='https://blog.cambridgespark.com/robust-one-hot-encoding-in-python-3e29bfcec77e'> link</a>. But i believe, what i am presenting is simpler, more up-to-date and thus better. sitesindeki datasetler kullanılarak hazırlanmıştır. This is more up-to-date because for example we don't need LabelEncoding before OnehHotEncoding for string features any more. Simpler, because we have coded less.

I don't even think about pandas' get_dummies method as it is too long to implement.

# Importing Libraries and Creating data

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import mlextension as ml #for the function returning onehot columns

In [2]:
%load_ext autoreload
%autoreload 2

Let's first create the train set

In [3]:
df_train = pd.DataFrame([["London", "car", 20],
                   ["Cambridge", "car", 10], 
                   ["Liverpool", "bus", 30]], 
                  columns=["city", "transport", "duration"])
df_train

Unnamed: 0,city,transport,duration
0,London,car,20
1,Cambridge,car,10
2,Liverpool,bus,30


Now test set

In [4]:
df_test = pd.DataFrame([["Manchester", "bike", 30], 
                        ["Cambridge", "car", 40], 
                        ["Liverpool", "bike", 10]], 
                       columns=["city", "transport", "duration"])
df_test

Unnamed: 0,city,transport,duration
0,Manchester,bike,30
1,Cambridge,car,40
2,Liverpool,bike,10


# Preparation

In [5]:
#which columns to encode
cat_columns = ["city", "transport"]

The main difference in my solution, **we are getting the whole set's unique items**. Because we may lack an item in both train and test set, as in this example. In train set, we don't have Manchester in city column and bike in transport column, whereas in test set we lack London in city and bus in transport.

In [6]:
whole=pd.concat([df_train,df_test],axis=0)
whole

Unnamed: 0,city,transport,duration
0,London,car,20
1,Cambridge,car,10
2,Liverpool,bus,30
0,Manchester,bike,30
1,Cambridge,car,40
2,Liverpool,bike,10


In [7]:
cats=ml.getfullitemsforOHE(whole,cat_columns) #the last parameter is True as defaul, which sorts the items.
#but it doesn't matter whether they are sorted or not,it is just a matter of preference
cats

[['Cambridge', 'Liverpool', 'London', 'Manchester'], ['bike', 'bus', 'car']]

# Encoding

In [8]:
ohe=OneHotEncoder(categories=cats, sparse=False,handle_unknown="ignore")
X_trans=ohe.fit_transform(df_train[cat_columns])

In [9]:
#train set will be transformed as below and go into a model like this
X_trans

array([[0., 0., 1., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 1., 0.]])

In [10]:
#let's presnet them in a dataframe
pd.DataFrame(X_trans,columns=ohe.get_feature_names(cat_columns))

Unnamed: 0,city_Cambridge,city_Liverpool,city_London,city_Manchester,transport_bike,transport_bus,transport_car
0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [11]:
#let's transform the test set as well, only transform(not fit)
x_test_trans=ohe.transform(df_test[cat_columns])
pd.DataFrame(x_test_trans,columns=ohe.get_feature_names(cat_columns))

Unnamed: 0,city_Cambridge,city_Liverpool,city_London,city_Manchester,transport_bike,transport_bus,transport_car
0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [12]:
#now, we have a new dataset that has values neither train set nor test set has, let's see if any problems crop up
# Istanbul in place of Manchester, plane in place of second bike
df_new = pd.DataFrame([["Istanbul", "bike", 30], 
                        ["Cambridge", "car", 40], 
                        ["Liverpool", "plane", 10]], 
                       columns=["city", "transport", "duration"])
df_new

Unnamed: 0,city,transport,duration
0,Istanbul,bike,30
1,Cambridge,car,40
2,Liverpool,plane,10


In [13]:
#thanks to handle_unknown="ignore" parameter no problems come up. remove that parameter, run again and see the error
x_new_trans=ohe.transform(df_new[cat_columns])
pd.DataFrame(x_new_trans,columns=ohe.get_feature_names(cat_columns))

Unnamed: 0,city_Cambridge,city_Liverpool,city_London,city_Manchester,transport_bike,transport_bus,transport_car
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0


If you notice, in the first row, all cities are 0 and int the last row all transports are 0