<a href="https://colab.research.google.com/github/ahmadaking/Comparisons/blob/master/Categorical_Variable_Encoding_Sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## CONTENTS

1. one hot encoding - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder


2. label encoding - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html


3. label encoding, and then one hot encoding

4. label binarizer - 
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html#sklearn.preprocessing.LabelBinarizer


My answer: https://stackoverflow.com/a/63822728/5114585

In [7]:
import pandas as pd

housing = pd.read_csv(r"https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv")
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [8]:
print(housing.ocean_proximity.values)
print(housing.ocean_proximity.value_counts())

# 5 categories = [<1H OCEAN, INLAND, NEAR OCEAN, NEAR BAY, ISLAND] in alphabetical order

['NEAR BAY' 'NEAR BAY' 'NEAR BAY' ... 'INLAND' 'INLAND' 'INLAND']
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64


In [9]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer

### 1-LabelEncoder

In [10]:
le = LabelEncoder()

ocean_le = le.fit_transform(housing['ocean_proximity'])

# housing.head() # this new coding is not added in the original data set automatically

# Let's see the coding

ocean_le # and it's numpy array

array([3, 3, 3, ..., 1, 1, 1])

In [11]:
# To add it to original dataframe, we need to convert it to data frame 

ocean_le_df = pd.DataFrame(ocean_le, columns=["LabelEncoder"])
# print(ocean_le_df)

housing = pd.concat([housing, ocean_le_df], axis=1)
print(housing)

# 5 categories = [<1H OCEAN, INLAND, NEAR OCEAN, NEAR BAY, ISLAND] in alphabetical order
# Labels = [0, 1, 2, 3, 4] respectively

       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0        -122.23     37.88                41.0        880.0           129.0   
1        -122.22     37.86                21.0       7099.0          1106.0   
2        -122.24     37.85                52.0       1467.0           190.0   
3        -122.25     37.85                52.0       1274.0           235.0   
4        -122.25     37.85                52.0       1627.0           280.0   
...          ...       ...                 ...          ...             ...   
20635    -121.09     39.48                25.0       1665.0           374.0   
20636    -121.21     39.49                18.0        697.0           150.0   
20637    -121.22     39.43                17.0       2254.0           485.0   
20638    -121.32     39.43                18.0       1860.0           409.0   
20639    -121.24     39.37                16.0       2785.0           616.0   

       population  households  median_income  media

In [12]:
inverse_LabelEncoder = le.inverse_transform(housing['LabelEncoder'])
inverse_LabelEncoder_df = pd.DataFrame(inverse_LabelEncoder, columns=['inverse_LabelEncoder'])
housing = pd.concat([housing, inverse_LabelEncoder_df],axis=1)
print(housing.iloc[ :, -3: ])

# drop this column again
housing.drop('inverse_LabelEncoder', axis=1)


      ocean_proximity  LabelEncoder inverse_LabelEncoder
0            NEAR BAY             3             NEAR BAY
1            NEAR BAY             3             NEAR BAY
2            NEAR BAY             3             NEAR BAY
3            NEAR BAY             3             NEAR BAY
4            NEAR BAY             3             NEAR BAY
...               ...           ...                  ...
20635          INLAND             1               INLAND
20636          INLAND             1               INLAND
20637          INLAND             1               INLAND
20638          INLAND             1               INLAND
20639          INLAND             1               INLAND

[20640 rows x 3 columns]


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,LabelEncoder
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,3
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,3
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,3
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,3
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,3
...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND,1
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND,1
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND,1
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND,1


### 2-OneHotEncoder

In [13]:
ohe = OneHotEncoder(sparse=False)

# sparse (default=True) - Will return sparse matrix if set True else will return an array
# using toarray() is not needed

# ocean_ohe = ohe.fit_transform(housing['ocean_proximity']) # Value error: Expected 2D array, got 1D array instead

ocean_ohe = ohe.fit_transform(housing['ocean_proximity'].values.reshape(-1, 1))

# housing.head() # this new coding is not added in the original data set automatically

# Let's see the coding

print(ocean_ohe) # and it's numpy array


[[0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0.]
 ...
 [0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0.]]


In [14]:
ocean_cat = ['<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND']
# print(range(len(ocean_cat)))
# print(ocean_cat[1])

ocean_OneHot = pd.DataFrame(ocean_ohe, columns=[ocean_cat[i] for i in range(len(ocean_cat))])
ocean_OneHot

Unnamed: 0,<1H OCEAN,INLAND,NEAR OCEAN,NEAR BAY,ISLAND
0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...
20635,0.0,1.0,0.0,0.0,0.0
20636,0.0,1.0,0.0,0.0,0.0
20637,0.0,1.0,0.0,0.0,0.0
20638,0.0,1.0,0.0,0.0,0.0


### 3-LabelEncoding followed by OneHotEncoding

same result as using onhotencoding directly


In [15]:
le = LabelEncoder()
ocean_le = le.fit_transform(housing['ocean_proximity'])

ohe = OneHotEncoder(sparse=False)

le_n_ohe = ohe.fit_transform(ocean_le.reshape(-1, 1))

# housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
# print(housing_cat_1hot)
# print(le_n_ohe)

df_ohe_over_le = pd.DataFrame(le_n_ohe, columns=[ocean_cat[i] for i in range(len(ocean_cat))])
df_ohe_over_le

Unnamed: 0,<1H OCEAN,INLAND,NEAR OCEAN,NEAR BAY,ISLAND
0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...
20635,0.0,1.0,0.0,0.0,0.0
20636,0.0,1.0,0.0,0.0,0.0
20637,0.0,1.0,0.0,0.0,0.0
20638,0.0,1.0,0.0,0.0,0.0


### 4-LabelBinarizer

same result again

ideally, it should be used for lables

In [16]:
lb  = LabelBinarizer()
ocean_lb = lb.fit_transform(housing['ocean_proximity'])
print(ocean_lb)

[[0 0 0 1 0]
 [0 0 0 1 0]
 [0 0 0 1 0]
 ...
 [0 1 0 0 0]
 [0 1 0 0 0]
 [0 1 0 0 0]]


In [None]:
df_LableBinerazier = pd.DataFrame(ocean_lb, columns=[ocean_cat[i] for i in range(len(ocean_cat))])
df_LableBinerazier

Unnamed: 0,<1H OCEAN,INLAND,NEAR OCEAN,NEAR BAY,ISLAND
0,0,0,0,1,0
1,0,0,0,1,0
2,0,0,0,1,0
3,0,0,0,1,0
4,0,0,0,1,0
...,...,...,...,...,...
20635,0,1,0,0,0
20636,0,1,0,0,0
20637,0,1,0,0,0
20638,0,1,0,0,0


## Summary:
### When to use which encoder?

***LabelEncoder*** – for labels(response variable) coding 1,2,3… [implies order] [output : numpy array]

***OrdinalEncoder*** – for features coding 1,2,3 … [implies order] [output :]

***Label Binarizer*** – for response variable, coding 0 & 1 [ creating multiple dummy columns] [output : ]

***OneHotEncoder*** -  for feature variables, coding 0 & 1 [ creating multiple dummy columns] [output : numpy array]


There are many other encoders suitable for different cases.


### Working of MultiLabelBinarizer:
#### Source : https://www.kaggle.com/questions-and-answers/66693

In [17]:
df_MLB = pd.DataFrame({"genre": [["action", "drama","fantasy"], ["fantasy","action"], ["drama"], ["sci-fi", "drama"]]})
df_MLB

Unnamed: 0,genre
0,"[action, drama, fantasy]"
1,"[fantasy, action]"
2,[drama]
3,"[sci-fi, drama]"


In [18]:
#importing MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
#instantiating MultiLabelBinarizer
mlb = MultiLabelBinarizer()

In [6]:
#Encode the multilabel data in MLB Format
genre_mlb = mlb.fit_transform(df_MLB['genre'])
genre_mlb

array([[1, 1, 1, 0],
       [1, 0, 1, 0],
       [0, 1, 0, 0],
       [0, 1, 0, 1]])

In [5]:
# Retrieve the labels
genre = mlb.inverse_transform(genre_mlb)
genre

[('action', 'drama', 'fantasy'),
 ('action', 'fantasy'),
 ('drama',),
 ('drama', 'sci-fi')]