## Guide to Encoding Categorical Values in Python
https://pbpython.com/categorical-encoding.html

In [1]:
import pandas as pd
import numpy as np


In [2]:
# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

In [3]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
df.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


In [4]:
df.dtypes

symboling              int64
normalized_losses    float64
make                  object
fuel_type             object
aspiration            object
num_doors             object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_cylinders         object
engine_size            int64
fuel_system           object
bore                 float64
stroke               float64
compression_ratio    float64
horsepower           float64
peak_rpm             float64
city_mpg               int64
highway_mpg            int64
price                float64
dtype: object

Since this article will only focus on encoding the categorical variables, we are going to include only the object columns in our dataframe. Pandas has a helpful select_dtypes function which we can use to build a new dataframe containing only the object columns.

In [5]:
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
1,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
2,alfa-romero,gas,std,two,hatchback,rwd,front,ohcv,six,mpfi
3,audi,gas,std,four,sedan,fwd,front,ohc,four,mpfi
4,audi,gas,std,four,sedan,4wd,front,ohc,five,mpfi


Before going any further, there are a couple of null values in the data that we need to clean up.

In [6]:
obj_df[obj_df.isnull().any(axis=1)]

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
27,dodge,gas,turbo,,sedan,fwd,front,ohc,four,mpfi
63,mazda,diesel,std,,sedan,fwd,front,ohc,four,idi


For the sake of simplicity, just fill in the value with the number 4 (since that is the most common value):

In [8]:
obj_df["num_doors"].value_counts()

four    114
two      89
Name: num_doors, dtype: int64

In [9]:
obj_df = obj_df.fillna({"num_doors": "four"})

Now that the data does not have any null values, we can look at options for encoding the categorical values.

### Approach Label Encoding

Another approach to encoding categorical values is to use a technique called label encoding. Label encoding is simply converting each value in a column to a number. For example, the body_style column contains 5 different values. We could choose to encode it like this:

convertible -> 0
hardtop -> 1
hatchback -> 2
sedan -> 3
wagon -> 4

One trick you can use in pandas is to convert a column to a category, then use those category values for your label encoding:

In [11]:
obj_df["body_style"] = obj_df["body_style"].astype('category')
obj_df.dtypes

make                 object
fuel_type            object
aspiration           object
num_doors            object
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders        object
fuel_system          object
dtype: object

Then you can assign the encoded variable to a new column using the cat.codes accessor:

In [12]:
obj_df["body_style_cat"] = obj_df["body_style"].cat.codes
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat
0,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi,0
1,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi,0
2,alfa-romero,gas,std,two,hatchback,rwd,front,ohcv,six,mpfi,2
3,audi,gas,std,four,sedan,fwd,front,ohc,four,mpfi,3
4,audi,gas,std,four,sedan,4wd,front,ohc,five,mpfi,3


The nice aspect of this approach is that you get the benefits of pandas categories (compact data size, ability to order, plotting support) but can easily be converted to numeric values for further analysis.

### Approach Custom Binary Encoding

#### Scikit-Learn

For instance, if we want to do the equivalent to label encoding on the make of the car, we need to instantiate a OrdinalEncoder object and fit_transform the data:

In [13]:
from sklearn.preprocessing import OrdinalEncoder

ord_enc = OrdinalEncoder()
obj_df["make_code"] = ord_enc.fit_transform(obj_df[["make"]])
obj_df[["make", "make_code"]].head(11)

Unnamed: 0,make,make_code
0,alfa-romero,0.0
1,alfa-romero,0.0
2,alfa-romero,0.0
3,audi,1.0
4,audi,1.0
5,audi,1.0
6,audi,1.0
7,audi,1.0
8,audi,1.0
9,audi,1.0


Scikit-learn also supports binary encoding by using the OneHotEncoder. We use a similar process as above to transform the data but the process of creating a pandas DataFrame adds a couple of extra steps.

In [14]:
from sklearn.preprocessing import OneHotEncoder

oe_style = OneHotEncoder()
oe_results = oe_style.fit_transform(obj_df[["body_style"]])
pd.DataFrame(oe_results.toarray(), columns=oe_style.categories_).head()

Unnamed: 0,convertible,hardtop,hatchback,sedan,wagon
0,1.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0


The next step would be to join this data back to the original dataframe. Here is an example:

In [15]:
obj_df = obj_df.join(pd.DataFrame(oe_results.toarray(), columns=oe_style.categories_))

The key point is that you need to use toarray() to convert the results to a format that can be converted into a DataFrame.

In [17]:
obj_df

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat,make_code,"(convertible,)","(hardtop,)","(hatchback,)","(sedan,)","(wagon,)"
0,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi,0,0.0,1.0,0.0,0.0,0.0,0.0
1,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi,0,0.0,1.0,0.0,0.0,0.0,0.0
2,alfa-romero,gas,std,two,hatchback,rwd,front,ohcv,six,mpfi,2,0.0,0.0,0.0,1.0,0.0,0.0
3,audi,gas,std,four,sedan,fwd,front,ohc,four,mpfi,3,1.0,0.0,0.0,0.0,1.0,0.0
4,audi,gas,std,four,sedan,4wd,front,ohc,five,mpfi,3,1.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,volvo,gas,std,four,sedan,rwd,front,ohc,four,mpfi,3,21.0,0.0,0.0,0.0,1.0,0.0
201,volvo,gas,turbo,four,sedan,rwd,front,ohc,four,mpfi,3,21.0,0.0,0.0,0.0,1.0,0.0
202,volvo,gas,std,four,sedan,rwd,front,ohcv,six,mpfi,3,21.0,0.0,0.0,0.0,1.0,0.0
203,volvo,diesel,turbo,four,sedan,rwd,front,ohc,six,idi,3,21.0,0.0,0.0,0.0,1.0,0.0
