<img src="https://www.th-koeln.de/img/logo.svg" style="float:right;" width="200">

# <font color="#C70039">Guide To Encoding Categorical Values</font>
* Purpose: Miscellaneous
* Author of notebook: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>
* Date:   08.07.2022

---------------------------------
**GENERAL NOTE **: 


---------------------

### <font color="ce33ff">DESCRIPTION</font>:
This notebook is derived from this [article](http://pbpython.com/categorical-encoding.html).

-------------------------------------------------------------------------------------------------------------

Import the pandas, scikit-learn, numpy and the [category_encoder](https://github.com/scikit-learn-contrib/category_encoders) libraries. Install the latter one if necessary:

* pip install category_encoders

or

* conda install -c conda-forge category_encoders

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

import category_encoders as ce

Need to define the headers since the data does not contain any

In [None]:
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", "body_style",
           "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", 
           "compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]

Read in the data from the url, add headers and convert ? to nan values

In [None]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
                 header=None, names=headers, na_values="?" )

In [None]:
df.head()

Look at the data types contained in the dataframe

In [None]:
df.dtypes

Create a copy of the data with only the object columns.

In [None]:
obj_df = df.select_dtypes(include=['object']).copy()

In [None]:
obj_df.head()

Check for null values in the data

In [None]:
obj_df[obj_df.isnull().any(axis=1)]

Since the num_doors column contains the null values, look at what values are current options

In [None]:
obj_df["num_doors"].value_counts()

We will fill in the doors value with the most common element - four.

In [None]:
obj_df = obj_df.fillna({"num_doors": "four"})

In [None]:
obj_df[obj_df.isnull().any(axis=1)]

### Encoding values using pandas

Convert the num_cylinders and num_doors values to numbers

In [None]:
obj_df["num_cylinders"].value_counts()

In [None]:
cleanup_nums = {"num_doors":     {"four": 4, "two": 2},
                "num_cylinders": {"four": 4, "six": 6, "five": 5, "eight": 8,
                                  "two": 2, "twelve": 12, "three":3 }}

In [None]:
obj_df = obj_df.replace(cleanup_nums)

In [None]:
obj_df.head()

In [None]:
obj_df.dtypes

One approach to encoding labels is to convert the values to a pandas category

In [None]:
obj_df["body_style"].value_counts()

In [None]:
obj_df["body_style"] = obj_df["body_style"].astype('category')

In [None]:
obj_df.dtypes

We can assign the category codes to a new column so we have a clean numeric representation

In [None]:
obj_df["body_style_cat"] = obj_df["body_style"].cat.codes

In [None]:
obj_df.head()

In [None]:
obj_df.dtypes

In order to do one hot encoding, use pandas get_dummies

In [None]:
pd.get_dummies(obj_df, columns=["drive_wheels"]).head()

get_dummiers has options for selecting the columns and adding prefixes to make the resulting data easier to understand.

In [None]:
pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head()

In [None]:
obj_df["engine_type"].value_counts()

Use np.where and the str accessor to do this in one efficient line

In [None]:
obj_df["OHC_Code"] = np.where(obj_df["engine_type"].str.contains("ohc"), 1, 0)

In [None]:
obj_df[["make", "engine_type", "OHC_Code"]].head(20)

### Encoding Values Using Scitkit-learn

Instantiate the LabelEncoder

In [None]:
ord_enc = OrdinalEncoder()

In [None]:
obj_df["make_code"] = ord_enc.fit_transform(obj_df[["make"]])

In [None]:
obj_df[["make", "make_code"]].head(11)

To accomplish something similar to pandas get_dummies, use LabelBinarizer

In [None]:
oe_style = OneHotEncoder()
oe_results = oe_style.fit_transform(obj_df[["body_style"]])

The results are an array that needs to be converted to a DataFrame

In [None]:
oe_results.toarray()

In [None]:
pd.DataFrame(oe_results.toarray(), columns=oe_style.categories_).head()

### Advanced Encoding
[category_encoder](https://github.com/scikit-learn-contrib/category_encoders) library

In [None]:
# Get a new clean dataframe
obj_df = df.select_dtypes(include=['object']).copy()

In [None]:
obj_df.head()

Try out the Backward Difference Encoder on the engine_type column

In [None]:
'''
# Specify the columns to encode then fit and transform
encoder = ce.BackwardDifferenceEncoder(cols=["engine_type"])
encoder.fit(obj_df, verbose=1)

encoder.fit_transform(obj_df).iloc[:,8:14].head()
'''

Another approach is to use a polynomial encoding.

In [None]:
'''
encoder = ce.polynomial.PolynomialEncoder(cols=["engine_type"])
encoder.fit_transform(obj_df, verbose=1).iloc[:,8:14].head()
'''

### Scikit-learn pipeline
Show an example of how to incorporate the encoding strategies into a scikit-learn pipeline

In [None]:
# for the purposes of this analysis, only use a small subset of features
feature_cols = [
    'fuel_type', 'make', 'aspiration', 'highway_mpg', 'city_mpg',
    'curb_weight', 'drive_wheels'
]

# Remove the empty price rows
df_ml = df.dropna(subset=['price'])

X = df_ml[feature_cols]
y = df_ml['price']

In [None]:
column_trans = make_column_transformer((OneHotEncoder(handle_unknown='ignore'),
                                        ['fuel_type', 'make', 'drive_wheels']),
                                      (OrdinalEncoder(), ['aspiration']),
                                      remainder='passthrough')

In [None]:
linreg = LinearRegression()
pipe = make_pipeline(column_trans, linreg)

In [None]:
cross_val_score(pipe, X, y, cv=10, scoring='neg_mean_absolute_error')

In [None]:
# Get the average of the errors after 10 iterations
cross_val_score(pipe, X, y, cv=10, scoring='neg_mean_absolute_error').mean().round(2)