In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
import seaborn as sns
sns.set_style("whitegrid")
# # Bigger font
sns.set_context("talk")
# # Figure size
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 10, 4
np.random.seed(123)

## Ordinal vs Categorical

Ordinal features are **ordered categorical features**, and there won't be an equal separation between consecutive values.

They can be presented as already encoded values.  
For example, in the titanic dataset, "Pclass" is an ordinal feature.

## PREPROCESSING

## 1. Tree-based models

**Label Encoding/Integer Encoding** = Mapping of unique values to numbers.  
It works well with tree-based models.

- For example, "red" is 1, "yellow" is 2, and "green" is 3.
- This one seems sufficient for **ordinal data**; for example, a “place” variable with the values like “first”, “second” and “third“.
- It can be used when the number of categorical features is either small or huge unlike when using one-hot-encoder.

Types of label encoding
#### a. Alphabetical (sorted)

In [106]:
from sklearn import preprocessing
titanic_train = pd.read_csv("A. Datasets/data/titanic_train.csv")
titanic_train['Embarked'].dropna(inplace=True)
print(titanic_train['Embarked'].head(6))
encoder = preprocessing.LabelEncoder()
encoder.fit(titanic_train['Embarked'].values)
encoder.transform(['S', 'C', 'Q'])

0    S
1    C
2    S
3    S
4    S
5    Q
Name: Embarked, dtype: object


array([2, 0, 1])

#### b. Order of appearance

In [100]:
# print(titanic_train['Embarked'].factorize())
pd.factorize(['S', 'C', 'Q'])[0]

array([0, 1, 2])

## 2. Non-tree-based models

#### a. Frequency encoding

- It maps values to their frequencies.
- This one may actually help both **tree-based** and **non-tree-based models**.

For example, if frequency of category is correlated with the target value,  a linear model will utilize this dependency.

In [111]:
encoding = titanic_train.groupby('Embarked').size()
encoding = encoding/len(titanic_train)
freq_enc = titanic_train['Embarked'].map(encoding).head(6)
print(freq_enc)

0    0.722783
1    0.188552
2    0.722783
3    0.722783
4    0.722783
5    0.086420
Name: Embarked, dtype: float64


If they are multiple categories with the same frequency, use rankdata.

In [118]:
from scipy.stats import rankdata
rankdata([0.3, 0.3, 0.4])

array([1.5, 1.5, 3. ])

#### b. One Hot Encoding

It is "The Standard Approach for Categorical Data"

- Useful when there's no ordinal relationship though some use it even so (I need to go further on this).
- Some sources suggest **not to do it for variables taking more than 15 categories**.
- It is already scaled.
- If there are a few Numeric Features and many One-hot-encoding features, it would be hard for tree-based methods. This will slow down their efficiency and precision.
- If the feature has too many categories, there will be too many columns with **few non-zero values ----> store those values efficiently with Space Matrices** (often useful with categorical or text data). XGBoost, LightGBM, sklearn work with sparce matrices directly.

- It does dummy coding.
- **Dummy coding** = process of coding a categorical variable into dichotomous variables (variables coded as 0 or 1).
- **Dummy variables** = binary variables often called dummy in other fields such as statistics.

An example.

In [8]:
dict_df = {'F': [1, 2, 3],
           'target': [1, 0, 1]
          }
df = pd.DataFrame(dict_df)

Without this procedude, only a tree-based model would do well.

![](images/whyonehotencoding.png)

For a linear model to work well, there should be linear a relationship between the feature(s) and the target. Then, the model can find a **LINEAR DEPENDENCY**.

**Linear a relationship** = when plotted on a graph a straight line can be traced.

The feature "F"  is turned into 3 different features.

In [9]:
df.corr()

Unnamed: 0,F,target
F,1.0,0.0
target,0.0,1.0


In that case, there is no linear correlation, so we **use one-hot-encoding**.

In [10]:
df_onehot = pd.concat([pd.get_dummies(df.F), df.target], axis=1)
df_onehot

Unnamed: 0,1,2,3,target
0,1,0,0,1
1,0,1,0,0
2,0,0,1,1


In [11]:
df_onehot.corr()

Unnamed: 0,1,2,3,target
1,1.0,-0.5,-0.5,0.5
2,-0.5,1.0,-0.5,-1.0
3,-0.5,-0.5,1.0,0.5
target,0.5,-1.0,0.5,1.0


Now theres's more linear correlation than before.

## FEATURE GENERATION

This will be the first basic feature generation type.

#### a. Feature interaction b/w categorical features

- This is useful for **non-tree-based models**.
- The model will be able to consider interactions b/w features.

If target depends on both sex and pclass, then we can make a model consider every possible combination of these features by **concatenating strings** and then applying **one-hot-encoding**.

![](images/feature_interaction.png)

#### b. Number of occurrences

A new column, the number of occurrences of a category can be added.