---
layout: post             
comments: true           
title: 'Methods to encode categorical variables'           
excerpt: 'Feature Engineering: Part 1'            
date: 2019-05-05 04:00:00      
mathjax: false       

---------

# References
- https://pbpython.com/categorical-encoding.html
    

# Introduction

Encoding categorical variables is an important step in the data science process. Because there are multiple approaches to encoding variables, it is important to understand the various options and how to implement them on your own data sets. The python data science ecosystem has many helpful approaches to handling these problems.

## Methods to encode categorical variables

## 1. Convert to number:              
    Some ML libraries do not take categorical variables as input. Thus we convert them in to numerical variables. Below are the methods to convert a categorical string variable input to numerical nature:              
###     1.1 Numerical Encoding: 
        - Numerical Encoding is very simple: assign an arbitrary number to each category.
        - It is used to transform non-numerical labels to numerical labels (or nominal categorical varaibles). Numerical labels are always between 0 and n_classes-1.

In [None]:
from sklearn.preprosessing import LabelEncoder

number= LabelEncoder()
train['sex'] = number.fit_transform(train['sex'].astype('str'))
test['sex'] = number.fit_transform(test['sex'].astype('str'))
train.head()

- A common challenge with nominal categorical variable is that, it may decrease performance of a model. 
    - For example: We have two features “age” (range: 0-80) and “city” (81 different levels). Now, when we’ll apply label encoder to ‘city’ variable, it will represent ‘city’ with numeric values range from 0 to 80. The ‘city’ variable is now similar to ‘age’ variable since both will have similar data points, which is certainly not a right approach.

### 1.2 Convert numeric bins to number: 
        - It is used to transform non-numerical labels to numerical labels (or nominal categorical varaibles). Numerical labels are always between 0 and n_classes-1.
    
### 1.3 Combine Levels
        - To avoid redundant levels in a categorical variable and to deal with rare levels, we can simply combine the different levels.
        1.3.1 Using Business Logic
        1.3.2 Using Frequency or Response Rate
        
### 1.4 Dummy Coding
        - Dummy coding is a commonly used method for converting a categorical input variable into continuous variable. Presence of a level is represented by 1 and absence is represented by 0. For every level present, one dummy variable is created.

### 1.5 One Hot Encoding:
    - Label encoding has the advantage that it is straightforward, but it has the disadvantage that the numeric values can be misinterpreted by the algorithms. 
    - A common alternative approach is called one hot encoding (but also goes by several different names shown below). Despite the different names, the basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly but does have the downside of adding more columns to the data set.
    - This function is powerful because you can pass as many category columns as you would like and choose how to label the columns using prefix . Proper naming will make the rest of the analysis just a little bit easier.
    - One hot encoding is very useful but it can cause the number of columns to expand greatly if you have very many unique values in a column. 
    - In addition to thinking about what One-Hot Encoding does, you will notice something very quickly:
        - You have as many columns as you have cardinalities (values) in the categorical variable.
        - You have a bunch of zeroes and only few 1s! (one 1 per new feature)              
        
Therefore, you have to choose between two representations of One-Hot Encoding:
        - Dense Representation: 0s are stored in memory, which ballons the RAM usage a LOT if you have many cardinalities. But at least, the support for such representation is typicallY worldwide.
        - Sparse Repsentation: 0s are not stored in memory, which makes RAM efficiency a LOT better even if you have millions of cardinalities. However, good luck finding support for sparse matrices for machine learning, because it is not widespread (think: xgboost, LightGBM, etc.)

### 1.6 Custom Binary Encoding
    - Depending on the data set, you may be able to use some combination of label encoding and one hot encoding to create a binary column that meets your needs for further analysis.
    - This approach can be really useful if there is an option to consolidate to a simple Y/N value in a column. This also highlights how important domain knowledge is to solving the problem in the most efficient manner possible.
    - Use power law of binary encoding to store **N** cardinalities using ceil(log(N+1)/log(2)) features.
    

In [None]:
from sklearn.preprocessing import LabelBinarizer

lb_style = LabelBinarizer()
lb_results = lb_style.fit_transform(obj_df["body_style"])
pd.DataFrame(lb_results, columns=lb_style.classes_).head()

### 1.7 Backward Difference Encoding
    - 

### 1.8 Polynomial Encoding
    - 

## How to deal with high cardinality
- Run chi-squared tests or odds-ratios for the categorical variable and dependent variable to reduce the number of categories.
- Use algorithms like Random Forests or Lasso to get feature importances on all the one-hot encoded columns and only keep those that have a certain level of feature importance.
- Perform Mean Encoding based on training set to set the mean value of that categorical variable in that column. For example, if 30% of people with brown hair are 1's, that column becomes .3 for anyone with brown hair. The issue with this is that if interpretability is important you generally lose that, as a column that indicates a poor neighborhood having a relative feature importance of .30 doesn't mean the same thing as categories that have an expected value of an arbitrary value has a feature importance of .3.