Article: https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02
https://medium.com/analytics-vidhya/how-to-handle-categorical-features-ab65c3cf498e


Categorical Data and Its Types

A categorical or discrete variable is one that has two or more categories (values). There are two different types of categorical variables:

Nomial: A nominal variable has no intrinsic ordering to its categories. For example, gender is a categorical variable having two categories (Male and Female) with no inherent ordering between them. Another example is Country (India, Australia, America, and so forth).

Ordinal: An ordinal variable has a clear ordering within its categories. For example, consider temperature as a variable with three distinct (but related) categories (low, medium, high). Another example is an education degree (Ph.D., Master’s, or Bachelor’s).

Different Approaches to Handle Categorical Data

· One Hot Encoding

· One Hot Encoding with multiple categories

· Ordinal Number Encoding

· Count or Frequency Encoding

· Target guided Ordinal Encoding

· Mean Ordinal Encoding

· Probability Ratio Encoding

1. One Hot Encoding

This technique is applied for nomial categorical features.

In one Hot Encoding method, each category value is converted into a new column and assigned a value as 1 or 0 to the column.

This will be done using the pandas get_dummies() function and then we will drop the first column in order to avoid dummy variable trap.



Advantages :

· Simple to use and fits well for data with few categories.

Disadvantages:

· A high cardinality of higher categories will increase the feature space, resulting in the curse of dimensionality. (for example; high cardinal data means high number of unique values. e.g pincode is high cardinal data)

2. One Hot Encoding with Multiple Categories

This is one of the ensemble selection techniques pick up from the KDD Orange Cup competition. In this technique, the author made a slight modification to the One hot encoding technique that is instead of creating the new column for every category, they limit creating the new column for 10 most frequent categories. Sounds like a Jargon !!!! Me too 😊


Advantages:

· Easy to implement

· Does not expand massively the feature space

Disadvantages :

· Does not keep track of category values that are overlooked.

3. Ordinal Number Encoding

As the name implies, this technique is used for ordinal categorical features.

In this technique, each unique category value is given an integer value. For instance, “red” equals 1, “green” equals 2 and “blue” equals 3.

Domain information can be used to determine the integer value order. For example, we people love Saturday and Sundays, and most hates Monday. In this scenario the mapping for weekdays goes ‘Monday’ is 1, ‘Tuesday’ is 2, ‘Wednesday’ is 3, ‘Thursday’ is 4, ‘Friday’ is 5,’Saturday’ is 6,’Sunday’ is 7.


Advantages :

· Easy and straightforward to implement

· Widely used in survey and research data encoding.

Disadvantages:

· Do not have a standardized interval scale.

4. Count or Frequency Encoding

As the name implies, in this technique we will substitute the categories by the count of the observations that show that category in the dataset.

As an example. If India appears 56 times in the country column and America appears 49 times, we replace India with 56 and America with 49 in the country column.


Advantages:

· Easy to implement

· There will be no increase in feature space.

· Work well with the tree-based algorithms.

Disadvantages:

It will not provide the same weight if the frequencies are the same.

5. Target guided Ordinal Encoding

In this technique, we will transform our categorical variable by comparing it to the target or output variable.

Steps:

1) Choose a categorical variable.

2) Take the aggregated mean of the categorical variable and apply it to the target variable.

3) Assign higher integer values or a higher rank to the category with the highest mean.


Advantages:

· Establish a monotonic relationship between the variable and the target.

· Helps in faster learning

Disadvantages:

· Because of the close relationship to the target variable, it often leads to overfitting.

6. Mean Ordinal Encoding

It’s a sight variant of target-guided ordinal encoding and is viral among data scientists. We replace the category with the obtained mean value instead of assigning integer values to it.


Advantages:

· Improves classification model efficiency.

· Fast acquisition of information

Disadvantages:

· Leads to overfitting

· May lead to possible loss of value if two categories have the same mean


7. Probability Ratio Encoding

This technique is suitable for classification problems only when the target variable is binary(Either 1 or 0 or True or False).

In this technique, we will substitute the category value with the probability ratio i.e. P(1)/P(0).

Steps :

1) Using the categorical variable, evaluate the probability of the Target variable (where the output is True or 1).

2) Calculate the probability of the Target variable having a False or 0 output.

3) Calculate the probability ratio i.e. P(True or 1) / P(False or 0).

4) Replace the category with a probability ratio.



Advantages:

· Do not expand the feature space.

· Captures information from within the category, resulting in more predictive features.

Disadvantages:

· Not defined when the denominator is 0.

· It sometimes results in overfitting.