# **Introduction**<br>
A **categorical variable** takes only a limited number of values.

Consider a survey that asks how often you eat breakfast and provides four options: "Never", "Rarely", "Most days", or "Every day". In this case, the data is categorical, because responses fall into a fixed set of categories.
If people responded to a survey about which what brand of car they owned, the responses would fall into categories like "Honda", "Toyota", and "Ford". In this case, the data is also categorical.
You will get an error if you try to plug these variables into most machine learning models in Python without preprocessing them first. In this tutorial, we'll compare three approaches that you can use to prepare your categorical data.


# **Three Approaches**<br>
1) **Drop Categorical Variables** 
   The easiest approach to dealing with categorical variables is to simply remove them from the dataset. This approach will only work well if the columns did not contain useful information.
2) **Ordinal Encoding**<br>
   **Ordinal encoding** assigns each unique value to a different integer.
   ![Ordinal Technique](https://storage.googleapis.com/kaggle-media/learn/images/tEogUAr.png)
   This approach assumes an ordering of the categories: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3).
   
   This assumption makes sense in this example, because there is an indisputable ranking to the categories. Not all categorical variables have a clear ordering in the values, but we refer to those that do as ordinal variables. For tree-based models (like decision trees and random forests), you can expect ordinal encoding to work well with ordinal variables.
3) **One-Hot Encoding**<br>
   **One-hot encoding** creates new columns indicating the presence (or absence) of each possible value in the original data. To understand this, we'll work through an example.
   ![One-Hot Encoding Technique](https://storage.googleapis.com/kaggle-media/learn/images/TW5m0aJ.png)
   n the original dataset, "Color" is a categorical variable with three categories: "Red", "Yellow", and "Green". The corresponding one-hot encoding contains one column for each possible value, and one row for each row in the original dataset. Wherever the original value was "Red", we put a 1 in the "Red" column; if the original value was "Yellow", we put a 1 in the "Yellow" column, and so on.
   
   In contrast to ordinal encoding, one-hot encoding does not assume an ordering of the categories. Thus, you can expect this approach to work particularly well if there is no clear ordering in the categorical data (e.g., "Red" is neither more nor less than "Yellow"). We refer to categorical variables without an intrinsic ranking as nominal variables.
   
   One-hot encoding generally does not perform well if the categorical variable takes on a large number of values (i.e., you generally won't use it for variables taking more than 15 different values).

**Example**<br>

In [1]:
import pandas as pd

df = pd.read_csv("../resources/datasets/melb_data.csv")

from sklearn.model_selection import train_test_split

y = df.Price
X = df.drop(columns="Price", axis=1)

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0, train_size=0.8, test_size=0.2)

X_train.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
12167,St Kilda,11/22 Charnwood Cr,1,u,S,hockingstuart,29/07/2017,5.0,3182.0,1.0,1.0,1.0,0.0,,1940.0,Port Phillip,-37.85984,144.9867,Southern Metropolitan,13240.0
6524,Williamstown,18 James St,2,h,SA,Hunter,17/09/2016,8.0,3016.0,2.0,2.0,1.0,193.0,,,Hobsons Bay,-37.858,144.9005,Western Metropolitan,6380.0
8413,Sunshine,10 Dundalk St,3,h,S,Barry,8/04/2017,12.6,3020.0,3.0,1.0,1.0,555.0,,,Brimbank,-37.7988,144.822,Western Metropolitan,3755.0
2919,Glenroy,1/2 Prospect St,3,u,SP,Brad,18/06/2016,13.0,3046.0,3.0,1.0,1.0,265.0,,1995.0,Moreland,-37.7083,144.9158,Northern Metropolitan,8870.0
6043,Sunshine North,35 Furlong Rd,3,h,S,First,22/05/2016,13.3,3020.0,3.0,1.0,2.0,673.0,673.0,1970.0,Brimbank,-37.7623,144.8272,Western Metropolitan,4217.0


In [2]:
# Drop missing value (simplest approach)
missing_val = [col for col in X_train.columns if X_train[col].isnull().any()]
X_train.drop(missing_val, axis=1, inplace=True)
X_val.drop(missing_val, axis=1, inplace=True)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train.columns if X_train[cname].nunique() < 10 and X_train[cname].dtype == "object"]

# Select numerical column
numerical_cols = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train_filter = X_train[my_cols].copy()
X_valid_filter = X_val[my_cols].copy()

X_train_filter.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,193.0,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,555.0,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,265.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,673.0,-37.7623,144.8272,4217.0


Next, we obtain a list of all of the categorical variables in the training data.

We do this by checking the data type (or dtype) of each column. The object dtype indicates a column has text (there are other things it could theoretically be, but that's unimportant for our purposes). For this dataset, the columns with text indicate categorical variables.

In [3]:
# Get list categorical columns
ccname = (X_train_filter.dtypes == "object")
object_cols = list(ccname[ccname].index)

print("Categorical columns list: ")
print(object_cols)

Categorical columns list: 
['Type', 'Method', 'Regionname']


**Define Function to Measure Quality of Each Approach**<br>
We define a function score_dataset() to compare the three different approaches to dealing with categorical variables. This function reports the mean absolute error (MAE) from a random forest model. In general, we want the MAE to be as low as possible!

In [4]:
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100,random_state=42)
    model.fit(X=X_train, y=y_train)
    pred_val = model.predict(X_valid)
    return mean_absolute_error(y_valid, pred_val)

**Score from Approach 1 (Drop Categorical Variables)**<br>
We drop the object columns with the select_dtypes() method.

In [5]:
X_train_drop = X_train_filter.select_dtypes(exclude="object")
X_valid_drop = X_valid_filter.select_dtypes(exclude="object")

print("MAE from Approach 1 (Drop Categorical Variable): ")
print(score_dataset(X_train_drop, X_valid_drop, y_train, y_val))

MAE from Approach 1 (Drop Categorical Variable): 
175730.74184705777


**Score from Approach 2 (Ordinal Encoding)**<br>
Scikit-learn has a [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) class that can be used to get ordinal encodings. We loop over the categorical variables and apply the ordinal encoder separately to each column.

In [6]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_X_train = X_train_filter.copy()
ordinal_X_valid = X_valid_filter.copy()

ordinal_encoder = OrdinalEncoder()
ordinal_X_train[object_cols] = ordinal_encoder.fit_transform(X_train_filter[object_cols])
ordinal_X_valid[object_cols] = ordinal_encoder.transform(X_valid_filter[object_cols])

print("MAE from Approach 2 (Ordinal Encoding): ")
print(score_dataset(ordinal_X_train, ordinal_X_valid, y_train, y_val))

MAE from Approach 2 (Ordinal Encoding): 
167160.15454134232


In the code cell above, for each column, we randomly assign each unique value to a different integer. This is a common approach that is simpler than providing custom labels; however, we can expect an additional boost in performance if we provide better-informed labels for all ordinal variables.



**Score from Approach 3 (One-Hot Encoding)**<br>
We use the [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) class from scikit-learn to get one-hot encodings. There are a number of parameters that can be used to customize its behavior.

- We set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the training data, and
setting 
- sparse=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).

To use the encoder, we supply only the categorical columns that we want to be one-hot encoded. For instance, to encode the training data, we supply X_train[object_cols]. (object_cols in the code cell below is a list of the column names with categorical data, and so X_train[object_cols] contains all of the categorical data in the training set.)

In [7]:
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
OH_X_train = pd.DataFrame(OH_encoder.fit_transform(X_train_filter[object_cols]))
OH_X_valid = pd.DataFrame(OH_encoder.transform(X_valid_filter[object_cols]))

# OHE removed index; put them back
OH_X_train.index = X_train_filter.index
OH_X_valid.index = X_valid_filter.index

# remove categorical columns (will replace with OHE)
num_X_train = X_train_filter.drop(object_cols, axis=1)
num_X_valid = X_valid_filter.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_X_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_X_valid], axis=1)

# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype('str')
OH_X_valid.columns = OH_X_valid.columns.astype('str')

print("MAE from One-Hot Encoding: ")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_val))

MAE from One-Hot Encoding: 
165583.65984886736


**Which approach is best?**<br>
In this case, dropping the categorical columns (**Approach 1**) performed worst, since it had the highest MAE score. As for the other two **approaches**, since the returned MAE scores are so close in value, there doesn't appear to be any meaningful benefit to one over the other.

In general, one-hot encoding (**Approach 3**) will typically perform best, and dropping the categorical columns (**Approach 1**) typically performs worst, but it varies on a case-by-case basis.



**Conclusion**<br>
The world is filled with categorical data. You will be a much more effective data scientist if you know how to use this common data type!