<a href="https://colab.research.google.com/github/cirilwakounig/MachineLearning/blob/main/3_Dealing_with_Categorical_Variables.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dealing with Categorical Variables

This script is showing how to effectively deal with categorical variables. More information about methods and strategies can be found here: https://www.kaggle.com/alexisbcook/categorical-variables. This script is based on the course 'Intermediate Machine Learning' provided by Kaggle. 

Categorical Values can be dealt with using three different approaches:

1. Dropping the columns, in case they do not contain useful information
2. Label Encoding, which is ordering the labels - Useful for data that can be ranked (ordinal data)
3. One-Hot Encoding, which is used for data that cannot be ranked (nominal data) 

In [None]:
# Import the required Libraries
import pandas as pd
import numpy as np

# Data Processing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Model Development and Validation
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

##### 1. Import and Process the required Data

---



##### 1.1 Import of Data

In [None]:
# Import the Data Set Set
file_path_train = '/content/drive/MyDrive/Colab Notebooks/Kaggle Course/Intermediate Machine Learning/melb_data.csv'
# file_path_test = '/content/drive/MyDrive/Colab Notebooks/Kaggle Course/Intermediate Machine Learning/test.csv'

# Read the data
X_full = pd.read_csv(file_path_train)
# X_test_full = pd.read_csv(file_path_test, index_col = 'Id')

# Assign the dependent variable - Remove missing target values
X_full.dropna(axis = 0, subset = ['Price'], inplace = True)   # Inplace = True overrides existing data frame
y = X_full.Price

# Separate features from predictors
X = X_full.drop(['Price'], axis = 1)

# Drop Columns with missing values (easier approach)
drop_cols = [col for col in X.columns if X[col].isnull().any()]
X.drop(drop_cols, axis = 1, inplace = True)

# Split the data in train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, 
                                                  train_size = 0.8, test_size = 0.2, random_state = 0)

##### 1.2 Processing Data

Now, the data needs to be processed, such that it suits for the analysis.

In [None]:
# Cardinality refers to the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_card_cols = [cname for cname in X_train.columns if X_train[cname].nunique() < 10 
                                                    and X_train[cname].dtype == 'object']

# Select numerical columns
numerical_cols = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_card_cols + numerical_cols
X_train = X_train[my_cols].copy()
X_val = X_val[my_cols].copy()

For the Analysis, it is important to know which columns contain categorical data. 

In [None]:
# Get list of categorical variables
cat_cols = (X_train.dtypes == 'object')   # Returns vector with true and false values if categorical
object_cols = list(cat_cols[cat_cols].index)

print('Categorical Variables:')
print(object_cols)

Categorical Variables:
['Type', 'Method', 'Regionname']


#### 2. Set up Regressor used to score Data processing Approaches

---



In [None]:
# We are using a Random Forest Regressor to score the performance

def score_mae(X_train, X_val, y_train, y_val):
  # Develop Model and make Predictions
  model = RandomForestRegressor(n_estimators = 100, random_state = 0)
  model.fit(X_train,y_train)
  preds = model.predict(X_val)

  error = mean_absolute_error(y_val, preds)
  return (error)

#### 3. Data Processing Approaches

---



In [None]:
# Variable Containing the Score of each Approach
method_score = []

##### 3.1 Drop Columns with Categorical Values

This approach drops any column, that contains a categorical value.


In [None]:
# Dropping columns can be done using two methods

## 1. Method: .drop() function of data frames

# Assign object columns to drop_cols variable
drop_cols = object_cols

# Remove columns with missing values
drop_X_train = X_train.drop(drop_cols, axis = 1)
drop_X_val = X_val.drop(drop_cols, axis = 1)

## 2. Method: dropping columns using the select_dtypes() method

# Select only specified datatypes from a dataframe
drop_X_train = X_train.select_dtypes(exclude = ['object'])
drop_X_val = X_val.select_dtypes(exclude = ['object'])

In [None]:
# Predict using the reduced feature set
method_score.append(score_mae(drop_X_train, drop_X_val, y_train, y_val))
print(method_score)

[175703.48185157913]


##### 3.2 Use Label Encoding

This approach uses sklearn's LabelEncoder function to label ordinal (rankable) values. 

In [None]:
# Make a copy to avoid changing the original data
label_X_train = X_train.copy()
label_X_val = X_val.copy()

# Define the Label Encoder
label_encoder = LabelEncoder()

# Apply the label encoder to each column with categorical data 
for col in object_cols: 
  label_X_train[col] = label_encoder.fit_transform(X_train[col])
  label_X_val[col] = label_encoder.transform(X_val[col])

In [None]:
# Predict using imputed dataset
method_score.append(score_mae(label_X_train, label_X_val, y_train, y_val))
print(method_score)

[175703.48185157913, 165936.40548390493]


##### 3.3 One-Hot Encoding

Here, the one-hot encoding approach is being used, where columns are added for each individual categorical value, even within a column (i.e. column color, with entries: red, blue, yellow). Each unique entry results in a new column, with boolean values indicating, which colour was present in the original column (i.e. columns red, blue and yellow are created, which contrain ones, where the respective colour was present in the column color). To achieve this, the OneHotEncoder() function from sklearn is imported.

Before applying OH-encoding, one needs to check the number of columns and unique entries within these columns.

In [None]:
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

[('Type', 3), ('Method', 5), ('Regionname', 8)]

> As can be seen, the number is not very high. OH-encoding should only be used for columns with a relatively small number of unique values, otherwise OH-encoding will result in a greatly expanded data set. In such instances, columns with a large number of unique variables should be either dropped, or label encoded. In this case here, the number of unique variables is small enough. 





In [None]:
# Make a copy to avoid changing data when imputing 
X_train_plus = X_train.copy()
X_val_plus = X_val.copy()

# Make new columns indicating what will be imputed. df[col].isnull() returns true/false in each row of column col in dataframe df
for col in drop_cols:
  X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
  X_val_plus[col + '_was_missing'] = X_val_plus[col].isnull()

In [None]:
# Define the Encoder
OH_encoder = OneHotEncoder(handle_unknown = "ignore", sparse = False)

# Encode columns containing categorical values
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_val = pd.DataFrame(OH_encoder.transform(X_val[object_cols]))

# Encoding returns a np array and therefore, index is lost
OH_cols_train.index = X_train.index
OH_cols_val.index = X_val.index

# Remove categorical columns (will be replaced with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis = 1)
num_X_val = X_val.drop(object_cols, axis = 1)

# Combine Numerical and Encoded Categorical Values
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis = 1)
OH_X_val = pd.concat([num_X_val, OH_cols_val], axis = 1)

In [None]:
method_score.append(score_mae(OH_X_train, OH_X_val, y_train, y_val))
print(method_score)

[175703.48185157913, 165936.40548390493, 166089.4893009678]


##### 3.4 Conclusion

As can be seen in this case, dropping categorical values performs significantly worse. In general, one-hot encoding will perform best, and dropping categorical columns will perform worst, however as always, this very much varies on a case by case basis, depending on the quality of the categorical data. When encoding data, it is important to check, if the categorical values present in the training data also appear in the validation/test data, otherwise an error will be returned. To prevent this, one should check the categorical values from the training and validation/test sets, identify the bad categories and remove them. Using the set() function is a way to identify bad columns. 