Categorical variables take only a limited number of values.

Categorical Variables can be divided into two categories:
- Ordinal: A ranking can be established based on the relationship between the values.
- Nominal: Values do not have an intrinsic ranking and cannot be ordered.

Three ways to deal with categorical variables in datasets:
- Drop them. Especially if they do not contain useful information.
- Ordinal encoding: Assigning a unique integer value to each category. Used for ordinal values.
- One-Hot Encoding: Creates new columns which indicate the presence and absence of each possible value. Used for nominal values.
    - Does not work well if the CatVar takes on a large number of values (no more than 15).



In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv("melb_data.csv")

y = data.Price
X = data.drop(['Price'], axis=1)

X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, random_state=0)

cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]

X_train.drop(cols_with_missing, axis=1, inplace=True)
X_val.drop(cols_with_missing, axis=1, inplace=True)

# Finding suitable categorical variables that have a small number of unique values and are text-based.
low_uniqueness_cols = [col for col in X_train.columns if X_train[col].nunique() < 10 and X_train[col].dtype == "object"]

numerical_cols = [col for col in X_train.columns if X_train[col].dtype in ['int64', 'float64']]

final_cols = low_uniqueness_cols + numerical_cols
X_train_final = X_train[final_cols].copy()
X_val_final = X_val[final_cols].copy()

In [3]:
X_train_final.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,193.0,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,555.0,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,265.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,673.0,-37.7623,144.8272,4217.0


In [4]:
s = (X_train_final.dtypes == 'object')  # Boolean mask approach
object_Cols = list(s[s].index)

# This is the same as:
# object_Cols = list(X_train_final.select_dtypes(include=['object']).columns)

print("Categorical Variables:")
print(object_Cols)

Categorical Variables:
['Type', 'Method', 'Regionname']


In [5]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

def score(X_train, X_val, y_train, y_val):
    model = RandomForestRegressor(n_estimators=150, random_state=0)
    model.fit(X_train, y_train)
    predictions = model.predict(X_val)
    return mean_absolute_error(predictions, y_val)

In [6]:
# Using Approach 1:

drop_X_train = X_train_final.select_dtypes(exclude=['object'])
drop_X_val = X_val_final.select_dtypes(exclude=['object'])

print(score(drop_X_train, drop_X_val, y_train, y_val))

175733.2014838971


In [7]:
# Using Approach 2:

from sklearn.preprocessing import OrdinalEncoder

label_X_train = X_train_final.copy()
label_X_val = X_val_final.copy()

oncoder = OrdinalEncoder()
label_X_train[object_Cols] = oncoder.fit_transform(X_train_final[object_Cols])
label_X_val[object_Cols] = oncoder.transform(X_val_final[object_Cols])

print(score(label_X_train, label_X_val, y_train, y_val))

165852.0073975501


In [8]:
# Using Approach 3:

from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train_final[object_Cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_val_final[object_Cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train_final.index
OH_cols_valid.index = X_val_final.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train_final.drop(object_Cols, axis=1)
num_X_valid = X_val_final.drop(object_Cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)

print(score(OH_X_train, OH_X_valid, y_train, y_val))

165584.7179708722


- There is a possibility that some values for Categorical variables may only be present in the validation data.
- This means that strategies like the Ordinal Encoder will throw errors as they have only assigned numerical values to object values present in the training data.
- In such a case, it is best to check which columns include values that aren't in both sets and then remove these problematic columns.