1. Introduction [here](#1-d)<br>
2. Missing values [here](#2)<br>
3. Categorical Variables [here](#3)<br>
4. Pipelines [here](#4)<br>
5. Cross-validation [here](#5)<br>
6. XGBoost [here](#6)<br>
7. Data Leakage [here](#7)<br>

<a id ="5"></a>
# Categorical Variables 

How to handle them:
1. <i>[Ordinal Encoding](#5.1)</i>. Assign a value a different integer each with clear ranking in mind. For tree-based models (like decision trees and random forests), ordinal encoding works well with ordinal Variables. `from sklearn.preprocessing import OrdinalEncoder`
2. *[One-Hot Encoding](#5.2)*: creates new columns indicating the presence (or absence) of each possible value in the original data; used for nominal variables (no ranking). `from sklearn.preprocessing import OneHotEncoder`
3. [Drop the columns](#5.3)

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X = pd.read_csv(r'C:\Users\core i5\Documents\GitHub\DataScience\datascience\Kaggle\data\train.csv', index_col='Id') 
X_test = pd.read_csv(r'C:\Users\core i5\Documents\GitHub\DataScience\datascience\Kaggle\data\test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

In [2]:
X_train.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,20,RL,11694,Pave,Reg,Lvl,AllPub,Inside,Gtl,NridgHt,...,108,0,0,260,0,0,7,2007,New,Partial
871,20,RL,6600,Pave,Reg,Lvl,AllPub,Inside,Gtl,NAmes,...,0,0,0,0,0,0,8,2009,WD,Normal
93,30,RL,13360,Pave,IR1,HLS,AllPub,Inside,Gtl,Crawfor,...,0,44,0,0,0,0,8,2009,WD,Normal
818,20,RL,13265,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,Mitchel,...,59,0,0,0,0,0,7,2008,WD,Normal
303,20,RL,13704,Pave,IR1,Lvl,AllPub,Corner,Gtl,CollgCr,...,81,0,0,0,0,0,1,2006,WD,Normal


In [3]:
# for regressors, we shall use the mean square error as the metric. create the function score_dataset to report the MSE with RandomForestRegressor. 

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

### 1. Drop columns with categorical variables (object data type)
<a id="5.3"></a>

In [7]:
cat_cols = [col for col in X_train.columns if X_train[col].dtype == object]
#! You can also use: cat_cols = [col for col in X_train.select_dtypes("object")]

drop_X_train = X_train.drop(cat_cols, axis=1)
drop_X_valid = X_valid.drop(cat_cols, axis=1)

In [8]:
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop categorical variables):
17837.82570776256


### 2. Ordinal Encoding
<a id = "5.1"></a>
fitting ordinal encoder object (and one-hot encoder) with training data and using to fit and transform valid dataset will not work if the valid dataset has different values for the same categorical variables as the training dataset. If you are dealing with something like this, you may want to store the columns that *can be used* to a list and do the same with the problematic columns

In [9]:
# Categorical columns in the training data
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely ordinal encoded
good_label_cols = [col for col in object_cols if 
                   set(X_valid[col]).issubset(set(X_train[col]))]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be ordinal encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)

Categorical columns that will be ordinal encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'BldgType', 'HouseStyle', 'RoofStyle', 'Exterior1st', 'Exterior2nd', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'PavedDrive', 'SaleType', 'SaleCondition']

Categorical columns that will be dropped from the dataset: ['Condition2', 'Functional', 'RoofMatl']


In [10]:
from sklearn.preprocessing import OrdinalEncoder

# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)

# Apply ordinal encoder 
oridnalEncoder_obj = OrdinalEncoder()
label_X_train[good_label_cols] = oridnalEncoder_obj.fit_transform(label_X_train[good_label_cols])
label_X_valid[good_label_cols] = oridnalEncoder_obj.transform(label_X_valid[good_label_cols])

In [12]:
print("MAE from Approach 2 (Ordinal Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

MAE from Approach 2 (Ordinal Encoding):
17098.01649543379


### 3. One-hot encoding
<a id = "5.2"></a>

In [13]:
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

[('Street', 2),
 ('Utilities', 2),
 ('CentralAir', 2),
 ('LandSlope', 3),
 ('PavedDrive', 3),
 ('LotShape', 4),
 ('LandContour', 4),
 ('ExterQual', 4),
 ('KitchenQual', 4),
 ('MSZoning', 5),
 ('LotConfig', 5),
 ('BldgType', 5),
 ('ExterCond', 5),
 ('HeatingQC', 5),
 ('Condition2', 6),
 ('RoofStyle', 6),
 ('Foundation', 6),
 ('Heating', 6),
 ('Functional', 6),
 ('SaleCondition', 6),
 ('RoofMatl', 7),
 ('HouseStyle', 8),
 ('Condition1', 9),
 ('SaleType', 9),
 ('Exterior1st', 15),
 ('Exterior2nd', 16),
 ('Neighborhood', 25)]

Typically we one-hot encode columns with relatively low cardinality or unique values due to causing great expansion of the size of the dataset. We can drop columns with high cardinality or we can use ordinal encoding on them. For this exercise, we will only one-hot encode columns with cardinality of less than 10.

In [57]:
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Un comment code above to take out columns in low_cardinality_cols that have different values compared to their validation vs training set.
'''low_cardinality_cols = list(set(low_cardinality_cols) - set(bad_label_cols))'''

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))

print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)

Categorical columns that will be one-hot encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition']

Categorical columns that will be dropped from the dataset: ['Exterior2nd', 'Exterior1st', 'Neighborhood']


In [58]:
from sklearn.preprocessing import OneHotEncoder

# create a data frame that drops the high cardinality columns
X_train_adjusted = X_train.drop(high_cardinality_cols, axis=1)
X_valid_adjusted = X_valid.drop(high_cardinality_cols, axis=1)

# create the One-Hot encoder using low cardinality columns of training data
OH_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

# fiting part is necessary to generate the encoded_cols
OH_encoder.fit(X_train_adjusted[low_cardinality_cols])

# put into list all of the new columns that need to be created
encoded_cols = list(OH_encoder.get_feature_names(input_features=low_cardinality_cols))

# Now that we have generated the one-hot encoded columns, we shall append them to our train, validation, and test datasets
X_train_adjusted[encoded_cols] = OH_encoder.fit_transform(X_train_adjusted[low_cardinality_cols])
X_valid_adjusted[encoded_cols] = OH_encoder.transform(X_valid_adjusted[low_cardinality_cols])

# Delete redundant columns such as the categori columns such as the ones we used the one-encoded on
X_train_adjusted.drop(low_cardinality_cols, axis=1, inplace=True)
X_valid_adjusted.drop(low_cardinality_cols, axis=1, inplace=True)

OH_X_train = X_train_adjusted
OH_X_valid = X_valid_adjusted

In [59]:
print(X_train_adjusted.shape, X_valid_adjusted.shape)

(1168, 155) (292, 155)


In [60]:
print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

MAE from Approach 3 (One-Hot Encoding):
17525.345719178084
