## 2-4-1. Feature Selection Using Hybrid Method (Random Shuffling)

A popular method of feature selection consists of randomly shuffling the values of a specific variable and determining how that permutation affects the performance metric of the machine learning algorithm. In other words, the idea is to shuffle the values of each feature, one feature at a time, and measure how much the permutation (or shuffling of its values) decreases the accuracy, or the roc_auc, or the mse of the machine learning model (or any other performance metric!). If the variables are important, a random permutation of their values will dramatically decrease any of these metrics. Contrarily, the permutation or shuffling of values should have little to no effect on the model performance metric we are assessing.

The procedure goes more or less like this:

- Build a machine learning model and store its performance metrics.

- Shuffle 1 feature, and make a new prediction using the previous model.

- Determine the performance of this prediction.

- Determine the change in the performance of the prediction with the shuffled feature compared to the original one.

- Repeat for each feature.

To select features, we chose those that induced a decrease in model performance beyond an arbitrarily set threshold.

I will demonstrate how to select features based on random shuffling using a regression and classification problem. 

**Note** For the demonstration, I will continue to use Random Forests, but this selection procedure can be used with any machine learning algorithm. In fact, the importance of the features is determined specifically for the algorithm used. Therefore, different algorithms may return different subsets of important features.

### A. Import Python libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import roc_auc_score, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

### B. Set City Name and Import City Data

In [2]:
# set city name
city_data = "1_goyang_city.xlsx"
CITY_NAME_Eng = "GoYang-City"

In [3]:
# Read excel file using pandas
df = pd.read_excel(open(f"../../data/{city_data}", 'rb'), sheet_name="training", header=4, index_col=0)
# Remove unnecessary columns for this analysis
df = df.iloc[2:]
# Change Date Format and Set Date as index
df.index = pd.to_datetime(df.index.str.strip(), format='%Y-%m')
df.index.name = "date"
# Change data format from "Object" to "Float"
df["water_supply"] = df.water_supply.astype(float)
df["Total_Population"] = df.Total_Population.astype(float)
# Delete unnecessary columns 
df.drop(columns=df.columns[19:21], inplace=True)
df.drop(columns=df.columns[22:23], inplace=True)
# Select clean data
df = df.loc["2010-01-01":]
df

Unnamed: 0_level_0,water_supply,Total_Population,Households,Population_per_Households,Male_Population,Female_Population,Male_Female_Ratio,Population_aging_Ratio,Power_usage,Num_of_Business,...,personal_expense,benefits_vs_personal_expense,employment_ratio,employment_insurance_ratio,Average_Temp,Monthly_Rainfall,Average_Relative_Humadity,Ground_Temp,Average_Wind,Average_Pressure
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2010-01-01,282265.709677,939497.0,353741.0,2.655889,463878.0,475619.0,0.975314,0.088,434436000.0,20326.0,...,104371555.0,0.191,99001.0,0.105377,-4.5,29.3,0.65,-1.2,2.3,1013.6
2010-02-01,273685.892857,940639.0,354266.0,2.655177,464518.0,476121.0,0.975630,0.088,418156000.0,20685.0,...,104371555.0,0.191,99893.0,0.106197,1.4,55.3,0.59,1.4,2.4,1010.7
2010-03-01,269918.193548,940982.0,354003.0,2.658119,464740.0,476242.0,0.975848,0.088,346653000.0,20809.0,...,104371555.0,0.191,99678.0,0.105930,4.3,82.5,0.59,5.0,2.9,1009.6
2010-04-01,274462.700000,941737.0,354192.0,2.658832,465148.0,476589.0,0.975994,0.088,356701000.0,21857.0,...,104371555.0,0.191,101594.0,0.107879,9.5,62.8,0.54,10.8,2.9,1007.4
2010-05-01,288537.806452,941724.0,354157.0,2.659058,465110.0,476614.0,0.975863,0.089,313793000.0,21739.0,...,104371555.0,0.191,102073.0,0.108390,17.2,124.0,0.62,18.7,2.6,1000.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-08-01,354382.451613,1080896.0,454793.0,2.376677,528972.0,551924.0,0.958415,0.140,487867544.0,46416.0,...,218316665.0,0.293,171845.0,0.158984,25.9,211.2,0.74,28.2,2.1,998.6
2021-09-01,343537.800000,1080787.0,455501.0,2.372743,528911.0,551876.0,0.958387,0.141,401433572.0,45514.0,...,218316665.0,0.293,172771.0,0.159857,22.6,131.0,0.71,24.6,2.3,1003.4
2021-10-01,340126.806452,1080240.0,455845.0,2.369753,528683.0,551557.0,0.958528,0.142,358286760.0,45839.0,...,218316665.0,0.293,173486.0,0.160599,15.6,57.0,0.70,16.7,2.1,1011.0
2021-11-01,335109.300000,1079722.0,456376.0,2.365861,528390.0,551332.0,0.958388,0.142,372991744.0,46076.0,...,218316665.0,0.293,173831.0,0.160996,8.2,62.4,0.68,7.9,2.1,1009.1


### C. Hybrid Methods: Selection using Random Shuffling

* Split Data

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(labels=['water_supply'], axis=1),
    df['water_supply'],
    test_size=0.2,
    random_state=0)

X_train.shape, X_test.shape

X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)

X_train = X_train.fillna(0)
X_test = X_test.fillna(0)

#### Train ML with all features

In [5]:
# The first step to determine feature importance by feature shuffling
# is to build the machine learning model for which we want to
# select features

# In this case, I will build Random Forests, but remember that
# you can use this procedure for any other machine learning algorithm

# I build few and shallow trees to avoid overfitting
rf = RandomForestRegressor(n_estimators=200,
                           max_depth=3,
                           random_state=2909,
                           n_jobs=4)

rf.fit(X_train, y_train)

# print performance metrics
print('train RMSE: ',
      mean_squared_error(y_train, rf.predict(X_train), squared=False))
print('train R2: ', r2_score(y_train, (rf.predict(X_train))))
print()
print('test RMSE: ',
      mean_squared_error(y_test, rf.predict(X_test), squared=False))
print('test R2: ', r2_score(y_test, rf.predict(X_test)))

train RMSE:  4858.90843031782
train R2:  0.9491408243110964

test RMSE:  8673.978999143294
test R2:  0.861875341464565


#### Shuffle features and asses performance drift

In [6]:
# in this cell, I will shuffle one by one, each feature of the dataset
# and then use the dataset with the shuffled variable to make predictions
# using the random forests I trained in the previous cell

# overall train rmse: using all the features
train_rmse = mean_squared_error(y_train, rf.predict(X_train), squared=False)

# list to capture the performance shift
performance_shift = []

# for each feature:
for feature in X_train.columns:
    
    X_train_c = X_train.copy()

    # shuffle individual feature
    X_train_c[feature] = X_train_c[feature].sample(frac=1, random_state=11).reset_index(
        drop=True)

    # make prediction with shuffled feature and calculate roc-auc
    shuff_rmse = mean_squared_error(y_train, rf.predict(X_train_c), squared=False)
    
    drift = train_rmse - shuff_rmse 

    # store the drop in roc-auc
    performance_shift.append(drift)

In [7]:
# Now I will transform the list into a pandas Series
# for easy manipulation

feature_importance = pd.Series(performance_shift)

# add variable names in the index
feature_importance.index = X_train.columns

feature_importance

Total_Population               -6.520779e+01
Households                     -2.579171e+02
Population_per_Households      -1.922194e+02
Male_Population                -2.075164e+02
Female_Population              -1.086853e+02
Male_Female_Ratio              -1.705207e+03
Population_aging_Ratio         -9.789274e+01
Power_usage                    -2.076161e+01
Num_of_Business                -1.331634e+04
Business_above_100             -3.172746e+01
complex_area                   -7.275958e-12
annual_household_income         7.587667e-01
High_School_Graduate_num        3.048825e+00
High_School_Graduate_ratio     -1.783354e+01
personal_expense               -2.113214e+01
benefits_vs_personal_expense   -5.813004e+01
employment_ratio               -7.011901e+01
employment_insurance_ratio     -2.763931e+02
Average_Temp                   -1.363414e+03
Monthly_Rainfall               -4.340650e+01
Average_Relative_Humadity      -3.689502e+01
Ground_Temp                    -7.141721e+03
Average_Wi

In [8]:
# Note here that when looking at the rmse, the smaller the better.

# as we do original_rmse - shuffled_data_rmse

# if the feature was important, the shuffled data would increase the rsme

# thus, we are looking for negative values here

# number of features that cause a drop in performance
# when shuffled

feature_importance[feature_importance<0].shape[0]

22

In [9]:
# and the variable names

feature_importance[feature_importance<0].index

Index(['Total_Population', 'Households', 'Population_per_Households',
       'Male_Population', 'Female_Population', 'Male_Female_Ratio',
       'Population_aging_Ratio', 'Power_usage', 'Num_of_Business',
       'Business_above_100', 'complex_area', 'High_School_Graduate_ratio',
       'personal_expense', 'benefits_vs_personal_expense', 'employment_ratio',
       'employment_insurance_ratio', 'Average_Temp', 'Monthly_Rainfall',
       'Average_Relative_Humadity', 'Ground_Temp', 'Average_Wind',
       'Average_Pressure'],
      dtype='object')

In [10]:
results_f = pd.DataFrame()
results_f["hybrid shuffling"] = [list(feature_importance[feature_importance<0].index.values)]
results_f.to_csv(f'./results/{CITY_NAME_Eng}_hybrid_shuffling_results.csv')
results_f

Unnamed: 0,hybrid shuffling
0,"[Total_Population, Households, Population_per_..."


### Select features

In [11]:
# Now let's compare the performance of a random forest
# built only using the selected features

# slice the data
feat = feature_importance[feature_importance<0].index

X_train = X_train[feat]
X_test = X_test[feat]

In [12]:
X_train.shape, X_train.shape

((115, 22), (115, 22))

In [13]:
# build and evaluate the model

rf = RandomForestRegressor(n_estimators=100,
                           max_depth=3,
                           random_state=2909,
                           n_jobs=4)

rf.fit(X_train, y_train)

# print performance metrics
print('train rmse: ', mean_squared_error(
    y_train, rf.predict(X_train), squared=False))
print('train r2: ', r2_score(y_train, (rf.predict(X_train))))
print()
print('test rmse: ', mean_squared_error(
    y_test, rf.predict(X_test), squared=False))
print('test r2: ', r2_score(y_test, rf.predict(X_test)))

train rmse:  4868.168741317753
train r2:  0.9489467804884121

test rmse:  8830.00418214133
test r2:  0.8568615522390285


The model with less features shows similar performance to that with all features.

That is all for this lecture, I hope you enjoyed it and see you in the next one!