# Feature Selection using Random Forest
* We'll show a different way to do feature selection (we already saw how to
do it using the regularization technique L1.
* RandomForestClassifier module in scikit-learn comes with an attribute
called feature_importances_, which indicates the feature importance.
* We will examine feature selection with random forest on the dataset with
100,000 ad click samples

# Step 1: Loading the data

In [2]:
import pandas as pd
import numpy as np

n_rows = 300_000
df = pd.read_csv("./dataset/train.csv", nrows = n_rows)

# Splitting the features from the target
X = df.drop(['click', 'id', 'hour', 'device_id', 'device_ip'], axis=1).values
Y = df['click']




## Step 2: Performing One-Hot Encoding
* We'll transform the data and split it into training and testing sets

In [3]:
# Split the data into training and testing sets (90% - 10%)
n_train = int(n_rows * 0.9)
X_train = X[:n_train]
Y_train = Y[:n_train].astype('float32')
X_test = X[n_train:]
Y_test = Y[n_train:].astype('float32')


# One-hot encode the categorical features
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X_train_enc = enc.fit_transform(X_train).toarray().astype('float32')
X_test_enc = enc.transform(X_test).toarray().astype('float32')


## Step 3: Training the model and finding the important features

In [5]:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators = 100, criterion =
'gini', min_samples_split = 30, n_jobs = -1)
random_forest.fit(X_train_enc, Y_train)

In [7]:
# Finding the 10 least important features
feature_imp = random_forest.feature_importances_
print(feature_imp)

# Getting the feature names:
feature_names = enc.get_feature_names_out()
bottom_10 = np.argsort(feature_imp)[:10]
print(f"The 10 least important features are n{feature_names[bottom_10]}")

[8.77560300e-06 1.37007052e-03 1.02735879e-03 ... 5.43089865e-04
 9.89793070e-03 9.87211403e-06]
The 10 least important features are n['x8_cf0c5821' 'x5_e8949ef7' 'x8_02551300' 'x8_20069b56' 'x8_c00022b2'
 'x8_92e1a858' 'x5_ac15a7c4' 'x8_7eb254eb' 'x8_a4b93048' 'x8_feba401a']


In [8]:
# Finding the top 10 most important f
top_10 = np.argsort(feature_imp)[-10:]
print(f"The most important features are: \n{feature_names[top_10]}")

The most important features are: 
['x17_-1' 'x16_1063' 'x18_157' 'x8_8a4875bd' 'x3_7687a86e' 'x2_d9750ee7'
 'x3_98572c79' 'x14_1993' 'x15_2' 'x18_33']


In [9]:
# Remembering the feature length (They're one-hoy_encoded)
print(len(feature_names))

8204
