<center>
    <h1>Recursive Feature Elimination (RFE)</h1>
</center>

- The basic feature selection methods are mostly about individual properties of features and how they interact with each other.
- RFE is a more pragmatic approach would select features based on how they affect a particular model’s performance

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

<b>Dataset Link: </b>https://www.kaggle.com/datasets/seshadrikolluri/ansur-ii

In [2]:
data = pd.read_csv("male_data.csv", nrows=1000)
data.head()

Unnamed: 0,subjectid,abdominalextensiondepthsitting,acromialheight,acromionradialelength,anklecircumference,axillaheight,balloffootcircumference,balloffootlength,biacromialbreadth,bicepscircumferenceflexed,...,Branch,PrimaryMOS,SubjectsBirthLocation,SubjectNumericRace,Ethnicity,DODRace,Age,Heightin,Weightlbs,WritingPreference
0,10027,266,1467,337,222,1347,253,202,401,369,...,Combat Arms,19D,North Dakota,1,,1,41,71,180,Right hand
1,10032,233,1395,326,220,1293,245,193,394,338,...,Combat Support,68W,New York,1,,1,35,68,160,Left hand
2,10033,287,1430,341,230,1327,256,196,427,408,...,Combat Support,68W,New York,2,,2,42,68,205,Left hand
3,10092,234,1347,310,230,1239,262,199,401,359,...,Combat Service Support,88M,Wisconsin,1,,1,31,66,175,Right hand
4,10093,250,1585,372,247,1478,267,224,435,356,...,Combat Service Support,92G,North Carolina,2,,2,21,77,213,Right hand


In [3]:
data.shape

(1000, 108)

In [4]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
data = data.select_dtypes(include=numerics)

In [5]:
data.shape

(1000, 99)

In [6]:
X = data.drop(columns="Weightlbs", axis = 1)
y = data.loc[:, "Weightlbs"]

In [7]:
X.shape, y.shape

((1000, 98), (1000,))

In [8]:
# splitting the data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 24)
print(f"Train Data: {X_train.shape}, {y_train.shape}")
print(f"Test Data: {X_test.shape}, {y_test.shape}")

Train Data: (800, 98), (800,)
Test Data: (200, 98), (200,)


In [9]:
# scaling the data
scaler = StandardScaler()
scaled_train = scaler.fit_transform(X_train)
scaled_test = scaler.transform(X_test)

In [10]:
rf_model = RandomForestRegressor(random_state = 24)
rf_model.fit(scaled_train, y_train)

In [11]:
print(f"Score: {rf_model.score(scaled_test, y_test)}")

Score: 0.9525694773382329


In [12]:
pd.DataFrame(
    zip(X_train.columns, abs(rf_model.feature_importances_)),
    columns=["feature", "importance"],
).sort_values("importance").reset_index(drop=True)

Unnamed: 0,feature,importance
0,DODRace,0.000036
1,Heightin,0.000120
2,crotchheight,0.000127
3,tibialheight,0.000130
4,acromionradialelength,0.000130
...,...,...
93,abdominalextensiondepthsitting,0.002206
94,bicepscircumferenceflexed,0.003395
95,buttockdepth,0.003969
96,bideltoidbreadth,0.014106


## Performing Recursive Feature Elimination

In [13]:
rfe = RFE(estimator=RandomForestRegressor(random_state = 24), n_features_to_select=10)
rfe.fit(scaled_train, y_train)

In [14]:
rfe.support_

array([False, False, False, False, False, False, False, False, False,
        True, False,  True, False, False, False, False, False,  True,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False,  True, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False,  True, False, False, False, False, False, False])

In [15]:
sum(rfe.support_)

10

## Using selected features to train the model

In [16]:
selected_train = rfe.transform(scaled_train)
selected_test = rfe.transform(scaled_test)

In [17]:
selected_train.shape, selected_test.shape

((800, 10), (200, 10))

In [18]:
rf_model = RandomForestRegressor(random_state = 24)
rf_model.fit(selected_train, y_train)

In [19]:
print(f"Score: {rf_model.score(selected_test, y_test)}")

Score: 0.9531169389266686
