# Feature Selection

After generating many features, it’s important to identify the ones that actually matter. This reduces computation and can improve performance by removing noise from irrelevant or redundant features.

The goal is to eliminate:

- **Irrelevant features**. They contribute no useful signal

- **Redundant features**. They overlap with others and add unnecessary complexity

Common approaches include:

1) Filter methods: rank features using statistical criteria

2) Wrapper methods: evaluate subsets using a predictive model

3) Embedded methods: select features during model training

In [1]:
import numpy as np
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor

from sklearn.feature_selection import RFE, SequentialFeatureSelector

In [2]:
X, y = load_iris(return_X_y=True)

print("original features:", X.shape[1])


original features: 4


### 1) Filter methods

Rank features using statistical criteria. For example how much the individual features correlate with the label.

Pros

- Fast and computationally cheap

- Model-agnostic

- Good first step for high-dimensional data (e.g., text, genomics)

Cons

- Consider each feature independently (ignores interactions)

- May keep features that look good statistically but don’t help the final model

- Can drop features that only work well in combination with others

**Code example:**

Chi-squared and Mutual Information

In [3]:
selector = SelectKBest(score_func=chi2, k=3)
X_chi2 = selector.fit_transform(X, y)

print("Chi2 scores:", selector.scores_)
print("Selected features using chi2:", selector.get_support(indices=True))

Chi2 scores: [ 10.81782088   3.7107283  116.31261309  67.0483602 ]
Selected features using chi2: [0 2 3]


In [4]:
mi_scores = mutual_info_classif(X, y)

k = 3
top_indices = mi_scores.argsort()[-k:]

print("MI scores:", mi_scores)
print("Selected features using MI:", top_indices)

MI scores: [0.48505954 0.2723968  0.9921889  0.99264549]
Selected features using MI: [0 2 3]


### 2) Wrapper Methods

Rank or select features by repeatedly training a model on different feature subsets. The model’s performance guides which features are kept.

Pros

- Captures interactions between features

- Tailored to a specific model and objective

Cons

- Computationally expensive (many model evaluations)

- Can overfit when data is limited

**Examples**:

Recursive Feature Elimination (RFE)

- Starts with all features, trains a model, removes the weakest feature, and repeats until the desired number remains.

Sequential Feature Selection (SFS)

- Adds (forward) or removes (backward) one feature at a time, choosing whichever improves model performance the most.

**Code example:**

Logistic Regression with RFE and SFS


In [5]:
model = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=model, n_features_to_select=3)
rfe.fit(X, y)

print("Selected feature indices (RFE):", rfe.get_support(indices=True))

Selected feature indices (RFE): [1 2 3]


In [6]:
model = LogisticRegression(max_iter=1000)

sfs = SequentialFeatureSelector(
    estimator=model,
    n_features_to_select=3,
    direction="forward" # or "backward"
)
sfs.fit(X, y)

print("Selected feature indices (SFS):", sfs.get_support(indices=True))

Selected feature indices (SFS): [0 2 3]


When you start with many features but want only a few, forward selection is more practical (small search path).

When you start with many features and want most of them, backward methods make more sense (less to remove).

### 3) Embedded Methods

Select features as part of the model training process. The model itself decides which features matter by shrinking coefficients (L1 regularization) or assigning importance scores (tree-based models).

Pros

- Feature selection is integrated into training

- Can capture interactions and complex patterns

Cons

- Model-dependent (not transferable across architectures)

- Regularization strength must be tuned in the case of Lasso

**Code example:**

Random Forest Feature Importances

In [13]:
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X, y)

importances = rf.feature_importances_
indices = np.argsort(importances)[::-1][:k]  # descending order

print(indices)

[2 3 1]


### A Concrete example: Activity Recognition with Smartphone Accelerometer Data


In [1]:
import pandas as pd

df = pd.read_csv("./data/train.csv")

print(df.head())
print(df.info())

   tBodyAcc-mean()-X  tBodyAcc-mean()-Y  tBodyAcc-mean()-Z  tBodyAcc-std()-X  \
0           0.288585          -0.020294          -0.132905         -0.995279   
1           0.278419          -0.016411          -0.123520         -0.998245   
2           0.279653          -0.019467          -0.113462         -0.995380   
3           0.279174          -0.026201          -0.123283         -0.996091   
4           0.276629          -0.016570          -0.115362         -0.998139   

   tBodyAcc-std()-Y  tBodyAcc-std()-Z  tBodyAcc-mad()-X  tBodyAcc-mad()-Y  \
0         -0.983111         -0.913526         -0.995112         -0.983185   
1         -0.975300         -0.960322         -0.998807         -0.974914   
2         -0.967187         -0.978944         -0.996520         -0.963668   
3         -0.983403         -0.990675         -0.997099         -0.982750   
4         -0.980817         -0.990482         -0.998321         -0.979672   

   tBodyAcc-mad()-Z  tBodyAcc-max()-X  ...  fBodyBodyGyr

The dataset contains 561 features, which makes model training unnecessarily heavy. To speed up computation and improve interpretability, it makes sense to reduce the feature set. For this example, I limit the number of features to 20.

Because of the large feature count, backward wrapper methods (like RFE or SBS) would be too computationally expensive. Instead, I use Random Forest-based feature importances, an embedded method that identifies useful features directly from the trained model.

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

n_features = 20

df = pd.read_csv("./data/train.csv").drop(columns="subject")
df["Activity"] = LabelEncoder().fit_transform(df["Activity"])
X, y = df.drop("Activity", axis=1), df["Activity"]

# Train random forest to get feature importances
rf = RandomForestClassifier(n_estimators=200, random_state=0)
rf.fit(X, y)
top20 = rf.feature_importances_.argsort()[-n_features:][::-1]
selected_features = X.columns[top20]

# Evaluate model with all features
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf.fit(X_train, y_train)
print("All features accuracy:", accuracy_score(y_test, rf.predict(X_test)))

# Evaluate model with top 20 features
X_sel = X[selected_features]
X_train, X_test, y_train, y_test = train_test_split(X_sel, y, test_size=0.2, random_state=42)
rf.fit(X_train, y_train)
print("Top 20 features accuracy:", accuracy_score(y_test, rf.predict(X_test)))

All features accuracy: 0.9809653297076818
Top 20 features accuracy: 0.973487423521414


The performance remains nearly identical (accuracy 0.981 vs. 0.973) even after reducing the feature set from 561 to 20, showing that only a small subset of features carries most of the predictive power.
Random forest feature importances provided a fast and effective way to identify these, avoiding the heavy computation cost of wrapper methods like RFE or SBS while still accounting for feature interactions and nonlinearity.

(note that feature selection was done before splitting to simplify the example.
In a full analysis, it should be performed using only the training data to avoid data leakage.)