# Part 6.1: Advanced Topics - Feature Selection

Feature selection is the process of selecting a subset of relevant features for use in model construction. It is done to improve model performance, reduce overfitting, and decrease training time.

In [3]:
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Generate a dataset with some irrelevant features
X, y = make_classification(n_samples=100, n_features=20, n_informative=5, n_redundant=5, random_state=42)

### Filter Methods
These methods select features based on their statistical properties (e.g., correlation with the target variable, variance). They are fast but don't consider the model that will be used.

### Wrapper Methods
These methods use a predictive model to score feature subsets. They are more accurate than filter methods but more computationally expensive. **Recursive Feature Elimination (RFE)** is a popular example.

In [4]:
estimator = LogisticRegression(max_iter=1000)
# Select the top 5 features
selector = RFE(estimator, n_features_to_select=5, step=1)
selector = selector.fit(X, y)

print("RFE Selected Features (top 5):")
print(selector.support_)

RFE Selected Features (top 5):
[False False False False False False False False False False  True  True
 False False  True False False False  True  True]


### Embedded Methods
These methods perform feature selection as part of the model training process. Lasso regression (which can shrink coefficients to zero) and tree-based models are common examples.

In [5]:
forest = RandomForestClassifier(n_estimators=100, random_state=42)
forest.fit(X, y)
importances = forest.feature_importances_

feature_importance_df = pd.DataFrame({'feature': range(X.shape[1]), 'importance': importances})
print("Feature Importances from Random Forest:")
print(feature_importance_df.sort_values('importance', ascending=False).head(10))

Feature Importances from Random Forest:
    feature  importance
18       18    0.137236
11       11    0.128201
10       10    0.111889
3         3    0.089496
19       19    0.064493
14       14    0.058723
2         2    0.048289
7         7    0.045446
4         4    0.040668
16       16    0.038796
