# CR_Score Playbook 02: Feature Selection

**Level:** Intermediate  
**Time:** 15-20 minutes  
**Goal:** Master model-agnostic feature selection methods

## What You'll Learn

- Forward selection (greedy addition)
- Backward elimination (greedy removal)
- Stepwise selection (bidirectional)
- Exhaustive search (all combinations)
- MLflow experiment tracking
- Compare results across methods

## Prerequisites

- Completed Playbook 01
- MLflow installed: `pip install mlflow`

## Step 1: Setup

In [None]:
import pandas as pd
import numpy as np
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / 'src'))

from sklearn.linear_model import LogisticRegression
from cr_score.features import ForwardSelector, BackwardSelector, StepwiseSelector

print("[OK] Libraries imported!")

[OK] Libraries imported!


In [2]:
# Load data
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

# Separate features and target
feature_cols = [col for col in train_df.columns 
                if col not in ['application_id', 'default']]
X_train = train_df[feature_cols]
y_train = train_df['default']
X_test = test_df[feature_cols]
y_test = test_df['default']

print(f"Training data: {len(X_train)} samples, {len(feature_cols)} features")
print(f"Test data: {len(X_test)} samples")

Training data: 3500 samples, 14 features
Test data: 1500 samples


## Step 2: Forward Selection

Start with no features, add best one at a time.

In [None]:
# Create estimator (model to use for selection)
estimator = LogisticRegression(max_iter=1000, random_state=42, solver='lbfgs')

# Create forward selector
forward = ForwardSelector(
    estimator=estimator,
    max_features=8,
    use_mlflow=False  # Disable MLflow for simplicity
)

# Fit (this will take a minute as it evaluates many feature combinations)
print("Running forward selection... (this may take 1-2 minutes)")
forward.fit(X_train, y_train)

# Get selected features
selected_features = forward.get_selected_features()

print(f"\nForward Selection Results:")
print(f"  Selected {len(selected_features)} features")
print(f"  Features: {selected_features}")
print(f"  Best AUC: {forward.best_score_:.3f}")

TypeError: ForwardSelector.__init__() missing 1 required positional argument: 'estimator'

## Step 3: Backward Elimination

Start with all features, remove worst one at a time.

In [None]:
# Create backward selector
estimator = LogisticRegression(max_iter=1000, random_state=42, solver='lbfgs')

backward = BackwardSelector(
    estimator=estimator,
    min_features=5,
    use_mlflow=False
)

# Fit
print("Running backward elimination... (this may take 1-2 minutes)")
backward.fit(X_train, y_train)

# Get selected features
selected_features = backward.get_selected_features()

print(f"\nBackward Elimination Results:")
print(f"  Selected {len(selected_features)} features")
print(f"  Features: {selected_features}")
print(f"  Best AUC: {backward.best_score_:.3f}")

## Step 4: Stepwise Selection

Bidirectional: can add or remove features.

In [None]:
# Create stepwise selector
estimator = LogisticRegression(max_iter=1000, random_state=42, solver='lbfgs')

stepwise = StepwiseSelector(
    estimator=estimator,
    max_features=8,
    use_mlflow=False
)

# Fit
print("Running stepwise selection... (this may take 1-2 minutes)")
stepwise.fit(X_train, y_train)

# Get selected features
selected_features = stepwise.get_selected_features()

print(f"\nStepwise Selection Results:")
print(f"  Selected {len(selected_features)} features")
print(f"  Features: {selected_features}")
print(f"  Best AUC: {stepwise.best_score_:.3f}")

## Step 5: Compare Methods

Let's compare all three methods.

In [None]:
# Compare results
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))

methods = ['Forward', 'Backward', 'Stepwise']
aucs = [forward.best_score_, backward.best_score_, stepwise.best_score_]
n_features = [
    len(forward.get_selected_features()),
    len(backward.get_selected_features()),
    len(stepwise.get_selected_features())
]

x = np.arange(len(methods))
width = 0.35

ax.bar(x - width/2, aucs, width, label='AUC', color='skyblue')
ax.bar(x + width/2, [n/10 for n in n_features], width, 
       label='# Features / 10', color='lightcoral')

ax.set_xlabel('Method')
ax.set_ylabel('Score')
ax.set_title('Feature Selection Method Comparison')
ax.set_xticks(x)
ax.set_xticklabels(methods)
ax.legend()
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nSummary:")
for method, auc, n_feat in zip(methods, aucs, n_features):
    print(f"  {method:12s}: AUC={auc:.3f}, Features={n_feat}")

## Summary

You learned how to:
- Apply forward selection (greedy addition)
- Apply backward elimination (greedy removal)
- Apply stepwise selection (bidirectional)
- Compare different methods

**Next:** Playbook 03 for visualization and reporting!