# 🧩 Feature Selection for Absolute Beginners

**Learning Goal:** Learn how to choose the best features for your machine learning models to make them work better and faster.

## Prerequisites

In [19]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

---

## 🔍 STEP 1: What are features?

### Q1: When building a machine learning model, what information do we give it?

**Answer:** We give it features - these are the input variables that describe our data.

**Simple Example:**
Think of predicting house prices. Our features might be:
- Number of bedrooms
- House size
- Location
- Age of house

In [20]:
# Create a simple house price dataset
house_data = {
    'bedrooms': [2, 3, 4, 2, 5],
    'size_sqft': [1000, 1500, 2000, 900, 2500],
    'age_years': [10, 5, 15, 20, 2],
    'location': ['City', 'Suburb', 'City', 'Rural', 'City'],
    'price': [200000, 300000, 400000, 150000, 500000]  # This is our target
}

df = pd.DataFrame(house_data)
print("Our house dataset:")
print(df)

Our house dataset:
   bedrooms  size_sqft  age_years location   price
0         2       1000         10     City  200000
1         3       1500          5   Suburb  300000
2         4       2000         15     City  400000
3         2        900         20    Rural  150000
4         5       2500          2     City  500000


**Key Point:** Features are the characteristics we use to make predictions. The target (price) is what we want to predict.

---

## 🔍 STEP 2: Why feature selection matters

### Q2: Why can't we just use all available features?

**Answer:** Too many features can confuse the model and make it perform worse.

**Problems with too many features:**
1. **Overfitting** - Model memorizes instead of learning
2. **Slow training** - More features = more computation
3. **Noise** - Irrelevant features add confusion


In [21]:
# Example: Adding useless features
df['house_id'] = [1, 2, 3, 4, 5]  # Just an ID number
df['random_number'] = [42, 17, 88, 33, 91]  # Random noise

print("Dataset with useless features:")
print(df)

# These features won't help predict price!
print("\nUseful features: bedrooms, size_sqft, age_years, location")
print("Useless features: house_id, random_number")

Dataset with useless features:
   bedrooms  size_sqft  age_years location   price  house_id  random_number
0         2       1000         10     City  200000         1             42
1         3       1500          5   Suburb  300000         2             17
2         4       2000         15     City  400000         3             88
3         2        900         20    Rural  150000         4             33
4         5       2500          2     City  500000         5             91

Useful features: bedrooms, size_sqft, age_years, location
Useless features: house_id, random_number


**Simple Rule:** More features ≠ Better model. We want the RIGHT features, not ALL features.

---

## 🔍 STEP 3: Manual Selection

### Q3: How do we manually remove bad features?

**Answer:** Use common sense and domain knowledge to remove obviously useless features.

**What to remove:**
1. **ID columns** (like customer_id, order_number)
2. **Random data** (meaningless numbers)
3. **Duplicate information** (age in years AND age in days)

In [22]:
# Manual feature removal
print("Before manual selection:")
print(f"Features: {list(df.columns)}")

# Remove useless features
features_to_remove = ['house_id', 'random_number']
df_clean = df.drop(columns=features_to_remove)

print("\nAfter manual selection:")
print(f"Features: {list(df_clean.columns)}")
print(f"Removed {len(features_to_remove)} useless features")

Before manual selection:
Features: ['bedrooms', 'size_sqft', 'age_years', 'location', 'price', 'house_id', 'random_number']

After manual selection:
Features: ['bedrooms', 'size_sqft', 'age_years', 'location', 'price']
Removed 2 useless features


**Simple Guidelines:**
- Remove ID numbers
- Remove random/meaningless data  
- Remove duplicate information
- Keep features that make business sense

---

## 🔍 STEP 4: Quantitative Techniques

Now we use statistical methods to find the best features automatically.

### A. Filter Methods (Statistical Tests)

**What they do:** Test how well each feature relates to the target.

In [23]:
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.impute import SimpleImputer


X = df_clean[['bedrooms', 'size_sqft', 'age_years']]
y = df_clean['price']

pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),      # handle NaNs if any
    ("kbest",   SelectKBest(score_func=f_regression, k=2))
])

X_selected = pipe.fit_transform(X, y)

# map support back to original columns
support = pipe.named_steps["kbest"].get_support()
scores = pipe.named_steps["kbest"].scores_

print("Feature importance scores:")
for feat, s in zip(X.columns, scores):
    print(f"{feat}: {s:.2f}")

selected_features = X.columns[support]
print(f"\nTop 2 features selected: {list(selected_features)}")


Feature importance scores:
bedrooms: 164.28
size_sqft: 496.65
age_years: 1.94

Top 2 features selected: ['bedrooms', 'size_sqft']


In [30]:
y_cls = pd.qcut(df_clean['price'], q=3, labels=False)  # low, mid, high
selector = SelectKBest(score_func=f_classif, k=2)
X_selected = selector.fit_transform(X, y_cls)

# Get selected feature names
selected_features = X.columns[selector.get_support()]

print("\nTop 2 features selected:", list(selected_features))


Top 2 features selected: ['bedrooms', 'size_sqft']


### B. Wrapper Methods (Model-Based)

**What they do:** Test different combinations of features with actual models.

In [24]:
# Simple wrapper method example
from sklearn.feature_selection import RFE

# Use a simple model to rank features
model = LogisticRegression(max_iter=9000)
selector = RFE(model, n_features_to_select=2)

# This will try different feature combinations
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.support_]

print(f"Wrapper method selected: {list(selected_features)}")

Wrapper method selected: ['size_sqft', 'age_years']


### C. Embedded Methods (Built-in Selection)

**What they do:** Some models automatically tell us which features are important.

In [25]:
# Random Forest automatically calculates feature importance
rf = RandomForestClassifier(n_estimators=50, random_state=42)
rf.fit(X, y)

# Get feature importance scores
importance_scores = rf.feature_importances_

print("Random Forest feature importance:")
for feature, importance in zip(X.columns, importance_scores):
    print(f"{feature}: {importance:.3f}")

# Select features above threshold
threshold = 0.1  # Keep features with >10% importance
important_features = [feature for feature, importance in 
                     zip(X.columns, importance_scores) 
                     if importance > threshold]

print(f"\nImportant features (>{threshold}): {important_features}")

Random Forest feature importance:
bedrooms: 0.291
size_sqft: 0.363
age_years: 0.346

Important features (>0.1): ['bedrooms', 'size_sqft', 'age_years']


**Summary of Methods:**
- **Filter:** Fast statistical tests
- **Wrapper:** Test with actual models (slower but better)
- **Embedded:** Built into some models (like Random Forest)

---

## 🔍 STEP 5: Handling categorical features

### Q5: What about non-numeric features like location?

**Answer:** Convert them to numbers using dummy variables.

In [26]:
# Our location feature is text, not numbers
print("Original location column:")
print(df_clean['location'].unique())

# Convert to dummy variables (one-hot encoding)
df_encoded = pd.get_dummies(df_clean, columns=['location'])
print("\nAfter creating dummy variables:")
print(df_encoded.columns.tolist())

# Now each location becomes a separate 0/1 column
print("\nSample of encoded data:")
print(df_encoded[['location_City', 'location_Rural', 'location_Suburb']].head())

Original location column:
['City' 'Suburb' 'Rural']

After creating dummy variables:
['bedrooms', 'size_sqft', 'age_years', 'price', 'location_City', 'location_Rural', 'location_Suburb']

Sample of encoded data:
   location_City  location_Rural  location_Suburb
0           True           False            False
1          False           False             True
2           True           False            False
3          False            True            False
4           True           False            False


**Key Point:** Machine learning needs numbers, so we convert categories into 0/1 columns.

---

## 🔍 STEP 6: Using Pipelines

### Q6: How do we combine preprocessing and feature selection?

**Answer:** Use pipelines to chain steps together.

In [39]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression

# Create a regression pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),                    # Step 1: Scale the data
    ('selector', SelectKBest(score_func=f_regression, k=3)),  # Step 2: Select best 3 features for regression
    ('model', LinearRegression())                    # Step 3: Train regression model
])

# Prepare features
X = df_encoded[['bedrooms', 'size_sqft', 'age_years']]
y = df_encoded['price']  # Keep price as continuous

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the pipeline
pipeline.fit(X_train, y_train)

# Test the pipeline (returns R² score for regression)
score = pipeline.score(X_test, y_test)
print(f"Pipeline R² score: {score:.3f}")

# See which features were selected
selected_features = X.columns[pipeline.named_steps['selector'].get_support()]
print(f"Pipeline selected features: {list(selected_features)}")

Pipeline R² score: 0.772
Pipeline selected features: ['bedrooms', 'size_sqft', 'age_years']


**Why use pipelines?**
- Keeps everything organized
- Prevents mistakes
- Easy to reuse
- Consistent preprocessing

---


### Key Takeaways:

| What to Do | Why | How |
|------------|-----|-----|
| Remove IDs | They don't predict anything | `df.drop('customer_id')` |
| Convert categories | ML needs numbers | `pd.get_dummies()` |
| Test feature importance | Find what matters | `SelectKBest()` or `RandomForest` |
| Use pipelines | Stay organized | `Pipeline([steps])` |
| Start simple | Better to understand | Begin with few features |


**Remember:** The goal is to find the features that help your model make better predictions. Start simple, remove the obvious bad features, then use the tools to find the best ones!