
# 🛠️ Feature Engineering & Selection

This notebook provides **code templates and checklists** for **creating, transforming, and selecting features** to improve machine learning models.

### 🔹 What’s Covered:
- Creating new features (interaction terms, binning, encoding)
- Handling categorical features
- Feature scaling & normalization
- Feature selection techniques (filter, wrapper, embedded methods)


In [None]:

# Ensure required libraries are installed (Uncomment if necessary)
# !pip install pandas numpy sklearn



## 🔨 Creating New Features

✅ Combine existing features to create interaction terms.  
✅ Use **binning** to convert numerical features into categories.  
✅ Extract useful information from timestamps (year, month, weekday).  


In [None]:

import pandas as pd
import numpy as np

# Sample dataset
df = pd.DataFrame({
    'age': [22, 35, 46, 57, 28],
    'salary': [50000, 70000, 90000, 120000, 65000],
    'signup_date': pd.to_datetime(["2021-01-15", "2020-06-23", "2019-11-12", "2022-03-01", "2021-09-30"])
})

# Create interaction term (e.g., age * salary)
df['age_salary_interaction'] = df['age'] * df['salary']

# Create bins for age groups
df['age_group'] = pd.cut(df['age'], bins=[20, 30, 40, 50, 60], labels=["20s", "30s", "40s", "50s"])

# Extract year and month from signup date
df['signup_year'] = df['signup_date'].dt.year
df['signup_month'] = df['signup_date'].dt.month

print(df.head())



## 🔤 Encoding Categorical Variables

✅ Convert categorical features into numerical representations.  
✅ Use **One-Hot Encoding** for non-ordinal categories.  
✅ Use **Label Encoding** for ordinal categories.  


In [None]:

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['age_group'], drop_first=True)

# Label Encoding (Example for ordinal categories)
le = LabelEncoder()
df['encoded_age_group'] = le.fit_transform(df['age_group'])

print(df_encoded.head())



## 📏 Feature Scaling & Normalization

✅ Normalize numerical features to ensure comparability.  
✅ Use **Min-Max Scaling** (0-1 range) or **Standardization** (Z-score).  


In [None]:

from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Min-Max Scaling
scaler = MinMaxScaler()
df[['salary_scaled']] = scaler.fit_transform(df[['salary']])

# Standardization (Z-score normalization)
scaler = StandardScaler()
df[['salary_standardized']] = scaler.fit_transform(df[['salary']])

print(df.head())



## 🏆 Feature Selection

✅ Use **Filter Methods** (e.g., correlation, mutual information).  
✅ Use **Wrapper Methods** (e.g., recursive feature elimination).  
✅ Use **Embedded Methods** (e.g., Lasso Regression, Decision Trees).  


In [None]:

from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.linear_model import LinearRegression

# Filter Method (Select top k features based on correlation)
selector = SelectKBest(score_func=f_regression, k=2)
X_new = selector.fit_transform(df[['age', 'salary', 'signup_year']], df['salary'])
print("Selected Features:", selector.get_support())

# Wrapper Method (Recursive Feature Elimination - RFE)
model = LinearRegression()
rfe = RFE(model, n_features_to_select=2)
X_rfe = rfe.fit_transform(df[['age', 'salary', 'signup_year']], df['salary'])
print("RFE Selected Features:", rfe.support_)



## ✅ Best Practices & Common Pitfalls

- **Avoid data leakage**: Feature transformations should be applied **only on training data**.  
- **Don't over-engineer**: More features aren’t always better—test their impact.  
- **Check feature importance**: Some transformations may not help and could add noise.  
- **Normalize before training**: Some ML models are sensitive to feature scales.  
