<a href="https://colab.research.google.com/github/asifahsaan/data-preprocessing-beginners/blob/main/notebooks/09_feature_selection_filter_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📘 09 — Feature Selection: Filter Methods
In this notebook, we’ll learn how to filter out irrelevant or redundant features using statistical methods **before model training**.

We'll cover:
- What filter methods are and when to use them
- Using `SelectKBest` with `chi2`, `f_classif`, and `mutual_info_classif`
- Integrating filter selection in a `Pipeline`

## 1. What Are Filter Methods?

**Filter methods** score each feature independently of any model, based on a statistical test:

| Method                | Use case                                                | Works with               |
|-----------------------|-----------------------------------------------------------|--------------------------|
| `chi2`                | Tests independence between categorical target & feature | Positive integers only   |
| `f_classif`           | ANOVA F-value for classification tasks                  | Numerical features       |
| `mutual_info_classif`| Captures nonlinear relationships                        | Categorical + continuous |

These are simple, fast, and useful for an **initial feature screening**.

## 2. Load Sample Dataset

In [None]:
import pandas as pd
from sklearn.datasets import load_breast_cancer

# Load toy classification dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X.head()

## 3. SelectKBest with `chi2` (Categorical Targets + Positive Features)

In [None]:
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler

# chi2 needs non-negative features → scale to 0-1
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Apply SelectKBest with chi2
selector = SelectKBest(score_func=chi2, k=5)
X_kbest = selector.fit_transform(X_scaled, y)

# Show selected feature names
selected_cols = X.columns[selector.get_support()]
print("Top features selected by chi2:\n", selected_cols.tolist())

## 4. SelectKBest with `f_classif` (ANOVA F-test)

In [None]:
from sklearn.feature_selection import f_classif

selector = SelectKBest(score_func=f_classif, k=5)
X_kbest_f = selector.fit_transform(X, y)

selected_cols_f = X.columns[selector.get_support()]
print("Top features selected by f_classif:\n", selected_cols_f.tolist())

## 5. SelectKBest with `mutual_info_classif`

In [None]:
from sklearn.feature_selection import mutual_info_classif

selector = SelectKBest(score_func=mutual_info_classif, k=5)
X_kbest_mi = selector.fit_transform(X, y)

selected_cols_mi = X.columns[selector.get_support()]
print("Top features selected by mutual_info_classif:\n", selected_cols_mi.tolist())

##  6. Use in Pipeline (Optional but Valuable)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ("scaler", MinMaxScaler()),
    ("select", SelectKBest(score_func=chi2, k=5)),
    ("model", LogisticRegression())
])

pipeline.fit(X, y)
print("Pipeline trained successfully with selected features.")

## Summary
In this notebook, we:

* Explored filter-based feature selection: `chi2`, `f_classif`, `mutual_info_classif`
* Selected top-k features statistically before training
* Integrated selection with pipelines for production-ready preprocessing

## ⏭ What’s Next?
 In the next notebook:
 `10_feature_selection_wrapper_methods.ipynb`
We’ll dive into **wrapper-based feature selection** like **Recursive Feature Elimination (RFE)**, where a model guides which features to keep.