# Automated ML Pipeline - Complete Tutorial

**Author:** Anik Tahabilder  
**Project:** 13 of 22 - Kaggle ML Portfolio  
**Difficulty:** 7/10 | **Learning Value:** 9/10

---

## What Will You Learn?

This tutorial teaches **how to automate the entire ML workflow**.

| Topic | What You'll Understand |
|-------|------------------------|
| **ML Pipeline Concept** | Why pipelines matter for production |
| **Automated Preprocessing** | Handle missing values, scaling, encoding automatically |
| **Feature Selection** | Filter, Wrapper, Embedded methods |
| **Model Selection** | Compare multiple models automatically |
| **Hyperparameter Tuning** | Grid Search, Random Search, Bayesian Optimization |
| **Cross-Validation** | K-Fold, Stratified, Time Series splits |
| **Complete AutoML** | End-to-end automated pipeline |

---

## The AutoML Pipeline

```
┌──────────────────────────────────────────────────────────────────────────┐
│                        AUTOMATED ML PIPELINE                              │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   ┌─────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
│   │  Data   │───>│Preprocessing│───>│  Feature    │───>│   Model     │  │
│   │         │    │             │    │  Selection  │    │  Selection  │  │
│   └─────────┘    └─────────────┘    └─────────────┘    └──────┬──────┘  │
│                                                               │         │
│   ┌─────────────────────────────────────────────────────────┘         │
│   │                                                                     │
│   v                                                                     │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                │
│   │Hyperparameter│───>│ Evaluation  │───>│ Deployment  │                │
│   │   Tuning    │    │             │    │             │                │
│   └─────────────┘    └─────────────┘    └─────────────┘                │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘
```

---

## Table of Contents

1. [Part 1: Why AutoML Pipelines?](#part1)
2. [Part 2: Automated Preprocessing](#part2)
3. [Part 3: Feature Selection Methods](#part3)
4. [Part 4: Automated Model Selection](#part4)
5. [Part 5: Hyperparameter Tuning](#part5)
6. [Part 6: Cross-Validation Strategies](#part6)
7. [Part 7: Complete AutoML Pipeline](#part7)
8. [Part 8: Evaluation & Results](#part8)
9. [Part 9: Summary & Best Practices](#part9)

---

<a id='part1'></a>
# Part 1: Why AutoML Pipelines?

---

## 1.1 The Problem with Manual ML

| Manual Approach | Problems |
|-----------------|----------|
| Separate preprocessing | Data leakage risk |
| Manual feature selection | Time-consuming, biased |
| Try models one by one | Inefficient, miss best model |
| Hand-tune hyperparameters | Suboptimal, not reproducible |

## 1.2 Benefits of Pipelines

| Benefit | Description |
|---------|-------------|
| **No Data Leakage** | Preprocessing fitted only on training data |
| **Reproducibility** | Same pipeline = same results |
| **Easy Deployment** | Single object to save/load |
| **Automation** | Less manual work, fewer errors |
| **Scalability** | Easy to add new steps |

## 1.3 Pipeline Components

```
sklearn.pipeline.Pipeline
│
├── Step 1: Preprocessor (ColumnTransformer)
│   ├── Numeric: Imputer → Scaler
│   └── Categorical: Imputer → Encoder
│
├── Step 2: Feature Selection (optional)
│   └── SelectKBest, RFE, etc.
│
└── Step 3: Model
    └── Any sklearn estimator
```

## 1.4 Key Sklearn Classes

| Class | Purpose |
|-------|--------|
| `Pipeline` | Chain transformers + estimator |
| `ColumnTransformer` | Apply different transforms to columns |
| `GridSearchCV` | Exhaustive hyperparameter search |
| `RandomizedSearchCV` | Random hyperparameter search |
| `cross_val_score` | Cross-validation scoring |

In [None]:
# ============================================================
# SETUP AND IMPORTS
# ============================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import time
import warnings
warnings.filterwarnings('ignore')

# Sklearn - Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.compose import ColumnTransformer

# Sklearn - Feature Selection
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.feature_selection import RFE, RFECV, SelectFromModel
from sklearn.feature_selection import VarianceThreshold

# Sklearn - Models
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier

# Sklearn - Pipeline & Model Selection
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold, KFold, TimeSeriesSplit

# Sklearn - Metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import roc_auc_score, f1_score, precision_score, recall_score

# Display settings
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

print("="*70)
print("AUTOMATED ML PIPELINE - TUTORIAL")
print("="*70)
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print("\nAll libraries loaded!")

In [ ]:
# ============================================================
# LOAD TITANIC DATASET
# ============================================================
print("="*70)
print("LOADING TITANIC DATASET")
print("="*70)

# ============================================================
# KAGGLE PATH CONFIGURATION
# ============================================================
# Dataset: https://www.kaggle.com/competitions/titanic
# Path: /kaggle/input/titanic

USE_KAGGLE = os.path.exists('/kaggle/input')

if USE_KAGGLE:
    TRAIN_PATH = '/kaggle/input/titanic/train.csv'
    TEST_PATH = '/kaggle/input/titanic/test.csv'
    
    if os.path.exists(TRAIN_PATH):
        df = pd.read_csv(TRAIN_PATH)
        print(f"✓ Loaded from: {TRAIN_PATH}")
    else:
        print("Titanic dataset not found!")
        print("Add the 'titanic' competition dataset in Kaggle")
        df = None
else:
    df = None

# Fallback: Create Titanic-like synthetic data
if df is None:
    print("\nCreating Titanic-like synthetic dataset...")
    print("(Add 'titanic' dataset in Kaggle for real data)")
    
    np.random.seed(42)
    n_samples = 891  # Same as real Titanic
    
    df = pd.DataFrame({
        'PassengerId': range(1, n_samples + 1),
        'Pclass': np.random.choice([1, 2, 3], n_samples, p=[0.24, 0.21, 0.55]),
        'Name': [f'Passenger_{i}' for i in range(n_samples)],
        'Sex': np.random.choice(['male', 'female'], n_samples, p=[0.65, 0.35]),
        'Age': np.random.normal(30, 14, n_samples).clip(1, 80),
        'SibSp': np.random.choice([0, 1, 2, 3, 4], n_samples, p=[0.68, 0.23, 0.05, 0.02, 0.02]),
        'Parch': np.random.choice([0, 1, 2, 3], n_samples, p=[0.76, 0.13, 0.09, 0.02]),
        'Ticket': [f'T{np.random.randint(10000, 99999)}' for _ in range(n_samples)],
        'Fare': np.random.exponential(32, n_samples).clip(0, 512),
        'Cabin': np.random.choice(['C85', 'B42', 'E101', np.nan], n_samples, p=[0.05, 0.05, 0.05, 0.85]),
        'Embarked': np.random.choice(['S', 'C', 'Q', np.nan], n_samples, p=[0.70, 0.19, 0.09, 0.02]),
    })
    
    # Create realistic Survived target
    survival_prob = (
        0.2 +
        0.3 * (df['Sex'] == 'female').astype(int) +
        0.2 * (df['Pclass'] == 1).astype(int) +
        0.1 * (df['Pclass'] == 2).astype(int) +
        -0.1 * (df['Age'] > 50).astype(int) +
        np.random.randn(n_samples) * 0.1
    ).clip(0, 1)
    df['Survived'] = (survival_prob > 0.5).astype(int)
    
    # Add missing values like real Titanic
    age_missing = np.random.random(n_samples) < 0.2
    df.loc[age_missing, 'Age'] = np.nan

target_col = 'Survived'

print(f"\n" + "="*50)
print("TITANIC DATASET SUMMARY")
print("="*50)
print(f"Shape: {df.shape}")
print(f"Target: {target_col}")
print(f"\nSurvival Distribution:")
print(df[target_col].value_counts())
print(f"  Survival Rate: {df[target_col].mean()*100:.1f}%")

print(f"\nMissing Values:")
missing = df.isnull().sum()
print(missing[missing > 0])

print(f"\nColumn Types:")
print(df.dtypes)

print(f"\nFirst few rows:")
df.head()

In [None]:
# ============================================================
# CREATE PREPROCESSING PIPELINE
# ============================================================
print("="*70)
print("AUTOMATED PREPROCESSING PIPELINE")
print("="*70)

# ============================================================
# TITANIC FEATURE ENGINEERING
# ============================================================
# Drop columns that are not useful for prediction:
# - PassengerId: Just an identifier
# - Name: Text field (could extract titles but keeping it simple)
# - Ticket: Mostly unique values
# - Cabin: 77% missing values

drop_cols = ['PassengerId', 'Name', 'Ticket', 'Cabin']
df_clean = df.drop(columns=[col for col in drop_cols if col in df.columns])

print("Dropped columns (not useful for prediction):")
for col in drop_cols:
    if col in df.columns:
        print(f"  - {col}")

# Separate features and target
X = df_clean.drop(columns=[target_col])
y = df_clean[target_col]

# Encode target if needed
if y.dtype == 'object':
    le = LabelEncoder()
    y = le.fit_transform(y)

# Identify column types
numeric_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()

print(f"\nFeatures for modeling:")
print(f"  Numeric ({len(numeric_cols)}): {numeric_cols}")
print(f"  Categorical ({len(categorical_cols)}): {categorical_cols}")

# Create preprocessing pipelines
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine with ColumnTransformer
preprocessor = ColumnTransformer([
    ('numeric', numeric_pipeline, numeric_cols),
    ('categorical', categorical_pipeline, categorical_cols)
], remainder='drop')

print("\nPreprocessor Pipeline:")
print("""
ColumnTransformer
├── Numeric Pipeline (Pclass, Age, SibSp, Parch, Fare):
│   ├── SimpleImputer(strategy='median')  # Handle missing Age
│   └── StandardScaler()                  # Normalize features
│
└── Categorical Pipeline (Sex, Embarked):
    ├── SimpleImputer(strategy='most_frequent')  # Handle missing Embarked
    └── OneHotEncoder(handle_unknown='ignore')   # Create dummy variables
""")

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\nData Split:")
print(f"  Train size: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"  Test size: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")

# Fit and transform
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print(f"\nProcessed feature shape: {X_train_processed.shape}")
print(f"  Original features: {X.shape[1]}")
print(f"  After encoding: {X_train_processed.shape[1]} (due to one-hot encoding)")

---

<a id='part3'></a>
# Part 3: Feature Selection Methods

---

## 3.1 Three Categories of Feature Selection

| Category | Method | How It Works | Speed |
|----------|--------|--------------|-------|
| **Filter** | SelectKBest, VarianceThreshold | Statistical tests, independent of model | Fast |
| **Wrapper** | RFE, RFECV | Uses model performance | Slow |
| **Embedded** | SelectFromModel (L1, RF) | Built into model training | Medium |

## 3.2 Filter Methods

| Method | For | Description |
|--------|-----|-------------|
| **VarianceThreshold** | All | Remove low-variance features |
| **SelectKBest + f_classif** | Classification | ANOVA F-test |
| **SelectKBest + mutual_info** | Classification | Information gain |
| **SelectKBest + f_regression** | Regression | Correlation-based |

## 3.3 Wrapper Methods

| Method | Description | Use When |
|--------|-------------|----------|
| **RFE** | Recursive feature elimination | Know target # features |
| **RFECV** | RFE with cross-validation | Want optimal # features |

## 3.4 Embedded Methods

| Method | Description | Use When |
|--------|-------------|----------|
| **L1 (Lasso)** | Coefficients shrink to zero | Linear models |
| **Tree Feature Importance** | Based on split quality | Tree-based models |

In [None]:
# ============================================================
# FEATURE SELECTION METHODS
# ============================================================
print("="*70)
print("FEATURE SELECTION METHODS")
print("="*70)

# Store results for comparison
feature_selection_results = []

# 1. FILTER METHOD: SelectKBest
print("\n1. FILTER METHOD: SelectKBest (ANOVA F-test)")
print("-" * 50)

selector_kbest = SelectKBest(score_func=f_classif, k='all')
selector_kbest.fit(X_train_processed, y_train)

# Get scores
feature_scores = pd.DataFrame({
    'Feature': range(X_train_processed.shape[1]),
    'F_Score': selector_kbest.scores_,
    'P_Value': selector_kbest.pvalues_
}).sort_values('F_Score', ascending=False)

print(f"Top 10 features by F-score:")
print(feature_scores.head(10).to_string(index=False))

# 2. FILTER METHOD: Mutual Information
print("\n2. FILTER METHOD: Mutual Information")
print("-" * 50)

mi_scores = mutual_info_classif(X_train_processed, y_train, random_state=42)
mi_df = pd.DataFrame({
    'Feature': range(len(mi_scores)),
    'MI_Score': mi_scores
}).sort_values('MI_Score', ascending=False)

print(f"Top 10 features by Mutual Information:")
print(mi_df.head(10).to_string(index=False))

# 3. WRAPPER METHOD: RFE
print("\n3. WRAPPER METHOD: Recursive Feature Elimination (RFE)")
print("-" * 50)

base_model = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
rfe = RFE(estimator=base_model, n_features_to_select=10, step=1)
rfe.fit(X_train_processed, y_train)

rfe_selected = np.where(rfe.support_)[0]
print(f"Selected {len(rfe_selected)} features: {rfe_selected.tolist()}")

# 4. EMBEDDED METHOD: Feature Importance
print("\n4. EMBEDDED METHOD: Random Forest Feature Importance")
print("-" * 50)

rf_for_importance = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_for_importance.fit(X_train_processed, y_train)

importance_df = pd.DataFrame({
    'Feature': range(len(rf_for_importance.feature_importances_)),
    'Importance': rf_for_importance.feature_importances_
}).sort_values('Importance', ascending=False)

print(f"Top 10 features by Random Forest Importance:")
print(importance_df.head(10).to_string(index=False))

In [None]:
# Visualize feature importance
print("="*70)
print("FEATURE IMPORTANCE VISUALIZATION")
print("="*70)

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# F-scores
ax1 = axes[0]
top_f = feature_scores.head(15)
ax1.barh(range(len(top_f)), top_f['F_Score'], color='steelblue')
ax1.set_yticks(range(len(top_f)))
ax1.set_yticklabels([f"Feature {i}" for i in top_f['Feature']])
ax1.set_xlabel('F-Score')
ax1.set_title('ANOVA F-Test Scores', fontweight='bold')
ax1.invert_yaxis()

# Mutual Information
ax2 = axes[1]
top_mi = mi_df.head(15)
ax2.barh(range(len(top_mi)), top_mi['MI_Score'], color='coral')
ax2.set_yticks(range(len(top_mi)))
ax2.set_yticklabels([f"Feature {i}" for i in top_mi['Feature']])
ax2.set_xlabel('Mutual Information')
ax2.set_title('Mutual Information Scores', fontweight='bold')
ax2.invert_yaxis()

# Random Forest Importance
ax3 = axes[2]
top_rf = importance_df.head(15)
ax3.barh(range(len(top_rf)), top_rf['Importance'], color='green')
ax3.set_yticks(range(len(top_rf)))
ax3.set_yticklabels([f"Feature {i}" for i in top_rf['Feature']])
ax3.set_xlabel('Importance')
ax3.set_title('Random Forest Importance', fontweight='bold')
ax3.invert_yaxis()

plt.suptitle('Feature Selection Method Comparison', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.show()

---

<a id='part4'></a>
# Part 4: Automated Model Selection

---

## 4.1 Comparing Multiple Models

| Model Family | Models | Best For |
|--------------|--------|----------|
| **Linear** | LogisticRegression, SVC(linear) | Linearly separable, interpretable |
| **Tree-based** | DecisionTree, RandomForest, GradientBoosting | Non-linear, feature importance |
| **Distance-based** | KNN, SVC(rbf) | Non-linear, small datasets |
| **Probabilistic** | GaussianNB | Fast, baseline |
| **Neural** | MLPClassifier | Complex patterns, lots of data |

## 4.2 Model Selection Strategy

```
1. Start with diverse set of models
2. Use cross-validation for fair comparison
3. Select top N models for hyperparameter tuning
4. Final evaluation on holdout test set
```

In [None]:
# ============================================================
# AUTOMATED MODEL SELECTION
# ============================================================
print("="*70)
print("AUTOMATED MODEL SELECTION")
print("="*70)

# Define models to compare
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'AdaBoost': AdaBoostClassifier(n_estimators=100, random_state=42),
    'Extra Trees': ExtraTreesClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'SVM (RBF)': SVC(kernel='rbf', random_state=42, probability=True),
    'Naive Bayes': GaussianNB(),
    'MLP': MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42),
}

print(f"\nComparing {len(models)} models with 5-fold cross-validation...")
print("\n" + "-"*60)

# Compare models
results = []
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    start_time = time.time()
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_processed, y_train, cv=cv, scoring='accuracy')
    
    elapsed = time.time() - start_time
    
    results.append({
        'Model': name,
        'CV Mean': cv_scores.mean(),
        'CV Std': cv_scores.std(),
        'Time (s)': elapsed
    })
    
    print(f"{name:25s} | Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f}) | {elapsed:.2f}s")

# Create results DataFrame
results_df = pd.DataFrame(results).sort_values('CV Mean', ascending=False)

print("\n" + "="*60)
print("MODEL RANKING")
print("="*60)
print(results_df.to_string(index=False))

In [None]:
# Visualize model comparison
print("="*70)
print("MODEL COMPARISON VISUALIZATION")
print("="*70)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Accuracy comparison
ax1 = axes[0]
colors = plt.cm.viridis(np.linspace(0, 1, len(results_df)))
bars = ax1.barh(results_df['Model'], results_df['CV Mean'], 
                xerr=results_df['CV Std'], color=colors, edgecolor='black')
ax1.set_xlabel('Cross-Validation Accuracy')
ax1.set_title('Model Accuracy Comparison', fontweight='bold')
ax1.axvline(x=results_df['CV Mean'].max(), color='red', linestyle='--', alpha=0.5)

# Add values
for bar, val in zip(bars, results_df['CV Mean']):
    ax1.text(val + 0.01, bar.get_y() + bar.get_height()/2, 
             f'{val:.3f}', va='center', fontsize=9)

# Speed vs Accuracy
ax2 = axes[1]
scatter = ax2.scatter(results_df['Time (s)'], results_df['CV Mean'], 
                      s=100, c=range(len(results_df)), cmap='viridis', edgecolor='black')
for i, row in results_df.iterrows():
    ax2.annotate(row['Model'], (row['Time (s)'], row['CV Mean']), 
                 fontsize=8, ha='left', va='bottom')
ax2.set_xlabel('Training Time (seconds)')
ax2.set_ylabel('CV Accuracy')
ax2.set_title('Speed vs Accuracy Trade-off', fontweight='bold')

plt.tight_layout()
plt.show()

# Select top 3 models for hyperparameter tuning
top_models = results_df.head(3)['Model'].tolist()
print(f"\nTop 3 models for hyperparameter tuning: {top_models}")

---

<a id='part5'></a>
# Part 5: Hyperparameter Tuning

---

## 5.1 Tuning Methods Comparison

| Method | How It Works | Pros | Cons |
|--------|--------------|------|------|
| **Grid Search** | Try all combinations | Thorough | Slow, curse of dimensionality |
| **Random Search** | Random combinations | Faster, often as good | May miss optimal |
| **Bayesian Optimization** | Learn from previous trials | Efficient | More complex |
| **Halving Search** | Progressive filtering | Very fast | May discard good candidates |

## 5.2 When to Use Which?

| Scenario | Recommended Method |
|----------|-------------------|
| Few hyperparameters (2-3) | Grid Search |
| Many hyperparameters (4+) | Random Search |
| Limited compute budget | Random Search or Halving |
| Expensive evaluations | Bayesian Optimization |
| Quick baseline | Random Search (n_iter=20) |

## 5.3 Common Hyperparameters

| Model | Key Hyperparameters |
|-------|--------------------|
| **Random Forest** | n_estimators, max_depth, min_samples_split, max_features |
| **Gradient Boosting** | n_estimators, learning_rate, max_depth, subsample |
| **SVM** | C, kernel, gamma |
| **KNN** | n_neighbors, weights, metric |
| **MLP** | hidden_layer_sizes, learning_rate, alpha |

In [None]:
# ============================================================
# HYPERPARAMETER TUNING
# ============================================================
print("="*70)
print("HYPERPARAMETER TUNING")
print("="*70)

# Define parameter grids for top models
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [5, 10, 20, None],
        'min_samples_split': [2, 5, 10],
        'max_features': ['sqrt', 'log2', None]
    },
    'Gradient Boosting': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [3, 5, 7],
        'subsample': [0.8, 1.0]
    },
    'Extra Trees': {
        'n_estimators': [50, 100, 200],
        'max_depth': [5, 10, 20, None],
        'min_samples_split': [2, 5, 10]
    }
}

# 1. Grid Search Example
print("\n1. GRID SEARCH")
print("-" * 50)

# Use smaller grid for demonstration
small_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid=small_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

start_time = time.time()
grid_search.fit(X_train_processed, y_train)
grid_time = time.time() - start_time

print(f"\nGrid Search completed in {grid_time:.2f}s")
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Total combinations tried: {len(grid_search.cv_results_['params'])}")

In [None]:
# 2. Random Search
print("\n2. RANDOM SEARCH")
print("-" * 50)

from scipy.stats import randint, uniform

random_grid = {
    'n_estimators': randint(50, 300),
    'max_depth': [5, 10, 20, 30, None],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', None]
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_distributions=random_grid,
    n_iter=20,  # Number of random combinations
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

start_time = time.time()
random_search.fit(X_train_processed, y_train)
random_time = time.time() - start_time

print(f"\nRandom Search completed in {random_time:.2f}s")
print(f"Best params: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.4f}")

# Compare Grid vs Random
print("\n" + "="*50)
print("GRID SEARCH vs RANDOM SEARCH")
print("="*50)
comparison = pd.DataFrame({
    'Method': ['Grid Search', 'Random Search'],
    'Best Score': [grid_search.best_score_, random_search.best_score_],
    'Time (s)': [grid_time, random_time],
    'Combinations': [len(grid_search.cv_results_['params']), 20]
})
print(comparison.to_string(index=False))

In [None]:
# Visualize hyperparameter search results
print("="*70)
print("HYPERPARAMETER SEARCH VISUALIZATION")
print("="*70)

# Get random search results
cv_results = pd.DataFrame(random_search.cv_results_)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Score vs n_estimators
ax1 = axes[0]
ax1.scatter(cv_results['param_n_estimators'], cv_results['mean_test_score'], 
           c=cv_results['mean_test_score'], cmap='viridis', s=100, edgecolor='black')
ax1.set_xlabel('n_estimators')
ax1.set_ylabel('Mean CV Score')
ax1.set_title('Score vs n_estimators', fontweight='bold')

# Score distribution
ax2 = axes[1]
ax2.hist(cv_results['mean_test_score'], bins=10, color='steelblue', edgecolor='black', alpha=0.7)
ax2.axvline(x=random_search.best_score_, color='red', linestyle='--', 
           label=f'Best: {random_search.best_score_:.4f}')
ax2.set_xlabel('Mean CV Score')
ax2.set_ylabel('Frequency')
ax2.set_title('Score Distribution Across Trials', fontweight='bold')
ax2.legend()

plt.tight_layout()
plt.show()

---

<a id='part6'></a>
# Part 6: Cross-Validation Strategies

---

## 6.1 Types of Cross-Validation

| Method | Use Case | Maintains Class Balance |
|--------|----------|------------------------|
| **KFold** | Regression, balanced classification | No |
| **StratifiedKFold** | Imbalanced classification | Yes |
| **TimeSeriesSplit** | Time series data | N/A (respects time) |
| **LeaveOneOut** | Very small datasets | No |
| **GroupKFold** | Grouped data (avoid leakage) | No |

## 6.2 Choosing the Right CV Strategy

| Data Type | Recommended CV |
|-----------|---------------|
| **Classification (balanced)** | KFold or StratifiedKFold |
| **Classification (imbalanced)** | StratifiedKFold |
| **Regression** | KFold |
| **Time series** | TimeSeriesSplit |
| **Grouped data** | GroupKFold |

In [None]:
# ============================================================
# CROSS-VALIDATION STRATEGIES
# ============================================================
print("="*70)
print("CROSS-VALIDATION STRATEGIES")
print("="*70)

from sklearn.model_selection import cross_validate

# Define CV strategies
cv_strategies = {
    'KFold (5)': KFold(n_splits=5, shuffle=True, random_state=42),
    'StratifiedKFold (5)': StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    'KFold (10)': KFold(n_splits=10, shuffle=True, random_state=42),
    'StratifiedKFold (10)': StratifiedKFold(n_splits=10, shuffle=True, random_state=42),
}

# Use best model from previous search
best_model = random_search.best_estimator_

cv_results = []
for name, cv_strategy in cv_strategies.items():
    scores = cross_val_score(best_model, X_train_processed, y_train, 
                             cv=cv_strategy, scoring='accuracy')
    cv_results.append({
        'CV Strategy': name,
        'Mean': scores.mean(),
        'Std': scores.std(),
        'Min': scores.min(),
        'Max': scores.max()
    })
    print(f"{name:25s} | Mean: {scores.mean():.4f} | Std: {scores.std():.4f}")

cv_results_df = pd.DataFrame(cv_results)

print("\n" + "="*50)
print("CV STRATEGY COMPARISON")
print("="*50)
print(cv_results_df.to_string(index=False))

print("\nKey Insights:")
print("  - More folds = lower variance, higher compute")
print("  - StratifiedKFold recommended for classification")
print("  - 5-fold is good default, 10-fold for more reliable estimate")

---

<a id='part7'></a>
# Part 7: Complete AutoML Pipeline

---

## 7.1 Putting It All Together

Now we build a complete automated pipeline that:
1. Preprocesses data automatically
2. Selects best features
3. Tunes hyperparameters
4. Returns best model

In [None]:
# ============================================================
# COMPLETE AUTOML PIPELINE CLASS
# ============================================================
print("="*70)
print("COMPLETE AUTOML PIPELINE")
print("="*70)

class AutoMLPipeline:
    """
    Automated Machine Learning Pipeline.
    
    Automates:
    1. Preprocessing (imputation, scaling, encoding)
    2. Feature selection
    3. Model selection
    4. Hyperparameter tuning
    """
    
    def __init__(self, task='classification', n_jobs=-1, random_state=42):
        self.task = task
        self.n_jobs = n_jobs
        self.random_state = random_state
        self.preprocessor = None
        self.best_pipeline = None
        self.best_model_name = None
        self.best_params = None
        self.best_score = None
        self.model_results = None
        
    def _get_models(self):
        """Return dictionary of models to try."""
        if self.task == 'classification':
            return {
                'LogisticRegression': LogisticRegression(max_iter=1000, random_state=self.random_state),
                'RandomForest': RandomForestClassifier(random_state=self.random_state, n_jobs=self.n_jobs),
                'GradientBoosting': GradientBoostingClassifier(random_state=self.random_state),
                'ExtraTrees': ExtraTreesClassifier(random_state=self.random_state, n_jobs=self.n_jobs),
                'SVM': SVC(random_state=self.random_state, probability=True),
            }
        else:
            from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
            from sklearn.linear_model import Ridge, Lasso
            return {
                'Ridge': Ridge(random_state=self.random_state),
                'Lasso': Lasso(random_state=self.random_state),
                'RandomForest': RandomForestRegressor(random_state=self.random_state, n_jobs=self.n_jobs),
                'GradientBoosting': GradientBoostingRegressor(random_state=self.random_state),
            }
    
    def _get_param_grids(self):
        """Return hyperparameter grids for each model."""
        return {
            'LogisticRegression': {
                'model__C': [0.01, 0.1, 1, 10],
                'model__penalty': ['l1', 'l2'],
                'model__solver': ['liblinear', 'saga']
            },
            'RandomForest': {
                'model__n_estimators': [50, 100, 200],
                'model__max_depth': [5, 10, None],
                'model__min_samples_split': [2, 5]
            },
            'GradientBoosting': {
                'model__n_estimators': [50, 100],
                'model__learning_rate': [0.05, 0.1, 0.2],
                'model__max_depth': [3, 5]
            },
            'ExtraTrees': {
                'model__n_estimators': [50, 100, 200],
                'model__max_depth': [5, 10, None]
            },
            'SVM': {
                'model__C': [0.1, 1, 10],
                'model__kernel': ['rbf', 'linear']
            },
            'Ridge': {
                'model__alpha': [0.1, 1, 10, 100]
            },
            'Lasso': {
                'model__alpha': [0.001, 0.01, 0.1, 1]
            }
        }
    
    def _create_preprocessor(self, X):
        """Create preprocessing pipeline based on data types."""
        numeric_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
        categorical_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
        
        numeric_pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ])
        
        categorical_pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
        ])
        
        preprocessor = ColumnTransformer([
            ('numeric', numeric_pipeline, numeric_cols),
            ('categorical', categorical_pipeline, categorical_cols)
        ], remainder='drop')
        
        return preprocessor
    
    def fit(self, X, y, cv=5, scoring=None, verbose=True):
        """
        Fit the AutoML pipeline.
        
        Parameters:
        - X: Features (DataFrame)
        - y: Target
        - cv: Cross-validation folds
        - scoring: Scoring metric
        - verbose: Print progress
        """
        if scoring is None:
            scoring = 'accuracy' if self.task == 'classification' else 'neg_mean_squared_error'
        
        # Create preprocessor
        self.preprocessor = self._create_preprocessor(X)
        
        # Get models and param grids
        models = self._get_models()
        param_grids = self._get_param_grids()
        
        if verbose:
            print(f"\nAutoML: Testing {len(models)} models...")
            print("-" * 60)
        
        # Compare all models
        results = []
        best_score = -np.inf
        
        for name, model in models.items():
            # Create pipeline
            pipeline = Pipeline([
                ('preprocessor', self.preprocessor),
                ('model', model)
            ])
            
            # Get param grid
            param_grid = param_grids.get(name, {})
            
            # Hyperparameter tuning
            if param_grid:
                search = RandomizedSearchCV(
                    pipeline, param_grid, 
                    n_iter=10, cv=cv, scoring=scoring,
                    n_jobs=self.n_jobs, random_state=self.random_state
                )
            else:
                search = GridSearchCV(
                    pipeline, {}, cv=cv, scoring=scoring
                )
            
            search.fit(X, y)
            
            results.append({
                'Model': name,
                'Best Score': search.best_score_,
                'Best Params': search.best_params_
            })
            
            if verbose:
                print(f"{name:20s} | Score: {search.best_score_:.4f}")
            
            # Track best
            if search.best_score_ > best_score:
                best_score = search.best_score_
                self.best_pipeline = search.best_estimator_
                self.best_model_name = name
                self.best_params = search.best_params_
                self.best_score = search.best_score_
        
        self.model_results = pd.DataFrame(results).sort_values('Best Score', ascending=False)
        
        if verbose:
            print("\n" + "="*60)
            print(f"BEST MODEL: {self.best_model_name}")
            print(f"BEST SCORE: {self.best_score:.4f}")
            print(f"BEST PARAMS: {self.best_params}")
        
        return self
    
    def predict(self, X):
        """Make predictions using best pipeline."""
        return self.best_pipeline.predict(X)
    
    def predict_proba(self, X):
        """Get probability predictions."""
        return self.best_pipeline.predict_proba(X)
    
    def get_results(self):
        """Return model comparison results."""
        return self.model_results

print("\nAutoMLPipeline class created!")
print("\nFeatures:")
print("  - Automatic preprocessing")
print("  - Multiple model comparison")
print("  - Hyperparameter tuning")
print("  - Best model selection")

In [None]:
# ============================================================
# RUN AUTOML PIPELINE
# ============================================================
print("="*70)
print("RUNNING AUTOML PIPELINE")
print("="*70)

# Initialize AutoML
automl = AutoMLPipeline(task='classification', n_jobs=-1, random_state=42)

# Fit on training data
start_time = time.time()
automl.fit(X_train, y_train, cv=5, scoring='accuracy', verbose=True)
total_time = time.time() - start_time

print(f"\nTotal AutoML time: {total_time:.2f} seconds")

# Show all results
print("\n" + "="*50)
print("ALL MODEL RESULTS")
print("="*50)
print(automl.get_results().to_string(index=False))

---

<a id='part8'></a>
# Part 8: Evaluation & Results

---

In [None]:
# ============================================================
# FINAL EVALUATION ON TEST SET
# ============================================================
print("="*70)
print("FINAL EVALUATION ON TEST SET")
print("="*70)

# Predict on test set
y_pred = automl.predict(X_test)
y_proba = automl.predict_proba(X_test)[:, 1]

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_proba)

print(f"\nBest Model: {automl.best_model_name}")
print(f"\nTest Set Metrics:")
print(f"  Accuracy:  {accuracy:.4f}")
print(f"  F1 Score:  {f1:.4f}")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  ROC-AUC:   {roc_auc:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# Visualization
print("="*70)
print("RESULTS VISUALIZATION")
print("="*70)

from sklearn.metrics import RocCurveDisplay, ConfusionMatrixDisplay

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Confusion Matrix
ax1 = axes[0]
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=ax1, cmap='Blues')
ax1.set_title('Confusion Matrix', fontweight='bold')

# ROC Curve
ax2 = axes[1]
RocCurveDisplay.from_predictions(y_test, y_proba, ax=ax2)
ax2.plot([0, 1], [0, 1], 'k--', label='Random')
ax2.set_title(f'ROC Curve (AUC = {roc_auc:.3f})', fontweight='bold')
ax2.legend()

# Model Comparison
ax3 = axes[2]
model_results = automl.get_results()
colors = ['green' if m == automl.best_model_name else 'steelblue' for m in model_results['Model']]
ax3.barh(model_results['Model'], model_results['Best Score'], color=colors, edgecolor='black')
ax3.set_xlabel('CV Score')
ax3.set_title('Model Comparison', fontweight='bold')
for i, (model, score) in enumerate(zip(model_results['Model'], model_results['Best Score'])):
    ax3.text(score + 0.005, i, f'{score:.3f}', va='center')

plt.tight_layout()
plt.show()

---

<a id='part9'></a>
# Part 9: Summary & Best Practices

---

In [None]:
# Final summary
print("="*70)
print("AUTOMATED ML PIPELINE - SUMMARY")
print("="*70)

print("""
WHAT WE LEARNED:
================

1. WHY PIPELINES?
   - Prevent data leakage
   - Ensure reproducibility
   - Easy deployment (single object)

2. PREPROCESSING AUTOMATION:
   ┌─────────────────────────────────────────────┐
   │ ColumnTransformer                           │
   ├─────────────────────────────────────────────┤
   │ Numeric: Imputer → Scaler                   │
   │ Categorical: Imputer → Encoder              │
   └─────────────────────────────────────────────┘

3. FEATURE SELECTION METHODS:
   ┌──────────────┬─────────────────┬───────────┐
   │ Type         │ Methods         │ Speed     │
   ├──────────────┼─────────────────┼───────────┤
   │ Filter       │ SelectKBest     │ Fast      │
   │ Wrapper      │ RFE, RFECV      │ Slow      │
   │ Embedded     │ L1, Tree Imp.   │ Medium    │
   └──────────────┴─────────────────┴───────────┘

4. HYPERPARAMETER TUNING:
   ┌──────────────────┬──────────────────────────┐
   │ Method           │ When to Use              │
   ├──────────────────┼──────────────────────────┤
   │ Grid Search      │ Few parameters           │
   │ Random Search    │ Many parameters (faster) │
   │ Bayesian Opt.    │ Expensive evaluations    │
   └──────────────────┴──────────────────────────┘

5. CROSS-VALIDATION:
   - StratifiedKFold for classification
   - TimeSeriesSplit for time data
   - 5-fold is good default
""")

print(f"\nFINAL RESULTS:")
print(f"  Best Model: {automl.best_model_name}")
print(f"  CV Score: {automl.best_score:.4f}")
print(f"  Test Accuracy: {accuracy:.4f}")
print(f"  Test ROC-AUC: {roc_auc:.4f}")

print("\n" + "="*70)

## Algorithm & Method Taxonomy

### Preprocessing Methods

| Method | Type | When to Use |
|--------|------|-------------|
| **StandardScaler** | Scaling | Normal distribution, most algorithms |
| **MinMaxScaler** | Scaling | Neural networks, bounded range |
| **RobustScaler** | Scaling | Data with outliers |
| **SimpleImputer** | Imputation | Simple missing value handling |
| **KNNImputer** | Imputation | Complex patterns in missing data |
| **OneHotEncoder** | Encoding | Nominal categories |
| **OrdinalEncoder** | Encoding | Ordinal categories |

### Feature Selection Methods

| Method | Category | Speed | Best For |
|--------|----------|-------|----------|
| **VarianceThreshold** | Filter | Very Fast | Remove constant features |
| **SelectKBest** | Filter | Fast | Quick baseline |
| **RFE** | Wrapper | Slow | Optimal subset |
| **RFECV** | Wrapper | Very Slow | Auto-select # features |
| **SelectFromModel** | Embedded | Medium | L1/Tree-based |

### Hyperparameter Tuning Methods

| Method | Combinations | Speed | Quality |
|--------|--------------|-------|--------|
| **GridSearchCV** | All | Slow | Guaranteed optimal |
| **RandomizedSearchCV** | Random N | Fast | Usually good |
| **HalvingGridSearchCV** | Progressive | Very Fast | Good |
| **Optuna/Bayesian** | Smart | Medium | Excellent |

---

## Production Libraries

| Library | Purpose | Complexity |
|---------|---------|------------|
| **sklearn Pipeline** | Manual pipelines | Low |
| **TPOT** | Genetic algorithm AutoML | Medium |
| **Auto-sklearn** | Meta-learning AutoML | Medium |
| **H2O AutoML** | Enterprise AutoML | Low |
| **Optuna** | Hyperparameter optimization | Medium |
| **MLflow** | Experiment tracking | Medium |

---

## Checklist

- [x] Understand why pipelines prevent data leakage
- [x] Can build preprocessing pipelines with ColumnTransformer
- [x] Know Filter, Wrapper, Embedded feature selection methods
- [x] Can compare multiple models automatically
- [x] Understand Grid Search vs Random Search trade-offs
- [x] Know when to use which cross-validation strategy
- [x] Can build end-to-end AutoML pipeline

---

**End of Automated ML Pipeline Tutorial**