# Project 20: Build Your Own AutoML System

**Combine everything you've learned into a single automated ML system**

This is the ultimate capstone project! We'll build a complete AutoML system from scratch that automates:

```
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│    Data     │───▶│   Feature   │───▶│   Feature   │───▶│    Model    │
│ Preparation │    │ Engineering │    │  Selection  │    │  Selection  │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                                                                │
                                                                ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│    Best     │◀───│  Ensemble   │◀───│   Model     │◀───│ Hyperparameter
│   Model     │    │   Methods   │    │  Training   │    │   Tuning    │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
```

**Our AutoML System Features:**
- Auto problem type detection (classification/regression)
- Auto data preprocessing (missing values, encoding, scaling)
- Auto feature engineering
- Auto feature selection
- Multi-model training & comparison
- Hyperparameter optimization
- Ensemble methods
- Auto report generation

## Table of Contents

1. [Setup and Installation](#1-setup-and-installation)
2. [AutoML Architecture Overview](#2-automl-architecture-overview)
3. [Auto Data Type Detection](#3-auto-data-type-detection)
4. [Auto Exploratory Data Analysis](#4-auto-exploratory-data-analysis)
5. [Auto Data Preprocessing](#5-auto-data-preprocessing)
6. [Auto Feature Engineering](#6-auto-feature-engineering)
7. [Auto Feature Selection](#7-auto-feature-selection)
8. [Multi-Model Training](#8-multi-model-training)
9. [Hyperparameter Tuning](#9-hyperparameter-tuning)
10. [Ensemble Methods](#10-ensemble-methods)
11. [Complete AutoML Class](#11-complete-automl-class)
12. [Demo: AutoML in Action](#12-demo-automl-in-action)
13. [Summary](#13-summary)

## 1. Setup and Installation

In [None]:
# Install required packages
!pip install -q scikit-learn xgboost lightgbm optuna category_encoders

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional, Union, Any
from dataclasses import dataclass, field
from enum import Enum
import warnings
import time
import json
warnings.filterwarnings('ignore')

# Sklearn
from sklearn.model_selection import (
    train_test_split, cross_val_score, StratifiedKFold, KFold
)
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler, 
    LabelEncoder, OneHotEncoder
)
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.feature_selection import (
    SelectKBest, f_classif, f_regression, mutual_info_classif,
    mutual_info_regression, RFE, SelectFromModel, VarianceThreshold
)
from sklearn.decomposition import PCA

# Models
from sklearn.linear_model import (
    LogisticRegression, Ridge, Lasso, ElasticNet,
    SGDClassifier, PassiveAggressiveClassifier
)
from sklearn.ensemble import (
    RandomForestClassifier, RandomForestRegressor,
    GradientBoostingClassifier, GradientBoostingRegressor,
    AdaBoostClassifier, AdaBoostRegressor,
    ExtraTreesClassifier, ExtraTreesRegressor,
    VotingClassifier, VotingRegressor,
    StackingClassifier, StackingRegressor
)
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.naive_bayes import GaussianNB

# XGBoost & LightGBM
import xgboost as xgb
import lightgbm as lgb

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report,
    mean_squared_error, mean_absolute_error, r2_score,
    make_scorer
)

# Hyperparameter tuning
try:
    import optuna
    OPTUNA_AVAILABLE = True
except:
    OPTUNA_AVAILABLE = False
    print("Optuna not available, using GridSearch instead")

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Set seeds
SEED = 42
np.random.seed(SEED)

print("All libraries imported successfully!")
print(f"Optuna available: {OPTUNA_AVAILABLE}")

## 2. AutoML Architecture Overview

Our AutoML system consists of modular components that can be used independently or together:

| Module | Responsibility |
|--------|---------------|
| `DataTypeDetector` | Detect column types (numeric, categorical, text, datetime) |
| `AutoEDA` | Automatic exploratory data analysis |
| `AutoPreprocessor` | Handle missing values, encoding, scaling |
| `AutoFeatureEngineer` | Create new features automatically |
| `AutoFeatureSelector` | Select best features |
| `ModelTrainer` | Train multiple models |
| `HyperparameterTuner` | Optimize hyperparameters |
| `Ensembler` | Create ensemble models |
| `AutoML` | Main class combining all modules |

In [None]:
# Define problem types
class ProblemType(Enum):
    BINARY_CLASSIFICATION = "binary_classification"
    MULTICLASS_CLASSIFICATION = "multiclass_classification"
    REGRESSION = "regression"

# Define column types
class ColumnType(Enum):
    NUMERIC = "numeric"
    CATEGORICAL = "categorical"
    TEXT = "text"
    DATETIME = "datetime"
    BINARY = "binary"
    ID = "id"

@dataclass
class AutoMLConfig:
    """Configuration for AutoML system."""
    # General
    random_state: int = 42
    test_size: float = 0.2
    cv_folds: int = 5
    
    # Preprocessing
    handle_missing: bool = True
    encode_categorical: bool = True
    scale_numeric: bool = True
    
    # Feature Engineering
    create_interactions: bool = True
    create_polynomial: bool = False
    
    # Feature Selection
    feature_selection: bool = True
    max_features: Optional[int] = None
    
    # Model Training
    quick_mode: bool = False  # Use fewer models for speed
    tune_hyperparameters: bool = True
    tuning_trials: int = 50
    
    # Ensemble
    create_ensemble: bool = True
    
print("AutoML configuration defined!")

## 3. Auto Data Type Detection

First, we need to automatically detect:
- Problem type (classification vs regression)
- Column types (numeric, categorical, text, datetime)

In [None]:
class DataTypeDetector:
    """
    Automatically detect data types for columns and problem type.
    """
    
    def __init__(self, categorical_threshold: int = 20, id_threshold: float = 0.9):
        """
        Args:
            categorical_threshold: Max unique values to consider as categorical
            id_threshold: If unique ratio > threshold, likely an ID column
        """
        self.categorical_threshold = categorical_threshold
        self.id_threshold = id_threshold
    
    def detect_problem_type(self, y: pd.Series) -> ProblemType:
        """
        Detect if problem is classification or regression.
        """
        n_unique = y.nunique()
        dtype = y.dtype
        
        # Check if target is numeric with many unique values
        if np.issubdtype(dtype, np.floating) and n_unique > 20:
            return ProblemType.REGRESSION
        
        # Check if target is integer with many unique values
        if np.issubdtype(dtype, np.integer) and n_unique > 20:
            return ProblemType.REGRESSION
        
        # Classification
        if n_unique == 2:
            return ProblemType.BINARY_CLASSIFICATION
        else:
            return ProblemType.MULTICLASS_CLASSIFICATION
    
    def detect_column_types(self, df: pd.DataFrame, target_col: str = None) -> Dict[str, ColumnType]:
        """
        Detect type for each column.
        """
        column_types = {}
        
        for col in df.columns:
            if col == target_col:
                continue
            
            column_types[col] = self._detect_single_column(df[col], col)
        
        return column_types
    
    def _detect_single_column(self, series: pd.Series, col_name: str) -> ColumnType:
        """
        Detect type for a single column.
        """
        # Check for datetime
        if pd.api.types.is_datetime64_any_dtype(series):
            return ColumnType.DATETIME
        
        # Check if column name suggests ID
        if any(id_word in col_name.lower() for id_word in ['id', '_id', 'index', 'key']):
            if series.nunique() / len(series) > self.id_threshold:
                return ColumnType.ID
        
        # Check for numeric
        if pd.api.types.is_numeric_dtype(series):
            n_unique = series.nunique()
            
            # Binary
            if n_unique == 2:
                return ColumnType.BINARY
            
            # Categorical (low cardinality numeric)
            if n_unique <= self.categorical_threshold:
                return ColumnType.CATEGORICAL
            
            return ColumnType.NUMERIC
        
        # Check for text/categorical
        if pd.api.types.is_string_dtype(series) or series.dtype == 'object':
            n_unique = series.nunique()
            avg_len = series.astype(str).str.len().mean()
            
            # Long text (likely text data)
            if avg_len > 50:
                return ColumnType.TEXT
            
            # High cardinality might be ID
            if series.nunique() / len(series) > self.id_threshold:
                return ColumnType.ID
            
            return ColumnType.CATEGORICAL
        
        return ColumnType.CATEGORICAL
    
    def get_summary(self, column_types: Dict[str, ColumnType]) -> Dict[str, List[str]]:
        """
        Group columns by type.
        """
        summary = {ct.value: [] for ct in ColumnType}
        for col, col_type in column_types.items():
            summary[col_type.value].append(col)
        return {k: v for k, v in summary.items() if v}  # Remove empty

# Test
print("DataTypeDetector created!")
print("\nCapabilities:")
print("  - Detect problem type (binary/multiclass classification, regression)")
print("  - Detect column types (numeric, categorical, text, datetime, ID)")

## 4. Auto Exploratory Data Analysis

In [None]:
class AutoEDA:
    """
    Automatic Exploratory Data Analysis.
    """
    
    def __init__(self):
        self.report = {}
    
    def analyze(self, df: pd.DataFrame, target_col: str = None) -> Dict:
        """
        Perform comprehensive EDA.
        """
        self.report = {
            'basic_info': self._basic_info(df),
            'missing_values': self._missing_analysis(df),
            'duplicates': self._duplicate_analysis(df),
            'numeric_stats': self._numeric_analysis(df),
            'categorical_stats': self._categorical_analysis(df),
        }
        
        if target_col and target_col in df.columns:
            self.report['target_analysis'] = self._target_analysis(df, target_col)
        
        return self.report
    
    def _basic_info(self, df: pd.DataFrame) -> Dict:
        return {
            'n_rows': len(df),
            'n_columns': len(df.columns),
            'memory_mb': df.memory_usage(deep=True).sum() / 1024**2,
            'dtypes': df.dtypes.value_counts().to_dict()
        }
    
    def _missing_analysis(self, df: pd.DataFrame) -> Dict:
        missing = df.isnull().sum()
        missing_pct = (missing / len(df) * 100).round(2)
        return {
            'total_missing': missing.sum(),
            'columns_with_missing': (missing > 0).sum(),
            'missing_by_column': missing[missing > 0].to_dict(),
            'missing_pct_by_column': missing_pct[missing_pct > 0].to_dict()
        }
    
    def _duplicate_analysis(self, df: pd.DataFrame) -> Dict:
        return {
            'n_duplicates': df.duplicated().sum(),
            'duplicate_pct': (df.duplicated().sum() / len(df) * 100).round(2)
        }
    
    def _numeric_analysis(self, df: pd.DataFrame) -> Dict:
        numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
        if not numeric_cols:
            return {}
        
        stats = df[numeric_cols].describe().to_dict()
        
        # Check for outliers (IQR method)
        outliers = {}
        for col in numeric_cols:
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            outlier_count = ((df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)).sum()
            if outlier_count > 0:
                outliers[col] = outlier_count
        
        return {
            'columns': numeric_cols,
            'statistics': stats,
            'outliers': outliers
        }
    
    def _categorical_analysis(self, df: pd.DataFrame) -> Dict:
        cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
        if not cat_cols:
            return {}
        
        analysis = {}
        for col in cat_cols:
            analysis[col] = {
                'n_unique': df[col].nunique(),
                'top_values': df[col].value_counts().head(5).to_dict()
            }
        
        return {'columns': cat_cols, 'analysis': analysis}
    
    def _target_analysis(self, df: pd.DataFrame, target_col: str) -> Dict:
        target = df[target_col]
        return {
            'dtype': str(target.dtype),
            'n_unique': target.nunique(),
            'value_counts': target.value_counts().to_dict(),
            'distribution': target.describe().to_dict() if np.issubdtype(target.dtype, np.number) else None
        }
    
    def print_report(self):
        """Print formatted EDA report."""
        print("=" * 60)
        print("AUTO EDA REPORT")
        print("=" * 60)
        
        # Basic Info
        info = self.report['basic_info']
        print(f"\nBasic Information:")
        print(f"  Rows: {info['n_rows']:,}")
        print(f"  Columns: {info['n_columns']}")
        print(f"  Memory: {info['memory_mb']:.2f} MB")
        
        # Missing Values
        missing = self.report['missing_values']
        print(f"\nMissing Values:")
        print(f"  Total: {missing['total_missing']:,}")
        print(f"  Columns with missing: {missing['columns_with_missing']}")
        
        # Duplicates
        dups = self.report['duplicates']
        print(f"\nDuplicates:")
        print(f"  Count: {dups['n_duplicates']:,} ({dups['duplicate_pct']}%)")
        
        # Numeric
        if self.report['numeric_stats']:
            print(f"\nNumeric Columns: {len(self.report['numeric_stats']['columns'])}")
            if self.report['numeric_stats']['outliers']:
                print(f"  Columns with outliers: {len(self.report['numeric_stats']['outliers'])}")
        
        # Categorical
        if self.report['categorical_stats']:
            print(f"\nCategorical Columns: {len(self.report['categorical_stats']['columns'])}")

print("AutoEDA created!")

## 5. Auto Data Preprocessing

In [None]:
class AutoPreprocessor:
    """
    Automatic data preprocessing pipeline.
    
    Handles:
    - Missing values (imputation)
    - Categorical encoding
    - Numeric scaling
    - Outlier handling
    """
    
    def __init__(self, config: AutoMLConfig = None):
        self.config = config or AutoMLConfig()
        self.column_types = {}
        self.imputers = {}
        self.encoders = {}
        self.scalers = {}
        self.is_fitted = False
        
    def fit(self, df: pd.DataFrame, column_types: Dict[str, ColumnType]) -> 'AutoPreprocessor':
        """
        Fit preprocessing transformers.
        """
        self.column_types = column_types
        df = df.copy()
        
        # Group columns by type
        numeric_cols = [c for c, t in column_types.items() if t == ColumnType.NUMERIC]
        categorical_cols = [c for c, t in column_types.items() if t == ColumnType.CATEGORICAL]
        binary_cols = [c for c, t in column_types.items() if t == ColumnType.BINARY]
        
        # Fit imputers
        if self.config.handle_missing:
            # Numeric: median imputation
            if numeric_cols:
                self.imputers['numeric'] = SimpleImputer(strategy='median')
                self.imputers['numeric'].fit(df[numeric_cols])
            
            # Categorical: mode imputation
            if categorical_cols:
                self.imputers['categorical'] = SimpleImputer(strategy='most_frequent')
                self.imputers['categorical'].fit(df[categorical_cols])
        
        # Fit encoders
        if self.config.encode_categorical:
            for col in categorical_cols:
                le = LabelEncoder()
                # Handle NaN for fitting
                non_null = df[col].dropna().astype(str)
                le.fit(non_null)
                self.encoders[col] = le
        
        # Fit scalers
        if self.config.scale_numeric and numeric_cols:
            # Use RobustScaler for outlier resistance
            self.scalers['numeric'] = RobustScaler()
            # Impute first, then fit scaler
            if 'numeric' in self.imputers:
                imputed = self.imputers['numeric'].transform(df[numeric_cols])
            else:
                imputed = df[numeric_cols].values
            self.scalers['numeric'].fit(imputed)
        
        self.is_fitted = True
        return self
    
    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Transform data using fitted transformers.
        """
        if not self.is_fitted:
            raise ValueError("Preprocessor not fitted. Call fit() first.")
        
        df = df.copy()
        
        # Group columns
        numeric_cols = [c for c, t in self.column_types.items() if t == ColumnType.NUMERIC and c in df.columns]
        categorical_cols = [c for c, t in self.column_types.items() if t == ColumnType.CATEGORICAL and c in df.columns]
        
        # Impute missing values
        if self.config.handle_missing:
            if 'numeric' in self.imputers and numeric_cols:
                df[numeric_cols] = self.imputers['numeric'].transform(df[numeric_cols])
            if 'categorical' in self.imputers and categorical_cols:
                df[categorical_cols] = self.imputers['categorical'].transform(df[categorical_cols])
        
        # Encode categorical
        if self.config.encode_categorical:
            for col in categorical_cols:
                if col in self.encoders:
                    le = self.encoders[col]
                    # Handle unseen categories
                    df[col] = df[col].astype(str)
                    df[col] = df[col].apply(
                        lambda x: le.transform([x])[0] if x in le.classes_ else -1
                    )
        
        # Scale numeric
        if self.config.scale_numeric and 'numeric' in self.scalers and numeric_cols:
            df[numeric_cols] = self.scalers['numeric'].transform(df[numeric_cols])
        
        # Remove ID columns
        id_cols = [c for c, t in self.column_types.items() if t == ColumnType.ID and c in df.columns]
        df = df.drop(columns=id_cols, errors='ignore')
        
        return df
    
    def fit_transform(self, df: pd.DataFrame, column_types: Dict[str, ColumnType]) -> pd.DataFrame:
        """Fit and transform in one step."""
        self.fit(df, column_types)
        return self.transform(df)
    
    def get_feature_names(self) -> List[str]:
        """Get names of output features."""
        return [c for c, t in self.column_types.items() if t != ColumnType.ID]

print("AutoPreprocessor created!")
print("\nCapabilities:")
print("  - Missing value imputation (median/mode)")
print("  - Categorical encoding (Label Encoding)")
print("  - Numeric scaling (Robust Scaler)")
print("  - ID column removal")

## 6. Auto Feature Engineering

In [None]:
class AutoFeatureEngineer:
    """
    Automatic feature engineering.
    
    Creates:
    - Interaction features
    - Polynomial features
    - Aggregation features
    - Date features
    """
    
    def __init__(self, config: AutoMLConfig = None):
        self.config = config or AutoMLConfig()
        self.created_features = []
        
    def create_features(self, df: pd.DataFrame, column_types: Dict[str, ColumnType]) -> pd.DataFrame:
        """
        Create new features automatically.
        """
        df = df.copy()
        original_cols = df.columns.tolist()
        
        # Get numeric columns
        numeric_cols = [c for c, t in column_types.items() 
                       if t == ColumnType.NUMERIC and c in df.columns]
        
        # Create interaction features (top numeric columns)
        if self.config.create_interactions and len(numeric_cols) >= 2:
            # Limit to prevent explosion
            top_cols = numeric_cols[:5]
            for i, col1 in enumerate(top_cols):
                for col2 in top_cols[i+1:]:
                    # Multiplication
                    df[f'{col1}_x_{col2}'] = df[col1] * df[col2]
                    # Ratio (with small epsilon)
                    df[f'{col1}_div_{col2}'] = df[col1] / (df[col2] + 1e-8)
        
        # Create aggregation features
        if len(numeric_cols) >= 3:
            df['numeric_mean'] = df[numeric_cols].mean(axis=1)
            df['numeric_std'] = df[numeric_cols].std(axis=1)
            df['numeric_max'] = df[numeric_cols].max(axis=1)
            df['numeric_min'] = df[numeric_cols].min(axis=1)
            df['numeric_range'] = df['numeric_max'] - df['numeric_min']
        
        # Create datetime features
        datetime_cols = [c for c, t in column_types.items() 
                        if t == ColumnType.DATETIME and c in df.columns]
        for col in datetime_cols:
            df = self._create_datetime_features(df, col)
        
        # Track created features
        self.created_features = [c for c in df.columns if c not in original_cols]
        
        return df
    
    def _create_datetime_features(self, df: pd.DataFrame, col: str) -> pd.DataFrame:
        """
        Extract features from datetime column.
        """
        try:
            dt = pd.to_datetime(df[col])
            df[f'{col}_year'] = dt.dt.year
            df[f'{col}_month'] = dt.dt.month
            df[f'{col}_day'] = dt.dt.day
            df[f'{col}_dayofweek'] = dt.dt.dayofweek
            df[f'{col}_hour'] = dt.dt.hour
            df[f'{col}_is_weekend'] = (dt.dt.dayofweek >= 5).astype(int)
        except:
            pass
        return df

print("AutoFeatureEngineer created!")
print("\nCapabilities:")
print("  - Interaction features (multiplication, ratio)")
print("  - Aggregation features (mean, std, max, min, range)")
print("  - Datetime features (year, month, day, dayofweek, etc.)")

## 7. Auto Feature Selection

In [None]:
class AutoFeatureSelector:
    """
    Automatic feature selection using multiple methods.
    
    Methods:
    - Variance threshold
    - Correlation-based
    - Statistical tests (f_classif, f_regression)
    - Mutual information
    - Model-based (Random Forest importance)
    """
    
    def __init__(self, problem_type: ProblemType, max_features: int = None):
        self.problem_type = problem_type
        self.max_features = max_features
        self.selected_features = []
        self.feature_importance = {}
        self.is_fitted = False
    
    def fit(self, X: pd.DataFrame, y: pd.Series) -> 'AutoFeatureSelector':
        """
        Fit feature selector and determine best features.
        """
        X = X.copy()
        
        # Ensure numeric
        X = X.select_dtypes(include=[np.number])
        
        # Handle any remaining NaN
        X = X.fillna(X.median())
        
        # Replace inf with large values
        X = X.replace([np.inf, -np.inf], 0)
        
        # Step 1: Remove zero variance features
        variances = X.var()
        zero_var_cols = variances[variances == 0].index.tolist()
        X = X.drop(columns=zero_var_cols)
        
        if X.empty:
            self.selected_features = []
            self.is_fitted = True
            return self
        
        # Step 2: Remove highly correlated features
        corr_matrix = X.corr().abs()
        upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
        high_corr_cols = [col for col in upper.columns if any(upper[col] > 0.95)]
        X = X.drop(columns=high_corr_cols[:len(high_corr_cols)//2])  # Keep half
        
        # Step 3: Statistical feature selection
        if self.problem_type == ProblemType.REGRESSION:
            score_func = f_regression
        else:
            score_func = f_classif
        
        k = self.max_features or min(20, len(X.columns))
        k = min(k, len(X.columns))
        
        selector = SelectKBest(score_func=score_func, k=k)
        selector.fit(X, y)
        
        # Get feature scores
        scores = selector.scores_
        for i, col in enumerate(X.columns):
            self.feature_importance[col] = scores[i] if not np.isnan(scores[i]) else 0
        
        # Step 4: Model-based selection (Random Forest)
        if self.problem_type == ProblemType.REGRESSION:
            rf = RandomForestRegressor(n_estimators=50, max_depth=5, random_state=42, n_jobs=-1)
        else:
            rf = RandomForestClassifier(n_estimators=50, max_depth=5, random_state=42, n_jobs=-1)
        
        rf.fit(X, y)
        rf_importance = dict(zip(X.columns, rf.feature_importances_))
        
        # Combine scores (normalize and average)
        for col in X.columns:
            stat_score = self.feature_importance.get(col, 0)
            rf_score = rf_importance.get(col, 0)
            
            # Normalize
            max_stat = max(self.feature_importance.values()) or 1
            max_rf = max(rf_importance.values()) or 1
            
            combined = (stat_score / max_stat + rf_score / max_rf) / 2
            self.feature_importance[col] = combined
        
        # Select top features
        sorted_features = sorted(self.feature_importance.items(), key=lambda x: x[1], reverse=True)
        self.selected_features = [f[0] for f in sorted_features[:k]]
        
        self.is_fitted = True
        return self
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Select only the chosen features.
        """
        if not self.is_fitted:
            raise ValueError("Selector not fitted. Call fit() first.")
        
        available = [f for f in self.selected_features if f in X.columns]
        return X[available]
    
    def fit_transform(self, X: pd.DataFrame, y: pd.Series) -> pd.DataFrame:
        """Fit and transform in one step."""
        self.fit(X, y)
        return self.transform(X)
    
    def get_importance_df(self) -> pd.DataFrame:
        """Get feature importance as DataFrame."""
        return pd.DataFrame([
            {'feature': k, 'importance': v}
            for k, v in sorted(self.feature_importance.items(), key=lambda x: x[1], reverse=True)
        ])

print("AutoFeatureSelector created!")
print("\nMethods used:")
print("  - Variance threshold")
print("  - Correlation filtering")
print("  - Statistical tests (F-test)")
print("  - Random Forest importance")

## 8. Multi-Model Training

In [None]:
class ModelTrainer:
    """
    Train and evaluate multiple models automatically.
    """
    
    def __init__(self, problem_type: ProblemType, config: AutoMLConfig = None):
        self.problem_type = problem_type
        self.config = config or AutoMLConfig()
        self.models = {}
        self.results = {}
        self.best_model_name = None
        self.best_model = None
    
    def get_models(self) -> Dict:
        """
        Get dictionary of models to train.
        """
        if self.problem_type == ProblemType.REGRESSION:
            models = {
                'Ridge': Ridge(random_state=self.config.random_state),
                'Lasso': Lasso(random_state=self.config.random_state),
                'ElasticNet': ElasticNet(random_state=self.config.random_state),
                'RandomForest': RandomForestRegressor(
                    n_estimators=100, random_state=self.config.random_state, n_jobs=-1
                ),
                'GradientBoosting': GradientBoostingRegressor(
                    n_estimators=100, random_state=self.config.random_state
                ),
                'XGBoost': xgb.XGBRegressor(
                    n_estimators=100, random_state=self.config.random_state, n_jobs=-1
                ),
                'LightGBM': lgb.LGBMRegressor(
                    n_estimators=100, random_state=self.config.random_state, n_jobs=-1, verbose=-1
                ),
            }
            if not self.config.quick_mode:
                models.update({
                    'ExtraTrees': ExtraTreesRegressor(
                        n_estimators=100, random_state=self.config.random_state, n_jobs=-1
                    ),
                    'KNN': KNeighborsRegressor(n_jobs=-1),
                    'SVR': SVR(),
                })
        else:
            models = {
                'LogisticRegression': LogisticRegression(
                    random_state=self.config.random_state, max_iter=1000, n_jobs=-1
                ),
                'RandomForest': RandomForestClassifier(
                    n_estimators=100, random_state=self.config.random_state, n_jobs=-1
                ),
                'GradientBoosting': GradientBoostingClassifier(
                    n_estimators=100, random_state=self.config.random_state
                ),
                'XGBoost': xgb.XGBClassifier(
                    n_estimators=100, random_state=self.config.random_state, 
                    n_jobs=-1, eval_metric='logloss'
                ),
                'LightGBM': lgb.LGBMClassifier(
                    n_estimators=100, random_state=self.config.random_state, 
                    n_jobs=-1, verbose=-1
                ),
            }
            if not self.config.quick_mode:
                models.update({
                    'ExtraTrees': ExtraTreesClassifier(
                        n_estimators=100, random_state=self.config.random_state, n_jobs=-1
                    ),
                    'KNN': KNeighborsClassifier(n_jobs=-1),
                    'NaiveBayes': GaussianNB(),
                    'SVC': SVC(probability=True, random_state=self.config.random_state),
                })
        
        return models
    
    def train_all(self, X_train: pd.DataFrame, y_train: pd.Series,
                  X_val: pd.DataFrame = None, y_val: pd.Series = None) -> Dict:
        """
        Train all models and evaluate.
        """
        self.models = self.get_models()
        
        # Ensure numeric
        X_train = X_train.select_dtypes(include=[np.number])
        if X_val is not None:
            X_val = X_val.select_dtypes(include=[np.number])
        
        # Handle any remaining issues
        X_train = X_train.fillna(0).replace([np.inf, -np.inf], 0)
        if X_val is not None:
            X_val = X_val.fillna(0).replace([np.inf, -np.inf], 0)
        
        print(f"Training {len(self.models)} models...")
        print("=" * 50)
        
        for name, model in self.models.items():
            start_time = time.time()
            
            try:
                # Train
                model.fit(X_train, y_train)
                train_time = time.time() - start_time
                
                # Evaluate
                if X_val is not None:
                    scores = self._evaluate_model(model, X_val, y_val)
                else:
                    # Use cross-validation
                    scores = self._cross_validate(model, X_train, y_train)
                
                scores['train_time'] = train_time
                self.results[name] = scores
                
                # Print progress
                metric = 'r2' if self.problem_type == ProblemType.REGRESSION else 'accuracy'
                print(f"  {name:20} | {metric}: {scores[metric]:.4f} | Time: {train_time:.2f}s")
                
            except Exception as e:
                print(f"  {name:20} | ERROR: {str(e)[:30]}")
                self.results[name] = {'error': str(e)}
        
        # Find best model
        self._select_best_model()
        
        return self.results
    
    def _evaluate_model(self, model, X, y) -> Dict:
        """
        Evaluate model on validation set.
        """
        y_pred = model.predict(X)
        
        if self.problem_type == ProblemType.REGRESSION:
            return {
                'r2': r2_score(y, y_pred),
                'rmse': np.sqrt(mean_squared_error(y, y_pred)),
                'mae': mean_absolute_error(y, y_pred)
            }
        else:
            scores = {
                'accuracy': accuracy_score(y, y_pred),
                'f1': f1_score(y, y_pred, average='weighted'),
                'precision': precision_score(y, y_pred, average='weighted'),
                'recall': recall_score(y, y_pred, average='weighted')
            }
            
            # ROC-AUC for binary
            if self.problem_type == ProblemType.BINARY_CLASSIFICATION:
                if hasattr(model, 'predict_proba'):
                    y_proba = model.predict_proba(X)[:, 1]
                    scores['roc_auc'] = roc_auc_score(y, y_proba)
            
            return scores
    
    def _cross_validate(self, model, X, y) -> Dict:
        """
        Perform cross-validation.
        """
        if self.problem_type == ProblemType.REGRESSION:
            scoring = 'r2'
            cv = KFold(n_splits=self.config.cv_folds, shuffle=True, random_state=self.config.random_state)
        else:
            scoring = 'accuracy'
            cv = StratifiedKFold(n_splits=self.config.cv_folds, shuffle=True, random_state=self.config.random_state)
        
        scores = cross_val_score(model, X, y, cv=cv, scoring=scoring, n_jobs=-1)
        
        if self.problem_type == ProblemType.REGRESSION:
            return {'r2': scores.mean(), 'r2_std': scores.std()}
        else:
            return {'accuracy': scores.mean(), 'accuracy_std': scores.std()}
    
    def _select_best_model(self):
        """
        Select the best performing model.
        """
        metric = 'r2' if self.problem_type == ProblemType.REGRESSION else 'accuracy'
        
        best_score = -np.inf
        for name, scores in self.results.items():
            if 'error' not in scores and scores.get(metric, -np.inf) > best_score:
                best_score = scores[metric]
                self.best_model_name = name
        
        if self.best_model_name:
            self.best_model = self.models[self.best_model_name]
            print(f"\nBest Model: {self.best_model_name} ({metric}: {best_score:.4f})")
    
    def get_leaderboard(self) -> pd.DataFrame:
        """
        Get model comparison leaderboard.
        """
        rows = []
        for name, scores in self.results.items():
            if 'error' not in scores:
                row = {'Model': name, **scores}
                rows.append(row)
        
        df = pd.DataFrame(rows)
        
        # Sort by primary metric
        metric = 'r2' if self.problem_type == ProblemType.REGRESSION else 'accuracy'
        if metric in df.columns:
            df = df.sort_values(metric, ascending=False)
        
        return df

print("ModelTrainer created!")
print(f"\nClassification models: LogisticRegression, RandomForest, GradientBoosting, XGBoost, LightGBM, ...")
print(f"Regression models: Ridge, Lasso, ElasticNet, RandomForest, XGBoost, LightGBM, ...")

## 9. Hyperparameter Tuning

In [None]:
class HyperparameterTuner:
    """
    Automatic hyperparameter tuning using Optuna or GridSearch.
    """
    
    def __init__(self, problem_type: ProblemType, n_trials: int = 50):
        self.problem_type = problem_type
        self.n_trials = n_trials
        self.best_params = {}
        self.best_model = None
    
    def tune(self, model_name: str, X_train: pd.DataFrame, y_train: pd.Series,
             X_val: pd.DataFrame = None, y_val: pd.Series = None):
        """
        Tune hyperparameters for specified model.
        """
        # Ensure numeric
        X_train = X_train.select_dtypes(include=[np.number]).fillna(0).replace([np.inf, -np.inf], 0)
        if X_val is not None:
            X_val = X_val.select_dtypes(include=[np.number]).fillna(0).replace([np.inf, -np.inf], 0)
        
        if OPTUNA_AVAILABLE:
            return self._tune_optuna(model_name, X_train, y_train, X_val, y_val)
        else:
            return self._tune_gridsearch(model_name, X_train, y_train)
    
    def _tune_optuna(self, model_name: str, X_train, y_train, X_val, y_val):
        """
        Tune using Optuna (Bayesian optimization).
        """
        def objective(trial):
            params = self._get_param_space_optuna(model_name, trial)
            model = self._create_model(model_name, params)
            
            model.fit(X_train, y_train)
            
            if X_val is not None:
                y_pred = model.predict(X_val)
                y_true = y_val
            else:
                y_pred = model.predict(X_train)
                y_true = y_train
            
            if self.problem_type == ProblemType.REGRESSION:
                return r2_score(y_true, y_pred)
            else:
                return accuracy_score(y_true, y_pred)
        
        # Run optimization
        optuna.logging.set_verbosity(optuna.logging.WARNING)
        study = optuna.create_study(direction='maximize')
        study.optimize(objective, n_trials=self.n_trials, show_progress_bar=True)
        
        self.best_params = study.best_params
        self.best_model = self._create_model(model_name, self.best_params)
        self.best_model.fit(X_train, y_train)
        
        return self.best_model, self.best_params, study.best_value
    
    def _tune_gridsearch(self, model_name: str, X_train, y_train):
        """
        Tune using GridSearchCV (fallback).
        """
        param_grid = self._get_param_space_grid(model_name)
        model = self._create_model(model_name, {})
        
        scoring = 'r2' if self.problem_type == ProblemType.REGRESSION else 'accuracy'
        
        search = RandomizedSearchCV(
            model, param_grid, n_iter=min(20, self.n_trials),
            scoring=scoring, cv=3, n_jobs=-1, random_state=42
        )
        search.fit(X_train, y_train)
        
        self.best_params = search.best_params_
        self.best_model = search.best_estimator_
        
        return self.best_model, self.best_params, search.best_score_
    
    def _get_param_space_optuna(self, model_name: str, trial) -> Dict:
        """
        Define parameter search space for Optuna.
        """
        spaces = {
            'RandomForest': {
                'n_estimators': trial.suggest_int('n_estimators', 50, 300),
                'max_depth': trial.suggest_int('max_depth', 3, 20),
                'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
                'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
            },
            'XGBoost': {
                'n_estimators': trial.suggest_int('n_estimators', 50, 300),
                'max_depth': trial.suggest_int('max_depth', 3, 15),
                'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
                'subsample': trial.suggest_float('subsample', 0.6, 1.0),
                'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
            },
            'LightGBM': {
                'n_estimators': trial.suggest_int('n_estimators', 50, 300),
                'max_depth': trial.suggest_int('max_depth', 3, 15),
                'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
                'num_leaves': trial.suggest_int('num_leaves', 20, 100),
                'subsample': trial.suggest_float('subsample', 0.6, 1.0),
            },
            'GradientBoosting': {
                'n_estimators': trial.suggest_int('n_estimators', 50, 200),
                'max_depth': trial.suggest_int('max_depth', 3, 10),
                'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
                'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
            },
        }
        return spaces.get(model_name, {})
    
    def _get_param_space_grid(self, model_name: str) -> Dict:
        """
        Define parameter grid for GridSearch.
        """
        spaces = {
            'RandomForest': {
                'n_estimators': [50, 100, 200],
                'max_depth': [5, 10, 15, None],
                'min_samples_split': [2, 5, 10],
            },
            'XGBoost': {
                'n_estimators': [50, 100, 200],
                'max_depth': [3, 6, 10],
                'learning_rate': [0.01, 0.1, 0.2],
            },
            'LightGBM': {
                'n_estimators': [50, 100, 200],
                'max_depth': [3, 6, 10],
                'learning_rate': [0.01, 0.1, 0.2],
            },
        }
        return spaces.get(model_name, {})
    
    def _create_model(self, model_name: str, params: Dict):
        """
        Create model instance with given parameters.
        """
        if self.problem_type == ProblemType.REGRESSION:
            models = {
                'RandomForest': RandomForestRegressor(random_state=42, n_jobs=-1, **params),
                'XGBoost': xgb.XGBRegressor(random_state=42, n_jobs=-1, **params),
                'LightGBM': lgb.LGBMRegressor(random_state=42, n_jobs=-1, verbose=-1, **params),
                'GradientBoosting': GradientBoostingRegressor(random_state=42, **params),
            }
        else:
            models = {
                'RandomForest': RandomForestClassifier(random_state=42, n_jobs=-1, **params),
                'XGBoost': xgb.XGBClassifier(random_state=42, n_jobs=-1, eval_metric='logloss', **params),
                'LightGBM': lgb.LGBMClassifier(random_state=42, n_jobs=-1, verbose=-1, **params),
                'GradientBoosting': GradientBoostingClassifier(random_state=42, **params),
            }
        return models.get(model_name)

print("HyperparameterTuner created!")
print(f"\nOptuna available: {OPTUNA_AVAILABLE}")
print("Supports: RandomForest, XGBoost, LightGBM, GradientBoosting")

## 10. Ensemble Methods

In [None]:
class Ensembler:
    """
    Create ensemble models from trained base models.
    
    Methods:
    - Voting (average predictions)
    - Stacking (meta-learner)
    """
    
    def __init__(self, problem_type: ProblemType):
        self.problem_type = problem_type
        self.ensemble_model = None
    
    def create_voting_ensemble(self, models: Dict, weights: List[float] = None):
        """
        Create voting ensemble from models.
        """
        estimators = [(name, model) for name, model in models.items()]
        
        if self.problem_type == ProblemType.REGRESSION:
            self.ensemble_model = VotingRegressor(estimators=estimators, weights=weights, n_jobs=-1)
        else:
            self.ensemble_model = VotingClassifier(
                estimators=estimators, voting='soft', weights=weights, n_jobs=-1
            )
        
        return self.ensemble_model
    
    def create_stacking_ensemble(self, models: Dict, meta_model=None):
        """
        Create stacking ensemble with meta-learner.
        """
        estimators = [(name, model) for name, model in models.items()]
        
        if self.problem_type == ProblemType.REGRESSION:
            final_estimator = meta_model or Ridge()
            self.ensemble_model = StackingRegressor(
                estimators=estimators, final_estimator=final_estimator, n_jobs=-1
            )
        else:
            final_estimator = meta_model or LogisticRegression(max_iter=1000)
            self.ensemble_model = StackingClassifier(
                estimators=estimators, final_estimator=final_estimator, n_jobs=-1
            )
        
        return self.ensemble_model
    
    def fit(self, X, y):
        """Fit ensemble model."""
        if self.ensemble_model is None:
            raise ValueError("Create ensemble first using create_voting_ensemble or create_stacking_ensemble")
        self.ensemble_model.fit(X, y)
        return self
    
    def predict(self, X):
        """Make predictions."""
        return self.ensemble_model.predict(X)

print("Ensembler created!")
print("\nMethods: Voting Ensemble, Stacking Ensemble")

## 11. Complete AutoML Class

Now let's combine everything into a single, easy-to-use AutoML class.

In [None]:
class AutoML:
    """
    Complete AutoML System.
    
    Automatically handles:
    - Data type detection
    - EDA
    - Preprocessing
    - Feature engineering
    - Feature selection
    - Model training & comparison
    - Hyperparameter tuning
    - Ensemble creation
    """
    
    def __init__(self, config: AutoMLConfig = None):
        self.config = config or AutoMLConfig()
        
        # Components
        self.type_detector = DataTypeDetector()
        self.eda = AutoEDA()
        self.preprocessor = AutoPreprocessor(self.config)
        self.feature_engineer = AutoFeatureEngineer(self.config)
        self.feature_selector = None
        self.model_trainer = None
        self.tuner = None
        self.ensembler = None
        
        # State
        self.problem_type = None
        self.column_types = {}
        self.target_col = None
        self.best_model = None
        self.is_fitted = False
        
        # Results
        self.eda_report = {}
        self.model_results = {}
        self.leaderboard = None
    
    def fit(self, df: pd.DataFrame, target_col: str, 
            tune_best: bool = True, create_ensemble: bool = True) -> 'AutoML':
        """
        Fit AutoML pipeline on data.
        
        Args:
            df: Input DataFrame
            target_col: Name of target column
            tune_best: Whether to tune best model's hyperparameters
            create_ensemble: Whether to create ensemble of top models
        """
        print("="*60)
        print("AUTOML PIPELINE STARTING")
        print("="*60)
        
        self.target_col = target_col
        start_time = time.time()
        
        # Step 1: Detect problem type
        print("\n[1/8] Detecting problem type...")
        self.problem_type = self.type_detector.detect_problem_type(df[target_col])
        print(f"  Problem type: {self.problem_type.value}")
        
        # Step 2: Detect column types
        print("\n[2/8] Detecting column types...")
        self.column_types = self.type_detector.detect_column_types(df, target_col)
        summary = self.type_detector.get_summary(self.column_types)
        for col_type, cols in summary.items():
            print(f"  {col_type}: {len(cols)} columns")
        
        # Step 3: EDA
        print("\n[3/8] Performing EDA...")
        self.eda_report = self.eda.analyze(df, target_col)
        print(f"  Missing values: {self.eda_report['missing_values']['total_missing']}")
        print(f"  Duplicates: {self.eda_report['duplicates']['n_duplicates']}")
        
        # Step 4: Split data
        print("\n[4/8] Splitting data...")
        X = df.drop(columns=[target_col])
        y = df[target_col]
        
        # Encode target if needed
        if self.problem_type != ProblemType.REGRESSION:
            if y.dtype == 'object':
                self.target_encoder = LabelEncoder()
                y = pd.Series(self.target_encoder.fit_transform(y), index=y.index)
            else:
                self.target_encoder = None
        else:
            self.target_encoder = None
        
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=self.config.test_size, random_state=self.config.random_state,
            stratify=y if self.problem_type != ProblemType.REGRESSION else None
        )
        print(f"  Train: {len(X_train):,} samples")
        print(f"  Test: {len(X_test):,} samples")
        
        # Step 5: Preprocess
        print("\n[5/8] Preprocessing data...")
        X_train = self.preprocessor.fit_transform(X_train, self.column_types)
        X_test = self.preprocessor.transform(X_test)
        print(f"  Features after preprocessing: {X_train.shape[1]}")
        
        # Step 6: Feature engineering
        print("\n[6/8] Engineering features...")
        X_train = self.feature_engineer.create_features(X_train, self.column_types)
        X_test = self.feature_engineer.create_features(X_test, self.column_types)
        print(f"  New features created: {len(self.feature_engineer.created_features)}")
        print(f"  Total features: {X_train.shape[1]}")
        
        # Step 7: Feature selection
        if self.config.feature_selection:
            print("\n[7/8] Selecting features...")
            self.feature_selector = AutoFeatureSelector(
                self.problem_type, self.config.max_features
            )
            X_train = self.feature_selector.fit_transform(X_train, y_train)
            X_test = self.feature_selector.transform(X_test)
            print(f"  Selected features: {len(self.feature_selector.selected_features)}")
        else:
            print("\n[7/8] Skipping feature selection...")
        
        # Step 8: Train models
        print("\n[8/8] Training models...")
        self.model_trainer = ModelTrainer(self.problem_type, self.config)
        self.model_results = self.model_trainer.train_all(X_train, y_train, X_test, y_test)
        self.leaderboard = self.model_trainer.get_leaderboard()
        self.best_model = self.model_trainer.best_model
        
        # Optional: Tune best model
        if tune_best and self.config.tune_hyperparameters:
            print("\n[BONUS] Tuning best model hyperparameters...")
            self.tuner = HyperparameterTuner(self.problem_type, self.config.tuning_trials)
            try:
                tuned_model, best_params, best_score = self.tuner.tune(
                    self.model_trainer.best_model_name, X_train, y_train, X_test, y_test
                )
                print(f"  Best params: {best_params}")
                print(f"  Tuned score: {best_score:.4f}")
                self.best_model = tuned_model
            except Exception as e:
                print(f"  Tuning failed: {e}")
        
        # Optional: Create ensemble
        if create_ensemble and self.config.create_ensemble:
            print("\n[BONUS] Creating ensemble...")
            try:
                # Get top 3 models
                top_models = {name: self.model_trainer.models[name] 
                             for name in self.leaderboard['Model'].head(3).tolist()
                             if name in self.model_trainer.models}
                
                if len(top_models) >= 2:
                    self.ensembler = Ensembler(self.problem_type)
                    self.ensembler.create_voting_ensemble(top_models)
                    self.ensembler.fit(X_train, y_train)
                    
                    # Evaluate ensemble
                    y_pred = self.ensembler.predict(X_test)
                    if self.problem_type == ProblemType.REGRESSION:
                        score = r2_score(y_test, y_pred)
                        print(f"  Ensemble R2: {score:.4f}")
                    else:
                        score = accuracy_score(y_test, y_pred)
                        print(f"  Ensemble Accuracy: {score:.4f}")
            except Exception as e:
                print(f"  Ensemble creation failed: {e}")
        
        # Store test data for later use
        self._X_test = X_test
        self._y_test = y_test
        
        self.is_fitted = True
        
        total_time = time.time() - start_time
        print(f"\n{'='*60}")
        print(f"AUTOML COMPLETE! Total time: {total_time:.1f}s")
        print(f"Best Model: {self.model_trainer.best_model_name}")
        print(f"{'='*60}")
        
        return self
    
    def predict(self, df: pd.DataFrame) -> np.ndarray:
        """
        Make predictions on new data.
        """
        if not self.is_fitted:
            raise ValueError("AutoML not fitted. Call fit() first.")
        
        # Preprocess
        X = self.preprocessor.transform(df)
        X = self.feature_engineer.create_features(X, self.column_types)
        
        if self.feature_selector:
            X = self.feature_selector.transform(X)
        
        # Ensure numeric
        X = X.select_dtypes(include=[np.number]).fillna(0).replace([np.inf, -np.inf], 0)
        
        # Predict
        predictions = self.best_model.predict(X)
        
        # Decode target if needed
        if self.target_encoder:
            predictions = self.target_encoder.inverse_transform(predictions.astype(int))
        
        return predictions
    
    def get_leaderboard(self) -> pd.DataFrame:
        """Get model comparison leaderboard."""
        return self.leaderboard
    
    def get_feature_importance(self) -> pd.DataFrame:
        """Get feature importance from best model."""
        if hasattr(self.best_model, 'feature_importances_'):
            if self.feature_selector:
                features = self.feature_selector.selected_features
            else:
                features = list(range(len(self.best_model.feature_importances_)))
            
            return pd.DataFrame({
                'feature': features[:len(self.best_model.feature_importances_)],
                'importance': self.best_model.feature_importances_
            }).sort_values('importance', ascending=False)
        return None
    
    def generate_report(self) -> str:
        """
        Generate comprehensive AutoML report.
        """
        report = []
        report.append("="*60)
        report.append("AUTOML REPORT")
        report.append("="*60)
        
        report.append(f"\nProblem Type: {self.problem_type.value}")
        report.append(f"Target Column: {self.target_col}")
        
        report.append(f"\n--- Data Summary ---")
        report.append(f"Total samples: {self.eda_report['basic_info']['n_rows']:,}")
        report.append(f"Total features: {self.eda_report['basic_info']['n_columns']}")
        report.append(f"Missing values: {self.eda_report['missing_values']['total_missing']}")
        
        report.append(f"\n--- Model Leaderboard ---")
        report.append(self.leaderboard.to_string())
        
        report.append(f"\n--- Best Model ---")
        report.append(f"Model: {self.model_trainer.best_model_name}")
        
        if self.tuner and self.tuner.best_params:
            report.append(f"\n--- Tuned Hyperparameters ---")
            for k, v in self.tuner.best_params.items():
                report.append(f"  {k}: {v}")
        
        return "\n".join(report)

print("="*60)
print("AUTOML SYSTEM READY!")
print("="*60)
print("""
Usage:
    automl = AutoML()
    automl.fit(df, target_col='target')
    predictions = automl.predict(new_df)
    print(automl.get_leaderboard())
""")

## 12. Demo: AutoML in Action

Let's test our AutoML system on real datasets!

In [None]:
# Demo 1: Classification (Titanic-style)
from sklearn.datasets import load_breast_cancer

print("DEMO 1: Binary Classification (Breast Cancer Dataset)")
print("="*60)

# Load data
data = load_breast_cancer()
df_demo = pd.DataFrame(data.data, columns=data.feature_names)
df_demo['target'] = data.target

print(f"Dataset shape: {df_demo.shape}")
print(f"Target distribution: {df_demo['target'].value_counts().to_dict()}")

# Run AutoML
config = AutoMLConfig(
    quick_mode=True,  # Use fewer models for demo
    tune_hyperparameters=True,
    tuning_trials=20  # Fewer trials for demo
)

automl_clf = AutoML(config)
automl_clf.fit(df_demo, target_col='target', tune_best=True, create_ensemble=True)

In [None]:
# Show leaderboard
print("\nModel Leaderboard:")
print(automl_clf.get_leaderboard().to_string())

In [None]:
# Feature importance
importance_df = automl_clf.get_feature_importance()
if importance_df is not None:
    plt.figure(figsize=(10, 8))
    top_features = importance_df.head(15)
    plt.barh(range(len(top_features)), top_features['importance'].values)
    plt.yticks(range(len(top_features)), top_features['feature'].values)
    plt.xlabel('Importance')
    plt.title('Top 15 Feature Importance')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()

In [None]:
# Demo 2: Regression
from sklearn.datasets import fetch_california_housing

print("\n" + "="*60)
print("DEMO 2: Regression (California Housing Dataset)")
print("="*60)

# Load data (use subset for speed)
data = fetch_california_housing()
df_reg = pd.DataFrame(data.data, columns=data.feature_names)
df_reg['target'] = data.target

# Use subset
df_reg = df_reg.sample(n=5000, random_state=42)

print(f"Dataset shape: {df_reg.shape}")
print(f"Target range: {df_reg['target'].min():.2f} - {df_reg['target'].max():.2f}")

# Run AutoML
config_reg = AutoMLConfig(
    quick_mode=True,
    tune_hyperparameters=True,
    tuning_trials=20
)

automl_reg = AutoML(config_reg)
automl_reg.fit(df_reg, target_col='target', tune_best=True, create_ensemble=True)

In [None]:
# Show regression leaderboard
print("\nRegression Model Leaderboard:")
print(automl_reg.get_leaderboard().to_string())

In [None]:
# Generate full report
print(automl_clf.generate_report())

## 13. Summary

### What We Built

A complete **AutoML System** that automates the entire machine learning pipeline:

| Component | Functionality |
|-----------|---------------|
| `DataTypeDetector` | Auto-detect column types and problem type |
| `AutoEDA` | Automatic exploratory data analysis |
| `AutoPreprocessor` | Missing values, encoding, scaling |
| `AutoFeatureEngineer` | Create interaction and aggregation features |
| `AutoFeatureSelector` | Select best features using multiple methods |
| `ModelTrainer` | Train and compare 10+ models |
| `HyperparameterTuner` | Bayesian optimization with Optuna |
| `Ensembler` | Create voting and stacking ensembles |
| `AutoML` | Main class combining everything |

### Models Supported

**Classification:**
- Logistic Regression
- Random Forest
- Gradient Boosting
- XGBoost
- LightGBM
- Extra Trees
- KNN
- Naive Bayes
- SVM

**Regression:**
- Ridge, Lasso, ElasticNet
- Random Forest
- Gradient Boosting
- XGBoost
- LightGBM
- Extra Trees
- KNN
- SVR

### Usage

```python
# Simple usage
automl = AutoML()
automl.fit(df, target_col='target')
predictions = automl.predict(new_df)

# Get results
print(automl.get_leaderboard())
print(automl.generate_report())
```

In [None]:
# Final summary
print("="*60)
print("AUTOML SYSTEM - COMPLETE!")
print("="*60)

print("""
Components Built:
─────────────────
1. DataTypeDetector   - Auto-detect data types
2. AutoEDA            - Automatic EDA
3. AutoPreprocessor   - Handle missing, encode, scale
4. AutoFeatureEngineer - Create new features
5. AutoFeatureSelector - Select best features
6. ModelTrainer       - Train multiple models
7. HyperparameterTuner - Bayesian optimization
8. Ensembler          - Voting & Stacking
9. AutoML             - Main orchestrator

Capabilities:
─────────────
• Classification (binary & multiclass)
• Regression
• Auto preprocessing
• Auto feature engineering
• Auto feature selection
• 10+ model comparison
• Hyperparameter tuning (Optuna)
• Ensemble methods
• Report generation

This AutoML system combines EVERYTHING learned
in the previous 19 projects into one powerful tool!
""")