# Bonus: AutoML Tools

## Learning Objectives

In this bonus session, we'll explore **AutoML (Automated Machine Learning)** tools that can build models with minimal code - sometimes just a few clicks or one function call.

**Why AutoML?**
- **Speed**: Build and compare multiple models quickly
- **Baseline**: Get a quick benchmark for your manual models
- **Feature engineering**: Some tools automatically engineer features
- **Hyperparameter tuning**: Automatic optimization

**Warning:** AutoML is a tool, not a replacement for understanding. Use it to:
- Get quick baselines
- Discover promising model types
- Save time on routine tasks

But always validate results and understand what the model is doing!


## Setup


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sys
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

# Import our modules
sys.path.insert(0, str(Path("..").resolve()))

from features.engineering import prepare_features, prepare_target
from backtesting.engine import backtest_strategy, print_backtest_metrics


In [None]:
# Load data
data_path = Path("../data/saved/stock_session2.csv")
df = pd.read_csv(data_path, parse_dates=["timestamp"])

# Prepare features and target
feature_cols = ["X1", "X2", "X3", "X4"]
X = prepare_features(df, feature_cols=feature_cols)
y_returns = prepare_target(df, target_col="returns")
y_direction = (y_returns > 0).astype(int)  # Binary classification

# Train-test split
split_idx = int(len(df) * 0.8)
X_train = X.iloc[:split_idx]
X_test = X.iloc[split_idx:]
y_train = y_direction.iloc[:split_idx]
y_test = y_direction.iloc[split_idx:]
y_train_returns = y_returns.iloc[:split_idx]
y_test_returns = y_returns.iloc[split_idx:]

print(f"Train: {len(X_train)} samples, Test: {len(X_test)} samples")


## Tool 1: H2O AutoML

**H2O AutoML** is one of the simplest AutoML tools - literally one function call!

**Pros:**
- ✅ Extremely simple API
- ✅ Free and open-source
- ✅ Handles feature engineering automatically
- ✅ Compares many models automatically

**Cons:**
- ❌ Requires Java (installed automatically)
- ❌ Can be memory-intensive
- ❌ Less control over the process


In [None]:
# Install H2O if needed: pip install h2o
try:
    import h2o
    from h2o.automl import H2OAutoML
    
    # Initialize H2O
    h2o.init()
    
    # Prepare data for H2O (needs to be H2OFrame)
    train_df = pd.concat([X_train, y_train.rename('target')], axis=1)
    test_df = pd.concat([X_test, y_test.rename('target')], axis=1)
    
    train_h2o = h2o.H2OFrame(train_df)
    test_h2o = h2o.H2OFrame(test_df)
    
    # Set target column
    train_h2o['target'] = train_h2o['target'].asfactor()
    
    # Run AutoML - THAT'S IT! One function call!
    aml = H2OAutoML(max_models=10, seed=42, max_runtime_secs=60)  # 60 seconds for demo
    aml.train(x=X_train.columns.tolist(), y='target', training_frame=train_h2o)
    
    # View leaderboard (best models)
    print("=" * 80)
    print("H2O AutoML Leaderboard (Best Models)")
    print("=" * 80)
    print(aml.leaderboard.as_data_frame())
    
    # Get best model predictions
    predictions = aml.leader.predict(test_h2o)
    y_pred_proba = predictions.as_data_frame()['p1'].values  # Probability of class 1
    
    print(f"\nBest model: {aml.leader.model_id}")
    print(f"Test predictions shape: {y_pred_proba.shape}")
    
except ImportError:
    print("H2O not installed. Install with: pip install h2o")
    print("\nH2O AutoML usage:")
    print("```python")
    print("import h2o")
    print("from h2o.automl import H2OAutoML")
    print("h2o.init()")
    print("aml = H2OAutoML(max_models=10, seed=42)")
    print("aml.train(x=features, y='target', training_frame=train_data)")
    print("predictions = aml.leader.predict(test_data)")
    print("```")
except Exception as e:
    print(f"H2O error: {e}")
    print("\nNote: H2O requires Java. It will try to install automatically.")


In [None]:
# Install PyCaret if needed: pip install pycaret
try:
    from pycaret.classification import setup, compare_models, create_model, predict_model
    
    # Prepare data
    train_data = pd.concat([X_train, y_train.rename('target')], axis=1)
    
    # Setup PyCaret - automatically detects data types, handles preprocessing
    clf = setup(
        data=train_data,
        target='target',
        train_size=0.8,
        session_id=42,
        silent=True,  # Suppress output for cleaner notebook
        verbose=False
    )
    
    # Compare models - ONE LINE!
    print("=" * 80)
    print("PyCaret: Comparing Multiple Models")
    print("=" * 80)
    best_model = compare_models(sort='Accuracy', n_select=1, verbose=False)
    
    print(f"\nBest model: {best_model}")
    
    # Make predictions
    test_data = pd.concat([X_test, y_test.rename('target')], axis=1)
    predictions = predict_model(best_model, data=test_data, verbose=False)
    
    print(f"\nPredictions shape: {predictions.shape}")
    print(f"Prediction columns: {predictions.columns.tolist()}")
    
except ImportError:
    print("PyCaret not installed. Install with: pip install pycaret")
    print("\nPyCaret usage:")
    print("```python")
    print("from pycaret.classification import setup, compare_models")
    print("clf = setup(data=train_df, target='target')")
    print("best_model = compare_models()")
    print("predictions = predict_model(best_model, data=test_df)")
    print("```")
except Exception as e:
    print(f"PyCaret error: {e}")
    print("\nNote: PyCaret may take a moment to install dependencies on first use.")


## Comparison: Manual vs AutoML

Let's compare our manual logistic regression with AutoML results (if available):


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# Manual model (from Session 2)
manual_model = LogisticRegression(max_iter=1000, random_state=42)
manual_model.fit(X_train, y_train)
y_pred_manual = manual_model.predict(X_test)
y_proba_manual = manual_model.predict_proba(X_test)[:, 1]

manual_acc = accuracy_score(y_test, y_pred_manual)
manual_auc = roc_auc_score(y_test, y_proba_manual)

print("=" * 80)
print("MANUAL vs AUTOML COMPARISON")
print("=" * 80)
print(f"\nManual Logistic Regression:")
print(f"  Accuracy: {manual_acc:.4f}")
print(f"  ROC AUC: {manual_auc:.4f}")

# If AutoML ran, compare
try:
    if 'y_pred_proba' in locals() or 'predictions' in locals():
        print(f"\nAutoML Results:")
        if 'y_pred_proba' in locals():
            # H2O results
            y_pred_h2o = (y_pred_proba >= 0.5).astype(int)
            h2o_acc = accuracy_score(y_test, y_pred_h2o)
            h2o_auc = roc_auc_score(y_test, y_pred_proba)
            print(f"  H2O AutoML:")
            print(f"    Accuracy: {h2o_acc:.4f}")
            print(f"    ROC AUC: {h2o_auc:.4f}")
except:
    pass

print("\n" + "=" * 80)
print("Key Insight: AutoML can find better models, but manual models give you")
print("more control and understanding. Use AutoML for baselines and exploration!")


## Summary

**AutoML Tools Demonstrated:**

1. **H2O AutoML**: One function call (`H2OAutoML().train()`)
2. **PyCaret**: Simple setup and comparison (`compare_models()`)
3. **Dataiku**: GUI-based, click-button interface (mentioned)

**When to Use AutoML:**
- ✅ Quick baselines and benchmarks
- ✅ Feature engineering ideas
- ✅ Model selection guidance
- ✅ Time-constrained projects

**When NOT to Use AutoML:**
- ❌ When you need to understand the model
- ❌ When domain knowledge is critical (like finance!)
- ❌ When you need fine-grained control
- ❌ When interpretability is required

**Key Takeaway:** AutoML is a powerful tool, but understanding the fundamentals (Sessions 1-4) is essential. AutoML can't replace domain expertise and careful validation!
