# AutoML and Neural Architecture Search

**Automated Machine Learning and Neural Architecture Search**

Learn how to automate the ML pipeline, optimize hyperparameters, and discover optimal neural architectures.

---

## Table of Contents

1. [Introduction to AutoML](#1-introduction-to-automl)
2. [Hyperparameter Optimization](#2-hyperparameter-optimization)
3. [AutoML Frameworks](#3-automl-frameworks)
4. [Neural Architecture Search (NAS)](#4-neural-architecture-search)
5. [Meta-Learning](#5-meta-learning)
6. [Automated Feature Engineering](#6-automated-feature-engineering)
7. [Best Practices](#7-best-practices)
8. [Interview Questions](#8-interview-questions)

---

In [None]:
# Install required packages
# !pip install optuna scikit-learn xgboost matplotlib seaborn numpy pandas

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_iris, load_breast_cancer, make_classification
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import optuna
from optuna.visualization import plot_optimization_history, plot_param_importances

import warnings
warnings.filterwarnings('ignore')

# Set random seeds
np.random.seed(42)
torch.manual_seed(42)

# Matplotlib settings
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

---

## 1. Introduction to AutoML

### What is AutoML?

**Automated Machine Learning (AutoML)** automates the end-to-end process of applying machine learning to real-world problems.

**AutoML automates:**
1. **Data preprocessing** - handling missing values, encoding, scaling
2. **Feature engineering** - creating and selecting features
3. **Model selection** - choosing the best algorithm
4. **Hyperparameter tuning** - finding optimal hyperparameters
5. **Model ensembling** - combining multiple models

### Why AutoML?

**Benefits:**
- ✅ **Democratizes ML** - Non-experts can build models
- ✅ **Saves time** - Automates tedious tasks
- ✅ **Better performance** - Explores more options than manual tuning
- ✅ **Reduces human bias** - Systematic exploration

**Challenges:**
- ❌ **Computational cost** - Can be expensive
- ❌ **Black box** - Harder to understand
- ❌ **Overfitting risk** - Too many choices
- ❌ **Domain knowledge** - Still needed for feature engineering

### The ML Pipeline

```
Raw Data
   ↓
Data Preprocessing (AutoML)
   ↓
Feature Engineering (AutoML)
   ↓
Model Selection (AutoML)
   ↓
Hyperparameter Tuning (AutoML)
   ↓
Model Ensembling (AutoML)
   ↓
Trained Model
```

---

## 2. Hyperparameter Optimization

### 2.1 Grid Search

**Exhaustive search** over specified parameter grid.

**Pros:**
- Guaranteed to find best combination in grid
- Simple and deterministic

**Cons:**
- Exponential complexity: $O(n^d)$ where $n$ = values per param, $d$ = number of params
- Wastes time on unimportant parameters

In [None]:
# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Dataset: {X_train.shape[0]} training samples, {X_test.shape[0]} test samples")
print(f"Features: {X_train.shape[1]}")
print(f"Classes: {len(np.unique(y))}")

In [None]:
%%time

# Grid Search for Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

print(f"Grid Search: {np.prod([len(v) for v in param_grid.values()])} combinations to try\n")

rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
    rf,
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Test accuracy: {grid_search.score(X_test, y_test):.4f}")

### 2.2 Random Search

**Randomly sample** from parameter distributions.

**Key Insight**: Not all hyperparameters are equally important. Random search explores more of the important dimensions.

**Pros:**
- More efficient than grid search
- Can specify budget (number of iterations)
- Often finds better results faster

**Cons:**
- Not guaranteed to find optimal
- Random, so results vary

In [None]:
%%time

from scipy.stats import randint, uniform

# Random Search for Random Forest
param_distributions = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(5, 30),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': uniform(0.1, 0.9)
}

print(f"Random Search: 100 random combinations to try\n")

rf = RandomForestClassifier(random_state=42)
random_search = RandomizedSearchCV(
    rf,
    param_distributions,
    n_iter=100,  # Budget: 100 iterations
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1,
    random_state=42
)

random_search.fit(X_train, y_train)

print(f"\nBest parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.4f}")
print(f"Test accuracy: {random_search.score(X_test, y_test):.4f}")

### 2.3 Bayesian Optimization

**Smart search** using probabilistic model of objective function.

**How it works:**
1. Build probabilistic model (Gaussian Process) of $f(\theta)$
2. Use acquisition function to decide next $\theta$ to try
3. Evaluate $f(\theta)$, update model
4. Repeat

**Acquisition Functions:**
- **Expected Improvement (EI)**: Balance exploration vs exploitation
- **Upper Confidence Bound (UCB)**: Optimistic estimate
- **Probability of Improvement (PI)**: Greedy

**Pros:**
- Much more sample-efficient than random search
- Automatic exploration-exploitation balance
- Works well with expensive evaluations

**Cons:**
- More complex
- Slower per iteration (building surrogate model)

In [None]:
def objective(trial):
    """
    Optuna objective function for Random Forest
    
    trial: Optuna trial object for suggesting hyperparameters
    """
    # Suggest hyperparameters
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'max_depth': trial.suggest_int('max_depth', 5, 30),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
        'max_features': trial.suggest_float('max_features', 0.1, 1.0),
        'random_state': 42
    }
    
    # Train and evaluate
    rf = RandomForestClassifier(**params)
    score = cross_val_score(rf, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1).mean()
    
    return score

# Create study
study = optuna.create_study(
    direction='maximize',  # Maximize accuracy
    sampler=optuna.samplers.TPESampler(seed=42)  # Tree-structured Parzen Estimator
)

print("Running Bayesian Optimization with Optuna...\n")

# Optimize
study.optimize(objective, n_trials=50, show_progress_bar=True)

print(f"\nBest parameters: {study.best_params}")
print(f"Best CV score: {study.best_value:.4f}")

# Train final model
best_rf = RandomForestClassifier(**study.best_params)
best_rf.fit(X_train, y_train)
test_acc = best_rf.score(X_test, y_test)
print(f"Test accuracy: {test_acc:.4f}")

In [None]:
# Visualize optimization history
fig = plot_optimization_history(study)
fig.update_layout(title='Bayesian Optimization History', width=800, height=400)
fig.show()

# Hyperparameter importances
fig = plot_param_importances(study)
fig.update_layout(title='Hyperparameter Importances', width=800, height=400)
fig.show()

### 2.4 Comparison: Grid vs Random vs Bayesian

| Method | Iterations | Time | Best Score | Notes |
|--------|-----------|------|------------|-------|
| Grid Search | 108 | Slow | Good | Exhaustive but wasteful |
| Random Search | 100 | Medium | Good | Simple, often effective |
| Bayesian (Optuna) | 50 | Fast | Best | Most sample-efficient |

**Rule of thumb:**
- **Grid Search**: < 4 hyperparameters, small grids
- **Random Search**: Quick baseline, many hyperparameters
- **Bayesian**: Expensive evaluations, need best performance

### 2.5 Advanced: Multi-Objective Optimization

Optimize **multiple objectives** simultaneously (e.g., accuracy AND inference time).

In [None]:
import time

def multi_objective(trial):
    """
    Optimize both accuracy and inference time
    """
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 10, 100),
        'max_depth': trial.suggest_int('max_depth', 3, 20),
        'random_state': 42
    }
    
    # Train model
    rf = RandomForestClassifier(**params)
    rf.fit(X_train, y_train)
    
    # Objective 1: Accuracy (maximize)
    accuracy = rf.score(X_test, y_test)
    
    # Objective 2: Inference time (minimize)
    start = time.time()
    _ = rf.predict(X_test)
    inference_time = time.time() - start
    
    return accuracy, inference_time

# Multi-objective study
study_multi = optuna.create_study(
    directions=['maximize', 'minimize'],  # Maximize acc, minimize time
    sampler=optuna.samplers.NSGAIISampler(seed=42)
)

print("Running Multi-Objective Optimization...\n")
study_multi.optimize(multi_objective, n_trials=50, show_progress_bar=True)

print(f"\nFound {len(study_multi.best_trials)} Pareto-optimal solutions")
print("\nTop 3 trade-offs:")
for i, trial in enumerate(study_multi.best_trials[:3]):
    print(f"{i+1}. Accuracy: {trial.values[0]:.4f}, Time: {trial.values[1]:.4f}s, Params: {trial.params}")

---

## 3. AutoML Frameworks

### 3.1 Combined Algorithm Selection and Hyperparameter Optimization (CASH)

**CASH Problem**: Jointly optimize:
1. Which algorithm to use?
2. What hyperparameters?

**Search Space**: $\{(A_1, \theta_1), (A_2, \theta_2), ..., (A_k, \theta_k)\}$

In [None]:
def cash_objective(trial):
    """
    Combined Algorithm Selection and Hyperparameter Optimization
    
    Try different algorithms with their hyperparameters
    """
    # Select algorithm
    algorithm = trial.suggest_categorical('algorithm', ['rf', 'gb', 'svm', 'mlp'])
    
    if algorithm == 'rf':
        # Random Forest hyperparameters
        params = {
            'n_estimators': trial.suggest_int('rf_n_estimators', 50, 200),
            'max_depth': trial.suggest_int('rf_max_depth', 5, 20),
            'min_samples_split': trial.suggest_int('rf_min_samples_split', 2, 10),
            'random_state': 42
        }
        model = RandomForestClassifier(**params)
    
    elif algorithm == 'gb':
        # Gradient Boosting hyperparameters
        params = {
            'n_estimators': trial.suggest_int('gb_n_estimators', 50, 200),
            'max_depth': trial.suggest_int('gb_max_depth', 3, 10),
            'learning_rate': trial.suggest_float('gb_learning_rate', 0.01, 0.3, log=True),
            'random_state': 42
        }
        model = GradientBoostingClassifier(**params)
    
    elif algorithm == 'svm':
        # SVM hyperparameters
        params = {
            'C': trial.suggest_float('svm_C', 0.1, 100, log=True),
            'gamma': trial.suggest_float('svm_gamma', 1e-4, 1, log=True),
            'kernel': trial.suggest_categorical('svm_kernel', ['rbf', 'poly']),
            'random_state': 42
        }
        model = SVC(**params)
    
    else:  # mlp
        # Neural Network hyperparameters
        params = {
            'hidden_layer_sizes': (trial.suggest_int('mlp_hidden_size', 50, 200),),
            'alpha': trial.suggest_float('mlp_alpha', 1e-5, 1e-2, log=True),
            'learning_rate_init': trial.suggest_float('mlp_lr', 1e-4, 1e-2, log=True),
            'random_state': 42,
            'max_iter': 500
        }
        model = MLPClassifier(**params)
    
    # Evaluate
    score = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1).mean()
    
    return score

# CASH optimization
study_cash = optuna.create_study(
    direction='maximize',
    sampler=optuna.samplers.TPESampler(seed=42)
)

print("Running CASH (Combined Algorithm Selection + Hyperparameter Optimization)...\n")
study_cash.optimize(cash_objective, n_trials=100, show_progress_bar=True)

print(f"\nBest algorithm: {study_cash.best_params['algorithm']}")
print(f"Best parameters: {study_cash.best_params}")
print(f"Best CV score: {study_cash.best_value:.4f}")

In [None]:
# Analyze which algorithms were tried
algorithms_tried = [trial.params['algorithm'] for trial in study_cash.trials]
algorithm_scores = {}

for algo in ['rf', 'gb', 'svm', 'mlp']:
    scores = [trial.value for trial in study_cash.trials if trial.params['algorithm'] == algo]
    if scores:
        algorithm_scores[algo] = {
            'mean': np.mean(scores),
            'max': np.max(scores),
            'count': len(scores)
        }

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Algorithm distribution
algos, counts = np.unique(algorithms_tried, return_counts=True)
ax1.bar(algos, counts, color='steelblue')
ax1.set_xlabel('Algorithm')
ax1.set_ylabel('Number of Trials')
ax1.set_title('Algorithm Selection Distribution')
ax1.grid(True, alpha=0.3)

# Algorithm performance
algos = list(algorithm_scores.keys())
means = [algorithm_scores[a]['mean'] for a in algos]
maxs = [algorithm_scores[a]['max'] for a in algos]

x = np.arange(len(algos))
width = 0.35
ax2.bar(x - width/2, means, width, label='Mean Score', color='lightcoral')
ax2.bar(x + width/2, maxs, width, label='Max Score', color='lightgreen')
ax2.set_xlabel('Algorithm')
ax2.set_ylabel('Accuracy')
ax2.set_title('Algorithm Performance Comparison')
ax2.set_xticks(x)
ax2.set_xticks(algos)
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nAlgorithm Performance Summary:")
for algo, stats in algorithm_scores.items():
    print(f"{algo.upper()}: Mean={stats['mean']:.4f}, Max={stats['max']:.4f}, Trials={stats['count']}")

### 3.2 Ensemble Selection

**Idea**: Combine multiple models found during optimization for better performance.

In [None]:
from sklearn.ensemble import VotingClassifier

# Get top 5 models from CASH optimization
top_trials = sorted(study_cash.trials, key=lambda t: t.value, reverse=True)[:5]

print("Top 5 models:")
models = []
for i, trial in enumerate(top_trials):
    print(f"{i+1}. {trial.params['algorithm']}: {trial.value:.4f}")
    
    # Recreate model
    algo = trial.params['algorithm']
    params = {k.replace(f"{algo}_", ""): v for k, v in trial.params.items() if k.startswith(f"{algo}_")}
    params['random_state'] = 42
    
    if algo == 'rf':
        model = RandomForestClassifier(**params)
    elif algo == 'gb':
        model = GradientBoostingClassifier(**params)
    elif algo == 'svm':
        model = SVC(**params, probability=True)  # Need probabilities for voting
    else:
        params['max_iter'] = 500
        model = MLPClassifier(**params)
    
    models.append((f"model_{i}", model))

# Create ensemble
ensemble = VotingClassifier(estimators=models, voting='soft', n_jobs=-1)
ensemble.fit(X_train, y_train)

# Evaluate
ensemble_score = ensemble.score(X_test, y_test)
best_single_score = top_trials[0].value

print(f"\nBest single model (CV): {best_single_score:.4f}")
print(f"Ensemble (test): {ensemble_score:.4f}")
print(f"Improvement: {ensemble_score - best_single_score:.4f}")

---

## 4. Neural Architecture Search (NAS)

### What is NAS?

**Neural Architecture Search** automates the design of neural network architectures.

**NAS Components:**
1. **Search Space**: Set of possible architectures
2. **Search Strategy**: How to explore search space
3. **Performance Estimation**: How to evaluate architectures

### 4.1 Search Space Design

**Common choices:**
- Number of layers
- Number of units per layer
- Activation functions
- Skip connections
- Layer types (Conv, Dense, Pooling)

### 4.2 Random NAS

In [None]:
# Prepare data for neural networks
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train_scaled)
y_train_tensor = torch.LongTensor(y_train)
X_test_tensor = torch.FloatTensor(X_test_scaled)
y_test_tensor = torch.LongTensor(y_test)

print(f"Data prepared for neural networks")
print(f"Input dimension: {X_train_scaled.shape[1]}")
print(f"Output dimension: {len(np.unique(y_train))}")

In [None]:
class SearchableNetwork(nn.Module):
    """
    Neural network with flexible architecture
    
    Can specify:
    - Number of layers
    - Hidden units per layer
    - Activation function
    - Dropout rate
    """
    
    def __init__(self, input_dim, output_dim, hidden_sizes, activation='relu', dropout=0.0):
        super(SearchableNetwork, self).__init__()
        
        self.activation_name = activation
        self.dropout = dropout
        
        # Build layers
        layers = []
        prev_size = input_dim
        
        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.BatchNorm1d(hidden_size))
            
            # Activation
            if activation == 'relu':
                layers.append(nn.ReLU())
            elif activation == 'tanh':
                layers.append(nn.Tanh())
            elif activation == 'elu':
                layers.append(nn.ELU())
            
            # Dropout
            if dropout > 0:
                layers.append(nn.Dropout(dropout))
            
            prev_size = hidden_size
        
        # Output layer
        layers.append(nn.Linear(prev_size, output_dim))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x)

def train_and_evaluate_architecture(hidden_sizes, activation, dropout, num_epochs=20):
    """
    Train and evaluate a specific architecture
    """
    # Create model
    model = SearchableNetwork(
        input_dim=X_train_scaled.shape[1],
        output_dim=len(np.unique(y_train)),
        hidden_sizes=hidden_sizes,
        activation=activation,
        dropout=dropout
    )
    
    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    # Training
    model.train()
    for epoch in range(num_epochs):
        optimizer.zero_grad()
        outputs = model(X_train_tensor)
        loss = criterion(outputs, y_train_tensor)
        loss.backward()
        optimizer.step()
    
    # Evaluation
    model.eval()
    with torch.no_grad():
        outputs = model(X_test_tensor)
        _, predicted = torch.max(outputs, 1)
        accuracy = (predicted == y_test_tensor).float().mean().item()
    
    return accuracy

# Random NAS: Try random architectures
print("Random Neural Architecture Search...\n")

architectures = []
for i in range(20):
    # Random architecture
    num_layers = np.random.randint(1, 4)
    hidden_sizes = [np.random.choice([32, 64, 128, 256]) for _ in range(num_layers)]
    activation = np.random.choice(['relu', 'tanh', 'elu'])
    dropout = np.random.choice([0.0, 0.1, 0.2, 0.3])
    
    # Train and evaluate
    accuracy = train_and_evaluate_architecture(hidden_sizes, activation, dropout)
    
    architectures.append({
        'hidden_sizes': hidden_sizes,
        'activation': activation,
        'dropout': dropout,
        'accuracy': accuracy
    })
    
    print(f"Trial {i+1}/20: {hidden_sizes}, {activation}, dropout={dropout:.1f} → Acc: {accuracy:.4f}")

# Best architecture
best_arch = max(architectures, key=lambda x: x['accuracy'])
print(f"\nBest Architecture:")
print(f"  Hidden sizes: {best_arch['hidden_sizes']}")
print(f"  Activation: {best_arch['activation']}")
print(f"  Dropout: {best_arch['dropout']}")
print(f"  Accuracy: {best_arch['accuracy']:.4f}")

### 4.3 Bayesian NAS with Optuna

In [None]:
def nas_objective(trial):
    """
    NAS objective function for Optuna
    """
    # Search space
    num_layers = trial.suggest_int('num_layers', 1, 4)
    hidden_sizes = [trial.suggest_categorical(f'hidden_size_{i}', [32, 64, 128, 256]) 
                   for i in range(num_layers)]
    activation = trial.suggest_categorical('activation', ['relu', 'tanh', 'elu'])
    dropout = trial.suggest_float('dropout', 0.0, 0.5)
    learning_rate = trial.suggest_float('learning_rate', 1e-4, 1e-2, log=True)
    
    # Create and train model
    model = SearchableNetwork(
        input_dim=X_train_scaled.shape[1],
        output_dim=len(np.unique(y_train)),
        hidden_sizes=hidden_sizes,
        activation=activation,
        dropout=dropout
    )
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    
    # Training
    model.train()
    for epoch in range(20):
        optimizer.zero_grad()
        outputs = model(X_train_tensor)
        loss = criterion(outputs, y_train_tensor)
        loss.backward()
        optimizer.step()
    
    # Evaluation
    model.eval()
    with torch.no_grad():
        outputs = model(X_test_tensor)
        _, predicted = torch.max(outputs, 1)
        accuracy = (predicted == y_test_tensor).float().mean().item()
    
    return accuracy

# Bayesian NAS
study_nas = optuna.create_study(
    direction='maximize',
    sampler=optuna.samplers.TPESampler(seed=42)
)

print("Running Bayesian NAS with Optuna...\n")
study_nas.optimize(nas_objective, n_trials=50, show_progress_bar=True)

print(f"\nBest Architecture:")
print(f"Parameters: {study_nas.best_params}")
print(f"Test Accuracy: {study_nas.best_value:.4f}")

In [None]:
# Visualize NAS results
fig = plot_optimization_history(study_nas)
fig.update_layout(title='NAS Optimization History', width=800, height=400)
fig.show()

fig = plot_param_importances(study_nas)
fig.update_layout(title='Architecture Component Importances', width=800, height=500)
fig.show()

### 4.4 Advanced NAS Methods

**Evolutionary NAS**:
- Population of architectures
- Mutation and crossover
- Selection based on performance

**Reinforcement Learning NAS**:
- Controller RNN generates architectures
- Trained with REINFORCE
- Example: NASNet, EfficientNet

**Differentiable NAS (DARTS)**:
- Continuous relaxation of search space
- Gradient-based optimization
- Much faster than RL-based methods

**One-Shot NAS**:
- Train supernet containing all possible architectures
- Sample sub-networks for evaluation
- Example: Once-for-All Networks

---

## 5. Meta-Learning

### Learning to Learn

**Meta-Learning**: Learn from multiple tasks to quickly adapt to new tasks.

**Key Ideas:**
- **Task distribution** $p(T)$: Sample tasks from distribution
- **Meta-train**: Learn across tasks
- **Meta-test**: Quickly adapt to new task

**Applications:**
- Few-shot learning
- Hyperparameter initialization
- Transfer learning

### 5.1 Simple Meta-Learning Example

In [None]:
class SimpleMetaLearner:
    """
    Simple meta-learner that learns good hyperparameter initialization
    across multiple tasks
    """
    
    def __init__(self):
        self.meta_params = {
            'learning_rate': 0.001,
            'hidden_size': 64
        }
    
    def create_task_data(self, n_samples=100):
        """Create a random binary classification task"""
        X, y = make_classification(
            n_samples=n_samples,
            n_features=20,
            n_informative=15,
            n_redundant=5,
            n_classes=2,
            random_state=np.random.randint(1000)
        )
        return train_test_split(X, y, test_size=0.3)
    
    def train_on_task(self, X_train, y_train, X_test, y_test, lr, hidden_size, num_epochs=50):
        """Train model on a single task"""
        # Scale data
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        # Convert to tensors
        X_train_t = torch.FloatTensor(X_train_scaled)
        y_train_t = torch.LongTensor(y_train)
        X_test_t = torch.FloatTensor(X_test_scaled)
        y_test_t = torch.LongTensor(y_test)
        
        # Create model
        model = nn.Sequential(
            nn.Linear(X_train.shape[1], hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 2)
        )
        
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(model.parameters(), lr=lr)
        
        # Train
        model.train()
        for epoch in range(num_epochs):
            optimizer.zero_grad()
            outputs = model(X_train_t)
            loss = criterion(outputs, y_train_t)
            loss.backward()
            optimizer.step()
        
        # Test
        model.eval()
        with torch.no_grad():
            outputs = model(X_test_t)
            _, predicted = torch.max(outputs, 1)
            accuracy = (predicted == y_test_t).float().mean().item()
        
        return accuracy
    
    def meta_train(self, num_tasks=10):
        """
        Meta-train: Learn good hyperparameter initialization
        across multiple tasks
        """
        print("Meta-Training: Learning good hyperparameters across tasks...\n")
        
        # Try different hyperparameter combinations
        lrs = [0.0001, 0.001, 0.01]
        hidden_sizes = [32, 64, 128]
        
        results = {}
        
        for lr in lrs:
            for hidden_size in hidden_sizes:
                task_accuracies = []
                
                # Test on multiple tasks
                for task_id in range(num_tasks):
                    X_train, X_test, y_train, y_test = self.create_task_data()
                    acc = self.train_on_task(X_train, y_train, X_test, y_test, lr, hidden_size)
                    task_accuracies.append(acc)
                
                avg_acc = np.mean(task_accuracies)
                results[(lr, hidden_size)] = avg_acc
                
                print(f"LR={lr:.4f}, Hidden={hidden_size}: Avg Accuracy={avg_acc:.4f}")
        
        # Find best meta-parameters
        best_params = max(results, key=results.get)
        self.meta_params = {
            'learning_rate': best_params[0],
            'hidden_size': best_params[1]
        }
        
        print(f"\nBest meta-parameters: LR={self.meta_params['learning_rate']}, "
              f"Hidden={self.meta_params['hidden_size']}")
        
        return self.meta_params
    
    def fast_adapt(self, X_train, y_train, X_test, y_test, num_epochs=50):
        """
        Fast adaptation to new task using learned meta-parameters
        """
        return self.train_on_task(
            X_train, y_train, X_test, y_test,
            lr=self.meta_params['learning_rate'],
            hidden_size=self.meta_params['hidden_size'],
            num_epochs=num_epochs
        )

# Meta-learning demo
meta_learner = SimpleMetaLearner()

# Meta-train on 10 tasks
best_meta_params = meta_learner.meta_train(num_tasks=10)

# Test on new tasks
print("\nTesting on new tasks:")
test_accuracies = []
for i in range(5):
    X_train, X_test, y_train, y_test = meta_learner.create_task_data()
    acc = meta_learner.fast_adapt(X_train, y_train, X_test, y_test)
    test_accuracies.append(acc)
    print(f"New Task {i+1}: Accuracy = {acc:.4f}")

print(f"\nAverage accuracy on new tasks: {np.mean(test_accuracies):.4f} ± {np.std(test_accuracies):.4f}")

---

## 6. Automated Feature Engineering

### 6.1 Feature Selection

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier

# Load data with many features
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Original number of features: {X_train.shape[1]}")

# Method 1: Univariate Feature Selection
selector_univariate = SelectKBest(f_classif, k=10)
X_train_selected = selector_univariate.fit_transform(X_train, y_train)
X_test_selected = selector_univariate.transform(X_test)

print(f"\nUnivariate Selection: {X_train_selected.shape[1]} features")
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_selected, y_train)
print(f"Accuracy: {rf.score(X_test_selected, y_test):.4f}")

# Method 2: Recursive Feature Elimination
estimator = RandomForestClassifier(n_estimators=50, random_state=42)
selector_rfe = RFE(estimator, n_features_to_select=10, step=1)
X_train_rfe = selector_rfe.fit_transform(X_train, y_train)
X_test_rfe = selector_rfe.transform(X_test)

print(f"\nRFE: {X_train_rfe.shape[1]} features")
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_rfe, y_train)
print(f"Accuracy: {rf.score(X_test_rfe, y_test):.4f}")

# Method 3: Feature Importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
top_k = 10
top_features = np.argsort(importances)[-top_k:]

X_train_importance = X_train[:, top_features]
X_test_importance = X_test[:, top_features]

print(f"\nFeature Importance: {top_k} features")
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_importance, y_train)
print(f"Accuracy: {rf.score(X_test_importance, y_test):.4f}")

# Visualize feature importances
plt.figure(figsize=(12, 5))
feature_names = load_breast_cancer().feature_names
indices = np.argsort(importances)[-15:][::-1]

plt.bar(range(15), importances[indices], color='steelblue')
plt.xticks(range(15), [feature_names[i] for i in indices], rotation=45, ha='right')
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Top 15 Feature Importances')
plt.tight_layout()
plt.show()

### 6.2 Automated Feature Selection with Optuna

In [None]:
def feature_selection_objective(trial):
    """
    Optimize feature selection
    """
    # Select which features to use (binary mask)
    selected_features = []
    for i in range(X_train.shape[1]):
        if trial.suggest_categorical(f'feature_{i}', [True, False]):
            selected_features.append(i)
    
    # Need at least 1 feature
    if len(selected_features) == 0:
        return 0.0
    
    # Train with selected features
    X_train_selected = X_train[:, selected_features]
    X_test_selected = X_test[:, selected_features]
    
    rf = RandomForestClassifier(n_estimators=50, random_state=42)
    score = cross_val_score(rf, X_train_selected, y_train, cv=3, scoring='accuracy').mean()
    
    # Penalize using too many features
    feature_penalty = len(selected_features) / X_train.shape[1] * 0.05
    
    return score - feature_penalty

# Optimize feature selection
study_features = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=42))

print("Optimizing feature selection...\n")
study_features.optimize(feature_selection_objective, n_trials=50, show_progress_bar=True)

# Get selected features
selected_features = [i for i in range(X_train.shape[1]) 
                    if study_features.best_params.get(f'feature_{i}', False)]

print(f"\nSelected {len(selected_features)} / {X_train.shape[1]} features")
print(f"Best CV score: {study_features.best_value:.4f}")

# Evaluate on test set
X_train_opt = X_train[:, selected_features]
X_test_opt = X_test[:, selected_features]

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_opt, y_train)
print(f"Test accuracy: {rf.score(X_test_opt, y_test):.4f}")

---

## 7. Best Practices

### 7.1 When to Use AutoML

**Use AutoML when:**
- ✅ Have well-defined problem and dataset
- ✅ Need quick baseline
- ✅ Want to explore many options
- ✅ Have computational budget
- ✅ Non-expert users need ML

**Don't use AutoML when:**
- ❌ Need to understand model deeply
- ❌ Very limited computational budget
- ❌ Data needs heavy domain knowledge
- ❌ Problem is novel/unique

### 7.2 Computational Budget

**Key Trade-offs:**
- More trials = better performance but more cost
- Early stopping can help
- Use cheaper proxies (smaller datasets, fewer epochs)

**Recommended Budgets:**
- Grid Search: < 100 combinations
- Random Search: 50-200 trials
- Bayesian Optimization: 50-100 trials
- NAS: 100-1000 trials (depending on method)

### 7.3 Validation Strategy

**Critical**: Use proper validation to avoid overfitting!

```python
# ❌ BAD: Optimizing on test set
for params in search_space:
    model.set_params(params)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)  # DON'T DO THIS!

# ✅ GOOD: Use cross-validation
for params in search_space:
    model.set_params(params)
    score = cross_val_score(model, X_train, y_train, cv=5).mean()

# Final evaluation on held-out test set
best_model.fit(X_train, y_train)
test_score = best_model.score(X_test, y_test)
```

### 7.4 Interpreting Results

**Always check:**
1. **Hyperparameter importance**: Which params matter most?
2. **Optimization history**: Is it converging?
3. **Cross-validation variance**: Is model stable?
4. **Train-test gap**: Overfitting?
5. **Feature selection**: Which features selected?

### 7.5 Common Pitfalls

**1. Overfitting to validation set**
- Problem: Too many hyperparameter tuning iterations
- Solution: Use nested cross-validation or hold-out test set

**2. Data leakage**
- Problem: Preprocessing before train-test split
- Solution: Use pipelines, fit on train only

**3. Computational waste**
- Problem: Not using early stopping
- Solution: Prune unpromising trials

**4. Ignoring domain knowledge**
- Problem: Letting AutoML choose unreasonable options
- Solution: Constrain search space with domain knowledge

---

## 8. Interview Questions

### Fundamentals

**Q1: What is AutoML and what does it automate?**

**A**: AutoML (Automated Machine Learning) automates the end-to-end process of applying ML:
1. **Data preprocessing**: Handling missing values, encoding, scaling
2. **Feature engineering**: Creating and selecting features
3. **Model selection**: Choosing algorithm
4. **Hyperparameter tuning**: Finding optimal hyperparameters
5. **Model ensembling**: Combining multiple models

**Benefits**: Democratizes ML, saves time, reduces human bias
**Challenges**: Computational cost, black box, still needs domain knowledge

---

**Q2: Compare Grid Search, Random Search, and Bayesian Optimization.**

**A**:

**Grid Search**:
- Exhaustive search over grid
- Pros: Guaranteed to find best in grid, deterministic
- Cons: Exponential complexity $O(n^d)$, wastes time
- Use: < 4 hyperparameters, small grids

**Random Search**:
- Randomly sample from distributions
- Pros: More efficient, explores important dimensions better
- Cons: Not guaranteed optimal, random
- Use: Quick baseline, many hyperparameters

**Bayesian Optimization**:
- Build probabilistic model of objective function
- Pros: Most sample-efficient, automatic exploration-exploitation
- Cons: More complex, slower per iteration
- Use: Expensive evaluations, need best performance

**Key insight**: Random search often beats grid search because not all hyperparameters are equally important.

---

**Q3: What is Neural Architecture Search (NAS)?**

**A**: NAS automates the design of neural network architectures.

**Components**:
1. **Search Space**: Set of possible architectures (layers, connections, operations)
2. **Search Strategy**: How to explore (random, evolutionary, RL, gradient-based)
3. **Performance Estimation**: How to evaluate (train from scratch, weight sharing, proxies)

**Methods**:
- **RL-based**: NASNet, EfficientNet (controller RNN generates architectures)
- **Evolutionary**: AmoebaNet (mutation and selection)
- **Gradient-based**: DARTS (continuous relaxation, much faster)
- **One-Shot**: Once-for-All (train supernet, sample sub-networks)

**Challenges**: Extremely expensive, requires massive compute

---

**Q4: Explain the CASH problem.**

**A**: CASH (Combined Algorithm Selection and Hyperparameter optimization)

**Problem**: Jointly optimize:
1. **Which algorithm** to use? (RF, GB, SVM, etc.)
2. **What hyperparameters** for that algorithm?

**Search Space**:
```
{
  (RandomForest, {n_estimators, max_depth, ...}),
  (GradientBoosting, {learning_rate, n_estimators, ...}),
  (SVM, {C, gamma, kernel, ...}),
  ...
}
```

**Approach**: Use Bayesian optimization over combined space

**Frameworks**: Auto-sklearn, TPOT, H2O AutoML

---

**Q5: What is meta-learning?**

**A**: Meta-learning = "learning to learn"

**Goal**: Learn from multiple tasks to quickly adapt to new tasks

**Key Concepts**:
- **Task distribution** $p(T)$: Sample tasks
- **Meta-training**: Learn across tasks
- **Meta-testing**: Fast adaptation to new task

**Approaches**:
1. **Metric-based**: Learn embedding space (Siamese networks, Matching Networks)
2. **Model-based**: Learn optimizer (LSTM meta-learner)
3. **Optimization-based**: Learn good initialization (MAML)

**Applications**:
- Few-shot learning (learn from few examples)
- Hyperparameter initialization
- Transfer learning

**Example**: MAML learns initialization that's few gradient steps away from optimal on any new task.

---

### Advanced

**Q6: How would you prevent overfitting during hyperparameter optimization?**

**A**: Several strategies:

**1. Proper validation**:
```python
# Use cross-validation on training data
score = cross_val_score(model, X_train, y_train, cv=5).mean()
# Hold out test set for final evaluation only
```

**2. Nested cross-validation**:
- Outer loop: Performance estimation
- Inner loop: Hyperparameter tuning

**3. Regularization**:
- Penalize model complexity in objective
- Prefer simpler models with similar performance

**4. Early stopping**:
- Stop if validation performance plateaus
- Prune unpromising trials

**5. Limited budget**:
- Don't tune forever
- Diminishing returns after ~100 trials

---

**Q7: What are the challenges of AutoML in production?**

**A**: 

**1. Computational cost**:
- Training many models is expensive
- Solution: Use cheaper proxies, early stopping

**2. Interpretability**:
- Hard to understand why choices made
- Solution: Analyze hyperparameter importances, feature selection

**3. Reproducibility**:
- Random search is non-deterministic
- Solution: Set random seeds, log all trials

**4. Domain knowledge**:
- Can't replace domain expertise
- Solution: Constrain search space with knowledge

**5. Deployment**:
- Ensemble models harder to deploy
- Solution: Model distillation, select single best model

**6. Data drift**:
- Optimal hyperparameters change over time
- Solution: Periodic retraining, online learning

---

**Q8: How does Bayesian Optimization work?**

**A**: 

**Algorithm**:
1. Build probabilistic model (Gaussian Process) of $f(\theta)$
2. Use acquisition function to select next $\theta$
3. Evaluate $f(\theta)$
4. Update model
5. Repeat

**Acquisition Functions**:
- **Expected Improvement (EI)**: $E[\max(0, f(\theta) - f(\theta_{best}))]$
  - Balances exploration vs exploitation
  
- **Upper Confidence Bound (UCB)**: $\mu(\theta) + \kappa \sigma(\theta)$
  - Optimistic estimate
  - $\kappa$ controls exploration

- **Probability of Improvement (PI)**: $P(f(\theta) > f(\theta_{best}))$
  - Greedy

**Why it works**:
- GP gives uncertainty estimates
- Acquisition balances "try promising areas" vs "explore unknown areas"
- Much more sample-efficient than random search

**Frameworks**: Optuna (TPE), Spearmint, GPyOpt

---

## Summary

**We covered**:

1. **AutoML Fundamentals**: What, why, and when to use
2. **Hyperparameter Optimization**: Grid, Random, Bayesian methods
3. **AutoML Frameworks**: CASH, ensemble selection
4. **Neural Architecture Search**: Random, Bayesian, advanced methods
5. **Meta-Learning**: Learning to learn across tasks
6. **Automated Feature Engineering**: Selection and generation
7. **Best Practices**: Validation, budgets, pitfalls

**Key Takeaways**:
- AutoML democratizes ML but doesn't replace expertise
- Bayesian optimization is most sample-efficient
- NAS can discover novel architectures
- Proper validation is critical to avoid overfitting
- Trade-off: computational budget vs performance

**Further Learning**:
- Optuna documentation: https://optuna.org/
- Auto-sklearn: https://automl.github.io/auto-sklearn/
- NAS survey: https://arxiv.org/abs/1808.05377

---

**References**:
1. Hutter et al., "Automated Machine Learning" (2019)
2. Bergstra & Bengio, "Random Search for Hyper-Parameter Optimization" (2012)
3. Snoek et al., "Practical Bayesian Optimization of Machine Learning Algorithms" (2012)
4. Zoph & Le, "Neural Architecture Search with Reinforcement Learning" (2017)
5. Liu et al., "DARTS: Differentiable Architecture Search" (2019)
6. Finn et al., "Model-Agnostic Meta-Learning" (2017)

---

*Created: October 2025*  
*Last Updated: October 2025*