# Optuna, Ray Tune, and Scikit-Optimize: Advanced Hyperparameter Optimization

These are **smart alternatives to GridSearchCV/RandomizedSearchCV** that use **intelligent search strategies** instead of brute force or random sampling.

## Quick Comparison Table

| Library | Philosophy | Best For | Learning Curve | Speed |
|---------|-----------|----------|----------------|-------|
| **GridSearchCV** | Try everything | Small search spaces | ⭐⭐⭐⭐⭐ Easiest | ❌ Slowest |
| **RandomizedSearchCV** | Random sampling | Medium spaces | ⭐⭐⭐⭐⭐ Easiest | ⭐⭐ Faster |
| **scikit-optimize** | Bayesian optimization | Single machine, scikit-learn focused | ⭐⭐⭐⭐ Easy | ⭐⭐⭐ Smart |
| **Optuna** | Bayesian + pruning | Production pipelines, flexibility | ⭐⭐⭐ Moderate | ⭐⭐⭐⭐ Very smart |
| **Ray Tune** | Distributed + schedulers | Large-scale, clusters, deep learning | ⭐⭐ Complex | ⭐⭐⭐⭐⭐ Fastest (distributed) |

---

## The Core Problem They Solve

**GridSearchCV** tries every combination:
```python
param_grid = {
    'n_estimators': [50, 100, 200, 300],      # 4 values
    'max_depth': [3, 5, 10, 15, 20],          # 5 values
    'min_samples_split': [2, 5, 10, 20],      # 4 values
    'learning_rate': [0.01, 0.05, 0.1, 0.2]   # 4 values
}
# Total: 4 × 5 × 4 × 4 = 320 combinations × 5 folds = 1,600 model fits!
```

**Smart optimizers** learn from previous trials:
```
Trial 1:  n_estimators=100, max_depth=5  → AUC=0.75
Trial 2:  n_estimators=200, max_depth=10 → AUC=0.82 ← Better!
Trial 3:  Focus search around here...     → AUC=0.85 ← Even better!
...
Trial 50: Found optimal                   → AUC=0.87
# Only 50 fits instead of 1,600, but smarter!
```

---

## 1. Scikit-Optimize (skopt)

**Most similar to GridSearchCV** - drop-in replacement.

### Installation
```bash
pip install scikit-optimize
```

### Basic Usage (Drop-in Replacement)

```python
from skopt import BayesSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold

# Define search space
search_spaces = {
    'n_estimators': (50, 300),              # Integer range
    'max_depth': (3, 20),                    # Integer range
    'min_samples_split': (2, 20),            # Integer range
    'min_samples_leaf': (1, 10),             # Integer range
    'max_features': ['sqrt', 'log2', None]   # Categorical
}

# Use like GridSearchCV!
bayes_search = BayesSearchCV(
    RandomForestClassifier(random_state=42),
    search_spaces,
    n_iter=50,              # Number of parameter combinations to try
    cv=StratifiedKFold(5, shuffle=True, random_state=42),
    scoring='roc_auc',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

bayes_search.fit(X_train, y_train)

print(f"Best score: {bayes_search.best_score_}")
print(f"Best params: {bayes_search.best_params_}")

# Access best model (already retrained on full data)
best_model = bayes_search.best_estimator_
```

**Key differences from GridSearchCV**:
- ✅ Uses Gaussian Process to model parameter space
- ✅ Intelligently picks next parameters to try
- ✅ Same API as GridSearchCV (easy migration!)
- ❌ Less flexible than Optuna/Ray Tune
- ❌ No pruning/early stopping

---

## 2. Optuna

**Most popular** for production ML - flexible, powerful, great visualizations.

### Installation
```bash
pip install optuna
```

### Basic Usage

```python
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

def objective(trial):
    """
    Optuna calls this function for each trial.
    You define what to optimize.
    """
    # Suggest hyperparameters
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'max_depth': trial.suggest_int('max_depth', 3, 20),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
        'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None]),
        'random_state': 42
    }
    
    # Create and evaluate model
    model = RandomForestClassifier(**params)
    cv = StratifiedKFold(5, shuffle=True, random_state=42)
    score = cross_val_score(model, X_train, y_train, cv=cv, scoring='roc_auc').mean()
    
    return score  # Optuna maximizes or minimizes this

# Create study
study = optuna.create_study(
    direction='maximize',  # or 'minimize'
    study_name='rf_optimization',
    sampler=optuna.samplers.TPESampler(seed=42)  # Bayesian optimization
)

# Run optimization
study.optimize(objective, n_trials=100, n_jobs=-1)

print(f"Best score: {study.best_value}")
print(f"Best params: {study.best_params}")
```

**Key features**:
- ✅ **Pruning**: Stop bad trials early (saves computation)
- ✅ **Flexible**: Works with any framework (sklearn, XGBoost, PyTorch)
- ✅ **Visualization**: Built-in plots for analysis
- ✅ **Persistence**: Save/resume studies
- ✅ **Parallel**: Multiple workers on same study

---

## 3. Ray Tune

**Most scalable** - designed for distributed computing and deep learning.

### Installation
```bash
pip install ray[tune]
```

### Basic Usage

```python
from ray import tune
from ray.tune.sklearn import TuneSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define search space
param_distributions = {
    'n_estimators': tune.randint(50, 300),
    'max_depth': tune.randint(3, 20),
    'min_samples_split': tune.randint(2, 20),
    'min_samples_leaf': tune.randint(1, 10),
    'max_features': tune.choice(['sqrt', 'log2', None])
}

# Use like GridSearchCV
tune_search = TuneSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions,
    search_optimization="bayesian",  # or "random", "grid"
    n_trials=100,
    cv=5,
    scoring='roc_auc',
    verbose=1
)

tune_search.fit(X_train, y_train)

print(f"Best score: {tune_search.best_score_}")
print(f"Best params: {tune_search.best_params_}")
```

**Key features**:
- ✅ **Distributed**: Multi-node clusters
- ✅ **Schedulers**: ASHA, PBT for advanced strategies
- ✅ **Deep learning**: Great with PyTorch, TensorFlow
- ❌ More complex setup
- ❌ Overkill for simple scikit-learn projects

---

## Integration with Your Complete Workflow

Here's how these tools fit into your 7-step workflow. They **replace step 3** (Hyperparameter Tuning) but integrate with everything else.

### Complete Workflow with Optuna + MLflow + Imblearn

```python
import numpy as np
import pandas as pd
import optuna
import mlflow
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import (classification_report, confusion_matrix, 
                            roc_auc_score, roc_curve, precision_recall_curve,
                            brier_score_loss)
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt

# ============================================================================
# STEP 1: DATA & SPLIT
# ============================================================================
# Assuming you have X, y loaded
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"Train size: {X_train.shape[0]}")
print(f"Test size: {X_test.shape[0]}")
print(f"Class distribution (train): {np.bincount(y_train)}")

# ============================================================================
# STEP 2: PREPROCESSING (defined in pipeline)
# ============================================================================
# We'll build this inside the objective function

# ============================================================================
# STEP 3: HYPERPARAMETER TUNING with Optuna + MLflow
# ============================================================================

# Setup MLflow
mlflow.set_experiment("complete-ml-workflow")

# Setup CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

def objective(trial):
    """
    Optimize entire pipeline: preprocessing + sampling + model
    """
    with mlflow.start_run(nested=True):
        # ---- PREPROCESSING ----
        scaler_type = trial.suggest_categorical('scaler', ['standard', 'robust'])
        
        if scaler_type == 'standard':
            from sklearn.preprocessing import StandardScaler
            scaler = StandardScaler()
        else:
            from sklearn.preprocessing import RobustScaler
            scaler = RobustScaler()
        
        mlflow.log_param("scaler", scaler_type)
        
        # ---- SAMPLING (for imbalanced data) ----
        use_smote = trial.suggest_categorical('use_smote', [True, False])
        
        steps = [('scaler', scaler)]
        
        if use_smote:
            k_neighbors = trial.suggest_int('smote_k', 3, 10)
            steps.append(('smote', SMOTE(k_neighbors=k_neighbors, random_state=42)))
            mlflow.log_param("smote_k", k_neighbors)
        
        mlflow.log_param("use_smote", use_smote)
        
        # ---- MODEL HYPERPARAMETERS ----
        model_params = {
            'n_estimators': trial.suggest_int('n_estimators', 50, 300),
            'max_depth': trial.suggest_int('max_depth', 3, 20),
            'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
            'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
            'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2']),
            'class_weight': trial.suggest_categorical('class_weight', ['balanced', None]),
            'random_state': 42
        }
        
        steps.append(('classifier', RandomForestClassifier(**model_params)))
        
        # Build pipeline
        pipeline = ImbPipeline(steps)
        
        mlflow.log_params(model_params)
        
        # ---- CROSS-VALIDATION ----
        scores = cross_val_score(
            pipeline, X_train, y_train, 
            cv=cv, scoring='roc_auc', n_jobs=-1
        )
        
        mean_score = scores.mean()
        std_score = scores.std()
        
        mlflow.log_metric("cv_auc_mean", mean_score)
        mlflow.log_metric("cv_auc_std", std_score)
        
        # ---- PRUNING (optional): Stop bad trials early ----
        # Report intermediate results
        trial.report(mean_score, step=0)
        
        if trial.should_prune():
            raise optuna.TrialPruned()
        
        return mean_score

# Run Optuna optimization
study = optuna.create_study(
    direction='maximize',
    study_name='ml-workflow-optimization',
    sampler=optuna.samplers.TPESampler(seed=42),
    pruner=optuna.pruners.MedianPruner(n_startup_trials=10)
)

study.optimize(objective, n_trials=100, n_jobs=1)  # n_jobs=1 for MLflow safety

# Print best results
print("\n" + "="*70)
print("BEST HYPERPARAMETERS FOUND")
print("="*70)
print(f"Best CV AUC: {study.best_value:.4f}")
print(f"Best params:")
for param, value in study.best_params.items():
    print(f"  {param}: {value}")

# ============================================================================
# STEP 4: REBUILD BEST PIPELINE & CALIBRATION
# ============================================================================

# Rebuild pipeline with best params
best_params = study.best_params

# Scaler
if best_params['scaler'] == 'standard':
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
else:
    from sklearn.preprocessing import RobustScaler
    scaler = RobustScaler()

# Pipeline steps
steps = [('scaler', scaler)]

if best_params['use_smote']:
    steps.append(('smote', SMOTE(k_neighbors=best_params['smote_k'], random_state=42)))

# Model
model_params = {
    'n_estimators': best_params['n_estimators'],
    'max_depth': best_params['max_depth'],
    'min_samples_split': best_params['min_samples_split'],
    'min_samples_leaf': best_params['min_samples_leaf'],
    'max_features': best_params['max_features'],
    'class_weight': best_params['class_weight'],
    'random_state': 42
}

steps.append(('classifier', RandomForestClassifier(**model_params)))

best_pipeline = ImbPipeline(steps)

# Apply calibration
calibrated_model = CalibratedClassifierCV(
    best_pipeline,
    method='sigmoid',  # Platt scaling
    cv=cv,
    n_jobs=-1
)

print("\nTraining final calibrated model on all training data...")
calibrated_model.fit(X_train, y_train)

# ============================================================================
# STEP 5: TEST EVALUATION
# ============================================================================

print("\n" + "="*70)
print("TEST SET EVALUATION")
print("="*70)

# Predictions
y_pred = calibrated_model.predict(X_test)
y_proba = calibrated_model.predict_proba(X_test)[:, 1]

# Metrics
test_auc = roc_auc_score(y_test, y_proba)
test_brier = brier_score_loss(y_test, y_proba)

print(f"\nTest AUC: {test_auc:.4f}")
print(f"Test Brier Score: {test_brier:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Log to MLflow
with mlflow.start_run(run_name="final-model"):
    mlflow.log_params(best_params)
    mlflow.log_metric("test_auc", test_auc)
    mlflow.log_metric("test_brier", test_brier)
    mlflow.sklearn.log_model(calibrated_model, "final_model")

# Plot ROC curve
fpr, tpr, _ = roc_curve(y_test, y_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC (AUC = {test_auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Test Set')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('roc_curve.png')
mlflow.log_artifact('roc_curve.png')
plt.show()

# Plot Precision-Recall curve
precision, recall, _ = precision_recall_curve(y_test, y_proba)
plt.figure(figsize=(8, 6))
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve - Test Set')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('pr_curve.png')
mlflow.log_artifact('pr_curve.png')
plt.show()

# ============================================================================
# STEP 6: LEARNING CURVES
# ============================================================================

from sklearn.model_selection import learning_curve

print("\nGenerating learning curves...")

train_sizes, train_scores, val_scores = learning_curve(
    calibrated_model.base_estimator,  # Use uncalibrated for speed
    X_train, y_train,
    cv=cv,
    train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='roc_auc',
    n_jobs=-1
)

train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, label='Training score', marker='o')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.2)
plt.plot(train_sizes, val_mean, label='Cross-validation score', marker='s')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.2)
plt.xlabel('Training Set Size')
plt.ylabel('AUC Score')
plt.title('Learning Curves')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('learning_curves.png')
mlflow.log_artifact('learning_curves.png')
plt.show()

# Diagnose
gap = train_mean[-1] - val_mean[-1]
if gap > 0.1:
    print("\n⚠️  HIGH VARIANCE (Overfitting)")
    print("   → Try: More data, regularization, simpler model")
elif val_mean[-1] < 0.75:
    print("\n⚠️  HIGH BIAS (Underfitting)")
    print("   → Try: More complex model, more features")
else:
    print("\n✅ Model is well-balanced")

# ============================================================================
# STEP 7: VALIDATION CURVES
# ============================================================================

from sklearn.model_selection import validation_curve

print("\nGenerating validation curve for max_depth...")

param_range = np.arange(3, 21, 2)
train_scores, val_scores = validation_curve(
    RandomForestClassifier(
        n_estimators=best_params['n_estimators'],
        min_samples_split=best_params['min_samples_split'],
        random_state=42
    ),
    X_train, y_train,
    param_name='max_depth',
    param_range=param_range,
    cv=cv,
    scoring='roc_auc',
    n_jobs=-1
)

train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)

plt.figure(figsize=(10, 6))
plt.plot(param_range, train_mean, label='Training score', marker='o')
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, alpha=0.2)
plt.plot(param_range, val_mean, label='Cross-validation score', marker='s')
plt.fill_between(param_range, val_mean - val_std, val_mean + val_std, alpha=0.2)
plt.xlabel('max_depth')
plt.ylabel('AUC Score')
plt.title('Validation Curve - max_depth')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('validation_curve.png')
mlflow.log_artifact('validation_curve.png')
plt.show()

# ============================================================================
# OPTUNA VISUALIZATIONS (Bonus!)
# ============================================================================

import optuna.visualization as vis

# Optimization history
fig = vis.plot_optimization_history(study)
fig.write_html('optuna_history.html')
mlflow.log_artifact('optuna_history.html')

# Parameter importances
fig = vis.plot_param_importances(study)
fig.write_html('optuna_importances.html')
mlflow.log_artifact('optuna_importances.html')

# Parallel coordinate plot
fig = vis.plot_parallel_coordinate(study)
fig.write_html('optuna_parallel.html')
mlflow.log_artifact('optuna_parallel.html')

print("\n" + "="*70)
print("WORKFLOW COMPLETE!")
print("="*70)
print(f"✅ Best CV AUC: {study.best_value:.4f}")
print(f"✅ Test AUC: {test_auc:.4f}")
print(f"✅ All artifacts logged to MLflow")
print(f"✅ View experiments: mlflow ui")
```

---

## MLflow Integration Patterns

### Pattern 1: Auto-logging with Optuna

```python
from optuna.integration.mlflow import MLflowCallback

mlflow.set_experiment("auto-tracking")

mlflow_callback = MLflowCallback(
    tracking_uri="file:./mlruns",
    metric_name="auc",
    create_experiment=False,
    mlflow_kwargs={"nested": True}
)

study.optimize(
    objective,
    n_trials=100,
    callbacks=[mlflow_callback]  # Auto-logs everything!
)
```

### Pattern 2: Manual Fine-grained Control

```python
def objective_with_mlflow(trial):
    with mlflow.start_run(nested=True):
        params = {...}
        
        # Log everything you want
        mlflow.log_params(params)
        mlflow.log_param("trial_number", trial.number)
        
        # Train and evaluate
        score = cross_val_score(...)
        
        mlflow.log_metric("cv_score", score.mean())
        mlflow.log_metric("cv_std", score.std())
        
        # Log artifacts
        mlflow.log_artifact("plot.png")
        
        return score.mean()
```

---

## When to Use Each Tool

```
┌─────────────────────────────────────────────────────────────┐
│ YOUR SITUATION                    → RECOMMENDED TOOL        │
├─────────────────────────────────────────────────────────────┤
│ Learning / small projects         → GridSearchCV            │
│ Want drop-in upgrade              → scikit-optimize         │
│ Production ML pipelines           → Optuna + MLflow         │
│ Need best visualizations          → Optuna                  │
│ Distributed computing / clusters  → Ray Tune                │
│ Deep learning projects            → Ray Tune or Optuna      │
│ Budget: <100 trials               → RandomizedSearchCV      │
│ Budget: 100-1000 trials           → Optuna                  │
│ Budget: >1000 trials, distributed → Ray Tune                │
└─────────────────────────────────────────────────────────────┘
```

---

## Practical Recommendation for Your Workflow

**Start with Optuna** because:

1. ✅ **Easy migration**: Replace GridSearchCV with just a few lines
2. ✅ **MLflow integration**: Built-in callbacks
3. ✅ **Visualizations**: Understand what's happening
4. ✅ **Pruning**: Save computation on bad trials
5. ✅ **Flexibility**: Works with any model/framework
6. ✅ **Production-ready**: Used by major companies

**Upgrade path**:
```
GridSearchCV → Optuna → Ray Tune (if you need distributed)
             → scikit-optimize (if you want minimal changes)
```

---

Want me to show:
1. **How to resume interrupted Optuna studies** (save/load progress)?
2. **Multi-objective optimization** (optimize multiple metrics simultaneously)?
3. **Optuna with XGBoost/LightGBM** (tree-based models)?
4. **Complete production deployment** with MLflow model registry?