# SageMaker ML Workshop: Building and Evaluating ML Models

## Workshop Overview

Welcome to this hands-on machine learning workshop! In this notebook, you'll learn how to:

1. **Set up your ML environment** - Configure H2O and dependencies
2. **Load and prepare data** - Import datasets for training
3. **Build ML models** - Train a Gradient Boosting Machine (GBM)
4. **Evaluate performance** - Assess model quality and metrics
5. **Generate predictions** - Use your trained model for inference
6. **Extract insights** - Analyze feature importance and interactions
7. **Save artifacts** - Export models for deployment

### About This Example

We'll use the **credit card fraud detection dataset** to predict fraudulent transactions. This is a binary classification problem that demonstrates real-world ML workflows with imbalanced data.

---

## Section 1: Environment Setup

### What is H2O?

H2O is an open-source machine learning platform that provides:
- Fast, scalable algorithms (GBM, Random Forest, Deep Learning, etc.)
- Automatic feature engineering
- Easy-to-use Python API
- Built-in model validation and metrics

### Step 1.1: Install Required Dependencies

H2O requires Java 8+ to run. We'll install the dependencies using conda/pip.

In [None]:
# Install Java (required for H2O) and H2O package
!conda install -y -c conda-forge openjdk=11 -q
!pip install h2o -q

print("Dependencies installed successfully!")

### Step 1.2: Import Libraries and Initialize H2O

Now we'll import the necessary Python libraries and start the H2O cluster.

In [None]:
import h2o
from h2o.estimators import H2OGradientBoostingEstimator

# Initialize H2O cluster
# This starts a local H2O instance that runs in the background
h2o.init()

print("H2O initialized successfully!")
print(f"H2O cluster is running at: {h2o.cluster().base_url}")

---

## Section 2: Data Loading and Preparation

### About the Credit Card Fraud Dataset

This dataset contains anonymized credit card transactions:
- **Target variable (Class)**: Whether the transaction is fraudulent (binary: 0 or 1)
- **Features**: V1-V28 (PCA-transformed features), Amount, Time

### Step 2.1: Import Dataset

In [None]:
# Import the credit card fraud dataset into H2O
# H2O loads data into its distributed in-memory format for fast processing
# Note: For this workshop, we'll use a sample of the data for faster training
df = h2o.import_file("creditcard.csv")

# Sample the data for faster training (optional - remove for full dataset)
df = df.sample(n=10000, seed=42)

# Display the first few rows
print("Dataset loaded successfully!")
print(f"\nDataset shape: {df.shape}")
df.head()

### Step 2.2: Define Features and Target

In machine learning, we need to specify:
- **Predictors (X)**: Input features used to make predictions
- **Response (y)**: Target variable we want to predict

We also need to convert categorical variables to factors (similar to R factors or pandas categorical types).

In [None]:
# Convert the target variable to categorical (factor)
df["Class"] = df["Class"].asfactor()

# Define predictor columns (all columns except Time and Class)
predictors = [col for col in df.columns if col not in ["Time", "Class"]]

# Define response column
response = "Class"

print(f"Predictors: {predictors}")
print(f"Response: {response}")
print(f"\nTarget variable distribution:")
df[response].table()

---

## Section 3: Model Training

### Understanding Gradient Boosting Machines (GBM)

GBM is an ensemble learning method that:
1. Builds multiple decision trees sequentially
2. Each tree learns from the errors of previous trees
3. Combines predictions from all trees for final output

### Key Hyperparameters:

- **nfolds**: Number of cross-validation folds (5 means 80% train, 20% validation per fold)
- **seed**: Random seed for reproducibility
- **keep_cross_validation_predictions**: Saves predictions for analysis

### Step 3.1: Configure and Train the Model

In [None]:
# Initialize the Gradient Boosting Model
fraud_gbm = H2OGradientBoostingEstimator(
    nfolds=5,                                    # 5-fold cross-validation
    seed=1111,                                   # For reproducibility
    keep_cross_validation_predictions=True      # Keep CV predictions for analysis
)

# Train the model
print("üöÄ Starting model training...\n")
fraud_gbm.train(
    x=predictors,           # Feature columns
    y=response,             # Target column
    training_frame=df       # Training dataset
)

print("‚úÖ Model training complete!")

### Step 3.2: View Model Details

Let's examine the trained model's architecture and hyperparameters.

In [None]:
# Display model summary
print("üìä Model Details")
print("=" * 80)
print(f"Model Type: {fraud_gbm}")
print(f"Model Key: {fraud_gbm.model_id}")
print("\n")

# Show model summary with tree statistics
print("Model Summary:")
print(fraud_gbm.summary())

---

## Section 4: Model Evaluation

### Understanding Model Metrics

For binary classification, we evaluate models using:
- **MSE (Mean Squared Error)**: Average squared difference between predictions and actual values
- **RMSE (Root Mean Squared Error)**: Square root of MSE, in the same units as the target
- **LogLoss**: Measures the performance of classification models (lower is better)
- **AUC (Area Under Curve)**: Measures classifier's ability to distinguish between classes (higher is better)
- **Confusion Matrix**: Shows true positives, false positives, etc.

### Step 4.1: Get Model Performance Metrics

In [None]:
# Get comprehensive model performance
perf = fraud_gbm.model_performance()

print("üìà Model Performance Metrics")
print("=" * 80)
print(perf)
print("\n")

# Extract key metrics
print("Key Metrics Summary:")
print(f"  MSE:  {perf.mse():.6f}")
print(f"  RMSE: {perf.rmse():.6f}")
print(f"  LogLoss: {perf.logloss():.6f}")
print(f"  AUC: {perf.auc():.6f}")

### Step 4.2: Analyze Cross-Validation Results

Cross-validation helps us understand how well our model generalizes to unseen data.

In [None]:
# Get cross-validation metrics
print("üîÑ Cross-Validation Performance")
print("=" * 80)
print(f"CV MSE:  {fraud_gbm.mse(xval=True):.6f}")
print(f"CV RMSE: {fraud_gbm.rmse(xval=True):.6f}")
print(f"CV AUC:  {fraud_gbm.auc(xval=True):.6f}")

# Display confusion matrix
print("\nConfusion Matrix:")
print(fraud_gbm.confusion_matrix())

---

## Section 5: Making Predictions

### Step 5.1: Generate Predictions on Training Data

Now let's use our trained model to make predictions. In a real scenario, you would do this on a separate test set.

In [None]:
# Generate predictions on the dataset
pred = fraud_gbm.predict(df)

print("üéØ Predictions Generated")
print("=" * 80)
print("\nFirst 10 predictions:")
print(pred.head(10))

# The prediction frame contains:
# - predict: The predicted class (0 or 1)
# - p0: Probability of class 0
# - p1: Probability of class 1

---

## Section 6: Model Interpretation

### Understanding Feature Importance and Interactions

Feature importance tells us which variables have the most impact on predictions.
Feature interactions reveal how features work together.

### Step 6.1: Extract Feature Interactions

In [None]:
# Extract feature interactions
feature_interactions = fraud_gbm.feature_interaction()

print("üîç Feature Interactions")
print("=" * 80)
print(feature_interactions)
print("\nHigher values indicate stronger feature interactions.")

### Step 6.2: Calculate Friedman and Popescu's H Statistics

H-statistics measure the strength of interaction between features:
- **H = 0**: No interaction
- **H > 0**: Features interact (higher values = stronger interaction)

In [None]:
# Get Friedman and Popescu's H statistics for specific features
h = fraud_gbm.h(df, ['V1', 'V2'])

print("üìä Friedman and Popescu's H Statistics")
print("=" * 80)
print(f"H-statistic for V1 and V2 interaction: {h}")
print("\nInterpretation:")
print("  H ‚âà 0: Features are independent")
print("  H > 0: Features interact in predictions")

### Step 6.3: Variable Importance Plot

In [None]:
# Get variable importance
var_importance = fraud_gbm.varimp(use_pandas=True)

print("üìä Variable Importance")
print("=" * 80)
print(var_importance)

# Plot variable importance
fraud_gbm.varimp_plot()

---

## Section 7: Model Persistence

### Saving Your Model

After training a model, you need to save it for later use in production or for sharing with others.

H2O models are saved in MOJO (Model Object, Optimized) format:
- **Fast**: Optimized for low-latency predictions
- **Portable**: Can be used in Java, Python, R, and other languages
- **Small**: Compact file size

### Step 7.1: Download Model Artifact

In [None]:
# Download the model as MOJO format
model_path = fraud_gbm.download_mojo('/home/sagemaker-user/fraud_gbm.mojo')

print("üíæ Model Saved Successfully!")
print("=" * 80)
print(f"Model saved to: {model_path}")
print("\nYou can now:")
print("  1. Deploy this model to production")
print("  2. Share it with your team")
print("  3. Load it in other environments")
print("  4. Use it with H2O's MOJO scoring pipeline")

### Step 7.2: Loading a Saved Model (Optional)

Here's how you would load the model back for future use:

In [None]:
# Example: Load a saved MOJO model
# loaded_model = h2o.import_mojo('/home/sagemaker-user/pros_gbm.mojo')
# predictions = loaded_model.predict(new_data)

print("‚ÑπÔ∏è  Use the commented code above to load your saved model in future sessions.")

---

## Section 8: Advanced Tips and Next Steps

### Model Inspection Commands

H2O provides several useful methods for model inspection:

In [None]:
# Helpful inspection commands
print("üîß Additional Model Inspection Tools")
print("=" * 80)
print("\n1. Detailed model explanation:")
print("   Use: model.explain()")
print("\n2. Toggle display tips:")
print("   Use: h2o.display.toggle_user_tips()")
print("\n3. Access H2O Flow UI:")
print(f"   Open: {h2o.cluster().base_url}")
print("\n4. Get model as plain text:")
print("   Use: model.show()")

### Hyperparameter Tuning

To improve model performance, try tuning these hyperparameters:

In [None]:
# Example: Advanced model with tuned hyperparameters
# Uncomment and experiment with these settings

# advanced_gbm = H2OGradientBoostingEstimator(
#     ntrees=100,              # Number of trees (default: 50)
#     max_depth=6,             # Maximum tree depth (default: 5)
#     learn_rate=0.1,          # Learning rate (default: 0.1)
#     sample_rate=0.8,         # Row sampling rate (default: 1.0)
#     col_sample_rate=0.8,     # Column sampling rate (default: 1.0)
#     min_rows=10,             # Minimum observations per leaf (default: 10)
#     nfolds=5,
#     seed=1111
# )
# 
# advanced_gbm.train(x=predictors, y=response, training_frame=df)

print("üí° Experiment with hyperparameters to improve model performance!")

---

## Section 9: Workshop Summary

### What You've Learned

Congratulations! In this workshop, you've learned how to:

‚úÖ **Set up an ML environment** with H2O and dependencies  
‚úÖ **Load and prepare datasets** for training  
‚úÖ **Train a Gradient Boosting Machine** classifier  
‚úÖ **Evaluate model performance** using multiple metrics  
‚úÖ **Generate predictions** on new data  
‚úÖ **Interpret models** using feature importance and interactions  
‚úÖ **Save and deploy models** as MOJO artifacts  

### Best Practices for Production ML

1. **Always split your data**: Use separate train/test sets
2. **Cross-validate**: Use k-fold CV to assess generalization
3. **Track experiments**: Log hyperparameters and metrics
4. **Monitor models**: Track performance drift over time
5. **Version models**: Keep track of model versions and lineage
6. **Document everything**: Explain your modeling decisions

### Next Steps

To continue your ML journey:

- **Try different algorithms**: Random Forest, XGBoost, Deep Learning
- **Feature engineering**: Create new features from existing ones
- **AutoML**: Let H2O automatically find the best model
- **Deploy to production**: Use SageMaker endpoints or batch transform
- **Integrate with MLOps**: Add monitoring, logging, and CI/CD

### Resources

- H2O Documentation: https://docs.h2o.ai/
- SageMaker Developer Guide: https://docs.aws.amazon.com/sagemaker/
- H2O AutoML: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

---

## Clean Up

Don't forget to shut down the H2O cluster when you're done!

In [None]:
# Shutdown H2O cluster
# h2o.cluster().shutdown()
print("‚ö†Ô∏è  Uncomment the line above to shutdown H2O when finished.")

---

## Appendix: Exercise Challenges

### Challenge 1: Data Splitting
Modify the code to split the data into train (70%), validation (15%), and test (15%) sets.

```python
# Hint: Use h2o.H2OFrame.split_frame()
train, valid, test = df.split_frame(ratios=[0.7, 0.15], seed=1111)
```

### Challenge 2: Hyperparameter Grid Search
Implement a grid search to find the best hyperparameters.

```python
# Hint: Use h2o.grid.H2OGridSearch
from h2o.grid.grid_search import H2OGridSearch
```

### Challenge 3: Compare Multiple Algorithms
Train and compare GBM, Random Forest, and Deep Learning models.

```python
# Hint: Import from h2o.estimators
from h2o.estimators import H2ORandomForestEstimator, H2ODeepLearningEstimator
```

### Challenge 4: Create a Confusion Matrix Heatmap
Visualize the confusion matrix using matplotlib or seaborn.

### Challenge 5: ROC Curve
Plot the ROC curve and calculate the optimal threshold.

```python
# Hint: Use model.roc() and model.find_threshold_by_max_metric()
```