# XGBoost Training with mlp_sdk

This notebook demonstrates how to use the mlp_sdk training wrapper to train an XGBoost model on SageMaker.

## What You'll Learn

1. Generate synthetic training data
2. Upload data to S3
3. Configure mlp_sdk session
4. Train an XGBoost model with configuration-driven defaults
5. Monitor training progress
6. Deploy and test the model

## Prerequisites

- mlp_sdk installed
- AWS credentials configured
- admin-config.yaml configured (see examples/generate_admin_config.py)
- Appropriate IAM permissions for SageMaker

## Step 1: Install Dependencies and Import Libraries

In [1]:
%pip install sagemaker-mlp-sdk

Looking in indexes: https://pypi.org/simple, https://plugin.us-east-1.prod.workshops.aws
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import boto3
import os
from datetime import datetime

from mlp_sdk import MLP_Session
from mlp_sdk.exceptions import MLPSDKError

print("‚úÖ All imports successful!")

‚úÖ All imports successful!


## Step 2: Generate Synthetic Training Data

We'll create a binary classification dataset with 10,000 samples and 20 features.

In [3]:
# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic classification data
print("Generating synthetic data...")
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.7, 0.3],  # Imbalanced classes
    flip_y=0.05,  # Add some noise
    random_state=42
)

# Split into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"‚úÖ Data generated:")
print(f"   Training samples: {len(X_train)}")
print(f"   Validation samples: {len(X_val)}")
print(f"   Features: {X_train.shape[1]}")
print(f"   Class distribution (train): {np.bincount(y_train)}")

Generating synthetic data...
‚úÖ Data generated:
   Training samples: 8000
   Validation samples: 2000
   Features: 20
   Class distribution (train): [5523 2477]


## Step 3: Prepare Data for XGBoost

XGBoost expects data in CSV format with no header and the target variable in the first column.

In [4]:
# Create DataFrames
train_df = pd.DataFrame(X_train)
train_df.insert(0, 'target', y_train)

val_df = pd.DataFrame(X_val)
val_df.insert(0, 'target', y_val)

# Save to CSV (no header, no index)
os.makedirs('data', exist_ok=True)
train_df.to_csv('data/train.csv', header=False, index=False)
val_df.to_csv('data/validation.csv', header=False, index=False)

print("‚úÖ Data saved to CSV files:")
print(f"   data/train.csv ({os.path.getsize('data/train.csv')} bytes)")
print(f"   data/validation.csv ({os.path.getsize('data/validation.csv')} bytes)")

# Preview the data
print("\nüìä Training data preview:")
print(train_df.head())

‚úÖ Data saved to CSV files:
   data/train.csv (3090111 bytes)
   data/validation.csv (772747 bytes)

üìä Training data preview:
   target         0         1         2         3         4         5  \
0       0  0.345305  0.634140  1.616602 -1.836800  3.316415 -1.993505   
1       1 -0.528892  0.890251 -3.072907 -5.771960 -0.164547 -1.876536   
2       0 -3.194108 -4.078910  2.692774 -9.106173  3.732561 -2.156379   
3       0 -0.973791 -3.649703  1.738861 -3.109399 -0.604512 -2.676314   
4       0  2.722265  5.836598 -1.795938  5.814710  0.669367 -1.666890   

          6          7         8  ...        10        11        12        13  \
0  3.056641   0.154949  1.112160  ... -1.939895  2.633934 -1.343398  0.210973   
1  0.066301  -4.420647  2.175401  ... -2.404712  0.534482 -1.998253  2.406805   
2  7.939058 -10.678452 -0.690522  ... -4.053388 -0.455795 -2.232891  0.454620   
3  2.545406  -1.503649  0.698588  ... -0.584730  2.409607  5.360897  2.550879   
4 -6.391533  10.349074 -1.

## Step 4: Initialize mlp_sdk Session

The session will automatically load configuration from your admin-config.yaml file.

In [14]:
try:
    # Initialize session with default config
    # If you have a custom config path, use: MLP_Session(config_path="/path/to/config.yaml")
    session = MLP_Session(config_path="admin-config.yaml",log_level="DEBUG")
    
    print("‚úÖ MLP_Session initialized successfully!")
    print(f"   Region: {session.region_name}")
    print(f"   Default bucket: {session.default_bucket}")
    print(f"   Execution role: {session.get_execution_role()}")
    
    # View configuration
    config = session.config_manager.MLP_config
    print(f"\nüìã Configuration loaded:")
    print(f"   Training instance: {config.compute_config.training_instance_type}")
    print(f"   Instance count: {config.compute_config.training_instance_count}")
    print(f"   VPC: {config.networking_config.vpc_id}")
    
except Exception as e:
    print(f"‚ùå Error initializing session: {e}")
    print("\nüí° Tip: Generate config with: python examples/generate_admin_config.py --interactive")
    raise

2026-02-03 17:21:16 - mlp_sdk.session - INFO - Initializing MLP_Session | config_path=admin-config.yaml


2026-02-03 17:21:16 - mlp_sdk.session - DEBUG - SageMaker SessionSettings initialized | region=us-west-2 | default_bucket=sagemaker-us-west-2-716664005094


2026-02-03 17:21:16 - mlp_sdk.session - INFO - MLP_Session initialized successfully | has_config=True


‚úÖ MLP_Session initialized successfully!
   Region: us-west-2
   Default bucket: sagemaker-us-west-2-716664005094
   Execution role: arn:aws:iam::716664005094:role/service-role/AmazonSageMaker-ExecutionRole-20251106T123495

üìã Configuration loaded:
   Training instance: ml.m5.xlarge
   Instance count: 1
   VPC: vpc-0e6bc35a47325c93e


## Step 5: Upload Data to S3

Upload the training and validation data to S3 using the default bucket from configuration.

In [15]:
# Create S3 paths
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
s3_prefix = f"xgboost-example/{timestamp}"

train_s3_path = f"s3://{session.default_bucket}/{s3_prefix}/train/"
val_s3_path = f"s3://{session.default_bucket}/{s3_prefix}/validation/"
output_s3_path = f"s3://{session.default_bucket}/{s3_prefix}/output"

print(f"üì§ Uploading data to S3...")
print(f"   Bucket: {session.default_bucket}")
print(f"   Prefix: {s3_prefix}")

# Upload files
s3_client = session.boto_session.client('s3')

try:
    s3_client.upload_file(
        'data/train.csv',
        session.default_bucket,
        f"{s3_prefix}/train/train.csv"
    )
    print(f"   ‚úÖ Uploaded: {train_s3_path}")
    
    s3_client.upload_file(
        'data/validation.csv',
        session.default_bucket,
        f"{s3_prefix}/validation/validation.csv"
    )
    print(f"   ‚úÖ Uploaded: {val_s3_path}")
    
except Exception as e:
    print(f"‚ùå Error uploading to S3: {e}")
    raise

üì§ Uploading data to S3...
   Bucket: sagemaker-us-west-2-716664005094
   Prefix: xgboost-example/20260203-172118


   ‚úÖ Uploaded: s3://sagemaker-us-west-2-716664005094/xgboost-example/20260203-172118/train/
   ‚úÖ Uploaded: s3://sagemaker-us-west-2-716664005094/xgboost-example/20260203-172118/validation/


## Step 6: Configure XGBoost Training Job

Define the training job parameters. The mlp_sdk will automatically apply defaults from your configuration.

In [22]:
# XGBoost hyperparameters
hyperparameters = {
    'objective': 'binary:logistic',
    'num_round': '100',
    'max_depth': '5',
    'eta': '0.2',
    'gamma': '4',
    'min_child_weight': '6',
    'subsample': '0.8',
    'verbosity': '1',
    'eval_metric': 'auc',
    'scale_pos_weight': '2'  # Handle class imbalance
}

# Get XGBoost container image
# This is the AWS-managed XGBoost container
region = session.region_name
xgboost_container = f"246618743249.dkr.ecr.{region}.amazonaws.com/sagemaker-xgboost:1.5-1"

print("üìã Training configuration:")
print(f"   Container: {xgboost_container}")
print(f"   Hyperparameters: {hyperparameters}")
print(f"   Training data: {train_s3_path}")
print(f"   Validation data: {val_s3_path}")
print(f"   Output path: {output_s3_path}")

üìã Training configuration:
   Container: 246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.5-1
   Hyperparameters: {'objective': 'binary:logistic', 'num_round': '100', 'max_depth': '5', 'eta': '0.2', 'gamma': '4', 'min_child_weight': '6', 'subsample': '0.8', 'verbosity': '1', 'eval_metric': 'auc', 'scale_pos_weight': '2'}
   Training data: s3://sagemaker-us-west-2-716664005094/xgboost-example/20260203-172118/train/
   Validation data: s3://sagemaker-us-west-2-716664005094/xgboost-example/20260203-172118/validation/
   Output path: s3://sagemaker-us-west-2-716664005094/xgboost-example/20260203-172118/output


## Step 7: Start Training Job with mlp_sdk

Use the mlp_sdk training wrapper to start the training job. Notice how we don't need to specify:
- Instance type (from config)
- Instance count (from config)
- VPC configuration (from config)
- Security groups (from config)
- Subnets (from config)
- IAM role (from config)
- KMS key (from config)

All these are automatically applied from your admin-config.yaml!

In [25]:
# Generate unique job name
job_name = f"xgboost-training-{timestamp}"

print(f"üöÄ Starting training job: {job_name}")
print(f"\n‚è≥ This may take 5-10 minutes...\n")



try:
        # SDK v3 ModelTrainer expects inputs as a dict of channel_name: S3 URI
        inputs = {
            'train': train_s3_path,
            'validation': val_s3_path
        }

        
        training_job = session.run_training_job(
            job_name=job_name,
            training_image=xgboost_container,
            inputs=inputs,
            hyperparameters=hyperparameters,
            output_path=output_s3_path,
            max_run_in_seconds=3600
        )
        
        print(f"\n‚úÖ Training job started!")
        print(f"   ModelTrainer object created")
        print(f"\nüí° Monitor in SageMaker console or use --wait flag")
        
        
except MLPSDKError as e:
        print(f"‚ùå SDK Error: {e}")
        raise
except Exception as e:
        print(f"‚ùå Error: {e}")
        raise

2026-02-03 19:02:39 - mlp_sdk.session - INFO - run_training_job called | name=xgboost-training-20260203-172118


üöÄ Starting training job: xgboost-training-20260203-172118

‚è≥ This may take 5-10 minutes...



2026-02-03 19:02:39 - mlp_sdk.session - INFO - Running training job with ModelTrainer | name=xgboost-training-20260203-172118


2026-02-03 19:02:39 - mlp_sdk.session - DEBUG - Using config instance_type | value=ml.m5.xlarge


2026-02-03 19:02:39 - mlp_sdk.session - DEBUG - Using config instance_count | value=1


2026-02-03 19:02:39 - mlp_sdk.session - DEBUG - Using config role_arn | role=arn:aws:iam::716664005094:role/service-role/AmazonSageMaker-ExecutionRole-20251106T123495


2026-02-03 19:02:39 - mlp_sdk.session - DEBUG - Using config subnets | subnets=['subnet-0db6fd10c0e431706']


2026-02-03 19:02:39 - mlp_sdk.session - DEBUG - Using config security_group_ids | security_groups=['sg-05039bb7ddd07bc3b']


2026-02-03 19:02:39 - mlp_sdk.session - DEBUG - Using runtime output_path


2026-02-03 19:02:39 - mlp_sdk.session - DEBUG - Using config volume_kms_key | key_id=0bf8a5e6-a713-4d33-8e3e-58fe4aa64780


2026-02-03 19:02:39 - mlp_sdk.session - DEBUG - Using config output_kms_key | key_id=0bf8a5e6-a713-4d33-8e3e-58fe4aa64780


2026-02-03 19:02:39 - mlp_sdk.session - DEBUG - Default input S3 URI available | uri=s3://sagemaker-us-west-2-716664005094/input/


2026-02-03 19:02:39 - mlp_sdk.session - DEBUG - Using runtime parameter: hyperparameters


2026-02-03 19:02:39 - mlp_sdk.session - DEBUG - Using runtime parameter: max_run_in_seconds


2026-02-03 19:02:40 - mlp_sdk.session - DEBUG - Starting training job with ModelTrainer | name=xgboost-training-20260203-172118 | instance_type=ml.m5.xlarge | instance_count=1 | has_source_code=False


2026-02-03 19:05:00 - mlp_sdk.session - INFO - Training job started successfully | name=xgboost-training-20260203-172118



‚úÖ Training job started!
   ModelTrainer object created

üí° Monitor in SageMaker console or use --wait flag


## Step 8: Monitor Training Progress

Monitor the training job status and wait for completion.

In [26]:
import time

print(f"üìä Monitoring training job: {job_name}\n")

# Get the actual training job name from ModelTrainer
if hasattr(training_job, '_latest_training_job') and training_job._latest_training_job:
    actual_job_name = training_job._latest_training_job.get_name()
    job_name = actual_job_name  # Update the variable
    print(f"‚úÖ Actual training job name: {job_name}")


sagemaker_client = session.sagemaker_client

while True:
    response = sagemaker_client.describe_training_job(TrainingJobName=job_name)
    status = response['TrainingJobStatus']
    
    if status == 'Completed':
        print(f"\n‚úÖ Training completed successfully!")
        print(f"   Training time: {response.get('TrainingTimeInSeconds', 0)} seconds")
        print(f"   Billable time: {response.get('BillableTimeInSeconds', 0)} seconds")
        
        # Get final metrics
        if 'FinalMetricDataList' in response:
            print(f"\nüìà Final metrics:")
            for metric in response['FinalMetricDataList']:
                print(f"   {metric['MetricName']}: {metric['Value']:.4f}")
        
        # Get model artifact location
        model_artifacts = response['ModelArtifacts']['S3ModelArtifacts']
        print(f"\nüì¶ Model artifacts: {model_artifacts}")
        break
        
    elif status == 'Failed':
        print(f"\n‚ùå Training failed!")
        print(f"   Failure reason: {response.get('FailureReason', 'Unknown')}")
        break
        
    elif status == 'Stopped':
        print(f"\n‚ö†Ô∏è  Training stopped!")
        break
        
    else:
        # Print progress
        print(f"   Status: {status} | Time: {datetime.now().strftime('%H:%M:%S')}", end='\r')
        time.sleep(30)  # Check every 30 seconds

üìä Monitoring training job: xgboost-training-20260203-172118

‚úÖ Actual training job name: xgboost-training-20260203-172118-20260203190240

‚úÖ Training completed successfully!
   Training time: 99 seconds
   Billable time: 99 seconds

üìà Final metrics:
   validation:auc: 0.9676
   train:auc: 0.9959

üì¶ Model artifacts: s3://sagemaker-us-west-2-716664005094/xgboost-example/20260203-172118/output/xgboost-training-20260203-172118-20260203190240/output/model.tar.gz


## Step 9: View Training Logs (Optional)

You can view the training logs in CloudWatch or using the SageMaker console.

In [27]:
# Get CloudWatch log stream
response = sagemaker_client.describe_training_job(TrainingJobName=job_name)

print("üìù Training logs:")
print(f"   Log group: /aws/sagemaker/TrainingJobs")
print(f"   Log stream: {job_name}/algo-1-*")
print(f"\nüí° View logs in CloudWatch console or use AWS CLI:")
print(f"   aws logs tail /aws/sagemaker/TrainingJobs --follow --log-stream-name-prefix {job_name}")

üìù Training logs:
   Log group: /aws/sagemaker/TrainingJobs
   Log stream: xgboost-training-20260203-172118-20260203190240/algo-1-*

üí° View logs in CloudWatch console or use AWS CLI:
   aws logs tail /aws/sagemaker/TrainingJobs --follow --log-stream-name-prefix xgboost-training-20260203-172118-20260203190240


## Step 10: View Audit Trail

The mlp_sdk automatically tracks all operations in an audit trail.

In [28]:
# Get audit trail
audit_entries = session.get_audit_trail(operation="run_training_job")

print(f"üìä Audit Trail ({len(audit_entries)} training job operations):\n")

for entry in audit_entries[-5:]:  # Show last 5
    print(f"   {entry.get('timestamp')}: {entry.get('operation')}")
    print(f"      Status: {entry.get('status')}")
    if 'parameters' in entry:
        print(f"      Job: {entry['parameters'].get('job_name', 'N/A')}")
    print()

üìä Audit Trail (8 training job operations):

   2026-02-04T01:28:24.345113: run_training_job
      Status: failed

   2026-02-04T02:29:14.394767: run_training_job
      Status: started

   2026-02-04T02:31:40.795699: run_training_job
      Status: failed

   2026-02-04T03:02:39.531799: run_training_job
      Status: started

   2026-02-04T03:05:00.891212: run_training_job
      Status: completed



## Step 11: Deploy Model (Optional)

Deploy the trained model to a SageMaker endpoint for real-time predictions.

The mlp_sdk deployment wrapper automatically applies:
- Instance type and count from configuration
- IAM execution role from configuration
- VPC configuration (security groups, subnets) from configuration
- KMS encryption from configuration

In [29]:

# Create endpoint name with timestamp
endpoint_name = f'xgboost-endpoint-{timestamp}'

print(f"üöÄ Deploying model to endpoint: {endpoint_name}")
print("‚è≥ This may take 5-10 minutes...\n")

# Deploy using mlp_sdk - automatically applies config defaults
predictor = session.deploy_model(
    model_data=model_artifacts,
    image_uri=xgboost_container,
    endpoint_name=endpoint_name
)

print(f"‚úÖ Model deployed successfully!")
print(f"   Endpoint name: {predictor.endpoint_name}")

print("üí° To deploy the model, uncomment the code above and run this cell.")

2026-02-03 19:05:54 - mlp_sdk.session - INFO - deploy_model called | endpoint_name=xgboost-endpoint-20260203-172118


üöÄ Deploying model to endpoint: xgboost-endpoint-20260203-172118
‚è≥ This may take 5-10 minutes...



2026-02-03 19:05:54 - mlp_sdk.session - INFO - Deploying model to endpoint | endpoint_name=xgboost-endpoint-20260203-172118


2026-02-03 19:05:57 - mlp_sdk.session - DEBUG - Using default instance_type | value=ml.m5.large


2026-02-03 19:05:57 - mlp_sdk.session - DEBUG - Using default instance_count | value=1


2026-02-03 19:05:57 - mlp_sdk.session - DEBUG - Using config role_arn | role=arn:aws:iam::716664005094:role/service-role/AmazonSageMaker-ExecutionRole-20251106T123495


2026-02-03 19:05:57 - mlp_sdk.session - DEBUG - VPC configuration disabled, skipping network config


2026-02-03 19:05:57 - mlp_sdk.session - DEBUG - Using config kms_key | key_id=0bf8a5e6-a713-4d33-8e3e-58fe4aa64780


2026-02-03 19:05:57 - mlp_sdk.session - DEBUG - Creating ModelBuilder | model_data=s3://sagemaker-us-west-2-716664005094/xgboost-example/20260203-172118/output/xgboost-training-20260203-172118-20260203190240/output/model.tar.gz | image_uri=246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.5-1


2026-02-03 19:06:00 - mlp_sdk.session - INFO - VPC configuration disabled for endpoint deployment


2026-02-03 19:06:00 - mlp_sdk.session - DEBUG - Deploying model to endpoint | endpoint_name=xgboost-endpoint-20260203-172118 | instance_type=ml.m5.large | instance_count=1


2026-02-03 19:09:35 - mlp_sdk.session - INFO - Model deployed successfully | endpoint_name=xgboost-endpoint-20260203-172118


‚úÖ Model deployed successfully!
   Endpoint name: xgboost-endpoint-20260203-172118
üí° To deploy the model, uncomment the code above and run this cell.


## Step 12: Make Predictions (Optional)

Test the deployed model with sample predictions.

In [30]:
# Uncomment after deploying the model

import numpy as np

# Prepare test data (first 5 validation samples)
# XGBoost expects CSV format without headers
test_data = X_val[:5]

# Convert to CSV string format
from io import StringIO
csv_buffer = StringIO()
np.savetxt(csv_buffer, test_data, delimiter=',', fmt='%.6f')
test_payload = csv_buffer.getvalue()

# Make predictions
predictions = predictor.predict(test_payload)

print("üîÆ Predictions:")
for i, (pred, actual) in enumerate(zip(predictions.split('\n')[:5], y_val[:5])):
    if pred.strip():  # Skip empty lines
        print(f"   Sample {i+1}: Predicted={float(pred):.4f}, Actual={actual}")

print("üí° To make predictions, deploy the model first and uncomment the code above.")

üîÆ Predictions:
   Sample 1: Predicted=0.4146, Actual=0
   Sample 2: Predicted=0.2509, Actual=0
   Sample 3: Predicted=0.0714, Actual=0
   Sample 4: Predicted=0.7144, Actual=1
   Sample 5: Predicted=0.8693, Actual=1
üí° To make predictions, deploy the model first and uncomment the code above.


## Step 13: Cleanup (Optional)

Clean up resources to avoid unnecessary charges.

In [None]:
# Uncomment to clean up resources

# # Delete endpoint (if deployed)
# print("üóëÔ∏è  Deleting endpoint...")
# session.delete_endpoint(endpoint_name)
# print("‚úÖ Endpoint deleted")

# # Delete model artifacts from S3 (optional)
# print("\nüóëÔ∏è  Deleting S3 data...")
# s3_client = session.boto_session.client('s3')
# s3_client.delete_object(Bucket=session.default_bucket, Key=f"{s3_prefix}/train/train.csv")
# s3_client.delete_object(Bucket=session.default_bucket, Key=f"{s3_prefix}/validation/validation.csv")
# print("‚úÖ S3 data deleted")

print("üí° Uncomment the code above to clean up resources.")

## Summary

In this notebook, you learned how to:

‚úÖ Generate synthetic training data for binary classification

‚úÖ Initialize mlp_sdk session with configuration-driven defaults

‚úÖ Upload data to S3

‚úÖ Train an XGBoost model using the mlp_sdk training wrapper

‚úÖ Monitor training progress

‚úÖ View audit trail of operations

### Key Benefits of mlp_sdk

1. **Configuration-driven defaults**: No need to specify instance types, VPC, security groups, IAM roles, etc.
2. **Simplified API**: Focus on ML logic, not infrastructure
3. **Audit trail**: Automatic tracking of all operations
4. **Error handling**: Clear error messages with actionable guidance
5. **Flexibility**: Override any default at runtime when needed

### Next Steps

- Try different hyperparameters
- Use your own dataset
- Deploy the model to an endpoint
- Create a pipeline with processing and training steps
- Explore other mlp_sdk features (processing jobs, feature store, pipelines)

### Resources

- [mlp_sdk Documentation](../README.md)
- [Configuration Guide](../docs/CONFIGURATION_GUIDE.md)
- [Usage Examples](../docs/USAGE_EXAMPLES.md)
- [Quick Start Guide](QUICKSTART.md)