# XGBoost Training with Amazon SageMaker Python SDK v3

This notebook demonstrates **model training and deployment** using the new **SageMaker Python SDK v3** unified APIs.

## SDK v3 vs SDK v2

| Task | Legacy SDK v2 | SDK v3 |
|------|--------------|--------|
| Training | `SKLearn()`, `XGBoost()`, `PyTorch()` | `ModelTrainer` |
| Input Data | `TrainingInput` | `InputData` |
| Compute | Inline parameters | `Compute` config class |
| Deployment | `model.deploy()` | `Model.create` + `Endpoint.create` |

## Use Case: Gas Lift Optimization

**Gas lift** is an artificial lift method where compressed gas is injected into oil wells to reduce fluid density and increase production.

**ML task**: Predict oil production given well sensor readings and gas injection rates (regression).

**Data**: [Petrobras 3W dataset](https://github.com/petrobras/3W) - real sensor data from offshore wells.

---

## 1. Setup and Configuration

In [None]:
# Install required packages
!pip install sagemaker boto3 pandas numpy xgboost scipy pyarrow joblib --quiet

In [None]:
import boto3
import pandas as pd
import numpy as np
from pathlib import Path
import json
import os
import tarfile
import shutil
from datetime import datetime

# Amazon SageMaker AI SDK v3 imports
from sagemaker.train.model_trainer import ModelTrainer
from sagemaker.train.configs import InputData, Compute, SourceCode, OutputDataConfig
from sagemaker.core.helper.session_helper import Session, get_execution_role
from sagemaker.core.image_uris import retrieve

# Initialize Amazon SageMaker AI session
sagemaker_session = Session()
region = sagemaker_session.boto_region_name
bucket = sagemaker_session.default_bucket()
prefix = 'gas-lift-optimization'

# Get execution role - prefer explicit SageMaker role, fall back to get_execution_role()
def get_sagemaker_role():
    """Get a valid SageMaker execution role."""
    # First, search for an explicit SageMaker execution role
    iam = boto3.client('iam')
    try:
        roles = iam.list_roles()['Roles']
        sm_roles = [r for r in roles 
                    if 'SageMaker' in r['RoleName'] 
                    and 'Execution' in r['RoleName']
                    and 'service-role' in r['Arn']]
        if sm_roles:
            return sm_roles[0]['Arn']
    except Exception:
        pass
    
    # Fall back to get_execution_role() (works in SageMaker notebooks)
    try:
        return get_execution_role()
    except ValueError:
        pass
    
    raise ValueError(
        "No SageMaker execution role found. "
        "Create one in IAM with AmazonSageMakerFullAccess policy."
    )

role = get_sagemaker_role()

print(f"SageMaker Python SDK v3")
print(f"Region: {region}")
print(f"Role: {role.split('/')[-1]}")
print(f"Amazon S3 Bucket: {bucket}")

## 2. Training Scripts

The training scripts are in the `../training/` directory of this repository.

In [None]:
# Point to training code directory
source_dir = Path('../training').resolve()

print(f"Training code directory: {source_dir}")
for f in source_dir.iterdir():
    print(f"  {f.name} ({f.stat().st_size} bytes)")

## 3. Load and Prepare Data

Using the open-source **Petrobras 3W dataset** - real sensor data from gas-lifted offshore wells.

In [None]:
# Clone 3W dataset if not present
data_path = Path('./3W')
if not data_path.exists():
    print("Cloning 3W dataset (Petrobras open-source well data)...")
    !git clone https://github.com/petrobras/3W.git
else:
    print(f"Dataset exists at {data_path.absolute()}")

In [None]:
# Load well files (class 3 = Severe Slugging, has gas lift rate variation)
dataset_path = data_path / 'dataset' / '3'
real_files = sorted([f for f in dataset_path.glob('WELL-*.parquet')])
print(f"Found {len(real_files)} well files")

dfs = []
for i, f in enumerate(real_files):
    df = pd.read_parquet(f, engine='pyarrow')
    df['file_id'] = i
    if isinstance(df.index, pd.DatetimeIndex):
        df = df.reset_index()
    dfs.append(df)
    if i < 3:
        print(f"  {f.name}: {len(df):,} rows")

all_data = pd.concat(dfs, ignore_index=True)
print(f"\nTotal: {len(all_data):,} rows from {len(dfs)} files")

In [None]:
# Select numeric features with good data availability
exclude_cols = ['timestamp', 'class', 'state', 'file_id', 'well_name']
numeric_cols = all_data.select_dtypes(include=[np.number]).columns.tolist()

feature_cols = [col for col in numeric_cols 
                if col not in exclude_cols 
                and all_data[col].notna().sum() / len(all_data) > 0.5
                and all_data[col].std() > 0]

print(f"Selected {len(feature_cols)} features: {feature_cols}")

In [None]:
# Prepare clean data with synthetic production target
clean_data = all_data[feature_cols].copy()
for col in feature_cols:
    clean_data[col] = clean_data[col].fillna(clean_data[col].median())

# Create synthetic production target (in production, use actual well test data)
np.random.seed(42)
well_ids = all_data['file_id'].values[:len(clean_data)]
n_wells = int(well_ids.max()) + 1
well_efficiency = np.random.uniform(0.5, 1.5, n_wells)

# Gas lift effect (non-linear response)
if 'QGL' in feature_cols:
    qgl = clean_data['QGL'].values
    qgl_norm = (qgl - qgl.mean()) / (qgl.std() + 1e-6)
    gl_effect = 30 * np.tanh(qgl_norm) * well_efficiency[well_ids.astype(int)]
else:
    gl_effect = 0

# Pressure effect
p_cols = [c for c in feature_cols if c.startswith('P-')]
p_effect = 20 * ((clean_data[p_cols[0]] - clean_data[p_cols[0]].mean()) / clean_data[p_cols[0]].std()) if p_cols else 0

production = 100 + gl_effect + p_effect + np.random.normal(0, 5, len(clean_data))
clean_data['production'] = np.clip(production, 10, 200)

print(f"Data shape: {clean_data.shape}")
print(f"Production range: {clean_data['production'].min():.1f} - {clean_data['production'].max():.1f}")

## 4. Upload Data to Amazon S3

In [None]:
# Save and upload training data
local_data_dir = Path('./data')
local_data_dir.mkdir(exist_ok=True)

train_file = local_data_dir / 'train.csv'
clean_data.to_csv(train_file, index=False)
print(f"Saved: {train_file} ({train_file.stat().st_size / 1024 / 1024:.2f} MB)")

# Upload to S3
s3_client = boto3.client('s3')
train_s3_key = f"{prefix}/data/train/train.csv"
train_s3_uri = f"s3://{bucket}/{train_s3_key}"

s3_client.upload_file(str(train_file), bucket, train_s3_key)
print(f"Uploaded to: {train_s3_uri}")

## 5. Configure Amazon SageMaker AI Training Job

Using **SageMaker Python SDK v3** with:
- `ModelTrainer` - unified training API
- SKLearn 1.4-2 container with custom requirements.txt for XGBoost

In [None]:
# Get SKLearn container
sklearn_image = retrieve(
    framework='sklearn',
    region=region,
    version='1.4-2',
    py_version='py3',
    image_scope='training'
)
print(f"Training container: {sklearn_image.split('/')[-1]}")

In [None]:
# Configure training job
hyperparameters = {
    'n-estimators': '100',
    'max-depth': '6',
    'learning-rate': '0.1',
    'target-column': 'production'
}

source_code_config = SourceCode(
    source_dir=str(source_dir),
    entry_script='train.py',
    requirements='requirements.txt'
)

compute_config = Compute(
    instance_type='ml.m5.large',
    instance_count=1,
    volume_size_in_gb=30
)

output_s3_path = f's3://{bucket}/{prefix}/output'

model_trainer = ModelTrainer(
    training_image=sklearn_image,
    source_code=source_code_config,
    compute=compute_config,
    hyperparameters=hyperparameters,
    role=role,
    base_job_name='gaslift-xgboost',
    output_data_config=OutputDataConfig(s3_output_path=output_s3_path),
    sagemaker_session=sagemaker_session
)

print(f"ModelTrainer configured")
print(f"  Instance: {compute_config.instance_type}")
print(f"  Output: {output_s3_path}")

## 6. Run Training Job

In [None]:
# Define training input and start training
training_input = InputData(
    channel_name='train',
    data_source=train_s3_uri,
    content_type='text/csv'
)

print("="*60)
print("STARTING AMAZON SAGEMAKER AI TRAINING JOB")
print("="*60)
print(f"Timestamp: {datetime.now().isoformat()}")
print("\nThis will provision an ml.m5.large instance and run training.")
print("Please wait (typically 3-5 minutes)...\n")

model_trainer.train(
    input_data_config=[training_input],
    wait=True,
    logs=True
)

In [None]:
# Get training job results
sm_client = boto3.client('sagemaker')
latest_job = sm_client.list_training_jobs(
    SortBy='CreationTime',
    SortOrder='Descending',
    MaxResults=1,
    NameContains='gaslift'
)['TrainingJobSummaries'][0]

training_job_name = latest_job['TrainingJobName']
job_desc = sm_client.describe_training_job(TrainingJobName=training_job_name)
model_artifacts = job_desc['ModelArtifacts']['S3ModelArtifacts']

print("="*60)
print("TRAINING COMPLETE")
print("="*60)
print(f"Job: {training_job_name}")
print(f"Status: {latest_job['TrainingJobStatus']}")
print(f"Model: {model_artifacts}")

## 7. Download and Verify Model

In [None]:
# Download model artifacts
model_dir = Path('./models')
model_dir.mkdir(exist_ok=True)

model_s3_path = model_artifacts.replace(f's3://{bucket}/', '')
local_model_tar = model_dir / 'model.tar.gz'

print(f"Downloading model...")
s3_client.download_file(bucket, model_s3_path, str(local_model_tar))

with tarfile.open(local_model_tar, 'r:gz') as tar:
    tar.extractall(model_dir)

print(f"Model extracted to {model_dir}")

In [None]:
# Load and verify model
import joblib
from sklearn.metrics import r2_score, mean_squared_error

model = joblib.load(model_dir / 'model.joblib')
with open(model_dir / 'feature_names.json', 'r') as f:
    trained_features = json.load(f)['feature_names']

print(f"Model: {type(model).__name__}")
print(f"Features: {len(trained_features)}")

# Validate
X_val = clean_data[trained_features].head(10000)
y_val = clean_data['production'].head(10000)
y_pred = model.predict(X_val)

print(f"\nValidation Results:")
print(f"  R2 Score: {r2_score(y_val, y_pred):.4f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_val, y_pred)):.4f}")

## 8. Deploy Model to Real-time Endpoint

Deploy the trained model to a SageMaker real-time endpoint for inference.

**Important**: The SKLearn container requires proper inference code bundled in the model artifact:
- Model files at root level (`model.joblib`, `feature_names.json`)
- Inference code in `code/` subdirectory (`inference.py`, `requirements.txt`)

In [None]:
# Point to inference code directory
inference_dir = Path('../inference').resolve()

print(f"Inference code directory: {inference_dir}")
for f in inference_dir.iterdir():
    print(f"  {f.name} ({f.stat().st_size} bytes)")

In [None]:
# Repack model with inference code for deployment
repack_dir = model_dir / 'repacked'
code_dir = repack_dir / 'code'
code_dir.mkdir(parents=True, exist_ok=True)

# Copy model files to root
shutil.copy(model_dir / 'model.joblib', repack_dir / 'model.joblib')
shutil.copy(model_dir / 'feature_names.json', repack_dir / 'feature_names.json')

# Copy inference code to code/ subdirectory
shutil.copy(inference_dir / 'inference.py', code_dir / 'inference.py')
shutil.copy(inference_dir / 'requirements.txt', code_dir / 'requirements.txt')

# Create new model tarball
repacked_tar = model_dir / 'model_deploy.tar.gz'
with tarfile.open(repacked_tar, 'w:gz') as tar:
    for item in repack_dir.iterdir():
        tar.add(item, arcname=item.name)

print(f"Created deployment package: {repacked_tar}")
print(f"Contents:")
with tarfile.open(repacked_tar, 'r:gz') as tar:
    for member in tar.getmembers():
        print(f"  ./{member.name}")

In [None]:
# Upload repacked model to S3
model_s3_key = f"{prefix}/models/model_deploy.tar.gz"
deploy_model_s3_uri = f"s3://{bucket}/{model_s3_key}"

s3_client.upload_file(str(repacked_tar), bucket, model_s3_key)
print(f"Uploaded to: {deploy_model_s3_uri}")

In [None]:
# Deploy to real-time endpoint using SDK v3
from sagemaker.core.resources import Model as SageMakerModel
from sagemaker.core.shapes import ContainerDefinition

# Get inference container (note: 1.2-1 is latest for inference, 1.4-2 is training-only)
sklearn_inference_image = retrieve(
    framework='sklearn',
    region=region,
    version='1.2-1',  # Latest available for inference
    py_version='py3',
    image_scope='inference'
)

# Create unique names
timestamp = datetime.now().strftime('%Y%m%d%H%M%S')
model_name = f'gaslift-mdl-{timestamp}'
endpoint_name = f'gaslift-ep-{timestamp}'

# Create container definition with required environment variables
container = ContainerDefinition(
    image=sklearn_inference_image,
    model_data_url=deploy_model_s3_uri,
    environment={
        'SAGEMAKER_PROGRAM': 'inference.py',
        'SAGEMAKER_SUBMIT_DIRECTORY': '/opt/ml/model/code',
    }
)

# Create model
sm_model = SageMakerModel.create(
    model_name=model_name,
    primary_container=container,
    execution_role_arn=role,
)
print(f"Created model: {model_name}")

In [None]:
# Create endpoint configuration and deploy
from sagemaker.core.resources import EndpointConfig, Endpoint
from sagemaker.core.shapes import ProductionVariant
import time

# Create endpoint config
endpoint_config = EndpointConfig.create(
    endpoint_config_name=endpoint_name,
    production_variants=[
        ProductionVariant(
            variant_name='AllTraffic',
            model_name=model_name,
            initial_instance_count=1,
            instance_type='ml.t2.medium',
            initial_variant_weight=1.0,
        )
    ]
)
print(f"Created endpoint config: {endpoint_name}")

# Create endpoint
endpoint = Endpoint.create(
    endpoint_name=endpoint_name,
    endpoint_config_name=endpoint_name,
)
print(f"Creating endpoint: {endpoint_name}")
print("Waiting for endpoint to be InService (typically 3-5 minutes)...")

# Wait for endpoint to be ready
while True:
    status = sm_client.describe_endpoint(EndpointName=endpoint_name)['EndpointStatus']
    print(f"  {datetime.now().strftime('%H:%M:%S')} - Status: {status}")
    if status == 'InService':
        print("\nEndpoint is ready!")
        break
    elif status == 'Failed':
        raise Exception(f"Endpoint deployment failed")
    time.sleep(30)

In [None]:
# Test the deployed endpoint
runtime = boto3.client('sagemaker-runtime', region_name=region)

print(f"Testing endpoint: {endpoint_name}")
print(f"Features ({len(trained_features)}): {trained_features}")

# Create test samples using actual data
test_samples = clean_data[trained_features].sample(3, random_state=42).values.tolist()

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Accept='application/json',
    Body=json.dumps(test_samples)
)

predictions = json.loads(response['Body'].read().decode())
print(f"\nPredictions:")
for i, pred in enumerate(predictions):
    print(f"  Sample {i+1}: {pred:.4f} bbl/d")

print("\nEndpoint working correctly!")

In [None]:
# Clean up endpoint (uncomment to run)
# print("Cleaning up endpoint resources...")
# sm_client.delete_endpoint(EndpointName=endpoint_name)
# sm_client.delete_endpoint_config(EndpointConfigName=endpoint_name)
# sm_client.delete_model(ModelName=model_name)
# print("Endpoint deleted")

## 9. Gas Lift Optimization

Now we use the trained model concept to optimize gas allocation across multiple wells.

In [None]:
from scipy.optimize import minimize

class Well:
    """Represents a gas lift well with its response curve."""
    def __init__(self, name, base_production, efficiency, max_response):
        self.name = name
        self.base_production = base_production
        self.efficiency = efficiency
        self.max_response = max_response
    
    def production(self, gas_rate):
        """Non-linear production response to gas injection."""
        return self.base_production + self.max_response * np.tanh(gas_rate / 1.5 * self.efficiency)


class GasLiftOptimizer:
    """Optimizes gas allocation across multiple wells."""
    def __init__(self, wells):
        self.wells = wells
    
    def total_production(self, gas_alloc):
        return sum(w.production(g) for w, g in zip(self.wells, gas_alloc))
    
    def optimize(self, total_gas, min_gas=0.3, max_gas=2.0):
        n = len(self.wells)
        bounds = [(min_gas, max_gas)] * n
        constraints = {'type': 'ineq', 'fun': lambda x: total_gas - sum(x)}
        x0 = np.array([total_gas / n] * n)
        
        result = minimize(lambda x: -self.total_production(x), x0, 
                         method='SLSQP', bounds=bounds, constraints=constraints)
        
        return {
            'baseline_alloc': x0,
            'optimal_alloc': result.x,
            'baseline_prod': self.total_production(x0),
            'optimal_prod': self.total_production(result.x),
            'improvement_pct': (self.total_production(result.x) - self.total_production(x0)) / 
                               self.total_production(x0) * 100
        }

In [None]:
# Create wells with different gas lift efficiencies
wells = [
    Well("WELL-A", 80, 0.3, 80),   # Poor responder
    Well("WELL-B", 70, 0.5, 100),  # Below average
    Well("WELL-C", 90, 1.0, 120),  # Average
    Well("WELL-D", 100, 1.5, 150), # Good responder
    Well("WELL-E", 85, 2.0, 180),  # Excellent responder
]

# Optimize gas allocation
optimizer = GasLiftOptimizer(wells)
result = optimizer.optimize(total_gas=6.0, min_gas=0.3, max_gas=2.0)

# Display results
print("="*70)
print("GAS LIFT OPTIMIZATION RESULTS")
print("="*70)
print(f"\n{'Well':<10} {'Efficiency':<12} {'Gas (Base)':<12} {'Gas (Opt)':<12} {'Change':<10}")
print("-"*60)

for i, well in enumerate(wells):
    gb, go = result['baseline_alloc'][i], result['optimal_alloc'][i]
    print(f"{well.name:<10} {well.efficiency:<12.1f} {gb:<12.2f} {go:<12.2f} {go-gb:+10.2f}")

print("-"*60)
print(f"\nBaseline Production: {result['baseline_prod']:.1f} bbl/d")
print(f"Optimized Production: {result['optimal_prod']:.1f} bbl/d")
print(f"\n>>> IMPROVEMENT: {result['improvement_pct']:+.1f}% <<<")

# Business value
oil_price = 70  # $/bbl
daily_gain = result['optimal_prod'] - result['baseline_prod']
annual_value = daily_gain * 365 * oil_price
print(f"\nAnnual Value: ${annual_value:,.0f}/year at ${oil_price}/bbl")

## 10. Cleanup

In [None]:
print("="*60)
print("ARTIFACTS CREATED")
print("="*60)
print(f"\nAmazon S3:")
print(f"  Training data: {train_s3_uri}")
print(f"  Model artifacts: {model_artifacts}")
print(f"  Deploy model: {deploy_model_s3_uri}")
print(f"\nLocal:")
print(f"  Training code: {source_dir}")
print(f"  Model: {model_dir.absolute()}")
print(f"\nEndpoint:")
print(f"  Name: {endpoint_name}")
print(f"\nTo delete all S3 data:")
print(f"  aws s3 rm s3://{bucket}/{prefix} --recursive")

## Summary

This notebook demonstrated **SageMaker Python SDK v3** for training and deployment:

### SDK v3 Classes Used

| Class | Purpose | Import |
|-------|---------|--------|
| `ModelTrainer` | Training jobs | `sagemaker.train.model_trainer` |
| `SourceCode` | Script + dependencies | `sagemaker.train.configs` |
| `Compute` | Instance configuration | `sagemaker.train.configs` |
| `InputData` | Training data channels | `sagemaker.train.configs` |
| `OutputDataConfig` | Model output location | `sagemaker.train.configs` |
| `Model` | Model resource | `sagemaker.core.resources` |
| `EndpointConfig` | Endpoint configuration | `sagemaker.core.resources` |
| `Endpoint` | Real-time endpoint | `sagemaker.core.resources` |
| `retrieve` | Container image URIs | `sagemaker.core.image_uris` |

### Key Takeaways

1. **Unified Training API** - `ModelTrainer` replaces framework-specific estimators
2. **Custom Dependencies** - Use `requirements.txt` with SKLearn container
3. **Model Packaging for Deployment** - SKLearn containers require:
   - Model files at root (`model.joblib`)
   - Inference code in `code/` subdirectory (`inference.py`, `requirements.txt`)
   - Environment variables: `SAGEMAKER_PROGRAM`, `SAGEMAKER_SUBMIT_DIRECTORY`

### Repository Structure

```
├── training/
│   ├── train.py           # Training script
│   └── requirements.txt   # Training dependencies
└── inference/
    ├── inference.py       # Inference handlers
    └── requirements.txt   # Inference dependencies
```

### Next Steps

- **Batch Inference**: Use `Transformer` for large-scale predictions
- **MLOps**: Use [SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html) for automation
- **Model Registry**: Track model versions with SageMaker Model Registry