# MMDetection Training on SageMaker with Public Datasets

This notebook demonstrates how to train MMDetection models on Amazon SageMaker using public datasets. We'll cover:

1. Building a custom training container
2. Training with different public datasets (COCO sample, Balloon, Pascal VOC)
3. Distributed training setup
4. Model deployment

## Prerequisites

- AWS CLI configured with appropriate permissions
- Docker installed and running
- SageMaker execution role with ECR and S3 permissions

In [None]:
import sagemaker
import boto3
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
import os
import time

# Initialize SageMaker session
session = sagemaker.Session()
region = session.boto_region_name
account = boto3.client('sts').get_caller_identity().get('Account')
bucket = session.default_bucket()
role = sagemaker.get_execution_role()

print(f"Region: {region}")
print(f"Account: {account}")
print(f"Bucket: {bucket}")
print(f"Role: {role}")

## 1. Build and Push Training Container

In [None]:
# Container configuration
container_name = "mmdetection-training"
tag = "latest"
image_uri = f"{account}.dkr.ecr.{region}.amazonaws.com/{container_name}:{tag}"

print(f"Container image URI: {image_uri}")

In [None]:
# Login to ECR
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.{region}.amazonaws.com
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account}.dkr.ecr.{region}.amazonaws.com

In [None]:
# Create ECR repository if it doesn't exist
!aws ecr describe-repositories --repository-names {container_name} --region {region} || aws ecr create-repository --repository-name {container_name} --region {region}

In [None]:
# Build and push the container
!docker build -t {container_name}:{tag} -f Dockerfile.training .
!docker tag {container_name}:{tag} {image_uri}
!docker push {image_uri}

## 2. Training Configuration

### Available Public Datasets:
- `coco_sample`: Small COCO dataset sample for quick testing
- `balloon`: Balloon dataset from Mask R-CNN paper
- `voc2007`: Pascal VOC 2007 dataset

### Available Models:
- Faster R-CNN
- Mask R-CNN
- RetinaNet
- FCOS
- YOLO series

In [None]:
# Training hyperparameters
hyperparameters = {
    'config-file': 'faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py',  # MMDetection config
    'dataset': 'coco_sample',  # Dataset type
    'download-dataset': True,  # Automatically download public dataset
    'auto-scale': True,  # Auto-scale learning rate
    'validate': True,  # Run validation during training
    'options': 'train_cfg.max_epochs=5; default_hooks.checkpoint.interval=2'  # Custom options
}

print("Training hyperparameters:")
for key, value in hyperparameters.items():
    print(f"  {key}: {value}")

## 3. Single Instance Training

In [None]:
# Create estimator for single instance training
estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=1,
    instance_type='ml.g4dn.xlarge',  # GPU instance
    hyperparameters=hyperparameters,
    output_path=f's3://{bucket}/mmdetection-output',
    base_job_name='mmdetection-single',
    use_spot_instances=True,  # Use spot instances to save cost
    max_wait=7200,  # Maximum wait time for spot instances
    max_run=3600,   # Maximum training time
)

print("Single instance estimator created")

In [None]:
# Start training (no input data needed as we're downloading public dataset)
job_name = f"mmdetection-single-{int(time.time())}"
estimator.fit(job_name=job_name)

print(f"Training job '{job_name}' started")

## 4. Distributed Training

In [None]:
# Update hyperparameters for distributed training
distributed_hyperparameters = hyperparameters.copy()
distributed_hyperparameters.update({
    'config-file': 'mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py',  # Use Mask R-CNN for distributed training
    'options': 'train_cfg.max_epochs=10; default_hooks.checkpoint.interval=5'  # Longer training
})

# Create distributed estimator
distributed_estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=2,  # Use 2 instances
    instance_type='ml.g4dn.xlarge',
    hyperparameters=distributed_hyperparameters,
    output_path=f's3://{bucket}/mmdetection-distributed-output',
    base_job_name='mmdetection-distributed',
    distribution={'pytorchddp': {'enabled': True}},  # Enable PyTorch DDP
    use_spot_instances=True,
    max_wait=7200,
    max_run=5400,
)

print("Distributed estimator created")

In [None]:
# Start distributed training
distributed_job_name = f"mmdetection-distributed-{int(time.time())}"
distributed_estimator.fit(job_name=distributed_job_name)

print(f"Distributed training job '{distributed_job_name}' started")

## 5. Training with Different Datasets and Models

In [None]:
# Example configurations for different models and datasets
training_configs = [
    {
        'name': 'retinanet-balloon',
        'config': {
            'config-file': 'retinanet/retinanet_r50_fpn_1x_coco.py',
            'dataset': 'balloon',
            'download-dataset': True,
            'options': 'train_cfg.max_epochs=20; default_hooks.checkpoint.interval=5'
        }
    },
    {
        'name': 'fcos-voc2007',
        'config': {
            'config-file': 'fcos/fcos_r50_caffe_fpn_gn-head_1x_coco.py',
            'dataset': 'voc2007',
            'download-dataset': True,
            'options': 'train_cfg.max_epochs=15; default_hooks.checkpoint.interval=3'
        }
    },
    {
        'name': 'yolox-coco-sample',
        'config': {
            'config-file': 'yolox/yolox_s_8x8_300e_coco.py',
            'dataset': 'coco_sample',
            'download-dataset': True,
            'options': 'train_cfg.max_epochs=50; default_hooks.checkpoint.interval=10'
        }
    }
]

print("Available training configurations:")
for i, config in enumerate(training_configs):
    print(f"{i+1}. {config['name']}")
    for key, value in config['config'].items():
        print(f"   {key}: {value}")
    print()

In [None]:
# Function to run training with specific configuration
def run_training_experiment(config_name, config_params, instance_type='ml.g4dn.xlarge'):
    """
    Run a training experiment with specified configuration
    """
    estimator = Estimator(
        image_uri=image_uri,
        role=role,
        instance_count=1,
        instance_type=instance_type,
        hyperparameters=config_params,
        output_path=f's3://{bucket}/mmdetection-experiments/{config_name}',
        base_job_name=f'mmdet-{config_name}',
        use_spot_instances=True,
        max_wait=7200,
        max_run=3600,
    )
    
    job_name = f"mmdet-{config_name}-{int(time.time())}"
    estimator.fit(job_name=job_name)
    
    return estimator, job_name

# Example: Run RetinaNet training on balloon dataset
# estimator, job_name = run_training_experiment(
#     'retinanet-balloon', 
#     training_configs[0]['config']
# )
# print(f"Started experiment: {job_name}")

## 6. Monitor Training Progress

In [None]:
# Function to monitor training job
def monitor_training_job(job_name):
    """
    Monitor training job status and logs
    """
    sm_client = boto3.client('sagemaker')
    
    try:
        response = sm_client.describe_training_job(TrainingJobName=job_name)
        status = response['TrainingJobStatus']
        
        print(f"Job Name: {job_name}")
        print(f"Status: {status}")
        print(f"Instance Type: {response['ResourceConfig']['InstanceType']}")
        print(f"Instance Count: {response['ResourceConfig']['InstanceCount']}")
        
        if 'TrainingStartTime' in response:
            print(f"Start Time: {response['TrainingStartTime']}")
        
        if status == 'Completed':
            print(f"End Time: {response['TrainingEndTime']}")
            print(f"Training Time: {response['TrainingTimeInSeconds']} seconds")
            print(f"Model Artifacts: {response['ModelArtifacts']['S3ModelArtifacts']}")
        elif status == 'Failed':
            print(f"Failure Reason: {response.get('FailureReason', 'Unknown')}")
            
    except Exception as e:
        print(f"Error monitoring job: {e}")

# Example usage:
# monitor_training_job('your-job-name-here')

## 7. Model Deployment (Optional)

In [None]:
# Create inference container (you would need to create this separately)
# This is a placeholder for model deployment

def deploy_model(estimator, instance_type='ml.m5.large'):
    """
    Deploy trained model to SageMaker endpoint
    Note: This requires a separate inference container
    """
    try:
        predictor = estimator.deploy(
            initial_instance_count=1,
            instance_type=instance_type,
            endpoint_name=f'mmdetection-endpoint-{int(time.time())}'
        )
        return predictor
    except Exception as e:
        print(f"Deployment failed: {e}")
        print("Note: You need to create a separate inference container for deployment")
        return None

# Example:
# predictor = deploy_model(estimator)
# if predictor:
#     print(f"Model deployed to endpoint: {predictor.endpoint_name}")

## 8. Cleanup Resources

In [None]:
# Function to cleanup endpoints and models
def cleanup_resources(endpoint_name=None):
    """
    Clean up SageMaker resources
    """
    sm_client = boto3.client('sagemaker')
    
    if endpoint_name:
        try:
            sm_client.delete_endpoint(EndpointName=endpoint_name)
            print(f"Deleted endpoint: {endpoint_name}")
        except Exception as e:
            print(f"Error deleting endpoint: {e}")
    
    # List and optionally delete other resources
    print("\nActive endpoints:")
    try:
        endpoints = sm_client.list_endpoints()['Endpoints']
        for endpoint in endpoints:
            if 'mmdetection' in endpoint['EndpointName'].lower():
                print(f"  - {endpoint['EndpointName']} ({endpoint['EndpointStatus']})")
    except Exception as e:
        print(f"Error listing endpoints: {e}")

# Example:
# cleanup_resources()

## Summary

This notebook demonstrated how to:

1. **Build a custom MMDetection training container** with support for multiple datasets
2. **Train models using public datasets** without manual data preparation
3. **Run both single-instance and distributed training**
4. **Configure different model architectures** (Faster R-CNN, Mask R-CNN, RetinaNet, etc.)
5. **Monitor training progress** and manage resources

### Key Features:
- **Automatic dataset download**: No need to manually prepare datasets
- **Multiple dataset support**: COCO sample, Balloon, Pascal VOC 2007
- **Flexible model configuration**: Easy to switch between different architectures
- **Cost optimization**: Uses spot instances and configurable training duration
- **Distributed training**: Supports multi-instance training for faster convergence

### Next Steps:
1. Create an inference container for model deployment
2. Add support for custom datasets
3. Implement hyperparameter tuning
4. Add model evaluation and metrics tracking
5. Set up MLOps pipeline for continuous training