# SageMaker HyperPod Cluster Creation - SDK Experience

This notebook demonstrates the complete end-to-end workflow for creating a SageMaker HyperPod cluster using the HyperPod SDK with the HpClusterStack class. The SDK provides programmatic control over cluster lifecycle management.

## Prerequisites

- AWS CLI configured with appropriate permissions
- SageMaker HyperPod SDK installed (`pip install sagemaker-hyperpod`)
- SageMaker Core SDK installed (`pip install sagemaker-core`)
- Python 3.8+ environment

## Workflow Overview

1. **Initialize** - Create HpClusterStack instance with configuration
2. **Configure** - Set cluster settings and tags programmatically
3. **Create** - Deploy the cluster infrastructure
4. **Monitor** - Check cluster status and manage lifecycle

## Step 1: Import Required Libraries and Initialize Configuration

First, we'll import the necessary SDK components and create an HpClusterStack instance with default settings. This is equivalent to `hyp init cluster-stack` in the CLI.

**What this does:**
- Imports HpClusterStack and related classes
- Creates cluster configuration with default settings
- Sets up basic infrastructure components (VPC, EKS, S3, etc.)
- Generates unique resource names to avoid conflicts

In [None]:
import uuid
import time
from sagemaker.hyperpod.cluster_management.hp_cluster_stack import HpClusterStack
from sagemaker_core.main.resources import Cluster

# Generate unique resource prefix to avoid conflicts
resource_prefix = f"hyperpod-sdk-{str(uuid.uuid4())[:8]}"

# Initialize cluster stack configuration (equivalent to hyp init cluster-stack)
cluster_stack = HpClusterStack(
    stage="prod",
    resource_name_prefix=resource_prefix,
    hyperpod_cluster_name=f"{resource_prefix}-cluster",
    eks_cluster_name=f"{resource_prefix}-eks",
    s3_bucket_name=f"{resource_prefix}-s3-bucket",
    sagemaker_iam_role_name=f"{resource_prefix}-iam-role",
    
    # Infrastructure components to create
    create_vpc_stack=True,
    create_security_group_stack=True,
    create_eks_cluster_stack=True,
    create_s3_bucket_stack=True,
    create_s3_endpoint_stack=True,
    create_life_cycle_script_stack=True,
    create_sagemaker_iam_role_stack=True,
    create_helm_chart_stack=True,
    create_hyperpod_cluster_stack=True,
    create_fsx_stack=True,
    
    # Network configuration
    vpc_cidr="10.192.0.0/16",
    availability_zone_ids=["use2-az1", "use2-az2", "use2-az3"],
    
    # Kubernetes configuration
    kubernetes_version="1.31",
    node_provisioning_mode="Continuous",
    
    # Instance group configuration
    instance_group_settings=[
        {
            "InstanceCount": 1,
            "InstanceGroupName": "controller-group",
            "InstanceType": "ml.t3.medium",
            "TargetAvailabilityZoneId": "use2-az2",
            "ThreadsPerCore": 1,
            "InstanceStorageConfigs": [
                {"EbsVolumeConfig": {"VolumeSizeInGB": 500}}
            ]
        }
    ]
)

print(f"Initialized cluster stack with prefix: {resource_prefix}")
print(f"Cluster name: {cluster_stack.hyperpod_cluster_name}")

## Step 2: Configure Cluster Settings and Tags

Configure the cluster with custom tags and additional settings. This is equivalent to `hyp configure --tags []` in the CLI.

**Key configuration options:**
- **Tags**: For resource organization and cost tracking
- **Instance Groups**: Define compute resources and their specifications
- **Networking**: VPC, subnets, and security group settings
- **Storage**: FSx and EBS volume configurations

In [None]:
# Configure cluster with custom tags (equivalent to hyp configure --tags)
cluster_tags = [
    {"Key": "Environment", "Value": "Development"},
    {"Key": "Project", "Value": "MLTraining"},
    {"Key": "Owner", "Value": "DataScienceTeam"},
    {"Key": "CostCenter", "Value": "ML-Research"},
    {"Key": "CreatedBy", "Value": "SDK-Example"}
]

# Update cluster stack with tags
cluster_stack.tags = cluster_tags

# Additional configuration options
cluster_stack.node_recovery = "Automatic"
cluster_stack.fsx_availability_zone_id = "use2-az2"
cluster_stack.storage_capacity = 1200
cluster_stack.per_unit_storage_throughput = 250

print("Configured cluster with custom tags:")
for tag in cluster_tags:
    print(f"  {tag['Key']}: {tag['Value']}")

print(f"\nNode recovery: {cluster_stack.node_recovery}")
print(f"FSx storage capacity: {cluster_stack.storage_capacity} GiB")

### View Current Configuration

Let's examine the current configuration to understand what will be deployed:

In [None]:
# Display current configuration details
print("=== Cluster Configuration ===")
print(f"Resource Prefix: {cluster_stack.resource_name_prefix}")
print(f"HyperPod Cluster: {cluster_stack.hyperpod_cluster_name}")
print(f"EKS Cluster: {cluster_stack.eks_cluster_name}")
print(f"S3 Bucket: {cluster_stack.s3_bucket_name}")
print(f"VPC CIDR: {cluster_stack.vpc_cidr}")
print(f"Kubernetes Version: {cluster_stack.kubernetes_version}")
print(f"\nInstance Groups:")
for ig in cluster_stack.instance_group_settings:
    print(f"  - {ig['InstanceGroupName']}: {ig['InstanceCount']}x {ig['InstanceType']}")
print(f"\nInfrastructure Components:")
print(f"  VPC Stack: {cluster_stack.create_vpc_stack}")
print(f"  EKS Stack: {cluster_stack.create_eks_cluster_stack}")
print(f"  HyperPod Stack: {cluster_stack.create_hyperpod_cluster_stack}")
print(f"  FSx Stack: {cluster_stack.create_fsx_stack}")

## Step 3: Create the Cluster

Deploy the HyperPod cluster infrastructure using the SDK. This is equivalent to `hyp create` in the CLI.

**Deployment includes:**
- VPC and networking infrastructure
- EKS cluster with managed node groups
- SageMaker HyperPod cluster
- IAM roles and policies
- S3 buckets for artifacts
- FSx file system (if configured)

**Note:** This process typically takes 15-30 minutes to complete.

In [None]:
# Create the HyperPod cluster (equivalent to hyp create)
try:
    print("Starting cluster creation...")
    print(f"This will create cluster: {cluster_stack.hyperpod_cluster_name}")
    
    # Deploy the cluster infrastructure
    response = cluster_stack.create(region="us-east-2")
    
    print("\n✅ Cluster creation initiated successfully!")
    print(f"Stack Name: {cluster_stack.stack_name}")
    print(f"Stack ID: {cluster_stack.stack_id}")
    
    # Store cluster information for later use
    cluster_name = cluster_stack.hyperpod_cluster_name
    stack_name = cluster_stack.stack_name
    
    print(f"\nCluster creation is in progress. This may take 15-30 minutes.")
    print(f"Monitor progress in the next steps.")
    
except Exception as e:
    print(f"\n❌ Cluster creation failed: {str(e)}")
    raise

## Step 4: Monitor Cluster Creation

Monitor the cluster creation progress using SDK methods. This provides real-time status updates on the deployment process.

In [None]:
# Monitor cluster creation progress
def monitor_cluster_creation(stack_name, max_checks=30, interval=120):
    """Monitor cluster creation progress"""
    print(f"Monitoring cluster creation progress for stack: {stack_name}")
    
    for i in range(max_checks):
        try:
            print(f"\n--- Status Check {i+1}/{max_checks} ---")
            
            # Check stack status
            status = HpClusterStack.check_status(stack_name, region="us-east-2")
            print(f"Stack Status: {status}")
            
            # Check if creation is complete
            if status == "CREATE_COMPLETE":
                print("\n🎉 Cluster creation completed successfully!")
                break
            elif status in ["CREATE_FAILED", "ROLLBACK_COMPLETE", "DELETE_COMPLETE"]:
                print(f"\n❌ Cluster creation failed with status: {status}")
                break
            elif status == "CREATE_IN_PROGRESS":
                print("⏳ Cluster creation still in progress...")
            
            if i < max_checks - 1:  # Don't sleep on the last iteration
                print(f"Waiting {interval} seconds before next check...")
                time.sleep(interval)
                
        except Exception as e:
            print(f"Error checking status: {str(e)}")
            break
    
    return status

# Start monitoring (uncomment when cluster creation is initiated)
# final_status = monitor_cluster_creation(stack_name, max_checks=5, interval=30)
print("Monitoring function ready. Uncomment to start monitoring after cluster creation.")

## Step 5: Describe Cluster Stack

Get detailed information about the deployed cluster using SDK methods. This is equivalent to `hyp describe cluster-stack` in the CLI.

**Information provided:**
- Cluster status and health
- Resource ARNs and IDs
- Network configuration details
- Instance group information
- Storage configuration

In [None]:
# Get detailed information about the cluster stack (equivalent to hyp describe cluster-stack)
def describe_cluster_stack(stack_name, region="us-east-2"):
    """Describe cluster stack details"""
    try:
        print(f"Describing cluster stack: {stack_name}")
        
        # Get stack description
        response = HpClusterStack.describe(stack_name, region=region)
        
        if response and 'Stacks' in response and len(response['Stacks']) > 0:
            stack = response['Stacks'][0]
            
            print("\n=== Stack Information ===")
            print(f"Stack Name: {stack.get('StackName', 'N/A')}")
            print(f"Stack Status: {stack.get('StackStatus', 'N/A')}")
            print(f"Creation Time: {stack.get('CreationTime', 'N/A')}")
            print(f"Stack ID: {stack.get('StackId', 'N/A')}")
            
            # Display parameters
            if 'Parameters' in stack:
                print("\n=== Parameters ===")
                for param in stack['Parameters'][:10]:  # Show first 10 parameters
                    print(f"  {param['ParameterKey']}: {param['ParameterValue']}")
            
            # Display outputs
            if 'Outputs' in stack:
                print("\n=== Outputs ===")
                for output in stack['Outputs'][:10]:  # Show first 10 outputs
                    print(f"  {output['OutputKey']}: {output['OutputValue']}")
            
            # Display tags
            if 'Tags' in stack:
                print("\n=== Tags ===")
                for tag in stack['Tags']:
                    print(f"  {tag['Key']}: {tag['Value']}")
        
        return response
        
    except Exception as e:
        print(f"Error describing stack: {str(e)}")
        return None

# Describe the cluster stack (uncomment when stack exists)
# describe_cluster_stack(stack_name)
print("Describe function ready. Use after cluster creation is complete.")

## Step 6: List All Cluster Stacks

List all HyperPod cluster stacks in your account using SDK methods. This is equivalent to `hyp list cluster-stack` in the CLI.

**Displays:**
- All cluster stacks in the current region
- Stack names and creation timestamps
- Current status of each stack
- Resource counts and types

In [None]:
# List all cluster stacks (equivalent to hyp list cluster-stack)
def list_cluster_stacks(region="us-east-2"):
    """List all cluster stacks in the account"""
    try:
        print(f"Listing cluster stacks in region: {region}")
        
        # Get list of stacks
        response = HpClusterStack.list(region=region)
        
        if response and 'StackSummaries' in response:
            stacks = response['StackSummaries']
            
            print(f"\n=== Found {len(stacks)} Stack(s) ===")
            
            if stacks:
                print(f"{'Stack Name':<40} {'Status':<25} {'Creation Time':<20}")
                print("-" * 85)
                
                for stack in stacks:
                    name = stack.get('StackName', 'N/A')[:39]
                    status = stack.get('StackStatus', 'N/A')[:24]
                    created = str(stack.get('CreationTime', 'N/A'))[:19]
                    print(f"{name:<40} {status:<25} {created:<20}")
            else:
                print("No cluster stacks found.")
        
        return response
        
    except Exception as e:
        print(f"Error listing stacks: {str(e)}")
        return None

# List all cluster stacks
list_response = list_cluster_stacks()

# Filter for HyperPod-related stacks
if list_response and 'StackSummaries' in list_response:
    hyperpod_stacks = [
        stack for stack in list_response['StackSummaries']
        if 'hyperpod' in stack.get('StackName', '').lower()
    ]
    
    if hyperpod_stacks:
        print(f"\n=== HyperPod Stacks ({len(hyperpod_stacks)}) ===")
        for stack in hyperpod_stacks:
            print(f"  - {stack['StackName']} ({stack['StackStatus']})")

## Step 7: Update Cluster Configuration

Update the existing cluster configuration using sagemaker-core's Cluster class. This is equivalent to `hyp update cluster` in the CLI.

**Common update scenarios:**
- Scaling instance groups up or down
- Adding new instance types
- Updating cluster tags
- Modifying storage configurations

**Note:** Some changes may require cluster restart or recreation.

In [None]:
# Update cluster configuration using sagemaker-core Cluster class
def update_cluster(cluster_name, region="us-east-2"):
    """Update cluster configuration (equivalent to hyp update cluster)"""
    try:
        print(f"Updating cluster: {cluster_name}")
        
        # Get existing cluster using sagemaker-core
        cluster = Cluster.get(cluster_name=cluster_name)
        
        print(f"\nCurrent cluster status: {cluster.cluster_status}")
        print(f"Current instance groups: {len(cluster.instance_groups)}")
        
        # Display current instance groups
        print("\n=== Current Instance Groups ===")
        for ig in cluster.instance_groups:
            print(f"  - {ig.instance_group_name}: {ig.current_count}x {ig.instance_type}")
        
        # Example: Update cluster tags
        updated_tags = [
            {"Key": "Environment", "Value": "Development"},
            {"Key": "Project", "Value": "MLTraining"},
            {"Key": "Owner", "Value": "DataScienceTeam"},
            {"Key": "CostCenter", "Value": "ML-Research"},
            {"Key": "UpdatedBy", "Value": "SDK-Example"},
            {"Key": "LastUpdated", "Value": str(time.time())}
        ]
        
        # Update cluster with new tags
        cluster.update(tags=updated_tags)
        
        print("\n✅ Cluster updated successfully!")
        print("Updated tags:")
        for tag in updated_tags:
            print(f"  {tag['Key']}: {tag['Value']}")
        
        return cluster
        
    except Exception as e:
        print(f"Error updating cluster: {str(e)}")
        return None

# Example: Scale instance group
def scale_instance_group(cluster_name, instance_group_name, target_count, region="us-east-2"):
    """Scale an instance group to target count"""
    try:
        print(f"Scaling instance group '{instance_group_name}' to {target_count} instances")
        
        # Get cluster
        cluster = Cluster.get(cluster_name=cluster_name)
        
        # Find the instance group
        target_ig = None
        for ig in cluster.instance_groups:
            if ig.instance_group_name == instance_group_name:
                target_ig = ig
                break
        
        if not target_ig:
            print(f"Instance group '{instance_group_name}' not found")
            return None
        
        print(f"Current count: {target_ig.current_count}")
        print(f"Target count: {target_count}")
        
        # Update instance group count
        target_ig.target_count = target_count
        
        # Apply the update
        cluster.update(instance_groups=[target_ig])
        
        print(f"\n✅ Instance group scaling initiated!")
        
        return cluster
        
    except Exception as e:
        print(f"Error scaling instance group: {str(e)}")
        return None

# Update functions ready (uncomment when cluster exists)
# updated_cluster = update_cluster(cluster_name)
# scaled_cluster = scale_instance_group(cluster_name, "controller-group", 2)

print("Update functions ready. Use after cluster creation is complete.")

## Step 8: Verify Cluster Status and Health

Verify that the cluster is healthy and ready for workloads using comprehensive status checks.

In [None]:
# Comprehensive cluster health check
def check_cluster_health(cluster_name, region="us-east-2"):
    """Perform comprehensive cluster health check"""
    try:
        print(f"Checking health for cluster: {cluster_name}")
        
        # Get cluster details
        cluster = Cluster.get(cluster_name=cluster_name)
        
        print("\n=== Cluster Health Summary ===")
        print(f"Cluster Name: {cluster.cluster_name}")
        print(f"Cluster Status: {cluster.cluster_status}")
        print(f"Creation Time: {cluster.creation_time}")
        print(f"Cluster ARN: {cluster.cluster_arn}")
        
        # Check instance groups health
        print("\n=== Instance Groups Health ===")
        total_instances = 0
        healthy_instances = 0
        
        for ig in cluster.instance_groups:
            print(f"\nInstance Group: {ig.instance_group_name}")
            print(f"  Instance Type: {ig.instance_type}")
            print(f"  Current Count: {ig.current_count}")
            print(f"  Target Count: {getattr(ig, 'target_count', 'N/A')}")
            print(f"  Status: {getattr(ig, 'instance_group_status', 'N/A')}")
            
            total_instances += ig.current_count
            if getattr(ig, 'instance_group_status', '') == 'InService':
                healthy_instances += ig.current_count
        
        print(f"\n=== Overall Health ===")
        print(f"Total Instances: {total_instances}")
        print(f"Healthy Instances: {healthy_instances}")
        health_percentage = (healthy_instances / total_instances * 100) if total_instances > 0 else 0
        print(f"Health Percentage: {health_percentage:.1f}%")
        
        # Determine overall health status
        if cluster.cluster_status == 'InService' and health_percentage >= 80:
            print("\n🟢 Cluster is HEALTHY and ready for workloads")
        elif cluster.cluster_status == 'Creating':
            print("\n🟡 Cluster is still CREATING")
        else:
            print("\n🔴 Cluster may have ISSUES - check individual components")
        
        return cluster
        
    except Exception as e:
        print(f"Error checking cluster health: {str(e)}")
        return None

# Health check function ready (uncomment when cluster exists)
# cluster_health = check_cluster_health(cluster_name)

print("Health check function ready. Use after cluster creation is complete.")

## Next Steps

After successfully creating your HyperPod cluster using the SDK, you can:

1. **Submit Training Jobs**: Use HyperPod SDK training classes for distributed training
2. **Deploy Inference Endpoints**: Use HyperPod SDK inference classes for model serving
3. **Monitor Resources**: Use SDK methods to check pod and job status
4. **Access Logs**: Retrieve training and system logs programmatically
5. **Scale Cluster**: Modify instance groups using the Cluster class

## Troubleshooting

If you encounter issues during cluster creation:

- Check AWS CloudFormation console for detailed error messages
- Verify AWS credentials and permissions using `boto3.Session()`
- Ensure resource quotas are sufficient
- Review the cluster configuration parameters

## Cleanup

To avoid ongoing charges, remember to delete your cluster when no longer needed:

```python
# Delete cluster using sagemaker-core
cluster = Cluster.get(cluster_name=cluster_name)
cluster.delete()

# Or delete the entire stack
import boto3
cf_client = boto3.client('cloudformation', region_name='us-east-2')
cf_client.delete_stack(StackName=stack_name)
```

## Summary

This notebook demonstrated the complete HyperPod cluster creation workflow using the SDK:

✅ **Initialized** cluster configuration with `HpClusterStack` class  
✅ **Configured** cluster settings and tags programmatically  
✅ **Created** cluster infrastructure with `cluster_stack.create()`  
✅ **Monitored** deployment with `HpClusterStack.check_status()`  
✅ **Listed** all clusters with `HpClusterStack.list()`  
✅ **Updated** cluster configuration with `Cluster.update()`  
✅ **Verified** cluster health with comprehensive checks  

Your HyperPod cluster is now ready for distributed machine learning workloads using the SDK!