# SageMaker HyperPod Cluster Creation - Init Experience

This notebook demonstrates the complete end-to-end workflow for creating a SageMaker HyperPod cluster using the HyperPod CLI. The init experience provides a guided approach to cluster creation with validation and configuration management.

## Prerequisites

- AWS CLI configured with appropriate permissions
- SageMaker HyperPod CLI installed (`pip install sagemaker-hyperpod`)
- Helm installed (required for cluster operations)
- Python 3.8+ environment

## Workflow Overview

1. **Initialize** - Create initial cluster configuration
2. **Configure** - Customize cluster settings and tags
3. **Validate** - Verify configuration before deployment
4. **Create** - Deploy the cluster infrastructure
5. **Monitor** - Check cluster status and manage lifecycle


## Step 1: Initialize Cluster Configuration

The `hyp init cluster-stack` command creates a new cluster configuration template with default settings. This generates a `config.yaml` file that serves as the foundation for your cluster deployment.

**What this does:**
- Creates a new `config.yaml` with default cluster settings
- Sets up basic infrastructure components (VPC, EKS, S3, etc.)
- Generates unique resource names to avoid conflicts


In [1]:
# Initialize a new cluster stack configuration
!hyp init cluster-stack

Initializing new scaffold for 'cluster-stack'…
Configuration saved to: /Users/nargokul/workspace/private-sagemaker-hyperpod-cli-staging/examples/config.yaml
Cloudformation Parameters Jinja template saved to: /Users/nargokul/workspace/private-sagemaker-hyperpod-cli-staging/examples/cfn_params.jinja[0m
[32m✔️  cluster-stack for schema version='1.0' is initialized in /Users/nargokul/workspace/private-sagemaker-hyperpod-cli-staging/examples[0m
[32m🚀 Welcome!
📘 See /Users/nargokul/workspace/private-sagemaker-hyperpod-cli-staging/examples/README.md for usage.
[0m


## Step 2: Configure Cluster Settings

The `hyp configure` command allows you to customize your cluster configuration. You can add tags for resource management, modify instance types, adjust networking settings, and more.

**Key configuration options:**
- **Tags**: For resource organization and cost tracking
- **Instance Groups**: Define compute resources and their specifications
- **Networking**: VPC, subnets, and security group settings
- **Storage**: FSx and EBS volume configurations


In [2]:
# Configure cluster with custom tags for resource management
# Tags help with cost tracking, resource organization, and compliance
!hyp configure --tags '[{"Key": "Environment", "Value": "Development"}, {"Key": "Project", "Value": "MLTraining"}, {"Key": "Owner", "Value": "DataScienceTeam"}, {"Key": "CostCenter", "Value": "ML-Research"}]'

[32m✔️  Configuration updated successfully![0m
Configuration saved to: /Users/nargokul/workspace/private-sagemaker-hyperpod-cli-staging/examples/config.yaml
[32m✔️  config.yaml updated successfully.[0m


### View Current Configuration

Let's examine the generated configuration to understand what will be deployed:

In [3]:
# Display the current configuration
!cat config.yaml | head -50

# Template type
template: cluster-stack

# Schema version (latest available version used by default)
version: 1.0

# Deployment stage (gamma, prod)
stage: prod

# Feature flag for enabling HP inference
enable_hp_inference_feature: False

# Custom S3 bucket name for templates
custom_bucket_name: 

# Unique prefix for all resources (must be different for each deployment)
resource_name_prefix: hyperpod-cli-integ-test

# The IP range (CIDR notation) for the VPC
vpc_cidr: 10.192.0.0/16

# List of AZs to deploy subnets in
availability_zone_ids:
  - use2-az1
  - use2-az2
  - use2-az3

# The ID of the VPC
vpc_id: 

# List of NAT Gateway IDs
nat_gateway_ids: 

# The ID of the security group
security_group_id: 

# The Kubernetes version
kubernetes_version: 1.31

# The node provisioning mode
node_provisioning_mode: Continuous

# The name of the EKS cluster
eks_cluster_name: eks

# List of private subnet IDs
eks_private_subnet_ids: 

# List of cluste

## Step 3: Validate Configuration

The `hyp validate` command performs comprehensive validation of your cluster configuration before deployment. This helps catch configuration errors early and ensures all prerequisites are met.

**Validation checks include:**
- AWS credentials and permissions
- Resource quotas and limits
- Configuration syntax and values
- Network and security settings
- Instance type availability in target regions


In [4]:
# Validate the cluster configuration
# This checks for potential issues before deployment
!hyp validate

[32m✔️  config.yaml is valid![0m


## Step 4: Reset Configuration (Optional)

The `hyp reset` command allows you to reset your configuration to defaults or clean up any partial deployments. This is useful when you want to start fresh or if validation reveals issues that require a clean slate.

**Use cases for reset:**
- Starting over with a clean configuration
- Cleaning up after failed deployments
- Switching between different cluster configurations


In [5]:
# Reset configuration if needed (uncomment to use)
# !hyp reset

print("Reset command available if configuration changes are needed")

Reset command available if configuration changes are needed


## Step 5: Create the Cluster

The `hyp create` command deploys your HyperPod cluster infrastructure. This process creates all the necessary AWS resources including VPC, EKS cluster, IAM roles, S3 buckets, and the HyperPod cluster itself.

**Deployment includes:**
- VPC and networking infrastructure
- EKS cluster with managed node groups
- SageMaker HyperPod cluster
- IAM roles and policies
- S3 buckets for artifacts
- FSx file system (if configured)

**Note:** This process typically takes 15-30 minutes to complete.


In [6]:
# Create the HyperPod cluster
# This will deploy all infrastructure components
!hyp create

[32m✔️  config.yaml is valid![0m
[32m✔️  Configuration is valid![0m
[32m✔️  Submitted! Files written to /Users/nargokul/workspace/private-sagemaker-hyperpod-cli-staging/examples/run/20250822T165502[0m
[33mSubmitting to default region: us-east-2.[0m
Stack creation initiated. Stack ID: arn:aws:cloudformation:us-east-2:211125564141:stack/HyperpodClusterStack-ba60b/6be3a540-7fb3-11f0-8bc0-0aa797f4fc05[0m


## Step 6: Monitor Cluster Creation

While the cluster is being created, you can monitor its progress using the describe and list commands. These provide real-time status updates on the deployment process.


In [7]:
# Check cluster creation status
import time

print("Monitoring cluster creation progress...")
for i in range(5):
    print(f"\n--- Status Check {i+1} ---")
    !hyp describe cluster-stack <STACK_NAME>
    time.sleep(30)  # Wait 30 seconds between checks

Monitoring cluster creation progress...

--- Status Check 1 ---
Usage: hyp describe cluster-stack [OPTIONS] STACK_NAME
Try 'hyp describe cluster-stack --help' for help.

Error: Missing argument 'STACK_NAME'.


KeyboardInterrupt: 

## Step 7: Describe Cluster Stack

The `hyp describe cluster-stack` command provides detailed information about your deployed cluster, including resource IDs, endpoints, and current status.

**Information provided:**
- Cluster status and health
- Resource ARNs and IDs
- Network configuration details
- Instance group information
- Storage configuration


In [None]:
# Get detailed information about the cluster stack
!hyp describe cluster-stack  <STACK_NAME>

## Step 8: List All Cluster Stacks

The `hyp list cluster-stack` command shows all HyperPod cluster stacks in your account. This is useful for managing multiple clusters and getting an overview of your infrastructure.

**Displays:**
- All cluster stacks in the current region
- Stack names and creation timestamps
- Current status of each stack
- Resource counts and types


In [8]:
# List all cluster stacks in your account
!hyp list cluster-stack

📋 HyperPod Cluster Stacks (20 found)

[1] Stack Details:
 Field               | Value
---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------
 StackId             | arn:aws:cloudformation:us-east-2:211125564141:stack/HyperpodClusterStack-ba60b-EKSClusterStack-1F8N74LGLG0RG/ef977740-7fb3-11f0-adbe-0a785f9dc1d7
 StackName           | HyperpodClusterStack-ba60b-EKSClusterStack-1F8N74LGLG0RG
 TemplateDescription | EKS Cluster Stack
 CreationTime        | 2025-08-22 23:58:44
 StackStatus         | CREATE_IN_PROGRESS
 ParentId            | arn:aws:cloudformation:us-east-2:211125564141:stack/HyperpodClusterStack-ba60b/6be3a540-7fb3-11f0-8bc0-0aa797f4fc05
 RootId              | arn:aws:cloudformation:us-east-2:211125564141:stack/HyperpodClusterStack-ba60b/6be3a540-7fb3-11f0-8bc0-0aa797f4fc05
 DriftInformation    | {'StackDriftStatus': 'NOT_CHECKED'}

[2] Stack Details

## Step 9: Update Cluster Configuration

The `hyp update cluster` command allows you to modify your existing cluster configuration. You can add or remove instance groups, update tags, or modify other cluster settings.

**Common update scenarios:**
- Scaling instance groups up or down
- Adding new instance types
- Updating cluster tags
- Modifying storage configurations

**Note:** Some changes may require cluster restart or recreation.


In [None]:
# Update cluster configuration (example: adding more tags)
# Uncomment and modify as needed
# !hyp update cluster --add-tags '[{"Key": "UpdatedBy", "Value": "NotebookExample"}]'

print("Update command available for cluster modifications")

## Step 10: Verify Cluster Connectivity

Once your cluster is created, verify that you can connect to it and that all components are functioning properly.


In [None]:
# Set cluster context for kubectl operations
# Replace 'your-cluster-name' with your actual cluster name
# !hyp set-cluster-context --cluster-name your-cluster-name

# Get cluster context information
# !hyp get-cluster-context

print("Cluster connectivity commands available after deployment")

## Next Steps

After successfully creating your HyperPod cluster, you can:

1. **Submit Training Jobs**: Use `hyp create hyp-pytorch-job` to run distributed training
2. **Deploy Inference Endpoints**: Use `hyp create hyp-jumpstart-endpoint` for model serving
3. **Monitor Resources**: Check pod status with `hyp list-pods`
4. **Access Logs**: View training logs with `hyp get-logs`
5. **Scale Cluster**: Add or remove instance groups as needed

## Troubleshooting

If you encounter issues during cluster creation:

- Check AWS CloudFormation console for detailed error messages
- Verify AWS credentials and permissions
- Ensure resource quotas are sufficient
- Review the configuration file for syntax errors
- Use `hyp validate` to identify configuration issues

## Cleanup

To avoid ongoing charges, remember to delete your cluster when no longer needed:

```bash
hyp delete cluster-stack --stack-name your-stack-name
```


## Summary

This notebook demonstrated the complete HyperPod cluster creation workflow:

✅ **Initialized** cluster configuration with `hyp init cluster-stack`  
✅ **Configured** cluster settings and tags with `hyp configure`  
✅ **Validated** configuration with `hyp validate`  
✅ **Created** cluster infrastructure with `hyp create`  
✅ **Monitored** deployment with `hyp describe cluster-stack`  
✅ **Listed** all clusters with `hyp list cluster-stack`  
✅ **Updated** cluster configuration with `hyp update cluster`  

Your HyperPod cluster is now ready for distributed machine learning workloads!
