# SageMaker HyperPod Pytorch Job - Init Experience

This notebook demonstrates the complete end-to-end workflow for creating a SageMaker HyperPod Pytorch Job using the HyperPod CLI. The init experience provides a guided approach to create Hyperpod Pytorch Job with validation and configuration management.

## Prerequisites

- SageMaker HyperPod CLI installed (`pip install sagemaker-hyperpod`)
- Hyperpod pytorch job template installed (`pip install hyperpod-pytorch-job-template`)
- Hyperpod training operator installed in your hyperpod cluster
- Python 3.8+ environment

## Workflow Overview

1. **Initialize** - Create initial pytorch job configuration
2. **Configure** - Customize pytorch job parameters
3. **Validate** - Verify configuration before deployment
4. **Create** - Deploy the pytorch job creation
5. **Monitor** - Check pytorch job status and manage lifecycle


## Step 0: Connect to your Hyperpod cluster

Make sure you have installed hyperpod training operator in your hyperpod cluster.


In [None]:
# List all available SageMaker HyperPod clusters in your account
!hyp list-cluster

In [None]:
# Configure your local kubectl environment to interact with a specific SageMaker HyperPod cluster (and namespace)
!hyp set-cluster-context --cluster-name ml-cluster-integ-test

## Step 1: Initialize Pytorch Job Configuration

The `hyp init hyp-pytorch-job` command creates a new configuration template with default settings. This generates a `config.yaml` file that serves as the foundation for your deployment.

**What this does:**
- Creates a `config.yaml` with default pytorch job settings.
- Creates a `k8s.jinja` which is a reference to the k8s payload that is going to be submitted with. Users can refer this to understand how the parameters are being used. 
- Creates a `README.md` which is a detailed explanation of the init experience.


In [None]:
# Initialize a new pytorch job configuration in the current directory
!hyp init hyp-pytorch-job

## Step 2: Configure Pytorch Job Settings

The `hyp configure` command allows you to customize your pytorch job configuration.

**Key configuration options:**
- **job_name**: Job name
- **image**: Docker image for training

In [None]:
!hyp configure --job-name my-pytorch-job

### View Current Configuration

Let's examine the generated configuration to understand what will be deployed:

In [None]:
# Display the current configuration
!cat config.yaml | head -50

## Step 3: Validate Configuration

The `hyp validate` command performs syntax validation of your pytorch job configuration before deployment. This helps catch configuration errors early and ensures all prerequisites are met.


In [None]:
# Validate the pytorch job configuration
# This checks for potential issues before deployment
!hyp validate

## Step 4: Reset Configuration (Optional)

The `hyp reset` command allows you to reset your configuration to defaults or clean up any partial deployments. This is useful when you want to start fresh or if validation reveals issues that require a clean slate.

**Use cases for reset:**
- Starting over with a clean configuration
- Cleaning up after failed deployments
- Switching between different pytorch job configurations


In [None]:
# Reset configuration if needed (uncomment to use)
# !hyp reset

print("Reset command available if configuration changes are needed")

## Step 5: Create the Pytorch Job

The `hyp create` command deploys your HyperPod pytorch job with configurations in the config.yaml. A timestamped folder is created in the `runs` folder, where the config.yaml and the values-injected k8s.yaml kubernates payload is saved.

In [None]:
# Create the pytorch job
!hyp create

## Step 6: Monitor Pytorch Job Creation

While the pytorch job is being created, you can monitor its progress using the describe and list commands. These provide real-time status updates on the deployment process.

In [None]:
# Check pytorch job creation status
import time

print("Monitoring pytorch job progress...")
for i in range(5):
    print(f"\n--- Status Check {i+1} ---")
    !hyp describe hyp-pytorch-job --name my-pytorch-job
    time.sleep(30)  # Wait 30 seconds between checks

## Step 7: Describe Pytorch Job

The `hyp describe hyp-pytorch-job` command provides detailed information about your pytorch job deployment status and sagemaker pytorch job status.

In [None]:
# Get detailed information about the pytorch job
!hyp describe hyp-pytorch-job  --name my-pytorch-job

## Step 8: List All Pytorch Jobs

The `hyp list hyp-pytorch-job` command shows all HyperPod pytorch jobs in your account. This is useful for managing multiple pytorch job deployments and getting an overview of your deployments.


In [None]:
# List all pytorch jobs in your account
!hyp list hyp-pytorch-job

## Next Steps

After successfully creating your HyperPod Pytorch Job, you can:

1. **Monitor Resources**: Check pod status with `hyp list-pods hyp-pytorch-job`
2. **Access Logs**: View pod logs with `hyp get-logs hyp-pytorch-job`


## Troubleshooting

If you encounter issues during Pytorch Job creation:

- Use `hyp get-operator-logs hyp-pytorch-job` to check potential operator log errors
- Verify AWS credentials and permissions
- Ensure resource quotas are sufficient
- Review the configuration file for syntax errors
- Use `hyp validate` to identify configuration issues

## Cleanup

To avoid ongoing charges, remember to delete your pytorch job when no longer needed:

```bash
hyp delete hyp-pytorch-job --name my-pytorch-job
```
