diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/.gitignore b/Container-Root/hyperpod/deployment/eks/demo/hero/.gitignore new file mode 100644 index 0000000..bd1aaf2 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/.gitignore @@ -0,0 +1,64 @@ +# IDE and Editor files +.vscode/ +.idea/ +*.swp +*.swo +*~ + +# OS generated files +.DS_Store +.DS_Store? +._* +.Spotlight-V100 +.Trashes +ehthumbs.db +Thumbs.db + +# Python +__pycache__/ +*.py[cod] +*$py.class +*.so +.Python +build/ +develop-eggs/ +dist/ +downloads/ +eggs/ +.eggs/ +lib/ +lib64/ +parts/ +sdist/ +var/ +wheels/ +*.egg-info/ +.installed.cfg +*.egg +MANIFEST + +# Virtual environments +.env +.venv +env/ +venv/ +ENV/ +env.bak/ +venv.bak/ + +# Jupyter Notebook +.ipynb_checkpoints + +# Node.js +node_modules/ +npm-debug.log* +yarn-debug.log* +yarn-error.log* + +# Logs +*.log +logs/ + +# Temporary files +*.tmp +*.temp \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/README.md b/Container-Root/hyperpod/deployment/eks/demo/hero/README.md new file mode 100644 index 0000000..eddab96 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/README.md @@ -0,0 +1,1005 @@ +# HyperPod Hero Demo πŸš€ + +This demo showcases how to efficiently orchestrate multi-user AI/ML training and inference jobs on a SageMaker HyperPod cluster orchestrated by EKS. + +## Key Features + +- **GPU resource allocation** via [Task Governance](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html) +- **Dynamic scaling** with Karpenter and [KEDA](https://keda.sh/) +- **One-click observability** for cluster metrics using Amazon Managed Prometheus & Grafana +- **Seamless integration** with SageMaker Studio and FSx +- **Model observability** using [Amazon Managed MLFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html) +- **Resiliency** with [HyperPod resiliency](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm.html) and the [HyperPod Training Operator](https://aws.amazon.com/blogs/machine-learning/accelerate-large-scale-ai-training-with-amazon-sagemaker-hyperpod-training-operator/) +- **Streamlined inference** with the [HyperPod Inference Operator](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-hyperpod-launches-model-deployments-to-accelerate-the-generative-ai-model-development-lifecycle/) + +This comprehensive showcase emphasizes HyperPod EKS's differentiators in handling diverse ML workloads at scale. + + +--- + +## Getting Started + +### Prerequisites + +Before running this demo, ensure you have: + +**HyperPod EKS Cluster** with: +- [Continuous Provisioning enabled](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-scaling-eks.html) +- [Autoscaling enabled](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling.html) +- [HyperPod Training Operator](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html) toggled on at cluster creation +- [HyperPod Inference Operator](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-setup.html) toggled on at cluster creation +- **10 g5.8xlarge instances** (persistent instance group) +- **2 m5.12xlarge instances** (head instance group) +- **0 instances of g5.8xlarge** (autoscaling instance group) + +**Additional Setup:** +- SageMaker Studio configured (optional, for enhanced cluster management) + +--- + +## 1. Set up the HyperPod Container Environment + +### 1.1 Clone and Setup + +```bash +# Clone this repository +git clone https://github.com/mvinci12/hero-demo-hyperpod.git +cd hero-demo-hyperpod +``` + +### 1.2 Launch Container Environment + +Run the [aws-do-hyperpod](https://github.com/aws-samples/aws-do-hyperpod/tree/main) container with this repo mounted: + +```bash +docker run --name=do-hyperpod-use1 \ + --hostname=deb706ea3971 \ + --mac-address=8a:15:14:f3:90:39 \ + --volume /var/run/docker.sock:/var/run/docker.sock \ + --volume $(pwd):/hero-demo-hyperpod \ + --volume ~/.kube:/root/.kube \ + --volume ~/.aws:/root/.aws \ + --network=bridge \ + --workdir=/hyperpod \ + --detach=true \ + public.ecr.aws/hpc-cloud/aws-do-hyperpod:latest \ + tail -f /dev/null +``` + +```bash +# Exec into the container +docker exec -it do-hyperpod-use1 bash +``` + +> **Why use the container?** The aws-do-hyperpod container provides a pre-configured environment with all necessary tools (kubectl, aws CLI, Kubernetes utilities like kubectx/kubens, and HyperPod management scripts) already installed. This containerized approach follows the do-framework principles to simplify DevOps tasks. If you prefer your local environment, ensure you have kubectl, aws CLI, and proper cluster access configured. + +### 1.3 Navigate to Project Directory + +```bash +cd /hero-demo-hyperpod +``` + +
+ +### 1.4 Configure Environment Variables + +The HyperPod environment variables will be used throughout this repo. + +Configure and source `env_vars`: +``` bash +source env_vars +``` + +
+ +### 1.5 Verify kubectl Access + +Update your kubeconfig with cluster credentials: + +```bash +aws eks update-kubeconfig --name $EKS_CLUSTER_NAME +``` + +**Verify connection:** +```bash +kubectl config current-context +``` +Expected output: +``` +arn:aws:eks:us-west-2:xxxxxxxxxxxx:cluster/hyperpod-eks-cluster +``` + +```bash +kubectl get svc +``` +Expected output: +``` +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE +svc/kubernetes ClusterIP 10.100.0.1 443/TCP 1m +``` + +
+ +### 1.6 Verify Cluster Nodes + +View your cluster nodes: + +```bash +# Using container shortcut (if in aws-do-hyperpod container) +kgn + +# Or standard kubectl command +kubectl get nodes +``` + +**Expected output:** +``` +NAME STATUS ROLES AGE VERSION INSTANCE-TYPE NODE-HEALTH-STATUS DEEP-HEALTH-CHECK-STATUS +NAME STATUS ROLES AGE VERSION INSTANCE-TYPE NODE-HEALTH-STATUS DEEP-HEALTH-CHECK-STATUS +hyperpod-i-025636fbd2fbba767 Ready 5h1m v1.32.9-eks-113cf36 ml.g5.8xlarge Schedulable +hyperpod-i-02ebe144311be48ce Ready 5h1m v1.32.9-eks-113cf36 ml.g5.8xlarge Schedulable +hyperpod-i-0387cfd46e8fc67fe Ready 5h1m v1.32.9-eks-113cf36 ml.g5.8xlarge Schedulable +hyperpod-i-0400a27e2a27d2c42 Ready 39m v1.32.9-eks-113cf36 ml.g5.8xlarge Schedulable +hyperpod-i-07ebef01aedec792b Ready 5h1m v1.32.9-eks-113cf36 ml.g5.8xlarge Schedulable +hyperpod-i-0a100ab8dd6e81c5e Ready 5h1m v1.32.9-eks-113cf36 ml.g5.8xlarge Schedulable +hyperpod-i-0b51bf8378de840f9 Ready 5h1m v1.32.9-eks-113cf36 ml.g5.8xlarge Schedulable +hyperpod-i-0cfc430f774dd4b1a Ready 4h39m v1.32.9-eks-113cf36 ml.g5.8xlarge Schedulable +hyperpod-i-0d1acb954b9caff70 Ready 5h1m v1.32.9-eks-113cf36 ml.g5.8xlarge Schedulable +hyperpod-i-0fb1c36a2202b07b7 Ready 5h1m v1.32.9-eks-113cf36 ml.g5.8xlarge Schedulable +hyperpod-i-0fc9b9f110fa4a55c Ready 24d v1.32.3-eks-473151a ml.m5.12xlarge Schedulable +hyperpod-i-0fc9b9f110fa4a55c Ready 24d v1.32.3-eks-473151a ml.m5.12xlarge Schedulable +``` + +--- + +## 2. Configure Karpenter Autoscaling + +### 2.1 Create NodeClass and NodePool Resources + +The autoscaling setup uses Karpenter with HyperPod-specific NodeClass and NodePool resources. These files define how Karpenter should manage your autoscaling instance groups. + +**Apply the NodeClass:** +```bash +kubectl apply -f autoscaling/nodeclass.yaml +``` + +**Apply the NodePool:** +```bash +kubectl apply -f autoscaling/nodepool.yaml +``` + +
+ +### 2.2 Verify Autoscaling Configuration + +**Check NodeClass status:** +```bash +kubectl get hyperpodnodeclass sample-nc -o yaml +``` + +**Check NodePool status:** +```bash +kubectl get nodepool sample-np -o yaml +``` + +Both resources should show `Ready: True` in their status conditions. + +> **πŸ’‘ Important:** Your autoscaling instance group (`worker-group-1`) should start with 0 nodes. Karpenter will automatically scale up nodes when pods are scheduled and scale them down when they're no longer needed. + +--- + +## 3. Set up Task Governance + +### 3.1 Installation + +To install the SageMaker HyperPod task governance EKS add-on, run the following command: +``` bash +aws eks create-addon --region $AWS_REGION --cluster-name $EKS_CLUSTER_NAME --addon-name amazon-sagemaker-hyperpod-taskgovernance +``` + +
+ +Verify successful installation with: +``` bash +aws eks describe-addon --region $AWS_REGION --cluster-name $EKS_CLUSTER_NAME --addon-name amazon-sagemaker-hyperpod-taskgovernance +``` +If the installation was successful, you should see details about the installed add-on in the output. + +
+ +### 3.2 Setup Cluster Policy + +Cluster policy will set up how tasks are prioritized and how idle compute is allocated. Apply a Cluster Policy using the following configuration: + +``` bash +aws sagemaker \ + create-cluster-scheduler-config \ + --name "example-cluster-scheduler-config" \ + --cluster-arn $HYPERPOD_CLUSTER_ARN \ + --scheduler-config "PriorityClasses=[{Name=inference,Weight=100},{Name=training,Weight=80}],FairShare=Enabled" +``` + +> **πŸ“‹ What this cluster policy does:** +> - **Priority Classes**: Creates two priority levels - `inference` jobs get higher priority (weight 100) than `training` jobs (weight 80) +> - **Fair Share**: Enables fair resource distribution among users and teams when multiple jobs compete for resources +> - **Resource Allocation**: Ensures inference workloads get preferential access to compute while still allowing training jobs to utilize available capacity +> - **Multi-tenancy**: Supports multiple teams submitting jobs simultaneously with predictable resource allocation behavior + +To verify creation, run: +``` bash +aws sagemaker list-cluster-scheduler-configs +``` + +### 3.3 Create Compute Quotas for Teams + +Compute quotas allow you to allocate specific amounts of compute resources to different teams, ensuring fair resource distribution and preventing any single team from consuming all available capacity. Each team gets their own quota with borrowing capabilities for efficient resource utilization. + +
+ +### 3.4 Create Training Team Quota + +Create a compute quota for the training team with 6 ml.g5.8xlarge instances: + +```bash +aws sagemaker \ + create-compute-quota \ + --name "Training-Team-Quota" \ + --cluster-arn $HYPERPOD_CLUSTER_ARN \ + --compute-quota-config "ComputeQuotaResources=[{InstanceType=ml.g5.8xlarge,Count=6}],ResourceSharingConfig={Strategy=LendAndBorrow,BorrowLimit=50},PreemptTeamTasks=LowerPriority" \ + --activation-state "Enabled" \ + --compute-quota-target "TeamName=training-team,FairShareWeight=0" +``` + +
+ +### 3.5 Create Inference Team Quota + +Create a compute quota for the inference team with 4 ml.g5.8xlarge instances: + +```bash +aws sagemaker \ + create-compute-quota \ + --name "Inference-Team-Quota" \ + --cluster-arn $HYPERPOD_CLUSTER_ARN \ + --compute-quota-config "ComputeQuotaResources=[{InstanceType=ml.g5.8xlarge,Count=4}],ResourceSharingConfig={Strategy=LendAndBorrow,BorrowLimit=50},PreemptTeamTasks=LowerPriority" \ + --activation-state "Enabled" \ + --compute-quota-target "TeamName=inference-team,FairShareWeight=0" +``` + +
+ +### 3.6 Verify Compute Quotas + +**List all compute quotas:** +```bash +aws sagemaker list-compute-quotas --cluster-arn $HYPERPOD_CLUSTER_ARN +``` + +> **πŸ’‘ Quota Configuration Explained:** +> - **Training Team**: Gets 6 ml.g5.8xlarge instances as their base allocation +> - **Inference Team**: Gets 4 ml.g5.8xlarge instances as their base allocation +> - **Resource Sharing**: Both teams can lend unused resources and borrow up to 50% additional capacity when available +> - **Preemption**: Lower priority tasks can be preempted when higher priority teams need resources +> - **Fair Share Weight**: Set to 0 for both teams, meaning they rely on their quota allocations rather than fair share scheduling + +View in HyperPod console: + +
+ +Task Governance Console + +
+ +Your cluster is now configured with team-based compute quotas that ensure fair resource allocation while allowing efficient resource sharing between teams! + +## 4. Set up SageMaker Managed MLFlow + +To see more instructions on setting up MLFlow, see the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html) and [AI on SageMaker instructions](https://awslabs.github.io/ai-on-sagemaker-hyperpod/docs/add-ons/Observability/MLFlow). + +We have already decorated our Dockerfile, training YAML, and Python code for MLFlow integration. The last part of this is to run: + +1. Set up Service Account: +``` bash +./mlflow/setup-mlflow.sh +``` + +2. Create Tracking Server: +``` bash +./mlflow/create-trackingserver.sh +``` + +3. Create MLFlow UI: +``` bash +./mlflow/create-ui.sh +``` + +## 5. Set up Training Resources + +### 5.1 Install HyperPod Training Operator + +If you haven't installed the HyperPod Training Operator via the cluster creation toggle, install it at this point: + +- Install [cert-manager](https://cert-manager.io/docs/installation/) +- Set up the [EKS Pod Identity Agent using the console](https://docs.aws.amazon.com/eks/latest/userguide/pod-id-agent-setup.html). If you want to use the AWS CLI, use the following command: + +```bash +aws eks create-addon --cluster-name $EKS_CLUSTER_NAME --addon-name eks-pod-identity-agent --region $AWS_REGION +``` + +**Now, install HPTO via the SageMaker AI console:** + +1. Open the Amazon SageMaker AI console at https://console.aws.amazon.com/sagemaker/ +2. Go to your cluster's details page +3. On the Dashboard tab, locate the add-on named Amazon SageMaker HyperPod training operator, and choose install +4. During the installation process, SageMaker AI creates an IAM execution role with permissions similar to the [AmazonSageMakerHyperPodTrainingOperatorAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerHyperPodTrainingOperatorAccess.html) managed policy and creates a pod identity association between your Amazon EKS cluster and your new execution role + +### 5.2 Download Training Dataset to FSx + +Before training, we need to download the C4 dataset to FSx for faster data loading during training. This script downloads the English subset of the C4 dataset (~305GB) directly to your FSx volume. + +You can also stream from HuggingFace bot this may bring throttling errors. To do this, please reference the example [here](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/pytorch/FSDP). The only difference between this demo and that example is the dataloader in `src/model_utils/train_utils.py` and how we reference the data in our `hpto_1b.yaml`. + +**Download the dataset:** +Create a pod that mounts our FSx to it: +``` bash +kubectl apply -f fsx.yaml +``` +Download data: +``` bash +./training/download-c4-direct.sh +``` + +> **πŸ“‹ What this does:** +> - Creates a temporary pod with FSx mounted +> - Installs git and git-lfs +> - Downloads the C4 English dataset to `/fsx/datasets/c4/en/` +> - The dataset will be available to all training jobs via the FSx PVC +> +> **Note:** This download takes time depending on your network speed. The dataset is ~305GB and will be stored on FSx for reuse across multiple training runs. + +### 5.3 Build Docker Image + +Before submitting our job, we need to build and push our Docker image to ECR. This image will contain all of our job dependencies, libraries, and code to submit the training job. + +To build and push your image to ECR, please run: +``` bash +./training/build-push.sh +``` + +### 5.4 Create PVC in `hyperpod-ns-training-team` Namespace + +Since PVCs are namespace isolated and the cluster creation creates our FSx PVC in the `default` namespace, we can create a PVC using static provisioning in the `hyperpod-ns-training-team` namespace, where we will submit our training job. + +To do this, run this script: +``` bash +./training/create-pvc.sh +``` + + +## 6. Set up Inference Operator + +### 6.1 Install Inference Operator + +If you haven't installed the HyperPod Inference Operator via the cluster creation toggle, install it at this point. Follow the [installation instructions](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-setup.html). + +--- + +## Running the Demo + +## 7. Submit FSDP Training Job + +### 7.1 Submit Training Job + +Now we are ready to submit our FSDP job with the HyperPod Training Operator. + +As the training team, we will be submitting a HyperPodPyTorchJob that requires **8 instances** but only has **6 allocated**. Task Governance allows the Training Team to **borrow** 2 instances from the Inference Team's capacity. + +To submit, run: +``` bash +./training/submit-job.sh +``` + +Output: +``` +root@deb706ea3971:/hero-demo-hyperpod# ./training/submit-job.sh +Running: envsubst < hpto_5b.yaml | kubectl apply -f - +Creating HyperPod PytorchJob for training Llama3 5B parameter model... + +hyperpodpytorchjob.sagemaker.amazonaws.com/llama3-1-8b-fsdp-hpto created +``` + +View your job pods: +``` bash +kgp -n hyperpod-ns-training-team +``` + +While the Kubernetes pods are pulling the image from ECR, here's a little bit about our job: + +> **πŸ“‹ Job YAML Configuration & Task Governance Integration** +> +> Our training job (`hpto_5b.yaml`) demonstrates several key integrations with HyperPod's task governance system: +> +> **Team Identification:** +> - **Namespace**: `hyperpod-ns-training-team` - This namespace identifies the job as belonging to the training team +> - **Queue Label**: `kueue.x-k8s.io/queue-name: hyperpod-ns-training-team-localqueue` - Routes the job through the training team's resource queue +> - **Priority Class**: `kueue.x-k8s.io/priority-class: training-priority` - Assigns training priority level (weight 80) as defined in our cluster policy +> +> **Resource Management:** +> - The job requests GPU and EFA resources per node using environment variables (`$GPU_PER_NODE`, `$EFA_PER_NODE`) +> - Task governance automatically applies the training team's compute quota (6 ml.g5.8xlarge instances, borrowing 2 ml.g5.8xlarge instances) +> - Resource sharing allows borrowing up to 50% additional capacity when available from other teams +> +> **HyperPod Training Operator Features:** +> - **Fault Tolerance**: `jobMaxRetryCount: 10` with intelligent restart policies +> - **Health Monitoring**: Log pattern monitoring detects job start and hanging scenarios +> - **Topology Awareness**: Pod anti-affinity and node affinity ensure optimal GPU placement +> - **Managed Tiered Checkpointing**: Automatic checkpoint management with configurable frequency +> - **EFA Networking**: High-performance networking configuration for distributed training +> +> The operator handles the complexity of distributed PyTorch training while task governance ensures fair resource allocation across teams. + + +When your pods go into `Running` state, you can view your pod logs via [kubetail](https://github.com/johanhaleby/kubetail): +``` bash +kubetail llama3 -n hyperpod-ns-training-team +``` + +We will start seeing outputs similar to this when it's training: +``` +[llama3-1-8b-fsdp-hpto-pods-6] INFO: 10.1.144.242:35242 - "GET /status HTTP/1.1" 200 OK +[llama3-1-8b-fsdp-hpto-pods-0] [default0]:2025-10-03 20:25:09,196 [INFO] __main__: Batch 1 Loss: 11.61360, Speed: 0.56 samples/sec, lr: 0.000063 +[llama3-1-8b-fsdp-hpto-pods-2] INFO: 10.1.144.242:37102 - "GET /status HTTP/1.1" 200 OK +``` + +We can also see our job submitting by team `hyperpod-ns-training-team` with the `training-priority` on the HyperPod console, with a request of 8 GPUs: + +
+ +Task Governance Console + +
+ +Another way you can check the status of your job is to see it through SageMaker Managed MLFlow. As configured in [Section 4](#4-set-up-sagemaker-managed-mlflow), you can access the MLFlow UI to monitor training metrics and job progress. + +Example of what it may look like on the MLFlow UI: + +
+ +MLFlow Metrics + +
+ +### 7.2 Resiliency Demonstration + +At this point while the training is happening, let's emulate a GPU error where the node will have to get rebooted. To do this, we can run: +``` bash +./training/node-error.sh +``` + +When we run this, then check the health of our nodes, we will see one of our GPU nodes in an unhealthy state. Check node status with this command: +``` bash +kgn +``` + +Output (look under `STATUS` and `NODE-HEALTH-STATUS`): +``` +root@deb706ea3971:/hero-demo-hyperpod# kgn + +kubectl get nodes -L node.kubernetes.io/instance-type -L sagemaker.amazonaws.com/node-health-status -L sagemaker.amazonaws.com/deep-health-check-status + +NAME STATUS ROLES AGE VERSION INSTANCE-TYPE NODE-HEALTH-STATUS DEEP-HEALTH-CHECK-STATUS +hyperpod-i-00032a01f65f15741 Ready 21d v1.32.3-eks-473151a ml.m5.12xlarge Schedulable +hyperpod-i-076a8fd1fb47273d2 NotReady 25h v1.32.3-eks-473151a ml.g5.8xlarge **UnschedulablePendingReboot** +hyperpod-i-08335947b01f32169 Ready 25h v1.32.3-eks-473151a ml.g5.8xlarge Schedulable +hyperpod-i-0881939d060d22431 Ready 8d v1.32.3-eks-473151a ml.g5.8xlarge Schedulable +hyperpod-i-0ab2d0831345d7b78 Ready 8d v1.32.3-eks-473151a ml.g5.8xlarge Schedulable +hyperpod-i-0ab6b32bad77da72a Ready 7d23h v1.32.3-eks-473151a ml.g5.8xlarge Schedulable +hyperpod-i-0c5cdf28e608bddd7 Ready 8d v1.32.3-eks-473151a ml.g5.8xlarge Schedulable +hyperpod-i-0dfe5b8b3c59343e4 Ready 8d v1.32.3-eks-473151a ml.g5.8xlarge Schedulable +hyperpod-i-0fc9b9f110fa4a55c Ready 24d v1.32.3-eks-473151a ml.m5.12xlarge Schedulable +``` + +If you look at your pods, you will also see the status being `Pending` by running this command: +``` bash +kgp -o wide -n hyperpod-ns-training-team +``` + +Output: +``` +root@deb706ea3971:/hero-demo-hyperpod# kgp -n hyperpod-ns-training-team +NAME READY STATUS RESTARTS AGE +llama3-1-8b-fsdp-hpto-pods-0 1/1 Running 0 10m +llama3-1-8b-fsdp-hpto-pods-1 1/1 Running 0 10m +llama3-1-8b-fsdp-hpto-pods-2 1/1 Running 0 10m +llama3-1-8b-fsdp-hpto-pods-3 1/1 Running 0 10m +llama3-1-8b-fsdp-hpto-pods-4 1/1 Running 0 10m +llama3-1-8b-fsdp-hpto-pods-5 1/1 Running 0 10m +llama3-1-8b-fsdp-hpto-pods-6 1/1 Running 0 10m +llama3-1-8b-fsdp-hpto-pods-7 0/1 Pending 0 11s +``` + + +Once the node is rebooted, the node and pod status will become healthy again, and the training will resume from the last saved checkpoint. + +When you describe the `HyperPodPytorchJob` you will see more details about the resiliency events: +``` bash +k describe hyperpodpytorchjob -n hyperpod-ns-training-team +``` + +Output: +``` +Events: + Type Reason Age From Message + ---- ------ ---- ---- ------- + Normal JobInitiallySuspended 19m hyperpod-pytorch-job-controller The job was suspended during it's creation. The job will resume when admitted by kueue. See the queue hyperpod-ns-training-team-localqueue for details + Normal CreatedWorkload 19m kueue Created Workload: hyperpod-ns-training-team/hyperpodpytorchjob-llama3-1-8b-fsdp-hpto-077cd + Normal Started 19m kueue Admitted by clusterQueue hyperpod-ns-training-team-clusterqueue + Normal Created 19m hyperpod-pytorch-job-controller Created by HyperPod Training Operator controller + Normal Running 16m hyperpod-pytorch-job-controller + Warning NodeFault 8m27s hyperpod-pytorch-job-controller Found unhealthy node hyperpod-i-076a8fd1fb47273d2 + Normal Running 6m45s hyperpod-pytorch-job-controller The fault of reason NodeFault was remediated in 102590 milliseconds. +``` + +View the job continue from its last saved checkpoint by running: +``` bash +kubetail llama3 -n hyperpod-ns-training-team +``` + + +Another cool feature we can demonstrate with the `aws-do-hyperpod` container is we can see EFA utilization from the training job with 1 command: + +``` +export NODE=$(kubectl get nodes -L node.kubernetes.io/instance-type | awk '$NF ~ /ml\.g5\./ && NR>1 {print $1}' | shuf -n 1) +eu $NODE +``` + +
+ +EFA Utilization + +
+ +Now that we submitted our training job, let's change hats to be the Inference Team... + + +--- + +## 8. Deploy Llama Model for Inference + +### 8.1 Deploy Model + +Once the model is uploaded, please have a look at our `deploy_s3_inference.yaml`: +``` bash +cat inference/jumpstart.yaml +``` + +> **πŸ“‹ Inference Deployment Configuration:** +> +> This YAML defines a HyperPod model deployment that: +> - Deploys the `huggingface-llm-mistral-7b-instruct-v3` model from JumpStart storage to the inference team's namespace +> - Integrates with task governance through the inference team's resource queue and priority class +> - Configures autoscaling, health checks, and load balancing for production inference workloads + +To deploy our model, run: +``` bash +./inference/submit-jumpstart.sh +``` + +### 8.2 Verify Deployment Status + +Check if the model successfully deployed: +```bash +kubectl describe jumpstartmodel -n hyperpod-ns-inference-team +``` + +Wait for the model to be fully loaded and ready: +```bash +# kubectl logs -f deployment/deepseek15b-autoscale -n default +kubectl logs -f deployment/mistral-jumpstart-autoscale -n hyperpod-ns-inference-team +``` + +Look for the message: + +``` +2025-10-13T22:11:52.564139Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] +2025-10-13T22:11:52.564228Z INFO download: text_generation_launcher: Starting download process. +2025-10-13T22:11:55.578709Z INFO text_generation_launcher: Files are already present on the host. Skipping download. + +2025-10-13T22:11:56.268206Z INFO download: text_generation_launcher: Successfully downloaded weights. +2025-10-13T22:11:56.268439Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 +2025-10-13T22:12:05.952613Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0 + +2025-10-13T22:12:05.978713Z INFO shard-manager: text_generation_launcher: Shard ready in 9.709293463s rank=0 +2025-10-13T22:12:06.076751Z INFO text_generation_launcher: Starting Webserver +2025-10-13T22:12:06.155020Z INFO text_generation_router: router/src/main.rs:289: Using config Some(Mistral) +2025-10-13T22:12:06.155065Z WARN text_generation_router: router/src/main.rs:298: no pipeline tag found for model /opt/ml/model +2025-10-13T22:12:06.158565Z INFO text_generation_router: router/src/main.rs:317: Warming up model +2025-10-13T22:12:08.759489Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32] + +2025-10-13T22:12:09.754597Z INFO text_generation_router: router/src/main.rs:354: Setting max batch total tokens to 43808 +2025-10-13T22:12:09.754618Z INFO text_generation_router: router/src/main.rs:355: Connected +2025-10-13T22:12:09.754621Z WARN text_generation_router: router/src/main.rs:369: Invalid hostname, defaulting to 0.0.0.0 +``` + +Check that the endpoint is successfully created. This will be created after your EC2 load balancer is successfully created. You can check this in the EC2 console. + +```bash +kubectl describe SageMakerEndpointRegistration -n hyperpod-ns-inference-team +``` +If it's created, you will see this output: +``` + Message: Endpoint has been successfully created. + Observed Generation: 1 + Reason: Success + Status: True + Type: CreationCompleted +``` + +Once this is created, our model is ready for deployment. + +At this point, we can invoke our model with the `invoke_jumpstart.py` Python file that will ask our model the question: "Hi, is water wet?" + +```bash +python3 inference/invoke_jumpstart.py +``` + +--- + +## 9. Inference Pod-Level Autoscaling Demo + +Now for the pod-level autoscaling demonstration, our `monitor_autoscaling.sh` script provides an intuitive view to see the pods scale for the invocations. + +Our `load_test_jumpstart.py` generates realistic inference load - 10 requests per second for 10 minutes with 15 concurrent workers. Each request asks the model to explain machine learning concepts, generating substantial CloudWatch metrics. This script will generate a total of 4 inference pods. + +BUT WAIT, here's the critical part - the inference team has 4 ml.g5.8xlarge instances allocated, but the training team is currently using 8 instances total, borrowing 2 from the inference team's quota. + +When inference needs to scale up, task governance kicks in with preemption. Since inference was guaranteed this compute, the system will preempt the training pods to free up resources for the inference scaling. You'll see some training pods get evicted and go into Pending state as the inference pods claim those nodes. + + +This deployment includes: +- **Min replicas**: 1 (always have at least one pod running) +- **Max replicas**: 4 (can scale up to 4 pods during high traffic) +- **CloudWatch trigger**: Scales based on SageMaker endpoint invocations +- **Target value**: 5 invocations per pod (when exceeded, triggers scale-up) + + +### 9.1 Monitor Autoscaling Setup + +In a separate terminal, start monitoring the autoscaling activity: +```bash +./inference/monitor_autoscaling.sh +``` + +This will show real-time updates of: +- Deployment replica count +- Pod status and distribution +- HPA (Horizontal Pod Autoscaler) metrics +- KEDA ScaledObject status +- Recent Kubernetes events + + +### 9.2 Load Testing and Autoscaling + +Now let's generate load to trigger autoscaling. The load test will send concurrent requests to create CloudWatch metrics that trigger the autoscaler. + +**Start the load test:** +```bash +python3 inference/load_test.py \ + --endpoint mistral-autoscale-endpoint \ + --requests 200 \ + --rps 10 \ + --duration 10 \ + --workers 15 +``` + +At this point, we will see our training job is suspended in the HyperPod Task Governance console: + +
+ +Task Governance Console + +
+ +If we were to describe the training workload using this command: +``` bash +kubectl describe workload -n hyperpod-ns-training-team +``` + +You will see these events: +``` + Normal Preempted 4m47s kueue-admission Preempted to accommodate a workload (UID: c3758835-94d0-42ee-a2c5-789eba7ce8a2, JobUID: d7e5a7c5-0d47-42ef-a0a2-862bd2d8ee00) due to Fair Sharing within the cohort + Warning Pending 4m46s kueue-admission couldn't assign flavors to pod set llama3-2-1b-fsdp-hpto-pods-podset: topology "hyperpod-default" doesn't allow to fit any of 8 pod(s) + Warning Pending 4m36s kueue-admission Workload no longer fits after processing another workload + Warning Pending 4m36s (x3 over 4m36s) kueue-admission couldn't assign flavors to pod set llama3-2-1b-fsdp-hpto-pods-podset: insufficient quota for nvidia.com/gpu in flavor ml.m5.12xlarge, request > maximum capacity (8 > 0), insufficient unused quota for nvidia.com/gpu in flavor ml.g5.8xlarge, 1 more needed + Warning Pending 16s (x11 over 4m16s) kueue-admission couldn't assign flavors to pod set llama3-2-1b-fsdp-hpto-pods-podset: insufficient quota for nvidia.com/gpu in flavor ml.m5.12xlarge, request > maximum capacity (8 > 0), insufficient unused quota for nvidia.com/gpu in flavor ml.g5.8xlarge, 2 more needed +``` + +**Load test parameters:** +- `--endpoint`: The SageMaker endpoint name (note the `-s3` suffix) +- `--requests`: Total number of requests to send +- `--rps`: Requests per second rate +- `--duration`: How long to run the test (minutes) +- `--workers`: Maximum concurrent threads + +The observability terminal should eventually show something similar to: + +``` +🎯 Monitoring mistral-jumpstart-autoscale autoscaling (Press Ctrl+C to stop) +Time | Pods | Status +---------|------|-------------------------------------------------- +Time | Pods | Status +---------|------|-------------------------------------------------- +23:59:54 | 1 | πŸ”΅ Stable: 1/1 ready +00:00:11 | 1 | πŸ”΅ Stable: 1/1 ready +00:00:19 | 1 | πŸ”΅ Stable: 1/1 ready +00:00:28 | 1 | πŸ”΅ Stable: 1/1 ready +00:00:36 | 1 | πŸ”΅ Stable: 1/1 ready +00:02:40 | 1 | πŸ”΅ Stable: 1/1 ready +00:02:49 | 1 | πŸ”΅ Stable: 1/1 ready +00:02:57 | 1 | πŸ”΅ Stable: 1/1 ready +00:03:05 | 1 | πŸ”΅ Stable: 1/1 ready +00:03:14 | 1 | πŸ”΅ Stable: 1/1 ready +00:03:22 | 1 | 🟑 Scaling: 1/4 ready +00:03:31 | 1 | 🟑 Scaling: 1/4 ready +00:05:54 | 1 | 🟑 Scaling: 1/4 ready +00:06:02 | 1 | 🟑 Scaling: 1/4 ready +00:06:19 | 1 | 🟑 Scaling: 1/4 ready +00:06:27 | 4 | 🟒 Scaled up: 4/4 ready +00:06:35 | 4 | 🟒 Scaled up: 4/4 ready +00:06:43 | 4 | 🟒 Scaled up: 4/4 ready +00:06:52 | 4 | 🟒 Scaled up: 4/4 ready +``` + +After the load test completes and the 5-minute cooldown period passes, inference scales back down to 1 pod, freeing up those resources. Task governance then allows the preempted training pods to reschedule and resume from their last checkpoint. + + +#### Autoscaling Configuration Details + +The autoscaling behavior is controlled by these key settings in `deploy_s3_inference.yaml`: + +```yaml +autoScalingSpec: + minReplicaCount: 1 # Always keep 1 pod running + maxReplicaCount: 4 # Scale up to 4 pods maximum + pollingInterval: 30 # Check metrics every 30 seconds + scaleUpStabilizationTime: 0 # Scale up immediately when needed + scaleDownStabilizationTime: 300 # Wait 5 minutes before scaling down + cloudWatchTrigger: + targetValue: 5.0 # Target 5 invocations per pod + metricName: "Invocations" # SageMaker endpoint invocations + metricStat: "Sum" # Sum of invocations over period + metricCollectionPeriod: 60 # 1-minute collection window +``` + +**Minutes 0-2**: Load test starts, single pod handles initial requests + +**Minutes 2-4**: CloudWatch metrics accumulate, showing >5 invocations per pod + +**Minutes 4-6**: HPA triggers scale-up, new pods are created + +**Minutes 6-8**: Multiple pods serve traffic, load is distributed + +**Minutes 8-12**: Load test completes, invocation rate drops + +**Minutes 12-17**: Cooldown period prevents immediate scale-down + +**Minutes 17+**: Pods scale back down to minimum (1 pod) + +### 7.3 Karpenter Autoscaling +#### Setup + +Now, let’s say another team came in and wants to deploy their inference job using a deepseek model that the training team customized and saved it in S3... but there’s no compute! + +With HyperPod, if set by your IT admin, we can use its infrastructure-level autoscaling capabilities to scale out and in based on demand. Let’s demonstrate this. + +To do this, let's first place our model in our S3 bucket ([we can also deploy from FSx or Jumpstart](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html)) + +``` bash +aws s3 sync s3://jumpstart-cache-prod-us-east-2/deepseek-llm/deepseek-llm-r1-distill-qwen-1-5b/artifacts/inference-prepack/v2.0.0 s3://${S3_BUCKET_NAME}/deepseek15b +``` + +#### Deploy DeepSeek Model + +To deploy our DeepSeek model, let's run: +``` +./inference/submit-deepseek.sh +``` + +Let's check our pod: +``` +kubectl get pods +``` + +#### Verify Deployment Status + +Check if the model successfully deployed: +```bash +kubectl describe InferenceEndpointConfig deepseek15b +``` + +Check that the endpoint is successfully created: + +```bash +kubectl describe SageMakerEndpointRegistration +``` +If it's created, you will see this output: +``` + Message: Endpoint has been successfully created. + Observed Generation: 1 + Reason: Success + Status: True + Type: CreationCompleted +``` + +Wait for the model to be fully loaded and ready: +```bash +kubectl logs -f deployment/deepseek15b-autoscale +``` + +Look for the message: + +``` +INFO PyProcess Model [model] initialized. +INFO WorkerThread Starting worker thread WT-0001 for model model (M-0001, READY) on device gpu(0) +INFO ModelServer Initialize BOTH server with: EpollServerSocketChannel. +INFO ModelServer BOTH API bind to: http://0.0.0.0:8080 +``` + +To test and invoke our model, let's run the following: + +```bash +python3 inference/invoke_deepseek.py +``` + +To start Karpenter autoscaling (cluster-level), please run the following command: + +```bash +python3 inference/load_test.py \ + --endpoint deepseek15b-s3 \ + --requests 200 \ + --rps 10 \ + --duration 10 \ + --workers 15 +``` + + + +--- + +## Observability + +**Cluster Dashboard** +
+ +Cluster Observability + +
+ + +**Tasks Dashboard** +
+ +Task Observability + +
+ + +**Training Dashboard** +
+ +Training Observability + +
+ +--- + +## Cleanup + +When you're done with the demo: + +### Delete Inference Deployments + +**Delete JumpStart inference deployment:** +```bash +kubectl delete jumpstartmodel -n hyperpod-ns-inference-team --all +``` + +**Delete DeepSeek inference deployment (if created):** +```bash +kubectl delete inferenceendpointconfig deepseek15b-autoscale -n default +``` + +**Delete SageMaker endpoint registrations:** +```bash +kubectl delete SageMakerEndpointRegistration -n hyperpod-ns-inference-team --all +kubectl delete SageMakerEndpointRegistration -n default --all +``` + +### Delete Training Resources + +**Delete training job:** +```bash +kubectl delete hyperpodpytorchjob llama3-1-8b-fsdp-hpto -n hyperpod-ns-training-team +``` + +**Delete all workloads:** +```bash +kubectl delete workloads -A --all +``` + +**Delete PVC and related resources:** +```bash +kubectl delete pvc fsx-claim -n hyperpod-ns-training-team +kubectl delete pv fsx-pv-hyperpod-ns-training-team +kubectl delete sc fsx-sc-hyperpod-ns-training-team +``` + +### Delete Autoscaling Resources + +```bash +kubectl delete nodepool sample-np +kubectl delete hyperpodnodeclass sample-nc +``` + +### Delete Task Governance Resources + +**Delete compute quotas:** +```bash +aws sagemaker delete-compute-quota \ + --compute-quota-id $(aws sagemaker list-compute-quotas --cluster-arn $HYPERPOD_CLUSTER_ARN --query 'ComputeQuotaSummaries[?ComputeQuotaName==`Training-Team-Quota`].ComputeQuotaId' --output text) + +aws sagemaker delete-compute-quota \ + --compute-quota-id $(aws sagemaker list-compute-quotas --cluster-arn $HYPERPOD_CLUSTER_ARN --query 'ComputeQuotaSummaries[?ComputeQuotaName==`Inference-Team-Quota`].ComputeQuotaId' --output text) +``` + +**Delete cluster scheduler config:** +```bash +aws sagemaker delete-cluster-scheduler-config \ + --cluster-scheduler-config-id $(aws sagemaker list-cluster-scheduler-configs --query 'ClusterSchedulerConfigSummaries[?ClusterSchedulerConfigName==`example-cluster-scheduler-config`].ClusterSchedulerConfigId' --output text) +``` + +**Delete task governance add-on:** +```bash +aws eks delete-addon --region $AWS_REGION --cluster-name $EKS_CLUSTER_NAME --addon-name amazon-sagemaker-hyperpod-taskgovernance +``` + +### Delete MLFlow Resources + +**Delete MLFlow tracking server:** +```bash +aws sagemaker delete-mlflow-tracking-server \ + --tracking-server-name hyperpod-ts-demo \ + --region $AWS_REGION +``` + +**Delete MLFlow IAM resources:** +```bash +# Delete service account +eksctl delete iamserviceaccount \ + --name sagemaker-mlflow-sa \ + --namespace hyperpod-ns-training-team \ + --cluster $EKS_CLUSTER_NAME \ + --region $AWS_REGION + +# Delete IAM policy +aws iam delete-policy \ + --policy-arn $(aws iam list-policies --query 'Policies[?PolicyName==`SageMakerMlFlowAccessPolicy`].Arn' --output text) +``` + +### Delete Docker Image (Optional) + +**Delete ECR image:** +```bash +aws ecr batch-delete-image \ + --repository-name hyperpod-training \ + --image-ids imageTag=latest \ + --region $AWS_REGION +``` + +### Stop Container (Optional) + +**Stop and remove the aws-do-hyperpod container:** +```bash +docker stop do-hyperpod-use1 +docker rm do-hyperpod-use1 +``` diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/autoscaling/inflate.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/autoscaling/inflate.yaml new file mode 100644 index 0000000..9693ca7 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/autoscaling/inflate.yaml @@ -0,0 +1,31 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: inflate + namespace: hyperpod-ns-inference-team + labels: + kueue.x-k8s.io/queue-name: hyperpod-ns-inference-team-localqueue + kueue.x-k8s.io/priority-class: inference-priority +spec: + selector: + matchLabels: + app: inflate + template: + metadata: + labels: + app: inflate + spec: + securityContext: + runAsUser: 1000 + runAsGroup: 3000 + fsGroup: 2000 + containers: + - image: public.ecr.aws/eks-distro/kubernetes/pause:3.2 + name: inflate + resources: + requests: + nvidia.com/gpu: 1 + limits: + nvidia.com/gpu: 1 + securityContext: + allowPrivilegeEscalation: false diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/autoscaling/nodeclass.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/autoscaling/nodeclass.yaml new file mode 100644 index 0000000..8958ea3 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/autoscaling/nodeclass.yaml @@ -0,0 +1,11 @@ +apiVersion: karpenter.sagemaker.amazonaws.com/v1 +kind: HyperpodNodeClass +metadata: + name: sample-nc +spec: + instanceGroups: + # name of InstanceGroup in HyperPod cluster. InstanceGroup needs to pre-created + # MaxItems: 10 + - scaling-group + - autoscaling-group + - inf-group \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/autoscaling/nodepool.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/autoscaling/nodepool.yaml new file mode 100644 index 0000000..4791fc5 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/autoscaling/nodepool.yaml @@ -0,0 +1,32 @@ +apiVersion: karpenter.sh/v1 +kind: NodePool +metadata: + name: sample-np +spec: + # Resource limits constrain the total size of the pool + limits: + nvidia.com/gpu: 2 + + # Disruption settings for node lifecycle management + disruption: + budgets: + - nodes: 30% + consolidateAfter: 0s + consolidationPolicy: WhenEmpty + + template: + spec: + # Node expiration - Never means nodes won't expire automatically + expireAfter: Never + + # Reference to the HyperpodNodeClass + nodeClassRef: + group: karpenter.sagemaker.amazonaws.com + kind: HyperpodNodeClass + name: sample-nc + + # Requirements for node selection + requirements: + # Let Karpenter choose from available instance types in the instance groups + - key: node.kubernetes.io/instance-type + operator: Exists \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/create_config.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/create_config.sh new file mode 100755 index 0000000..7c7593b --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/create_config.sh @@ -0,0 +1,281 @@ +#!/bin/bash + +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# Permission is hereby granted, free of charge, to any person obtaining a copy of this +# software and associated documentation files (the "Software"), to deal in the Software +# without restriction, including without limitation the rights to use, copy, modify, +# merge, publish, distribute, sublicense, and/or sell copies of the Software, and to +# permit persons to whom the Software is furnished to do so. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A +# PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT +# HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION +# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE +# SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +# : "${STACK_ID:-hyperpod-eks-full-stack}" + +# Clear previously set env_vars +> env_vars + +# Define AWS Region +if [ -z ${AWS_REGION} ]; then + echo "[WARNING] AWS_REGION environment variable is not set, automatically set depending on aws cli default region." + export AWS_REGION=$(aws configure get region) +fi +echo "export AWS_REGION=${AWS_REGION}" >> env_vars +echo "[INFO] AWS_REGION = ${AWS_REGION}" + +# Retrieve EKS CLUSTER Name if not already defined +if [[ -z "${EKS_CLUSTER_NAME}" ]]; then + # Only retrieve from CloudFormation if not already set + export EKS_CLUSTER_NAME=`aws cloudformation describe-stacks \ + --stack-name $STACK_ID \ + --query 'Stacks[0].Outputs[?OutputKey==\`OutputEKSClusterName\`].OutputValue' \ + --region ${AWS_REGION} \ + --output text` + + if [[ ! -z $EKS_CLUSTER_NAME ]]; then + echo "export EKS_CLUSTER_NAME=${EKS_CLUSTER_NAME}" >> env_vars + echo "[INFO] EKS_CLUSTER_NAME = ${EKS_CLUSTER_NAME}" + else + echo "[ERROR] failed to retrieve EKS_CLUSTER_NAME" + return 1 + fi +else + echo "[INFO] Using existing EKS_CLUSTER_NAME = ${EKS_CLUSTER_NAME}" + echo "export EKS_CLUSTER_NAME=${EKS_CLUSTER_NAME}" >> env_vars +fi + +# Retrieve EKS CLUSTER ARN +# Check if EKS_CLUSTER_ARN is already set and not empty +if [[ -z "${EKS_CLUSTER_ARN}" ]]; then + # First attempt: retrieve from CloudFormation + export EKS_CLUSTER_ARN=`aws cloudformation describe-stacks \ + --stack-name $STACK_ID \ + --query 'Stacks[0].Outputs[?OutputKey==\`OutputEKSClusterArn\`].OutputValue' \ + --region ${AWS_REGION} \ + --output text` + + # Second attempt: verify cluster exists and get ARN + if [[ -z "${EKS_CLUSTER_ARN}" && ! -z "${EKS_CLUSTER_NAME}" ]]; then + # Verify cluster exists + if aws eks describe-cluster --name ${EKS_CLUSTER_NAME} --region ${AWS_REGION} &>/dev/null; then + export EKS_CLUSTER_ARN=`aws eks describe-cluster \ + --name ${EKS_CLUSTER_NAME} \ + --query 'cluster.arn' \ + --region ${AWS_REGION} \ + --output text` + echo "[INFO] Retrieved EKS_CLUSTER_ARN from existing cluster" + else + echo "[ERROR] EKS cluster ${EKS_CLUSTER_NAME} does not exist in region ${AWS_REGION}" + return 1 + fi + fi + + if [[ ! -z "${EKS_CLUSTER_ARN}" ]]; then + echo "export EKS_CLUSTER_ARN=${EKS_CLUSTER_ARN}" >> env_vars + echo "[INFO] EKS_CLUSTER_ARN = ${EKS_CLUSTER_ARN}" + else + echo "[ERROR] failed to retrieve EKS_CLUSTER_ARN" + return 1 + fi +else + echo "[INFO] Using existing EKS_CLUSTER_ARN = ${EKS_CLUSTER_ARN}" +fi + +# Check if S3_BUCKET_NAME is already set and not empty +if [[ -z "${S3_BUCKET_NAME}" ]]; then + # Retrieve S3 Bucket Name + export S3_BUCKET_NAME=`aws cloudformation describe-stacks \ + --stack-name $STACK_ID \ + --query 'Stacks[0].Outputs[?OutputKey==\`OutputS3BucketName\`].OutputValue' \ + --region ${AWS_REGION} \ + --output text` + + if [[ ! -z $S3_BUCKET_NAME ]]; then + echo "export S3_BUCKET_NAME=${S3_BUCKET_NAME}" >> env_vars + echo "[INFO] S3_BUCKET_NAME = ${S3_BUCKET_NAME}" + else + echo "[ERROR] failed to retrieve S3_BUCKET_NAME" + return 1 + fi +else + echo "[INFO] Using existing S3_BUCKET_NAME = ${S3_BUCKET_NAME}" + echo "export S3_BUCKET_NAME=${S3_BUCKET_NAME}" >> env_vars +fi + +# Check if EXECUTION_ROLE is already set and not empty +if [[ -z "${EXECUTION_ROLE}" ]]; then + # Retrieve SageMaker Execution Role + export EXECUTION_ROLE=`aws cloudformation describe-stacks \ + --stack-name $STACK_ID \ + --query 'Stacks[0].Outputs[?OutputKey==\`OutputSageMakerIAMRoleArn\`].OutputValue' \ + --region ${AWS_REGION} \ + --output text` + + if [[ ! -z $EXECUTION_ROLE ]]; then + echo "export EXECUTION_ROLE=${EXECUTION_ROLE}" >> env_vars + echo "[INFO] EXECUTION_ROLE = ${EXECUTION_ROLE}" + else + echo "[ERROR] failed to retrieve EXECUTION_ROLE" + return 1 + fi +else + echo "[INFO] Using existing EXECUTION_ROLE = ${EXECUTION_ROLE}" + echo "export EXECUTION_ROLE=${EXECUTION_ROLE}" >> env_vars +fi + +# Check if VPC_ID is already set and not empty +if [[ -z "${VPC_ID}" ]]; then + # Only retrieve from CloudFormation if not already set + export VPC_ID=`aws cloudformation describe-stacks \ + --stack-name $STACK_ID \ + --query 'Stacks[0].Outputs[?OutputKey==\`OutputVpcId\`].OutputValue' \ + --region ${AWS_REGION} \ + --output text` + + if [[ ! -z $VPC_ID ]]; then + echo "export VPC_ID=${VPC_ID}" >> env_vars + echo "[INFO] VPC_ID = ${VPC_ID}" + else + echo "[ERROR] failed to retrieve VPC_ID" + return 1 + fi +else + echo "[INFO] Using existing VPC_ID = ${VPC_ID}" + echo "export VPC_ID=${VPC_ID}" >> env_vars +fi + +# Check if PRIVATE_SUBNET_ID is already set and not empty +if [[ -z "${PRIVATE_SUBNET_ID}" ]]; then + # Only retrieve from CloudFormation if not already set + export PRIVATE_SUBNET_ID=`aws cloudformation describe-stacks \ + --stack-name $STACK_ID \ + --query 'Stacks[0].Outputs[?OutputKey==\`OutputPrivateSubnetIds\`].OutputValue' \ + --region ${AWS_REGION} \ + --output text` + + if [[ ! -z $PRIVATE_SUBNET_ID ]]; then + echo "export PRIVATE_SUBNET_ID=${PRIVATE_SUBNET_ID}" >> env_vars + echo "[INFO] PRIVATE_SUBNET_ID = ${PRIVATE_SUBNET_ID}" + else + echo "[ERROR] failed to retrieve PRIVATE_SUBNET_ID" + return 1 + fi +else + echo "[INFO] Using existing PRIVATE_SUBNET_ID = ${PRIVATE_SUBNET_ID}" + echo "export PRIVATE_SUBNET_ID=${PRIVATE_SUBNET_ID}" >> env_vars +fi + +# Check if SECURITY_GROUP_ID is already set and not empty +if [[ -z "${SECURITY_GROUP_ID}" ]]; then + # Only retrieve from CloudFormation if not already set + export SECURITY_GROUP_ID=`aws cloudformation describe-stacks \ + --stack-name $STACK_ID \ + --query 'Stacks[0].Outputs[?OutputKey==\`OutputSecurityGroupId\`].OutputValue' \ + --region ${AWS_REGION} \ + --output text` + + if [[ ! -z $SECURITY_GROUP_ID ]]; then + echo "export SECURITY_GROUP_ID=${SECURITY_GROUP_ID}" >> env_vars + echo "[INFO] SECURITY_GROUP_ID = ${SECURITY_GROUP_ID}" + else + echo "[ERROR] failed to retrieve SECURITY_GROUP_ID" + return 1 + fi +else + echo "[INFO] Using existing SECURITY_GROUP_ID = ${SECURITY_GROUP_ID}" + echo "export SECURITY_GROUP_ID=${SECURITY_GROUP_ID}" >> env_vars +fi + + +# Define accelerated compute instance type. +if [ -z ${ACCEL_INSTANCE_TYPE} ]; then + echo "[WARNING] ACCEL_INSTANCE_TYPE environment variable is not set, automatically set to ml.g5.12xlarge." + export ACCEL_INSTANCE_TYPE=ml.g5.12xlarge +fi +echo "export ACCEL_INSTANCE_TYPE=${ACCEL_INSTANCE_TYPE}" >> env_vars +echo "[INFO] ACCEL_INSTANCE_TYPE = ${ACCEL_INSTANCE_TYPE}" + +# Set number of accelerated compute nodes to deploy +if [ -z ${ACCEL_INSTANCE_COUNT} ]; then + echo "[WARNING] ACCEL_INSTANCE_COUNT environment variable is not set, automatically set to 1." + export ACCEL_INSTANCE_COUNT=1 +fi +echo "export ACCEL_INSTANCE_COUNT=${ACCEL_INSTANCE_COUNT}" >> env_vars +echo "[INFO] ACCEL_INSTANCE_COUNT = ${ACCEL_INSTANCE_COUNT}" + +# Set the EBS Volume size for the accelerated compute nodes +if [ -z ${ACCEL_VOLUME_SIZE} ]; then + echo "[WARNING] ACCEL_VOLUME_SIZE environment variable is not set, automatically set to 500." + export ACCEL_VOLUME_SIZE=500 +fi +echo "export ACCEL_VOLUME_SIZE=${ACCEL_VOLUME_SIZE}" >> env_vars +echo "[INFO] ACCEL_VOLUME_SIZE = ${ACCEL_VOLUME_SIZE}" + +# Define general purpose compute instance type. +if [ -z ${GEN_INSTANCE_TYPE} ]; then + echo "[WARNING] GEN_INSTANCE_TYPE environment variable is not set, automatically set to ml.m5.2xlarge." + export GEN_INSTANCE_TYPE=ml.m5.2xlarge +fi +echo "export GEN_INSTANCE_TYPE=${GEN_INSTANCE_TYPE}" >> env_vars +echo "[INFO] GEN_INSTANCE_TYPE = ${GEN_INSTANCE_TYPE}" + +# Set the number of general purpose nodes to deploy +if [ -z ${GEN_INSTANCE_COUNT} ]; then + echo "[WARNING] GEN_INSTANCE_COUNT environment variable is not set, automatically set to 1." + export GEN_INSTANCE_COUNT=1 +fi +echo "export GEN_INSTANCE_COUNT=${GEN_INSTANCE_COUNT}" >> env_vars +echo "[INFO] GEN_INSTANCE_COUNT = ${GEN_INSTANCE_COUNT}" + +# Set the EBS Volume size for the general purpose compute nodes +if [ -z ${GEN_VOLUME_SIZE} ]; then + echo "[WARNING] GEN_VOLUME_SIZE environment variable is not set, automatically set to 500." + export GEN_VOLUME_SIZE=500 +fi +echo "export GEN_VOLUME_SIZE=${GEN_VOLUME_SIZE}" >> env_vars +echo "[INFO] GEN_VOLUME_SIZE = ${GEN_VOLUME_SIZE}" + +# Set auto-recovery +if [ -z ${NODE_RECOVERY} ]; then + echo "[WARNING] NODE_RECOVERY environment variable is not set, set to Automatic." + export NODE_RECOVERY="Automatic" +fi +echo "export NODE_RECOVERY=${NODE_RECOVERY}" >> env_vars +echo "[INFO] NODE_RECOVERY = ${NODE_RECOVERY}" + +# Set network flag for Docker if in SageMaker Code Editor +if [ "${SAGEMAKER_APP_TYPE:-}" = "CodeEditor" ]; then + echo "export DOCKER_NETWORK=\"--network sagemaker\"" >> env_vars +fi + +# Get absolute path of env_vars file +ENV_VARS_PATH="$(realpath "$(dirname "$0")/env_vars")" + +# Persist the environment variables +add_source_command() { + local config_file="$1" + local source_line="[ -f \"${ENV_VARS_PATH}\" ] && source \"${ENV_VARS_PATH}\"" + + # Only add if the line doesn't exist already + if ! grep -q "source.*${ENV_VARS_PATH}" "$config_file"; then + echo "$source_line" >> "$config_file" + echo "[INFO] Added environment variables to $config_file" + else + echo "[INFO] Environment variables already configured in $config_file" + fi +} + +# Check shell config files +if [ -f ~/.bashrc ]; then + add_source_command ~/.bashrc +fi + +if [ -f ~/.zshrc ]; then + add_source_command ~/.zshrc +fi \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/env_vars b/Container-Root/hyperpod/deployment/eks/demo/hero/env_vars new file mode 100644 index 0000000..4ecfa04 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/env_vars @@ -0,0 +1,28 @@ +export AWS_REGION=us-east-1 +export EKS_CLUSTER_NAME=sagemaker-hero-cluster-9e7915eb-eks +export EKS_CLUSTER_ARN=arn:aws:eks:us-east-1:011528295005:cluster/sagemaker-hero-cluster-9e7915eb-eks +export S3_BUCKET_NAME=sagemaker-hero-cluster-9e7915eb-bucket +export EXECUTION_ROLE=arn:aws:iam::011528295005:role/sagemaker-hero-cluster-9e7915ebExecRole +export HYPERPOD_CLUSTER_ARN=arn:aws:sagemaker:us-east-1:011528295005:cluster/91vjgpc0or9z +export GEN_INSTANCE_COUNT=2 +export GEN_INSTANCE_TYPE="ml.m5.12xlarge" +export FSX_ID=fs-06df629436872672f # If you leave this blank, it will pick the first fsx in your account + +# Container env vars +export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]') +export ACCOUNT=$(aws sts get-caller-identity --query Account --output text) +export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/ +export IMAGE=fsdp +export TAG=pytorch2.7.1 + + +# Training Envs +export GPU_PER_NODE=1 +export ACCEL_INSTANCE_TYPE="ml.g5.8xlarge" +export ACCEL_INSTANCE_COUNT=8 +export EFA_PER_NODE=1 +export HF_TOKEN= + +# Inference Envs +export NS_INF="hyperpod-ns-inference-team" +export SAGEMAKER_ENDPOINT_NAME="deepseek15b-s3" diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/exec.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/exec.sh new file mode 100755 index 0000000..783d1f1 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/exec.sh @@ -0,0 +1,3 @@ +#!/bin/bash +docker exec -it do-hyperpod-use1 bash + diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/fsx.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/fsx.yaml new file mode 100644 index 0000000..20f879b --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/fsx.yaml @@ -0,0 +1,19 @@ +apiVersion: v1 +kind: Pod +metadata: + name: fsx-pod +spec: + volumes: + - name: fsx-storage + persistentVolumeClaim: + claimName: fsx-claim + containers: + - name: cleanup-container + image: ubuntu:latest + command: ["/bin/bash"] + args: ["-c", "sleep infinity"] # Keeps the container running so you can exec into it + volumeMounts: + - name: fsx-storage + mountPath: /fsx + restartPolicy: Never + diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/hpcli-fsdp.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/hpcli-fsdp.yaml new file mode 100644 index 0000000..f6ae96c --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/hpcli-fsdp.yaml @@ -0,0 +1,77 @@ +defaults: + - override hydra/job_logging: stdout + +hydra: + run: + dir: . + output_subdir: null + +training_cfg: + entry_script: /fsdp/train.py + script_args: + - --max_context_width: 4096 + - --num_key_value_heads: 32 + - --intermediate_size: 11008 + - --hidden_width: 4096 + - --num_layers: 32 + - --num_heads: 32 + - --model_type: llama_v2 + - --tokenizer: hf-internal-testing/llama-tokenizer + - --checkpoint_freq: 5000 + - --validation_freq: 500 + - --max_steps: 5000 + - --checkpoint_dir: /checkpoints + - --dataset: allenai/c4 + - --dataset_config_name: en + - --resume_from_checkpoint: /checkpoints + - --train_batch_size: 1 + - --val_batch_size: 1 + - --sharding_strategy: full + - --offload_activation: 1 + + run: + name: fsdp + nodes: + ntasks_per_node: 1 +cluster: + cluster_type: k8s + instance_type: + cluster_config: + service_account_name: null + + volumes: + - volumeName: local + hostPath: "/mnt/k8s-disks/0" + mountPath: "/local" + + namespace: kubeflow + label_selector: + required: + sagemaker.amazonaws.com/node-health-status: + - Schedulable + preferred: + sagemaker.amazonaws.com/deep-health-check-status: + - Passed + weights: + - 100 + pullPolicy: Always + restartPolicy: OnFailure + + annotations: + sagemaker.amazonaws.com/enable-job-auto-resume: True + sagemaker.amazonaws.com/job-max-retry-count: 10 + +base_results_dir: ./result +container: + +env_vars: + LOGLEVEL: DEBUG + TORCH_DISTRIBUTED_DEBUG: DETAIL + TORCH_NCCL_ENABLE_MONITORING: 1 + TORCH_NCCL_TRACE_BUFFER_SIZE: 20000 + TORCH_NCCL_DUMP_ON_TIMEOUT: 1 + TORCH_NCCL_DEBUG_INFO_TEMP_FILE: /local/nccl_trace_rank_ + PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True" + NCCL_DEBUG: INFO + NCCL_SOCKET_IFNAME: ^lo + TORCH_NCCL_ASYNC_ERROR_HANDLING: 1 diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/img/eu.png b/Container-Root/hyperpod/deployment/eks/demo/hero/img/eu.png new file mode 100644 index 0000000..3e1cf88 Binary files /dev/null and b/Container-Root/hyperpod/deployment/eks/demo/hero/img/eu.png differ diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/img/gpu-metrics.png b/Container-Root/hyperpod/deployment/eks/demo/hero/img/gpu-metrics.png new file mode 100644 index 0000000..420809f Binary files /dev/null and b/Container-Root/hyperpod/deployment/eks/demo/hero/img/gpu-metrics.png differ diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/img/mlflow-metrics.png b/Container-Root/hyperpod/deployment/eks/demo/hero/img/mlflow-metrics.png new file mode 100644 index 0000000..9b35cac Binary files /dev/null and b/Container-Root/hyperpod/deployment/eks/demo/hero/img/mlflow-metrics.png differ diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/img/observability-tasks.png b/Container-Root/hyperpod/deployment/eks/demo/hero/img/observability-tasks.png new file mode 100644 index 0000000..fbff652 Binary files /dev/null and b/Container-Root/hyperpod/deployment/eks/demo/hero/img/observability-tasks.png differ diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/img/observability-training.png b/Container-Root/hyperpod/deployment/eks/demo/hero/img/observability-training.png new file mode 100644 index 0000000..935247e Binary files /dev/null and b/Container-Root/hyperpod/deployment/eks/demo/hero/img/observability-training.png differ diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/img/suspended.png b/Container-Root/hyperpod/deployment/eks/demo/hero/img/suspended.png new file mode 100644 index 0000000..99032c9 Binary files /dev/null and b/Container-Root/hyperpod/deployment/eks/demo/hero/img/suspended.png differ diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/img/task-gov.png b/Container-Root/hyperpod/deployment/eks/demo/hero/img/task-gov.png new file mode 100644 index 0000000..16938e9 Binary files /dev/null and b/Container-Root/hyperpod/deployment/eks/demo/hero/img/task-gov.png differ diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/img/training-task.png b/Container-Root/hyperpod/deployment/eks/demo/hero/img/training-task.png new file mode 100644 index 0000000..8b86272 Binary files /dev/null and b/Container-Root/hyperpod/deployment/eks/demo/hero/img/training-task.png differ diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/inference/deploy_s3_inference.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/deploy_s3_inference.yaml new file mode 100644 index 0000000..80f2fab --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/deploy_s3_inference.yaml @@ -0,0 +1,87 @@ +--- +apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1 +kind: InferenceEndpointConfig +metadata: + name: deepseek15b-autoscale +spec: + modelName: deepseek15b + endpointName: $SAGEMAKER_ENDPOINT_NAME + instanceType: $ACCEL_INSTANCE_TYPE + invocationEndpoint: invocations + modelSourceConfig: + modelSourceType: s3 + s3Storage: + bucketName: $S3_BUCKET_NAME + region: $AWS_REGION + modelLocation: deepseek15b + prefetchEnabled: true + replicas: 1 + autoScalingSpec: + minReplicaCount: 1 + maxReplicaCount: 4 + pollingInterval: 30 + cooldownPeriod: 120 + initialCooldownPeriod: 60 + scaleDownStabilizationTime: 300 + scaleUpStabilizationTime: 0 + cloudWatchTrigger: + name: "DeepSeek-Invocations" + namespace: "AWS/SageMaker" + useCachedMetrics: true + metricName: "Invocations" + targetValue: 5.0 + activationTargetValue: 1.0 + minValue: 0.0 + metricCollectionStartTime: 300 + metricCollectionPeriod: 60 + metricStat: "Sum" + metricType: "Average" + dimensions: + - name: "EndpointName" + value: "$SAGEMAKER_ENDPOINT_NAME" + - name: "VariantName" + value: "AllTraffic" + worker: + resources: + limits: + nvidia.com/gpu: 1 + requests: + nvidia.com/gpu: 1 + cpu: 25600m + memory: 102Gi + image: 763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu124 + modelInvocationPort: + containerPort: 8080 + name: http + modelVolumeMount: + name: model-weights + mountPath: /opt/ml/model + environmentVariables: + - name: OPTION_ROLLING_BATCH + value: "vllm" + - name: SERVING_CHUNKED_READ_TIMEOUT + value: "480" + - name: DJL_OFFLINE + value: "true" + - name: NUM_SHARD + value: "1" + - name: SAGEMAKER_PROGRAM + value: "inference.py" + - name: SAGEMAKER_SUBMIT_DIRECTORY + value: "/opt/ml/model/code" + - name: MODEL_CACHE_ROOT + value: "/opt/ml/model" + - name: SAGEMAKER_MODEL_SERVER_WORKERS + value: "1" + - name: SAGEMAKER_MODEL_SERVER_TIMEOUT + value: "3600" + - name: OPTION_TRUST_REMOTE_CODE + value: "true" + - name: OPTION_ENABLE_REASONING + value: "true" + - name: OPTION_REASONING_PARSER + value: "deepseek_r1" + - name: SAGEMAKER_CONTAINER_LOG_LEVEL + value: "20" + - name: SAGEMAKER_ENV + value: "1" \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/inference/invoke_deepseek.py b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/invoke_deepseek.py new file mode 100644 index 0000000..2c4fb4f --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/invoke_deepseek.py @@ -0,0 +1,15 @@ +import boto3 +import json + +client = boto3.client('sagemaker-runtime') + +response = client.invoke_endpoint( + EndpointName='deepseek15b-s3', + ContentType='application/json', + Accept='application/json', + Body=json.dumps({ + "inputs": "Hi, is water wet?" + }) +) + +print(response['Body'].read().decode('utf-8')) \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/inference/invoke_jumpstart.py b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/invoke_jumpstart.py new file mode 100755 index 0000000..4475028 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/invoke_jumpstart.py @@ -0,0 +1,19 @@ +import boto3 +import json + +client = boto3.client('sagemaker-runtime') + +response = client.invoke_endpoint( + EndpointName='mistral-autoscale-endpoint', + ContentType='application/json', + Accept='application/json', + Body=json.dumps({ + "inputs": "Hi, is water wet?", + "parameters": { + "max_new_tokens": 100, + "temperature": 0.7 + } + }) +) + +print(response['Body'].read().decode('utf-8')) \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/inference/jumpstart.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/jumpstart.yaml new file mode 100644 index 0000000..b6800d3 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/jumpstart.yaml @@ -0,0 +1,47 @@ +apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1 +kind: JumpStartModel +metadata: + name: mistral-jumpstart-autoscale + namespace: hyperpod-ns-inference-team + labels: + kueue.x-k8s.io/queue-name: hyperpod-ns-inference-team-localqueue + kueue.x-k8s.io/priority-class: inference-priority +spec: + sageMakerEndpoint: + name: "mistral-autoscale-endpoint" + model: + modelHubName: SageMakerPublicHub + modelId: huggingface-llm-mistral-7b-instruct-v3 + modelVersion: "1.0.0" + server: + instanceType: ml.g5.8xlarge + metrics: + enabled: true + maxDeployTimeInSeconds: 1800 + tlsConfig: + tlsCertificateOutputS3Uri: "s3://${S3_BUCKET_NAME}" + autoScalingSpec: + minReplicaCount: 1 + maxReplicaCount: 4 + pollingInterval: 30 + cooldownPeriod: 120 + initialCooldownPeriod: 60 + scaleDownStabilizationTime: 300 + scaleUpStabilizationTime: 0 + cloudWatchTrigger: + name: "Mistral-Invocations" + namespace: "AWS/SageMaker" + useCachedMetrics: true + metricName: "Invocations" + targetValue: 5.0 + activationTargetValue: 1.0 + minValue: 0.0 + metricCollectionStartTime: 300 + metricCollectionPeriod: 60 + metricStat: "Sum" + metricType: "Average" + dimensions: + - name: "EndpointName" + value: "mistral-autoscale-endpoint" + - name: "VariantName" + value: "AllTraffic" \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/inference/load_test.py b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/load_test.py new file mode 100755 index 0000000..128d346 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/load_test.py @@ -0,0 +1,138 @@ +#!/usr/bin/env python3 +""" +Load testing script for HyperPod inference endpoint autoscaling. +This script sends concurrent requests to trigger CloudWatch metrics and autoscaling. +""" + +import boto3 +import json +import time +import threading +import argparse +from concurrent.futures import ThreadPoolExecutor, as_completed +from datetime import datetime + +class LoadTester: + def __init__(self, endpoint_name, region='us-east-1', max_workers=10): + self.endpoint_name = endpoint_name + self.client = boto3.client('sagemaker-runtime', region_name=region) + self.max_workers = max_workers + self.request_count = 0 + self.success_count = 0 + self.error_count = 0 + self.lock = threading.Lock() + + def send_request(self, request_id): + """Send a single inference request""" + try: + payload = { + "inputs": f"Request {request_id}: What is the capital of France?", + "parameters": { + "max_new_tokens": 50, + "temperature": 0.7 + } + } + + start_time = time.time() + response = self.client.invoke_endpoint( + EndpointName=self.endpoint_name, + ContentType='application/json', + Accept='application/json', + Body=json.dumps(payload) + ) + + response_body = response['Body'].read().decode('utf-8') + end_time = time.time() + + with self.lock: + self.success_count += 1 + + print(f"βœ… Request {request_id} completed in {end_time - start_time:.2f}s") + return True + + except Exception as e: + with self.lock: + self.error_count += 1 + print(f"❌ Request {request_id} failed: {str(e)}") + return False + + def run_load_test(self, total_requests, requests_per_second=2, duration_minutes=5): + """Run load test with specified parameters""" + print(f"πŸš€ Starting load test:") + print(f" Endpoint: {self.endpoint_name}") + print(f" Total requests: {total_requests}") + print(f" Requests per second: {requests_per_second}") + print(f" Duration: {duration_minutes} minutes") + print(f" Max workers: {self.max_workers}") + print("-" * 50) + + start_time = time.time() + end_time = start_time + (duration_minutes * 60) + + with ThreadPoolExecutor(max_workers=self.max_workers) as executor: + request_id = 0 + + while time.time() < end_time and request_id < total_requests: + # Submit batch of requests + batch_size = min(requests_per_second, total_requests - request_id) + futures = [] + + for _ in range(batch_size): + request_id += 1 + future = executor.submit(self.send_request, request_id) + futures.append(future) + + # Wait for batch to complete or timeout + batch_start = time.time() + for future in as_completed(futures, timeout=30): + try: + future.result() + except Exception as e: + print(f"❌ Future failed: {e}") + + # Control request rate + elapsed = time.time() - batch_start + sleep_time = max(0, 1.0 - elapsed) # Aim for 1 request per second + if sleep_time > 0: + time.sleep(sleep_time) + + # Print progress + if request_id % 10 == 0: + elapsed_minutes = (time.time() - start_time) / 60 + print(f"πŸ“Š Progress: {request_id} requests sent, {elapsed_minutes:.1f} minutes elapsed") + + total_time = time.time() - start_time + print("-" * 50) + print(f"🏁 Load test completed!") + print(f" Total time: {total_time:.2f} seconds") + print(f" Requests sent: {request_id}") + print(f" Successful: {self.success_count}") + print(f" Failed: {self.error_count}") + print(f" Success rate: {(self.success_count/request_id)*100:.1f}%") + print(f" Average RPS: {request_id/total_time:.2f}") + +def main(): + parser = argparse.ArgumentParser(description='Load test HyperPod inference endpoint') + parser.add_argument('--endpoint', required=True, help='SageMaker endpoint name') + parser.add_argument('--requests', type=int, default=100, help='Total number of requests') + parser.add_argument('--rps', type=int, default=2, help='Requests per second') + parser.add_argument('--duration', type=int, default=5, help='Duration in minutes') + parser.add_argument('--workers', type=int, default=10, help='Max concurrent workers') + parser.add_argument('--region', default='us-east-1', help='AWS region') + + args = parser.parse_args() + + tester = LoadTester( + endpoint_name=args.endpoint, + region=args.region, + max_workers=args.workers + ) + + tester.run_load_test( + total_requests=args.requests, + requests_per_second=args.rps, + duration_minutes=args.duration + ) + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/inference/load_test_deepseek.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/load_test_deepseek.sh new file mode 100755 index 0000000..a2691fa --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/load_test_deepseek.sh @@ -0,0 +1,7 @@ +#!/bin/bash +python3 inference/load_test.py \ + --endpoint deepseek15b-s3 \ + --requests 200 \ + --rps 10 \ + --duration 10 \ + --workers 15 \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/inference/load_test_jumpstart.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/load_test_jumpstart.sh new file mode 100755 index 0000000..56c7876 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/load_test_jumpstart.sh @@ -0,0 +1,7 @@ +#!/bin/bash +python3 inference/load_test.py \ + --endpoint mistral-autoscale-endpoint \ + --requests 200 \ + --rps 10 \ + --duration 10 \ + --workers 15 \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/inference/monitor_autoscaling.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/monitor_autoscaling.sh new file mode 100755 index 0000000..10fc3c4 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/monitor_autoscaling.sh @@ -0,0 +1,52 @@ +#!/bin/bash + +# Simple, clean autoscaling monitor +DEPLOYMENT_NAME="mistral-jumpstart-autoscale" +NAMESPACE="hyperpod-ns-inference-team" + +echo "🎯 Monitoring $DEPLOYMENT_NAME autoscaling (Press Ctrl+C to stop)" +echo "Time | Pods | Status" +echo "---------|------|--------------------------------------------------" + +while true; do + # Get current time + TIME=$(date '+%H:%M:%S') + + # Get pod count and status + POD_INFO=$(kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE -o custom-columns=READY:.status.readyReplicas,DESIRED:.spec.replicas --no-headers 2>/dev/null) + + if [ $? -eq 0 ]; then + READY=$(echo $POD_INFO | cut -d' ' -f1) + DESIRED=$(echo $POD_INFO | cut -d' ' -f2) + + # Handle null values + [ "$READY" = "" ] && READY="0" + [ "$DESIRED" = "" ] && DESIRED="0" + + # Get scaling events + RECENT_EVENT=$(kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' --no-headers 2>/dev/null | \ + grep -E "(Scaled|ScalingReplicaSet)" | tail -1 | \ + awk '{for(i=6;i<=NF;i++) printf "%s ", $i; print ""}' | \ + cut -c1-40) + + # Color coding based on scaling activity + if [ "$READY" -gt 1 ]; then + STATUS="🟒 Scaled up: $READY/$DESIRED ready" + elif [ "$READY" != "$DESIRED" ]; then + STATUS="🟑 Scaling: $READY/$DESIRED ready" + else + STATUS="πŸ”΅ Stable: $READY/$DESIRED ready" + fi + + printf "%-8s | %-4s | %s\n" "$TIME" "$READY" "$STATUS" + + # Show recent event if significant + if [[ "$RECENT_EVENT" == *"Scaled"* ]]; then + printf " | | πŸ“ˆ %s\n" "$RECENT_EVENT" + fi + else + printf "%-8s | %-4s | πŸ”΄ Deployment not found\n" "$TIME" "?" + fi + + sleep 5 +done \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/inference/submit-deepseek.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/submit-deepseek.sh new file mode 100755 index 0000000..707ef35 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/submit-deepseek.sh @@ -0,0 +1,9 @@ +#!/bin/bash + +echo "Running: envsubst < deploy_s3_inference.yaml | kubectl apply -f -" + +echo "Creating HyperPod Inference Operator Job for training DeepSeek 15B parameter model..." + +echo "" + +envsubst < inference/deploy_s3_inference.yaml | kubectl apply -f - \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/inference/submit-jumpstart.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/submit-jumpstart.sh new file mode 100755 index 0000000..528163b --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/inference/submit-jumpstart.sh @@ -0,0 +1,9 @@ +#!/bin/bash + +# Load environment variables +source env_vars + +echo "Running: envsubst < jumpstart.yaml | kubectl apply -f -" +echo "Creating HyperPod JumpStart Model with Autoscaling and Task Governance..." + +envsubst < inference/jumpstart.yaml | kubectl apply -f - \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/mlflow/create-trackingserver.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/mlflow/create-trackingserver.sh new file mode 100755 index 0000000..f2d1ddb --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/mlflow/create-trackingserver.sh @@ -0,0 +1,17 @@ +#!/bin/bash + +export MLFLOW_ROLE_ARN=$(aws iam get-role --role-name $MLFLOW_ROLE_NAME --query 'Role.Arn' --output text) +export TS_NAME=hyperpod-ts-demo +export MLFLOW_VERSION=3.0 + +# 2. Use AWS CLI to create Tracking Server +aws sagemaker create-mlflow-tracking-server \ + --tracking-server-name $TS_NAME \ + --artifact-store-uri s3://$S3_BUCKET_NAME \ + --role-arn $MLFLOW_ROLE_ARN \ + --automatic-model-registration \ + --region $AWS_REGION \ + --mlflow-version $MLFLOW_VERSION + + + \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/mlflow/create-ui.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/mlflow/create-ui.sh new file mode 100755 index 0000000..e2cf930 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/mlflow/create-ui.sh @@ -0,0 +1,8 @@ +#!/bin/bash +export TS_NAME=hyperpod-ts-demo + +aws sagemaker create-presigned-mlflow-tracking-server-url \ + --tracking-server-name $TS_NAME \ + --session-expiration-duration-in-seconds 1800 \ + --expires-in-seconds 300 \ + --region $AWS_REGION \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/mlflow/mlflowpolicy.json b/Container-Root/hyperpod/deployment/eks/demo/hero/mlflow/mlflowpolicy.json new file mode 100644 index 0000000..006953d --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/mlflow/mlflowpolicy.json @@ -0,0 +1,34 @@ +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "sagemaker-mlflow:*", + "sagemaker:CreateMlflowTrackingServer", + "sagemaker:UpdateMlflowTrackingServer", + "sagemaker:DeleteMlflowTrackingServer", + "sagemaker:StartMlflowTrackingServer", + "sagemaker:StopMlflowTrackingServer", + "sagemaker:CreatePresignedMlflowTrackingServerUrl", + "sagemaker:AddTags", + "sagemaker:CreateModelPackageGroup", + "sagemaker:CreateModelPackage", + "sagemaker:DescribeModelPackageGroup", + "sagemaker:UpdateModelPackage", + "sagemaker:ListModelPackageGroups", + "sagemaker:ListModelPackages", + "sagemaker:DeleteModelPackageGroup", + "sagemaker:DeleteModelPackage", + "cloudwatch:PutMetricData", + "logs:CreateLogStream", + "logs:PutLogEvents", + "logs:CreateLogGroup", + "s3:Get*", + "s3:Put*", + "s3:List*" + ], + "Resource": "*" + } + ] +} diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/mlflow/setup-mlflow.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/mlflow/setup-mlflow.sh new file mode 100755 index 0000000..5d019b3 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/mlflow/setup-mlflow.sh @@ -0,0 +1,69 @@ +#!/bin/bash + +# 1. Create an IAM OIDC identity provider for your cluster with the following command: +eksctl utils associate-iam-oidc-provider --cluster $EKS_CLUSTER_NAME --approve + +# 2. Create an IAM policy +cat < mlflowpolicy.json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "sagemaker-mlflow:*", + "sagemaker:CreateMlflowTrackingServer", + "sagemaker:UpdateMlflowTrackingServer", + "sagemaker:DeleteMlflowTrackingServer", + "sagemaker:StartMlflowTrackingServer", + "sagemaker:StopMlflowTrackingServer", + "sagemaker:CreatePresignedMlflowTrackingServerUrl", + "sagemaker:AddTags", + "sagemaker:CreateModelPackageGroup", + "sagemaker:CreateModelPackage", + "sagemaker:DescribeModelPackageGroup", + "sagemaker:UpdateModelPackage", + "sagemaker:ListModelPackageGroups", + "sagemaker:ListModelPackages", + "sagemaker:DeleteModelPackageGroup", + "sagemaker:DeleteModelPackage", + "cloudwatch:PutMetricData", + "logs:CreateLogStream", + "logs:PutLogEvents", + "logs:CreateLogGroup", + "s3:Get*", + "s3:Put*", + "s3:List*" + ], + "Resource": "*" + } + ] +} +EOF + +aws iam create-policy \ + --policy-name SageMakerMlFlowAccessPolicy \ + --policy-document file://mlflowpolicy.json + + + +# 3. Create an IAM role + +# To Access MLFlow tracking server we need to create an IAM service account with a role that uses the above created policy. This section shows how to create an IAM role to delegate these permissions. To create this role we will use eksctl. + +MLFLOW_ROLE_NAME=SM_MLFLOW_ACCESS_ROLE_DEMO +MLFLOW_POLICY_ARN=$(aws iam list-policies --query 'Policies[?PolicyName==`SageMakerMlFlowAccessPolicy`]' | jq '.[0].Arn' | tr -d '"') + +eksctl create iamserviceaccount \ + --name sagemaker-mlflow-sa \ + --namespace hyperpod-ns-training-team \ + --cluster $EKS_CLUSTER_NAME \ + --attach-policy-arn $MLFLOW_POLICY_ARN \ + --approve \ + --role-name $MLFLOW_ROLE_NAME \ + --region $AWS_REGION \ + --override-existing-serviceaccounts + +# Now please include a trust policy for the SageMaker service to the trust relationship for that role: +aws iam update-assume-role-policy --role-name $MLFLOW_ROLE_NAME --policy-document "$(aws iam get-role --role-name $MLFLOW_ROLE_NAME --query 'Role.AssumeRolePolicyDocument' --output json | jq '.Statement += [{"Effect": "Allow", "Principal": {"Service": "sagemaker.amazonaws.com"}, "Action": "sts:AssumeRole"}]')" + diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/run.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/run.sh new file mode 100644 index 0000000..04bfce5 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/run.sh @@ -0,0 +1,14 @@ +#!/bin/bash + +docker run --name=do-hyperpod-use1 \ + --hostname=deb706ea3971 \ + --mac-address=8a:15:14:f3:90:39 \ + --volume /var/run/docker.sock:/var/run/docker.sock \ + --volume $(pwd):/hero-demo-hyperpod \ + --volume ~/.kube:/root/.kube \ + --volume ~/.aws:/root/.aws \ + --network=bridge \ + --workdir=/hyperpod \ + --detach=true \ + public.ecr.aws/hpc-cloud/aws-do-hyperpod:latest \ + tail -f /dev/null diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/build-push.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/training/build-push.sh new file mode 100755 index 0000000..a9c5556 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/build-push.sh @@ -0,0 +1,23 @@ +#!/bin/bash +echo "Building HPTO FSDP image..." +pushd ./training/fsdp/ +docker build --platform linux/amd64 -f Dockerfile -t ${REGISTRY}${IMAGE}:${TAG} . +popd + +echo "Done building image!" +echo "" +echo "Pushing image to ECR..." +# Create registry if needed +REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"${IMAGE}\" | wc -l) +if [ "$REGISTRY_COUNT" == "0" ]; then + aws ecr create-repository --repository-name ${IMAGE} +fi + +# Login to registry +echo "Logging in to $REGISTRY ..." +aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY + +# Push image to registry +docker image push ${REGISTRY}${IMAGE}:${TAG} +echo "Done pushing image!" +echo "" \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/cleanup-fsx.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/training/cleanup-fsx.sh new file mode 100755 index 0000000..286f579 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/cleanup-fsx.sh @@ -0,0 +1,24 @@ +#!/bin/bash + +TARGET_NAMESPACE="hyperpod-ns-training-team" + +# Delete PVC first +echo "Deleting PVC..." +kubectl delete pvc fsx-claim -n ${TARGET_NAMESPACE} + +# Delete PV +echo "Deleting PV..." +kubectl delete pv fsx-pv-${TARGET_NAMESPACE} + +# Delete StorageClass +echo "Deleting StorageClass..." +kubectl delete sc fsx-sc-${TARGET_NAMESPACE} + +# Verify resources are deleted +echo -e "\nVerifying resources are deleted:" +echo "PVC status:" +kubectl get pvc -n ${TARGET_NAMESPACE} fsx-claim 2>/dev/null || echo "PVC deleted" +echo -e "\nPV status:" +kubectl get pv fsx-pv-${TARGET_NAMESPACE} 2>/dev/null || echo "PV deleted" +echo -e "\nStorageClass status:" +kubectl get sc fsx-sc-${TARGET_NAMESPACE} 2>/dev/null || echo "StorageClass deleted" diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/create-pvc.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/training/create-pvc.sh new file mode 100755 index 0000000..6233232 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/create-pvc.sh @@ -0,0 +1,99 @@ +#!/bin/bash + +# Set target namespace +TARGET_NAMESPACE="hyperpod-ns-training-team" + + + +# Get FSx filesystem information +if [ -z "$FSX_ID" ]; then + echo "FSX_ID not set. Attempting to find an available FSx filesystem..." + FSX_ID=$(aws fsx describe-file-systems --region ${AWS_REGION} | jq -r '.FileSystems[0].FileSystemId') + if [ -z "$FSX_ID" ]; then + echo "Error: Could not find any FSx filesystem" + exit 1 + else + echo "Using FSx filesystem: $FSX_ID" + fi +else + echo "Using provided FSX_ID: $FSX_ID" + # Verify if the provided FSX_ID exists + if ! aws fsx describe-file-systems --file-system-id $FSX_ID --region ${AWS_REGION} &> /dev/null; then + echo "Error: Provided FSX_ID $FSX_ID does not exist or is not accessible" + exit 1 + fi +fi + +# Get Private Subnet ID +PRIVATE_SUBNET_ID=$(aws fsx describe-file-systems --file-system-id ${FSX_ID} --region ${AWS_REGION} --query 'FileSystems[0].SubnetIds[0]' --output text) + +# Get the security group +CLUSTER_INFO=$(aws sagemaker describe-cluster --cluster-name hero-cluster) +SECURITY_GROUP=$(echo "$CLUSTER_INFO" | jq -r '.VpcConfig.SecurityGroupIds[0]') + +# Get FSx DNS name +FSX_DNS=$(aws fsx describe-file-systems --region ${AWS_REGION} --file-system-id ${FSX_ID} --query 'FileSystems[0].DNSName' --output text) + +# Get FSx Mount name +FSX_MOUNT=$(aws fsx describe-file-systems --region ${AWS_REGION} --file-system-id ${FSX_ID} --query 'FileSystems[0].LustreConfiguration.MountName' --output text) + +# Create StorageClass +cat </dev/null || echo "Repository already exists" + +# Build the Docker image for AMD64 platform (to match your Kubernetes nodes) +echo "Building Docker image for linux/amd64 platform..." +docker buildx build --platform linux/amd64 -t ${REGISTRY}/${REPOSITORY_NAME}:${IMAGE_TAG} . --load + +# Push the image to ECR +echo "Pushing image to ECR..." +docker push ${REGISTRY}/${REPOSITORY_NAME}:${IMAGE_TAG} + +echo "Build and push completed successfully!" +echo "Image URI: ${REGISTRY}/${REPOSITORY_NAME}:${IMAGE_TAG}" \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/generate-sbatch-training-files.py b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/generate-sbatch-training-files.py new file mode 100644 index 0000000..db41708 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/generate-sbatch-training-files.py @@ -0,0 +1,53 @@ +#!/usr/bin/env python3 + +import argparse +import jinja2 +import os +import pathlib + + +def get_model_parameters(model_name): + f = open('models/' + model_name + '.txt') + return f.read() + + +def list_models(path='models'): + models = [str(pathlib.Path(i).with_suffix('')) for i in os.listdir(path)] + + return models + + +def create_sbatch_file(model_name, model_parameters): + env = jinja2.Environment(loader=jinja2.FileSystemLoader('slurm')) + template = env.get_template('training-sub.template') + content = template.render(MODEL_NAME=model_name, + MODEL_PARAMETERS=model_parameters) + + f = open('slurm/' + model_name + '-training.sbatch', mode='w') + f.write(content) + f.close() + + +def create_kubernetes_file(model_name, model_parameters): + env = jinja2.Environment(loader=jinja2.FileSystemLoader('kubernetes')) + template = env.get_template('training_kubernetes.template') + model_name_dashed = model_name.replace('_', '-') + content = template.render(MODEL_NAME=model_name_dashed, + MODEL_PARAMETERS=model_parameters.splitlines()) + + f = open('kubernetes/' + model_name + '-fsdp.yaml', mode='w') + f.write(content) + f.close() + + +def main(): + models = list_models() + + for i in models: + model_parameters = get_model_parameters(i) + create_sbatch_file(i, model_parameters) + create_kubernetes_file(i, model_parameters) + + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/README.md b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/README.md new file mode 100644 index 0000000..5440ff2 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/README.md @@ -0,0 +1,202 @@ +# Run Distributed Training with PyTorch FSDP on Amazon EKS + +These scripts provide an easy way to get started with multinode [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) training on EKS. It is designed to be as simple as possible, requires no data preparation, and uses a container image. If you would like to run FSDP with SLURM, please refer to [README.md](../slurm/README.md). + +This document will run you through how to run Llama 3.1 8B model training with FSDP. You will also find in this folder manifests to run Llama 2(7B, 13B, 70B), Llama 3.1(8B, 70B), Llama 3.2(1B, 3B), Mistral 8x7b and Mistral Mathstral. + +## 0. Prerequisites + +### 0.1. EKS Cluster +Before running this training, you'll need to create an Amazon EKS or a SageMaker HyperPod EKS cluster. Instructions can be found in [1.architectures](../../1.architectures), the [aws-do-eks](https://bit.ly/do-eks) project, or the [eks-blueprints](https://github.com/aws-ia/terraform-aws-eks-blueprints) project. + +### 0.2. Connect to your EKS Cluster + +Run the [aws eks update-kubeconfig](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/update-kubeconfig.html) command to update your local kube config file (located at ~/.kube/config) with the credentials and configuration needed to connect to your EKS cluster using the kubectl command. + +```bash +aws eks update-kubeconfig --name +``` +You can verify that you are connected to the EKS cluster by running this commands: +```bash +kubectl config current-context +``` +``` +arn:aws:eks:us-west-1:xxxxxxxxxxxx:cluster/xxx-eks-cluster +``` +### 0.3. Clone the awsome-distributed-training reposource code +Clone this repo. + +``` +git clone https://github.com/aws-samples/awsome-distributed-training/ +cd awsome-distributed-training/3.test_cases/pytorch/FSDP/kubernetes +``` + +### 0.4. Envsubst +If the [envsubst](https://github.com/a8m/envsubst) utility is not available in your environment, please install it, following the instructions appropriate for your operating system. + +### 0.5. Kubeflow training operator +Deploy the Kubeflow training operator + +```bash +kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.9.1" +``` + +## 1. Build container image + +Build a container image for this example using the code below: + +```bash +export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]') +export ACCOUNT=$(aws sts get-caller-identity --query Account --output text) +export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/ +pushd ../ +docker build -f Dockerfile -t ${REGISTRY}fsdp:pytorch2.7.1 . +popd +``` + +The PyTorch FSDP container uses the [nccl-tests](https://github.com/aws-samples/awsome-distributed-training/blob/main/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile) container as base. + +## 2. Push container image to Amazon ECR + +In this step we create a container registry if one does not exist, and push the container image to it. + +```bash +# Create registry if needed +REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"fsdp\" | wc -l) +if [ "$REGISTRY_COUNT" == "0" ]; then + aws ecr create-repository --repository-name fsdp +fi + +# Login to registry +echo "Logging in to $REGISTRY ..." +aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY + +# Push image to registry +docker image push ${REGISTRY}fsdp:pytorch2.7.1 +``` + +## 3. Data + +For this example, we'll be using the [allenai/c4](https://huggingface.co/datasets/allenai/c4) dataset. Instead of downloading the entire dataset, the `create_streaming_dataloaders` function will stream the dataset from [HuggingFace](https://huggingface.co/datasets), so there's no data prep required for running this training. + +**For this dataset, we will need a Hugging Face access token**. First, create a [Hugging Face account](https://huggingface.co/welcome). Then [generate your access token with read permissions](https://huggingface.co/docs/hub/en/security-tokens). We will use this token and set it in our environment variables in the next step. + +If you'd like to instead use your own dataset, you can do so by [formatting it as a HuggingFace dataset](https://huggingface.co/docs/datasets/create_dataset), and passing its location to the `--dataset_path` argument. + +## 4. Launch Llama 3.1 8B training job + +Generate the Kubernetes manifest and apply it to the cluster. + +Create environment variables: + +``` bash +cat << EOF > env_vars +export IMAGE_URI=${REGISTRY}fsdp:pytorch2.7.1 +export INSTANCE_TYPE= +export NUM_NODES= +export GPU_PER_NODE= +export EFA_PER_NODE= +export FI_PROVIDER=efa +export HF_TOKEN= +EOF +``` + +For reference, we are running the Llama 3.1 8B model on 4 x p5.48xlarge instances and below is the configuration of our environment variables: +``` bash +cat << EOF > env_vars +export IMAGE_URI=${REGISTRY}fsdp:pytorch2.7.1 +export INSTANCE_TYPE=p5.48xlarge +export NUM_NODES=4 +export GPU_PER_NODE=8 +export EFA_PER_NODE=32 +export FI_PROVIDER=efa +export HF_TOKEN= +EOF +``` + +Fill in `env_vars` and then source variables: + +``` bash +source env_vars +``` + +Apply yaml: +``` bash +envsubst < llama3_1_8b-fsdp.yaml | kubectl apply -f - +``` + +EFA level variables are available for adjustment in fsdp.yaml-template +Keep FI_* values commented out for non-efa instances (G5, G4d, P3) or P5 +Uncomment FI_* values for P4d instances + +You can also adjust the training parameters in `TRAINING_ARGS` (for example, to train Llama 3.1 70B). Additional parameters can be found in `model/arguments.py`. Note that we use the same directory for both `--checkpoint_dir` and `--resume_from_checkpoint`. If there are multiple checkpoints, `--resume_from_checkpoint` will automatically select the most recent one. This way if our training is interupted for any reason, it will automatically pick up the most recent checkpoint. + +## 5. Monitor training job + +To see the status of your job, use the commands below + +```bash +kubectl get pytorchjob +kubectl get pods +``` + +```log +NAME STATE AGE +llama3-1-8b-fsdp Running 5m38s +NAME READY STATUS RESTARTS AGE +llama3-1-8b-fsdp-worker-0 1/1 Running 0 5m39s +llama3-1-8b-fsdp-worker-1 1/1 Running 0 5m39s +llama3-1-8b-fsdp-worker-2 1/1 Running 0 5m39s +llama3-1-8b-fsdp-worker-3 1/1 Running 0 5m39s +``` + +Each of the pods produces job logs. One of the pods is elected master during job initialization. Only this pod will show the progress of the training job in its log. To find out which pod is currently the master, run the command below. + +```bash +kubectl logs llama3-1-8b-fsdp-worker-0 | grep master_addr= +``` + +```log +I0620 14:27:39.789000 1 torch/distributed/elastic/agent/server/api.py:525] master_addr=llama3-1-8b-fsdp-worker-0 +``` + +This shows that the pod `llama3-1-8b-fsdp-worker-0` is currently the master. To look at the current job logs, use the command below: + +```bash +kubectl logs -f llama3-1-8b-fsdp-worker-0 +``` + +```log +... +2025-06-20 14:17:10 I [train.py:103] Batch 90 Loss: 7.24291, Speed: 9.41 samples/sec, lr: 0.000010 +2025-06-20 14:17:14 I [train.py:103] Batch 91 Loss: 7.27470, Speed: 8.94 samples/sec, lr: 0.000010 +2025-06-20 14:17:17 I [train.py:103] Batch 92 Loss: 7.06632, Speed: 9.42 samples/sec, lr: 0.000010 +2025-06-20 14:17:21 I [train.py:103] Batch 93 Loss: 7.17624, Speed: 8.96 samples/sec, lr: 0.000010 +2025-06-20 14:17:24 I [train.py:103] Batch 94 Loss: 7.24291, Speed: 9.06 samples/sec, lr: 0.000010 +2025-06-20 14:17:28 I [train.py:103] Batch 95 Loss: 7.13051, Speed: 9.05 samples/sec, lr: 0.000010 +2025-06-20 14:17:32 I [train.py:103] Batch 96 Loss: 7.16901, Speed: 8.30 samples/sec, lr: 0.000010 +2025-06-20 14:17:36 I [train.py:103] Batch 97 Loss: 7.50217, Speed: 8.51 samples/sec, lr: 0.000010 +``` + +## 6. Stop training job + +To stop the current training job, use the following command. + +```bash +kubectl delete -f ./llama3_1_8b-fsdp.yaml +``` + +If you wish to launch a new job, you must first stop the previous one, even if it is in `Completed` state. + +## References +Llama 2 and Llama 3.x models parameters are based on the values in the [Llama 2 paper](https://arxiv.org/abs/2307.09288) and [Llama 3 paper](https://arxiv.org/abs/2407.21783) + + +| Parameter | Llama 2 7B | Llama 2 13B | Llama 2 70B | Llama 3.1 8B | Llama 3.1 70B | Llama 3.2 1B | Llama 3.2 3B | +|----------------------|------------|-------------|-------------|--------------|---------------|--------------|--------------| +| intermediate_size | 11008 | 13824 | 28672 | 14336 | 28672 | 8192 | 11008 | +| num_key_value_heads | 32 | 40 | 8 | 8 | 8 | 8 | 8 | +| hidden_width | 4096 | 5120 | 8192 | 4096 | 8192 | 2048 | 3072 | +| num_layers | 32 | 40 | 80 | 32 | 80 | 16 | 28 | +| num_heads | 32 | 40 | 64 | 32 | 64 | 32 | 24 | +| max_context_length | 4096 | 4096 | 4096 | 8192 | 8192 | 8192 | 8192 | \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/fsdp.yaml-template b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/fsdp.yaml-template new file mode 100644 index 0000000..af1b3bd --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/fsdp.yaml-template @@ -0,0 +1,115 @@ +apiVersion: "kubeflow.org/v1" +kind: PyTorchJob +metadata: + name: fsdp +spec: + elasticPolicy: + rdzvBackend: c10d + minReplicas: 1 + maxReplicas: 64 + maxRestarts: 100 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 90 + pytorchReplicaSpecs: + Worker: + replicas: $NUM_NODES + restartPolicy: OnFailure + template: + metadata: + labels: + app: fsdp + spec: + volumes: + - name: shmem + hostPath: + path: /dev/shm + - name: local + hostPath: + path: /mnt/k8s-disks/0 + #nodeSelector: + # node.kubernetes.io/instance-type: "${INSTANCE_TYPE}" + containers: + - name: pytorch + image: ${IMAGE_URI} + imagePullPolicy: Always + resources: + requests: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + limits: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + env: + # for P5 FI_* should be commented out + - name: LOGLEVEL + value: "DEBUG" + #- name: FI_PROVIDER + # value: $FI_PROVIDER + #- name: FI_EFA_USE_DEVICE_RDMA + # value: "1" + #- name: FI_EFA_FORK_SAFE + # value: "1" + #- name: FI_LOG_LEVEL + # value: "1" + #- name: FI_EFA_ENABLE_SHM_TRANSFER + # value: "1" + - name: TORCH_DISTRIBUTED_DEBUG + value: "DETAIL" + - name: TORCH_NCCL_ENABLE_MONITORING + value: "1" + - name: TORCH_NCCL_TRACE_BUFFER_SIZE + value: "20000" + - name: TORCH_NCCL_DUMP_ON_TIMEOUT + value: "1" + - name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE + value: "/local/nccl_trace_rank_" + - name: PYTORCH_CUDA_ALLOC_CONF + value: "expandable_segments:True" + - name: NCCL_DEBUG + value: "INFO" + - name: NCCL_SOCKET_IFNAME + value: "^lo" + - name: TORCH_NCCL_ASYNC_ERROR_HANDLING + value: "1" + - name: HF_TOKEN + value: "${HF_TOKEN}" + #- name: TORCH_DIST_INIT_BARRIER + # value: "1" + #- name: NCCL_IGNORE_DISABLED_P2P + # value: "1" + #- name: NCCL_NVLS_ENABLE + # value: "0" + command: + - /usr/local/bin/torchrun + - --nproc_per_node=$GPU_PER_NODE + - --nnodes=$NUM_NODES + - /fsdp/train.py + - --max_context_width=4096 + - --num_key_value_heads=32 + - --intermediate_size=11008 + - --hidden_width=4096 + - --num_layers=32 + - --num_heads=32 + - --model_type=llama_v2 + - --tokenizer=hf-internal-testing/llama-tokenizer + - --checkpoint_freq=5000 + - --validation_freq=500 + - --max_steps=5000 + - --checkpoint_dir=/checkpoints + - --dataset=allenai/c4 + - --dataset_config_name=en + - --resume_from_checkpoint=/checkpoints + - --train_batch_size=1 + - --val_batch_size=1 + - --sharding_strategy=full + - --offload_activation=1 + volumeMounts: + - name: shmem + mountPath: /dev/shm + - name: local + mountPath: /local diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama2_13b-fsdp.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama2_13b-fsdp.yaml new file mode 100644 index 0000000..c0361b5 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama2_13b-fsdp.yaml @@ -0,0 +1,115 @@ +apiVersion: "kubeflow.org/v1" +kind: PyTorchJob +metadata: + name: llama2-13b-fsdp +spec: + elasticPolicy: + rdzvBackend: c10d + minReplicas: 1 + maxReplicas: 64 + maxRestarts: 100 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 90 + pytorchReplicaSpecs: + Worker: + replicas: $NUM_NODES + restartPolicy: OnFailure + template: + metadata: + labels: + app: llama2-13b-fsdp + spec: + volumes: + - name: shmem + hostPath: + path: /dev/shm + - name: local + hostPath: + path: /mnt/k8s-disks/0 + #nodeSelector: + # node.kubernetes.io/instance-type: "${INSTANCE_TYPE}" + containers: + - name: pytorch + image: ${IMAGE_URI} + imagePullPolicy: Always + resources: + requests: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + limits: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + env: + # for P5 FI_* should be commented out + - name: LOGLEVEL + value: "DEBUG" + #- name: FI_PROVIDER + # value: $FI_PROVIDER + #- name: FI_EFA_USE_DEVICE_RDMA + # value: "1" + #- name: FI_EFA_FORK_SAFE + # value: "1" + #- name: FI_LOG_LEVEL + # value: "1" + #- name: FI_EFA_ENABLE_SHM_TRANSFER + # value: "1" + - name: TORCH_DISTRIBUTED_DEBUG + value: "DETAIL" + - name: TORCH_NCCL_ENABLE_MONITORING + value: "1" + - name: TORCH_NCCL_TRACE_BUFFER_SIZE + value: "20000" + - name: TORCH_NCCL_DUMP_ON_TIMEOUT + value: "1" + - name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE + value: "/local/nccl_trace_rank_" + - name: PYTORCH_CUDA_ALLOC_CONF + value: "expandable_segments:True" + - name: NCCL_DEBUG + value: "INFO" + - name: NCCL_SOCKET_IFNAME + value: "^lo" + - name: TORCH_NCCL_ASYNC_ERROR_HANDLING + value: "1" + - name: HF_TOKEN + value: "${HF_TOKEN}" + #- name: TORCH_DIST_INIT_BARRIER + # value: "1" + #- name: NCCL_IGNORE_DISABLED_P2P + # value: "1" + #- name: NCCL_NVLS_ENABLE + # value: "0" + command: + - /usr/local/bin/torchrun + - --nproc_per_node=$GPU_PER_NODE + - --nnodes=$NUM_NODES + - /fsdp/train.py + - --max_context_width=4096 + - --num_key_value_heads=40 + - --intermediate_size=13824 + - --hidden_width=5120 + - --num_layers=40 + - --num_heads=40 + - --model_type=llama_v2 + - --tokenizer=hf-internal-testing/llama-tokenizer + - --checkpoint_freq=50 + - --validation_freq=25 + - --max_steps=50 + - --checkpoint_dir=./checkpoints + - --dataset=allenai/c4 + - --dataset_config_name=en + - --resume_from_checkpoint=./checkpoints + - --train_batch_size=1 + - --val_batch_size=1 + - --sharding_strategy=full # https://pytorch.org/docs/stable/fsdp.html + - --offload_activations=1 + volumeMounts: + - name: shmem + mountPath: /dev/shm + - name: local + mountPath: /local \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama2_70b-fsdp.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama2_70b-fsdp.yaml new file mode 100644 index 0000000..4e91559 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama2_70b-fsdp.yaml @@ -0,0 +1,115 @@ +apiVersion: "kubeflow.org/v1" +kind: PyTorchJob +metadata: + name: llama2-70b-fsdp +spec: + elasticPolicy: + rdzvBackend: c10d + minReplicas: 1 + maxReplicas: 64 + maxRestarts: 100 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 90 + pytorchReplicaSpecs: + Worker: + replicas: $NUM_NODES + restartPolicy: OnFailure + template: + metadata: + labels: + app: llama2-70b-fsdp + spec: + volumes: + - name: shmem + hostPath: + path: /dev/shm + - name: local + hostPath: + path: /mnt/k8s-disks/0 + #nodeSelector: + # node.kubernetes.io/instance-type: "${INSTANCE_TYPE}" + containers: + - name: pytorch + image: ${IMAGE_URI} + imagePullPolicy: Always + resources: + requests: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + limits: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + env: + # for P5 FI_* should be commented out + - name: LOGLEVEL + value: "DEBUG" + #- name: FI_PROVIDER + # value: $FI_PROVIDER + #- name: FI_EFA_USE_DEVICE_RDMA + # value: "1" + #- name: FI_EFA_FORK_SAFE + # value: "1" + #- name: FI_LOG_LEVEL + # value: "1" + #- name: FI_EFA_ENABLE_SHM_TRANSFER + # value: "1" + - name: TORCH_DISTRIBUTED_DEBUG + value: "DETAIL" + - name: TORCH_NCCL_ENABLE_MONITORING + value: "1" + - name: TORCH_NCCL_TRACE_BUFFER_SIZE + value: "20000" + - name: TORCH_NCCL_DUMP_ON_TIMEOUT + value: "1" + - name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE + value: "/local/nccl_trace_rank_" + - name: PYTORCH_CUDA_ALLOC_CONF + value: "expandable_segments:True" + - name: NCCL_DEBUG + value: "INFO" + - name: NCCL_SOCKET_IFNAME + value: "^lo" + - name: TORCH_NCCL_ASYNC_ERROR_HANDLING + value: "1" + - name: HF_TOKEN + value: "${HF_TOKEN}" + #- name: TORCH_DIST_INIT_BARRIER + # value: "1" + #- name: NCCL_IGNORE_DISABLED_P2P + # value: "1" + #- name: NCCL_NVLS_ENABLE + # value: "0" + command: + - /usr/local/bin/torchrun + - --nproc_per_node=$GPU_PER_NODE + - --nnodes=$NUM_NODES + - /fsdp/train.py + - --max_context_width=4096 + - --num_key_value_heads=8 + - --intermediate_size=28672 + - --hidden_width=8192 + - --num_layers=80 + - --num_heads=64 + - --model_type=llama_v2 + - --tokenizer=hf-internal-testing/llama-tokenizer + - --checkpoint_freq=1000 + - --validation_freq=10 + - --max_steps=30 + - --checkpoint_dir=./checkpoints + - --dataset=allenai/c4 + - --dataset_config_name=en + - --resume_from_checkpoint=./checkpoints + - --train_batch_size=1 + - --val_batch_size=1 + - --sharding_strategy=full # https://pytorch.org/docs/stable/fsdp.html + - --offload_activations=1 + volumeMounts: + - name: shmem + mountPath: /dev/shm + - name: local + mountPath: /local \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama2_7b-fsdp.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama2_7b-fsdp.yaml new file mode 100644 index 0000000..d611f60 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama2_7b-fsdp.yaml @@ -0,0 +1,115 @@ +apiVersion: "kubeflow.org/v1" +kind: PyTorchJob +metadata: + name: llama2-7b-fsdp +spec: + elasticPolicy: + rdzvBackend: c10d + minReplicas: 1 + maxReplicas: 64 + maxRestarts: 100 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 90 + pytorchReplicaSpecs: + Worker: + replicas: $NUM_NODES + restartPolicy: OnFailure + template: + metadata: + labels: + app: llama2-7b-fsdp + spec: + volumes: + - name: shmem + hostPath: + path: /dev/shm + - name: local + hostPath: + path: /mnt/k8s-disks/0 + #nodeSelector: + # node.kubernetes.io/instance-type: "${INSTANCE_TYPE}" + containers: + - name: pytorch + image: ${IMAGE_URI} + imagePullPolicy: Always + resources: + requests: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + limits: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + env: + # for P5 FI_* should be commented out + - name: LOGLEVEL + value: "DEBUG" + #- name: FI_PROVIDER + # value: $FI_PROVIDER + #- name: FI_EFA_USE_DEVICE_RDMA + # value: "1" + #- name: FI_EFA_FORK_SAFE + # value: "1" + #- name: FI_LOG_LEVEL + # value: "1" + #- name: FI_EFA_ENABLE_SHM_TRANSFER + # value: "1" + - name: TORCH_DISTRIBUTED_DEBUG + value: "DETAIL" + - name: TORCH_NCCL_ENABLE_MONITORING + value: "1" + - name: TORCH_NCCL_TRACE_BUFFER_SIZE + value: "20000" + - name: TORCH_NCCL_DUMP_ON_TIMEOUT + value: "1" + - name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE + value: "/local/nccl_trace_rank_" + - name: PYTORCH_CUDA_ALLOC_CONF + value: "expandable_segments:True" + - name: NCCL_DEBUG + value: "INFO" + - name: NCCL_SOCKET_IFNAME + value: "^lo" + - name: TORCH_NCCL_ASYNC_ERROR_HANDLING + value: "1" + - name: HF_TOKEN + value: "${HF_TOKEN}" + #- name: TORCH_DIST_INIT_BARRIER + # value: "1" + #- name: NCCL_IGNORE_DISABLED_P2P + # value: "1" + #- name: NCCL_NVLS_ENABLE + # value: "0" + command: + - /usr/local/bin/torchrun + - --nproc_per_node=$GPU_PER_NODE + - --nnodes=$NUM_NODES + - /fsdp/train.py + - --max_context_width=4096 + - --num_key_value_heads=32 + - --intermediate_size=11008 + - --hidden_width=4096 + - --num_layers=32 + - --num_heads=32 + - --model_type=llama_v2 + - --tokenizer=hf-internal-testing/llama-tokenizer + - --checkpoint_freq=50 + - --validation_freq=100 + - --max_steps=100 + - --checkpoint_dir=./checkpoints + - --dataset=allenai/c4 + - --dataset_config_name=en + - --resume_from_checkpoint=./checkpoints + - --train_batch_size=1 + - --val_batch_size=1 + - --sharding_strategy=full # https://pytorch.org/docs/stable/fsdp.html + - --offload_activations=1 + volumeMounts: + - name: shmem + mountPath: /dev/shm + - name: local + mountPath: /local \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_1_70b-fsdp.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_1_70b-fsdp.yaml new file mode 100644 index 0000000..c682643 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_1_70b-fsdp.yaml @@ -0,0 +1,115 @@ +apiVersion: "kubeflow.org/v1" +kind: PyTorchJob +metadata: + name: llama3-1-70b-fsdp +spec: + elasticPolicy: + rdzvBackend: c10d + minReplicas: 1 + maxReplicas: 64 + maxRestarts: 100 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 90 + pytorchReplicaSpecs: + Worker: + replicas: $NUM_NODES + restartPolicy: OnFailure + template: + metadata: + labels: + app: llama3-1-70b-fsdp + spec: + volumes: + - name: shmem + hostPath: + path: /dev/shm + - name: local + hostPath: + path: /mnt/k8s-disks/0 + #nodeSelector: + # node.kubernetes.io/instance-type: "${INSTANCE_TYPE}" + containers: + - name: pytorch + image: ${IMAGE_URI} + imagePullPolicy: Always + resources: + requests: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + limits: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + env: + # for P5 FI_* should be commented out + - name: LOGLEVEL + value: "DEBUG" + #- name: FI_PROVIDER + # value: $FI_PROVIDER + #- name: FI_EFA_USE_DEVICE_RDMA + # value: "1" + #- name: FI_EFA_FORK_SAFE + # value: "1" + #- name: FI_LOG_LEVEL + # value: "1" + #- name: FI_EFA_ENABLE_SHM_TRANSFER + # value: "1" + - name: TORCH_DISTRIBUTED_DEBUG + value: "DETAIL" + - name: TORCH_NCCL_ENABLE_MONITORING + value: "1" + - name: TORCH_NCCL_TRACE_BUFFER_SIZE + value: "20000" + - name: TORCH_NCCL_DUMP_ON_TIMEOUT + value: "1" + - name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE + value: "/local/nccl_trace_rank_" + - name: PYTORCH_CUDA_ALLOC_CONF + value: "expandable_segments:True" + - name: NCCL_DEBUG + value: "INFO" + - name: NCCL_SOCKET_IFNAME + value: "^lo" + - name: TORCH_NCCL_ASYNC_ERROR_HANDLING + value: "1" + - name: HF_TOKEN + value: "${HF_TOKEN}" + #- name: TORCH_DIST_INIT_BARRIER + # value: "1" + #- name: NCCL_IGNORE_DISABLED_P2P + # value: "1" + #- name: NCCL_NVLS_ENABLE + # value: "0" + command: + - /usr/local/bin/torchrun + - --nproc_per_node=$GPU_PER_NODE + - --nnodes=$NUM_NODES + - /fsdp/train.py + - --max_context_width=8192 + - --num_key_value_heads=8 + - --intermediate_size=28672 + - --hidden_width=8192 + - --num_layers=80 + - --num_heads=64 + - --model_type=llama_v3 + - --tokenizer=hf-internal-testing/llama-tokenizer + - --checkpoint_freq=1000 + - --validation_freq=10 + - --max_steps=30 + - --checkpoint_dir=./checkpoints + - --dataset=allenai/c4 + - --dataset_config_name=en + - --resume_from_checkpoint=./checkpoints + - --train_batch_size=1 + - --val_batch_size=1 + - --sharding_strategy=full # https://pytorch.org/docs/stable/fsdp.html + - --offload_activations=1 + volumeMounts: + - name: shmem + mountPath: /dev/shm + - name: local + mountPath: /local \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_1_8b-fsdp-hpto.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_1_8b-fsdp-hpto.yaml new file mode 100644 index 0000000..17a7cd5 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_1_8b-fsdp-hpto.yaml @@ -0,0 +1,135 @@ +apiVersion: sagemaker.amazonaws.com/v1 +kind: HyperPodPyTorchJob +metadata: + name: llama3-1-8b-fsdp-hpto +spec: + nprocPerNode: "$GPU_PER_NODE" + runPolicy: + jobMaxRetryCount: 10 + restartPolicy: + numRestartBeforeFullJobRestart: 3 + evalPeriodSeconds: 21600 + maxFullJobRestarts: 1 + cleanPodPolicy: All + logMonitoringConfiguration: + - name: JobStart + logPattern: '.*Loss:.*' + expectedStartCutOffInSeconds: 12000 + - name: JobHangingDetection + logPattern: '.*Loss:.*' + expectedRecurringFrequencyInSeconds: 12000 + replicaSpecs: + - name: pods + replicas: $NUM_NODES + template: + metadata: + labels: + job-name: llama3-1-8b-fsdp-hpto + replica-type: pods + spec: + volumes: + - name: shmem + hostPath: + path: /dev/shm + - name: local + hostPath: + path: /mnt/k8s-disks/0 + affinity: + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchLabels: + job-name: llama3-1-8b-fsdp-hpto + topologyKey: kubernetes.io/hostname + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: sagemaker.amazonaws.com/node-health-status + operator: In + values: + - Schedulable + - key: sagemaker.amazonaws.com/compute-type + operator: In + values: + - '${INSTANCE_TYPE}' + topologySpreadConstraints: + - maxSkew: 1 + topologyKey: kubernetes.io/hostname + whenUnsatisfiable: DoNotSchedule + labelSelector: + matchLabels: + job-name: llama3-1-8b-fsdp-hpto + containers: + - name: pytorch + image: ${IMAGE_URI} + imagePullPolicy: Always + resources: + requests: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + limits: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + env: + - name: LOGLEVEL + value: INFO + - name: FI_PROVIDER + value: efa + - name: FI_EFA_USE_DEVICE_RDMA + value: '0' + - name: FI_EFA_FORK_SAFE + value: '1' + - name: FI_EFA_ENABLE_SHM_TRANSFER + value: '1' + - name: TORCH_DISTRIBUTED_DEBUG + value: DETAIL + - name: TORCH_NCCL_ENABLE_MONITORING + value: '1' + - name: TORCH_NCCL_TRACE_BUFFER_SIZE + value: '20000' + - name: TORCH_NCCL_DUMP_ON_TIMEOUT + value: '1' + - name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE + value: /local/nccl_trace_rank_ + - name: PYTORCH_CUDA_ALLOC_CONF + value: 'expandable_segments:True' + - name: NCCL_DEBUG + value: INFO + - name: NCCL_SOCKET_IFNAME + value: ^lo + - name: TORCH_NCCL_ASYNC_ERROR_HANDLING + value: '1' + - name: HF_TOKEN + value: '${HF_TOKEN}' + command: + - hyperpodrun + - '--tee=3' + - '--log_dir=/tmp/hyperpod' + - '--nproc_per_node=$GPU_PER_NODE' + - '--nnodes=$NUM_NODES' + - /fsdp/train.py + - '--max_context_width=4096' + - '--num_key_value_heads=40' + - '--intermediate_size=13824' + - '--hidden_width=5120' + - '--num_layers=40' + - '--num_heads=40' + - '--model_type=llama_v2' + - '--tokenizer=hf-internal-testing/llama-tokenizer' + - '--checkpoint_freq=50' + - '--validation_freq=25' + - '--max_steps=1000' + - '--checkpoint_dir=./checkpoints' + - '--dataset=allenai/c4' + - '--dataset_config_name=en' + - '--resume_from_checkpoint=./checkpoints' + - '--train_batch_size=1' + - '--val_batch_size=1' + - '--sharding_strategy=full' + - '--offload_activations=1' + volumeMounts: + - name: shmem + mountPath: /dev/shm + - name: local + mountPath: /local diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_1_8b-fsdp.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_1_8b-fsdp.yaml new file mode 100644 index 0000000..5e4cffe --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_1_8b-fsdp.yaml @@ -0,0 +1,115 @@ +apiVersion: "kubeflow.org/v1" +kind: PyTorchJob +metadata: + name: llama3-1-8b-fsdp +spec: + elasticPolicy: + rdzvBackend: c10d + minReplicas: 1 + maxReplicas: 64 + maxRestarts: 100 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 90 + pytorchReplicaSpecs: + Worker: + replicas: $NUM_NODES + restartPolicy: OnFailure + template: + metadata: + labels: + app: llama3-1-8b-fsdp + spec: + volumes: + - name: shmem + hostPath: + path: /dev/shm + - name: local + hostPath: + path: /mnt/k8s-disks/0 + #nodeSelector: + # node.kubernetes.io/instance-type: "${INSTANCE_TYPE}" + containers: + - name: pytorch + image: ${IMAGE_URI} + imagePullPolicy: Always + resources: + requests: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + limits: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + env: + # for P5 FI_* should be commented out + - name: LOGLEVEL + value: "DEBUG" + #- name: FI_PROVIDER + # value: $FI_PROVIDER + #- name: FI_EFA_USE_DEVICE_RDMA + # value: "1" + #- name: FI_EFA_FORK_SAFE + # value: "1" + #- name: FI_LOG_LEVEL + # value: "1" + #- name: FI_EFA_ENABLE_SHM_TRANSFER + # value: "1" + - name: TORCH_DISTRIBUTED_DEBUG + value: "DETAIL" + - name: TORCH_NCCL_ENABLE_MONITORING + value: "1" + - name: TORCH_NCCL_TRACE_BUFFER_SIZE + value: "20000" + - name: TORCH_NCCL_DUMP_ON_TIMEOUT + value: "1" + - name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE + value: "/local/nccl_trace_rank_" + - name: PYTORCH_CUDA_ALLOC_CONF + value: "expandable_segments:True" + - name: NCCL_DEBUG + value: "INFO" + - name: NCCL_SOCKET_IFNAME + value: "^lo" + - name: TORCH_NCCL_ASYNC_ERROR_HANDLING + value: "1" + - name: HF_TOKEN + value: "${HF_TOKEN}" + #- name: TORCH_DIST_INIT_BARRIER + # value: "1" + #- name: NCCL_IGNORE_DISABLED_P2P + # value: "1" + #- name: NCCL_NVLS_ENABLE + # value: "0" + command: + - /usr/local/bin/torchrun + - --nproc_per_node=$GPU_PER_NODE + - --nnodes=$NUM_NODES + - /fsdp/train.py + - --max_context_width=8192 + - --num_key_value_heads=8 + - --intermediate_size=14336 + - --hidden_width=4096 + - --num_layers=32 + - --num_heads=32 + - --model_type=llama_v3 + - --tokenizer=hf-internal-testing/llama-tokenizer + - --checkpoint_freq=50 + - --validation_freq=25 + - --max_steps=50 + - --checkpoint_dir=./checkpoints + - --dataset=allenai/c4 + - --dataset_config_name=en + - --resume_from_checkpoint=./checkpoints + - --train_batch_size=1 + - --val_batch_size=1 + - --sharding_strategy=full # https://pytorch.org/docs/stable/fsdp.html + - --offload_activations=1 + volumeMounts: + - name: shmem + mountPath: /dev/shm + - name: local + mountPath: /local \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_2_1b-fsdp.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_2_1b-fsdp.yaml new file mode 100644 index 0000000..4eec4e5 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_2_1b-fsdp.yaml @@ -0,0 +1,115 @@ +apiVersion: "kubeflow.org/v1" +kind: PyTorchJob +metadata: + name: llama3-2-1b-fsdp +spec: + elasticPolicy: + rdzvBackend: c10d + minReplicas: 1 + maxReplicas: 64 + maxRestarts: 100 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 90 + pytorchReplicaSpecs: + Worker: + replicas: $NUM_NODES + restartPolicy: OnFailure + template: + metadata: + labels: + app: llama3-2-1b-fsdp + spec: + volumes: + - name: shmem + hostPath: + path: /dev/shm + - name: local + hostPath: + path: /mnt/k8s-disks/0 + #nodeSelector: + # node.kubernetes.io/instance-type: "${INSTANCE_TYPE}" + containers: + - name: pytorch + image: ${IMAGE_URI} + imagePullPolicy: Always + resources: + requests: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + limits: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + env: + # for P5 FI_* should be commented out + - name: LOGLEVEL + value: "DEBUG" + #- name: FI_PROVIDER + # value: $FI_PROVIDER + #- name: FI_EFA_USE_DEVICE_RDMA + # value: "1" + #- name: FI_EFA_FORK_SAFE + # value: "1" + #- name: FI_LOG_LEVEL + # value: "1" + #- name: FI_EFA_ENABLE_SHM_TRANSFER + # value: "1" + - name: TORCH_DISTRIBUTED_DEBUG + value: "DETAIL" + - name: TORCH_NCCL_ENABLE_MONITORING + value: "1" + - name: TORCH_NCCL_TRACE_BUFFER_SIZE + value: "20000" + - name: TORCH_NCCL_DUMP_ON_TIMEOUT + value: "1" + - name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE + value: "/local/nccl_trace_rank_" + - name: PYTORCH_CUDA_ALLOC_CONF + value: "expandable_segments:True" + - name: NCCL_DEBUG + value: "INFO" + - name: NCCL_SOCKET_IFNAME + value: "^lo" + - name: TORCH_NCCL_ASYNC_ERROR_HANDLING + value: "1" + - name: HF_TOKEN + value: "${HF_TOKEN}" + #- name: TORCH_DIST_INIT_BARRIER + # value: "1" + #- name: NCCL_IGNORE_DISABLED_P2P + # value: "1" + #- name: NCCL_NVLS_ENABLE + # value: "0" + command: + - /usr/local/bin/torchrun + - --nproc_per_node=$GPU_PER_NODE + - --nnodes=$NUM_NODES + - /fsdp/train.py + - --max_context_width=8192 + - --num_key_value_heads=2 + - --intermediate_size=8192 + - --hidden_width=2048 + - --num_layers=16 + - --num_heads=32 + - --model_type=llama_v3 + - --tokenizer=hf-internal-testing/llama-tokenizer + - --checkpoint_freq=50 + - --validation_freq=100 + - --max_steps=100 + - --checkpoint_dir=./checkpoints + - --dataset=allenai/c4 + - --dataset_config_name=en + - --resume_from_checkpoint=./checkpoints + - --train_batch_size=1 + - --val_batch_size=1 + - --sharding_strategy=full # https://pytorch.org/docs/stable/fsdp.html + - --offload_activations=1 + volumeMounts: + - name: shmem + mountPath: /dev/shm + - name: local + mountPath: /local \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_2_3b-fsdp.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_2_3b-fsdp.yaml new file mode 100644 index 0000000..a05d346 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/llama3_2_3b-fsdp.yaml @@ -0,0 +1,115 @@ +apiVersion: "kubeflow.org/v1" +kind: PyTorchJob +metadata: + name: llama3-2-3b-fsdp +spec: + elasticPolicy: + rdzvBackend: c10d + minReplicas: 1 + maxReplicas: 64 + maxRestarts: 100 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 90 + pytorchReplicaSpecs: + Worker: + replicas: $NUM_NODES + restartPolicy: OnFailure + template: + metadata: + labels: + app: llama3-2-3b-fsdp + spec: + volumes: + - name: shmem + hostPath: + path: /dev/shm + - name: local + hostPath: + path: /mnt/k8s-disks/0 + #nodeSelector: + # node.kubernetes.io/instance-type: "${INSTANCE_TYPE}" + containers: + - name: pytorch + image: ${IMAGE_URI} + imagePullPolicy: Always + resources: + requests: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + limits: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + env: + # for P5 FI_* should be commented out + - name: LOGLEVEL + value: "DEBUG" + #- name: FI_PROVIDER + # value: $FI_PROVIDER + #- name: FI_EFA_USE_DEVICE_RDMA + # value: "1" + #- name: FI_EFA_FORK_SAFE + # value: "1" + #- name: FI_LOG_LEVEL + # value: "1" + #- name: FI_EFA_ENABLE_SHM_TRANSFER + # value: "1" + - name: TORCH_DISTRIBUTED_DEBUG + value: "DETAIL" + - name: TORCH_NCCL_ENABLE_MONITORING + value: "1" + - name: TORCH_NCCL_TRACE_BUFFER_SIZE + value: "20000" + - name: TORCH_NCCL_DUMP_ON_TIMEOUT + value: "1" + - name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE + value: "/local/nccl_trace_rank_" + - name: PYTORCH_CUDA_ALLOC_CONF + value: "expandable_segments:True" + - name: NCCL_DEBUG + value: "INFO" + - name: NCCL_SOCKET_IFNAME + value: "^lo" + - name: TORCH_NCCL_ASYNC_ERROR_HANDLING + value: "1" + - name: HF_TOKEN + value: "${HF_TOKEN}" + #- name: TORCH_DIST_INIT_BARRIER + # value: "1" + #- name: NCCL_IGNORE_DISABLED_P2P + # value: "1" + #- name: NCCL_NVLS_ENABLE + # value: "0" + command: + - /usr/local/bin/torchrun + - --nproc_per_node=$GPU_PER_NODE + - --nnodes=$NUM_NODES + - /fsdp/train.py + - --max_context_width=8192 + - --num_key_value_heads=2 + - --intermediate_size=8192 + - --hidden_width=3072 + - --num_layers=28 + - --num_heads=24 + - --model_type=llama_v3 + - --tokenizer=hf-internal-testing/llama-tokenizer + - --checkpoint_freq=50 + - --validation_freq=100 + - --max_steps=100 + - --checkpoint_dir=./checkpoints + - --dataset=allenai/c4 + - --dataset_config_name=en + - --resume_from_checkpoint=./checkpoints + - --train_batch_size=1 + - --val_batch_size=1 + - --sharding_strategy=full # https://pytorch.org/docs/stable/fsdp.html + - --offload_activations=1 + volumeMounts: + - name: shmem + mountPath: /dev/shm + - name: local + mountPath: /local \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/mathstral_7b-fsdp.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/mathstral_7b-fsdp.yaml new file mode 100644 index 0000000..cfa10a5 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/mathstral_7b-fsdp.yaml @@ -0,0 +1,137 @@ +apiVersion: "kubeflow.org/v1" +kind: PyTorchJob +metadata: + name: mathstral-7b-fsdp +spec: + elasticPolicy: + rdzvBackend: c10d + minReplicas: 1 + maxReplicas: 64 + maxRestarts: 100 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 90 + pytorchReplicaSpecs: + Worker: + replicas: $NUM_NODES + restartPolicy: OnFailure + template: + metadata: + labels: + app: mathstral-7b-fsdp + spec: + volumes: + - name: shmem + hostPath: + path: /dev/shm + - name: local + hostPath: + path: /mnt/k8s-disks/0 + #nodeSelector: + # node.kubernetes.io/instance-type: "${INSTANCE_TYPE}" + containers: + - name: pytorch + image: ${IMAGE_URI} + imagePullPolicy: Always + resources: + requests: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + limits: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + env: + # for P5 FI_* should be commented out + - name: LOGLEVEL + value: "DEBUG" + #- name: FI_PROVIDER + # value: $FI_PROVIDER + #- name: FI_EFA_USE_DEVICE_RDMA + # value: "1" + #- name: FI_EFA_FORK_SAFE + # value: "1" + #- name: FI_LOG_LEVEL + # value: "1" + #- name: FI_EFA_ENABLE_SHM_TRANSFER + # value: "1" + - name: TORCH_DISTRIBUTED_DEBUG + value: "DETAIL" + - name: TORCH_NCCL_ENABLE_MONITORING + value: "1" + - name: TORCH_NCCL_TRACE_BUFFER_SIZE + value: "20000" + - name: TORCH_NCCL_DUMP_ON_TIMEOUT + value: "1" + - name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE + value: "/local/nccl_trace_rank_" + - name: PYTORCH_CUDA_ALLOC_CONF + value: "expandable_segments:True" + - name: NCCL_DEBUG + value: "INFO" + - name: NCCL_SOCKET_IFNAME + value: "^lo" + - name: TORCH_NCCL_ASYNC_ERROR_HANDLING + value: "1" + - name: HF_TOKEN + value: "${HF_TOKEN}" + #- name: TORCH_DIST_INIT_BARRIER + # value: "1" + #- name: NCCL_IGNORE_DISABLED_P2P + # value: "1" + #- name: NCCL_NVLS_ENABLE + # value: "0" + command: + - /usr/local/bin/torchrun + - --nproc_per_node=$GPU_PER_NODE + - --nnodes=$NUM_NODES + - /fsdp/train.py + - --train_batch_size=1 + - --val_batch_size=1 + - --seed=42 + - --grad_clip=1.0 + - --weight_decay=0.2 + - --beta1=0.9 + - --beta2=0.95 + - --activation_checkpointing=1 + - --intermediate_size=14336 + - --num_key_value_heads=8 + - --logging_freq=1 + - --max_context_width=32768 + - --vocab_size=32768 + - --hidden_width=4096 + - --num_layers=32 + - --num_heads=32 + - --resid_pdrop=0.1 + - --embd_pdrop=0.1 + - --attn_pdrop=0.1 + - --summary_first_pdrop=0.1 + - --initializer_range=0.02 + - --model_type=mistral + - --rotary_pct=0.25 + - --rotary_emb_base=10000 + - --lr=0.0001 + - --lr_decay_style=cosine + - --min_lr=1e-5 + - --warmup=0.0032 + - --plateau=0.0 + - --dataset=allenai/c4 + - --tokenizer=mistralai/mathstral-7B-v0.1 + - --epochs=3 + - --checkpoint_dir=./checkpoints/mathstral-7B + - --resume_from_checkpoint=./checkpoints/mathstral-7B + - --max_steps=200 + - --checkpoint_freq=50 + - --validation_freq=100 + - --dataset_config_name=en + - --limit_all_gathers=1 + - --sharding_strategy=full # https://pytorch.org/docs/stable/fsdp.html + - --offload_activations=1 + volumeMounts: + - name: shmem + mountPath: /dev/shm + - name: local + mountPath: /local \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/mistral_8x7b-fsdp.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/mistral_8x7b-fsdp.yaml new file mode 100644 index 0000000..185bc20 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/mistral_8x7b-fsdp.yaml @@ -0,0 +1,134 @@ +apiVersion: "kubeflow.org/v1" +kind: PyTorchJob +metadata: + name: mistral-8x7b-fsdp +spec: + elasticPolicy: + rdzvBackend: c10d + minReplicas: 1 + maxReplicas: 64 + maxRestarts: 100 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 90 + pytorchReplicaSpecs: + Worker: + replicas: $NUM_NODES + restartPolicy: OnFailure + template: + metadata: + labels: + app: mistral-8x7b-fsdp + spec: + volumes: + - name: shmem + hostPath: + path: /dev/shm + - name: local + hostPath: + path: /mnt/k8s-disks/0 + #nodeSelector: + # node.kubernetes.io/instance-type: "${INSTANCE_TYPE}" + containers: + - name: pytorch + image: ${IMAGE_URI} + imagePullPolicy: Always + resources: + requests: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + limits: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + env: + # for P5 FI_* should be commented out + - name: LOGLEVEL + value: "DEBUG" + #- name: FI_PROVIDER + # value: $FI_PROVIDER + #- name: FI_EFA_USE_DEVICE_RDMA + # value: "1" + #- name: FI_EFA_FORK_SAFE + # value: "1" + #- name: FI_LOG_LEVEL + # value: "1" + #- name: FI_EFA_ENABLE_SHM_TRANSFER + # value: "1" + - name: TORCH_DISTRIBUTED_DEBUG + value: "DETAIL" + - name: TORCH_NCCL_ENABLE_MONITORING + value: "1" + - name: TORCH_NCCL_TRACE_BUFFER_SIZE + value: "20000" + - name: TORCH_NCCL_DUMP_ON_TIMEOUT + value: "1" + - name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE + value: "/local/nccl_trace_rank_" + - name: PYTORCH_CUDA_ALLOC_CONF + value: "expandable_segments:True" + - name: NCCL_DEBUG + value: "INFO" + - name: NCCL_SOCKET_IFNAME + value: "^lo" + - name: TORCH_NCCL_ASYNC_ERROR_HANDLING + value: "1" + - name: HF_TOKEN + value: "${HF_TOKEN}" + #- name: TORCH_DIST_INIT_BARRIER + # value: "1" + #- name: NCCL_IGNORE_DISABLED_P2P + # value: "1" + #- name: NCCL_NVLS_ENABLE + # value: "0" + command: + - /usr/local/bin/torchrun + - --nproc_per_node=$GPU_PER_NODE + - --nnodes=$NUM_NODES + - /fsdp/train.py + - --train_batch_size=4 + - --val_batch_size=4 + - --max_steps=200 + - --seed=42 + - --bf16=1 + - --grad_clip=1.0 + - --weight_decay=0.2 + - --beta1=0.9 + - --beta2=0.95 + - --activation_checkpointing=1 + - --intermediate_size=14336 + - --num_key_value_heads=8 + - --logging_freq=1 + - --max_context_width=32768 + - --vocab_size=32000 + - --hidden_width=4096 + - --num_layers=32 + - --num_heads=32 + - --resid_pdrop=0.1 + - --embd_pdrop=0.1 + - --attn_pdrop=0.1 + - --summary_first_pdrop=0.1 + - --initializer_range=0.02 + - --model_type=mixtral + - --rotary_pct=0.25 + - --rotary_emb_base=10000 + - --lr=0.0001 + - --lr_decay_style=cosine + - --min_lr=1e-5 + - --warmup=0.0032 + - --plateau=0.0 + - --dataset=allenai/c4 + - --tokenizer=mistralai/Mixtral-8x7B-v0.1 + - --epochs=3 + - --dataset_config_name=en + - --limit_all_gathers=1 + - --sharding_strategy=full # https://pytorch.org/docs/stable/fsdp.html + - --offload_activations=1 + volumeMounts: + - name: shmem + mountPath: /dev/shm + - name: local + mountPath: /local \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/training_kubernetes.template b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/training_kubernetes.template new file mode 100644 index 0000000..78a2a81 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/kubernetes/training_kubernetes.template @@ -0,0 +1,99 @@ +apiVersion: "kubeflow.org/v1" +kind: PyTorchJob +metadata: + name: {{ MODEL_NAME}}-fsdp +spec: + elasticPolicy: + rdzvBackend: c10d + minReplicas: 1 + maxReplicas: 64 + maxRestarts: 100 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 90 + pytorchReplicaSpecs: + Worker: + replicas: $NUM_NODES + restartPolicy: OnFailure + template: + metadata: + labels: + app: {{ MODEL_NAME}}-fsdp + spec: + volumes: + - name: shmem + hostPath: + path: /dev/shm + - name: local + hostPath: + path: /mnt/k8s-disks/0 + #nodeSelector: + # node.kubernetes.io/instance-type: "${INSTANCE_TYPE}" + containers: + - name: pytorch + image: ${IMAGE_URI} + imagePullPolicy: Always + resources: + requests: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + limits: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + env: + # for P5 FI_* should be commented out + - name: LOGLEVEL + value: "DEBUG" + #- name: FI_PROVIDER + # value: $FI_PROVIDER + #- name: FI_EFA_USE_DEVICE_RDMA + # value: "1" + #- name: FI_EFA_FORK_SAFE + # value: "1" + #- name: FI_LOG_LEVEL + # value: "1" + #- name: FI_EFA_ENABLE_SHM_TRANSFER + # value: "1" + - name: TORCH_DISTRIBUTED_DEBUG + value: "DETAIL" + - name: TORCH_NCCL_ENABLE_MONITORING + value: "1" + - name: TORCH_NCCL_TRACE_BUFFER_SIZE + value: "20000" + - name: TORCH_NCCL_DUMP_ON_TIMEOUT + value: "1" + - name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE + value: "/local/nccl_trace_rank_" + - name: PYTORCH_CUDA_ALLOC_CONF + value: "expandable_segments:True" + - name: NCCL_DEBUG + value: "INFO" + - name: NCCL_SOCKET_IFNAME + value: "^lo" + - name: TORCH_NCCL_ASYNC_ERROR_HANDLING + value: "1" + - name: HF_TOKEN + value: "${HF_TOKEN}" + #- name: TORCH_DIST_INIT_BARRIER + # value: "1" + #- name: NCCL_IGNORE_DISABLED_P2P + # value: "1" + #- name: NCCL_NVLS_ENABLE + # value: "0" + command: + - /usr/local/bin/torchrun + - --nproc_per_node=$GPU_PER_NODE + - --nnodes=$NUM_NODES + - /fsdp/train.py + {%- for i in MODEL_PARAMETERS %} + - {{ i|indent(18) -}} + {% endfor %} + volumeMounts: + - name: shmem + mountPath: /dev/shm + - name: local + mountPath: /local diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/model_utils/__init__.py b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/model_utils/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/model_utils/arguments.py b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/model_utils/arguments.py new file mode 100644 index 0000000..ffb8389 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/model_utils/arguments.py @@ -0,0 +1,194 @@ +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 + +import argparse +import os + + +def parse_args(): # pylint: disable=too-many-statements + """Parse args.""" + parser = argparse.ArgumentParser() + + # hyperparameters sent by the client are passed as command-line arguments to the script. + + opt_grp = parser.add_argument_group( + title="optimization", description="arguments for optimization") + opt_grp.add_argument( + "--train_batch_size", + type=int, + default=2, + help="batch size per dp rank", # pylint: disable=line-too-long + ) + opt_grp.add_argument("--val_batch_size", type=int, default=4) + opt_grp.add_argument("--max_steps", + "--max_training_steps", + type=int, + default=5000) + opt_grp.add_argument("--seed", type=int, default=12345) + opt_grp.add_argument("--same_seed", type=int, default=0) + opt_grp.add_argument("--bf16", + default=1, + type=int, + help="automatic mixed precision training") + opt_grp.add_argument("--grad_clip", + default=1.0, + type=float, + help="gradient clipping") + opt_grp.add_argument("--weight_decay", + default=0.2, + type=float, + help="weight decay") + opt_grp.add_argument("--beta1", + default=0.9, + type=float, + help="beta1 parameter for Adam optimizer") + opt_grp.add_argument("--beta2", + default=0.95, + type=float, + help="beta2 parameter for Adam optimizer") + opt_grp.add_argument( + "--activation_checkpointing", + type=int, + default=1, + help="enable gradient checkpointing to reduce memory consumption", + ) + opt_grp.add_argument( + "--intermediate_size", + type=int, + default=11008, + help="intermediate_size, a dimension associated with MLP", + ) + opt_grp.add_argument( + "--num_key_value_heads", + type=int, + default=None, + help="num_key_value_heads for GQA", + ) + parser.add_argument("--logging_freq", + type=int, + default=1, + help="number of iterations between logging") + parser.add_argument("--tensorboard_dir", type=str, nargs="+", default=None) + + model_grp = parser.add_argument_group( + title="model", description="arguments to describe model configuration") + model_grp.add_argument("--max_context_width", type=int, default=2048) + model_grp.add_argument("--vocab_size", type=int, default=50432) + model_grp.add_argument("--hidden_width", type=int, default=768) + model_grp.add_argument("--num_layers", type=int, default=12) + model_grp.add_argument("--num_heads", type=int, default=12) + model_grp.add_argument("--resid_pdrop", type=float, default=0.1) + model_grp.add_argument("--embd_pdrop", type=float, default=0.1) + model_grp.add_argument("--attn_pdrop", type=float, default=0.1) + model_grp.add_argument("--summary_first_pdrop", type=float, default=0.1) + model_grp.add_argument("--initializer_range", type=float, default=0.02) + model_grp.add_argument("--model_type", type=str, default="gpt_neox") + model_grp.add_argument("--rotary_pct", type=float, default=0.25) + model_grp.add_argument("--rotary_emb_base", type=int, default=10000) + + fsdp_grp = parser.add_argument_group( + title="fsdp", description="arguments for fully sharded data parallel") + fsdp_grp.add_argument("--offload_activations", type=int, default=0) + fsdp_grp.add_argument("--activation_loading_horizon", type=int, default=2) + fsdp_grp.add_argument("--limit_all_gathers", default=1, type=int) + fsdp_grp.add_argument( + "--sharding_strategy", + type=str, + default="full", + choices=["full", "hybrid"], + help="FSDP sharding strategy https://pytorch.org/docs/stable/fsdp.html", + ) + fsdp_grp.add_argument( + "--cpu_offload", + type=int, + default=0, + help= + "CPU offloading https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.CPUOffload" + ) + + # learning rate + lr_grp = parser.add_argument_group( + title="lr", description="arguments for learning rate schedule") + lr_grp.add_argument("--lr", + type=float, + default=0.0001, + help="Initial learning rate.") + lr_grp.add_argument( + "--lr_decay_style", + type=str, + default="cosine", + choices=["constant", "linear", "cosine", "exponential", "plateau"], + help="Learning rate decay function.", + ) + lr_grp.add_argument( + "--lr_decay_iters", + type=int, + default=None, + help="number of iterations to decay learning rate over," + " If None defaults to train iters", + ) + lr_grp.add_argument( + "--min_lr", + type=float, + default=1e-05, + help="Minumum value for learning rate. The scheduler" + "clip values below this threshold.", + ) + lr_grp.add_argument( + "--warmup", + type=float, + default=0.0032, + help="Percentage of total iterations to warmup on " + "(.01 = 1 percent of all training iters).", + ) + lr_grp.add_argument( + "--plateau", + type=float, + default=0.0, + help= + "Percentage of total iterations to keep at max if using plateau lr", + ) + io_grp = parser.add_argument_group( + title="io", description="location for input and output") + io_grp.add_argument("--dataset", type=str, default="allenai/c4") + io_grp.add_argument("--dataset_config_name", type=str, default="en") + io_grp.add_argument("--tokenizer", + type=str, + default="EleutherAI/gpt-neox-20b") + io_grp.add_argument( + "--resume_from_checkpoint", + type=str, + default=None, + help="Checkpoint folder name to load from", + ) + io_grp.add_argument( + "--checkpoint_dir", + type=str, + default=None, + help="Saves partial checkpoints (model, optimizer) to this dir.", # pylint: disable=line-too-long + ) + io_grp.add_argument("--epochs", + type=int, + default=3, + help="times of iterating over the training dataset") + + parser.add_argument( + "--checkpoint_freq", + type=int, + default=1000, + help="number of iterations between checkpointing", + ) + parser.add_argument( + "--validation_freq", + type=int, + default=None, + help="number of iterations to print validation loss", + ) + parser.add_argument( + "--validation_batches", + type=int, + default=10, + help="number of batches to estimate validation loss", + ) + + return parser.parse_known_args() diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/model_utils/checkpoint.py b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/model_utils/checkpoint.py new file mode 100644 index 0000000..55d8111 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/model_utils/checkpoint.py @@ -0,0 +1,126 @@ +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 + +import os +import re +import pickle +import statistics +import time +import warnings +from pathlib import Path + +import torch +import torch.distributed as dist + +# pylint: disable=import-error,no-name-in-module +import torch.distributed.checkpoint as dist_cp +from torch.distributed.checkpoint.optimizer import load_sharded_optimizer_state_dict +from torch.distributed.fsdp import FullyShardedDataParallel as FSDP +from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType +from model_utils.train_utils import get_logger + + +logger = get_logger() + +def save_checkpoint(model, optimizer, scheduler, user_content, root_dir, sub_dir): + torch.cuda.empty_cache() + + save_dir = os.path.join(root_dir, sub_dir) + if dist.get_rank() == 0: + logger.info("Writing checkpoint to {0}.".format(save_dir)) + + with FSDP.state_dict_type( + model, + StateDictType.SHARDED_STATE_DICT): + state_dict = { + "model": model.state_dict(), + "optim": FSDP.optim_state_dict(model, optimizer), + "scheduler": scheduler.state_dict(), + "total_steps": user_content["total_steps"], + "start_batch_index": user_content["start_batch_index"], + } + dist_cp.save_state_dict( + state_dict=state_dict, + storage_writer=dist_cp.FileSystemWriter(save_dir) + ) + dist.barrier() + if dist.get_rank() == 0: + logger.info("Completed checkpoint.") + +def get_last_checkpoint(checkpoint_paths, model_type): + steps = [int(re.findall(r'\d+steps', checkpoint.stem)[0].replace('steps','')) \ + for checkpoint in checkpoint_paths] + checkpoints = sorted([(step, path) for step,path in zip(steps, checkpoint_paths)]) + + # find last checkpoint, skipping incomplete ones + for step, path in reversed(checkpoints): + metadata_path = path.joinpath(".metadata") + if not metadata_path.exists(): + logger.warn(f"{metadata_path} not found. Skipping this incomplete checkpoint") + continue + return path.as_posix() + else: + return None + +def load_checkpoint(model, optimizer, scheduler, checkpoint_dir, model_type, device): + checkpoint_paths = list(Path(checkpoint_dir).glob(f"{model_type}-*steps")) + last_checkpoint = get_last_checkpoint(checkpoint_paths, model_type) + if last_checkpoint is None: + if dist.get_rank() == 0: + logger.info("No Checkpoints Found") + return( + model, + optimizer, + scheduler, + 0, + 0, + ) + if dist.get_rank() == 0: + logger.info("Loading checkpoint from %s ...", last_checkpoint) + with FSDP.state_dict_type( + model, + StateDictType.SHARDED_STATE_DICT, + ): + state_dict = { + "model": model.state_dict(), + "scheduler": scheduler.state_dict(), + "total_steps": 0, + "start_batch_index": 0, + # cannot load the optimizer state_dict together with the model state_dict + } + dist_cp.load_state_dict( + state_dict=state_dict, + storage_reader=dist_cp.FileSystemReader(last_checkpoint), + ) + model.load_state_dict(state_dict["model"]) + scheduler.load_state_dict(state_dict["scheduler"]) + if dist.get_rank() == 0: + logger.info("Loaded model state from disk") + logger.info("Loading optimizer state from disk") + optim_state = load_sharded_optimizer_state_dict( + model_state_dict=state_dict["model"], + optimizer_key="optim", + storage_reader=dist_cp.FileSystemReader(last_checkpoint), + ) + if dist.get_rank() == 0: + logger.info("Loaded and sharded optimizer state from disk") + with warnings.catch_warnings(): + warnings.simplefilter("ignore", UserWarning) + # UserWarning to replace all_gather_base with all_gather_into_tensor floods the logs + flattened_osd = FSDP.optim_state_dict_to_load( + model, optimizer, optim_state["optim"] + ) + + if dist.get_rank() == 0: + logger.info("Converted optimizer state dict for FSDP") + optimizer.load_state_dict(flattened_osd) + dist.barrier() + if dist.get_rank() == 0: + logger.info("Checkpoint loaded from %s.", last_checkpoint) + return ( + model, + optimizer, + scheduler, + state_dict["total_steps"], + state_dict["start_batch_index"], + ) diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/model_utils/concat_dataset.py b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/model_utils/concat_dataset.py new file mode 100644 index 0000000..759489d --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/model_utils/concat_dataset.py @@ -0,0 +1,42 @@ +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 + +import os +import numpy as np +import datasets as hf_datasets +from torch.utils.data import IterableDataset +from typing import Dict, Iterable, Union +from transformers import PreTrainedTokenizerBase + +class ConcatTokensDataset(IterableDataset): + def __init__( + self, + hf_dataset: Union[hf_datasets.IterableDataset, hf_datasets.Dataset], + tokenizer: PreTrainedTokenizerBase, + max_length: int, + wrap: bool, + ): + os.environ['TOKENIZERS_PARALLELISM'] = 'false' + self.hf_dataset = hf_dataset + self.tokenizer = tokenizer + self.max_length = max_length + self.should_wrap = wrap + + def __iter__(self) -> Iterable[Dict[str, bytes]]: + + buffer = [] + mask_buffer = [] + for sample in self.hf_dataset: + encoded = self.tokenizer(sample['text'], + truncation=True, + padding=False) + iids = encoded['input_ids'] + mask = encoded['attention_mask'] + buffer = buffer + iids + [self.tokenizer.eos_token_id] + mask_buffer = mask_buffer + mask + [1] + while len(buffer) >= self.max_length: + concat_sample = buffer[:self.max_length] + buffer = buffer[self.max_length:] if self.should_wrap else [] + concat_sample_mask = mask_buffer[:self.max_length] + mask_buffer = mask_buffer[self.max_length:] if self.should_wrap else [] + yield np.array(concat_sample) diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/model_utils/train_utils.py b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/model_utils/train_utils.py new file mode 100644 index 0000000..7254fe0 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/model_utils/train_utils.py @@ -0,0 +1,557 @@ +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 + +import os +import math +import functools +import numpy as np +import torch +import torch.distributed as dist +from torch.utils.data import DataLoader +from datetime import datetime +import tqdm +import logging +from torch.distributed.fsdp import BackwardPrefetch, ShardingStrategy +from transformers import AutoTokenizer +from datasets import load_dataset + +from model_utils.concat_dataset import ConcatTokensDataset + +from transformers import LlamaForCausalLM, LlamaTokenizer, LlamaConfig +from transformers.models.llama.modeling_llama import LlamaDecoderLayer + +g_gigabyte = 1024**3 + +def setup(): + # initialize the process group + dist.init_process_group("nccl") + + +def cleanup(): + dist.destroy_process_group() + +def get_date_of_run(): + """create date and time for file save uniqueness + example: 2022-05-07-08:31:12_PM' + """ + date_of_run = datetime.now().strftime("%Y-%m-%d-%I:%M:%S_%p") + print(f"--> current date and time of run = {date_of_run}") + return date_of_run + + + +def format_metrics_to_gb(item): + """quick function to format numbers to gigabyte and round to 4 digit precision""" + metric_num = item / g_gigabyte + metric_num = round(metric_num, ndigits=4) + return metric_num + +def train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler=None): + model.train() + local_rank = int(os.environ['LOCAL_RANK']) + fsdp_loss = torch.zeros(2).to(local_rank) + + if sampler: + sampler.set_epoch(epoch) + if rank==0: + inner_pbar = tqdm.tqdm( + range(len(train_loader)), colour="blue", desc="r0 Training Epoch" + ) + for batch in train_loader: + for key in batch.keys(): + batch[key] = batch[key].to(local_rank) + optimizer.zero_grad() + output = model(input_ids=batch["source_ids"],attention_mask=batch["source_mask"],labels=batch["target_ids"] ) + loss = output["loss"] + loss.backward() + optimizer.step() + fsdp_loss[0] += loss.item() + fsdp_loss[1] += len(batch) + if rank==0: + inner_pbar.update(1) + + dist.all_reduce(fsdp_loss, op=dist.ReduceOp.SUM) + train_accuracy = fsdp_loss[0] / fsdp_loss[1] + + + if rank == 0: + inner_pbar.close() + print( + f"Train Epoch: \t{epoch}, Loss: \t{train_accuracy:.4f}" + ) + return train_accuracy + + +def validation(model, rank, world_size, val_loader): + model.eval() + correct = 0 + local_rank = int(os.environ['LOCAL_RANK']) + fsdp_loss = torch.zeros(2).to(local_rank) + if rank == 0: + inner_pbar = tqdm.tqdm( + range(len(val_loader)), colour="green", desc="Validation Epoch" + ) + with torch.no_grad(): + for batch in val_loader: + for key in batch.keys(): + batch[key] = batch[key].to(local_rank) + output = model(input_ids=batch["source_ids"],attention_mask=batch["source_mask"],labels=batch["target_ids"]) + fsdp_loss[0] += output["loss"].item() # sum up batch loss + fsdp_loss[1] += len(batch) + + if rank==0: + inner_pbar.update(1) + + dist.all_reduce(fsdp_loss, op=dist.ReduceOp.SUM) + val_loss = fsdp_loss[0] / fsdp_loss[1] + if rank == 0: + inner_pbar.close() + print(f"Validation Loss: {val_loss:.4f}") + return val_loss + +def get_model_config(args): + if "gpt_neox" in args.model_type: + from transformers import GPTNeoXConfig + + model_config = GPTNeoXConfig( + vocab_size=args.vocab_size, + hidden_size=args.hidden_width, + num_hidden_layers=args.num_layers, + num_attention_heads=args.num_heads, + hidden_act="gelu", + intermediate_size=4 * args.hidden_width, + rotary_pct=args.rotary_pct, + rotary_emb_base=args.rotary_emb_base, + max_position_embeddings=args.max_context_width, + layer_norm_epsilon=1e-05, + initializer_range=args.initializer_range, + use_cache=False, + parallel_attn_output=True, + ) + elif "llama_v2" in args.model_type: + from transformers import LlamaConfig + + model_config = LlamaConfig( + vocab_size=args.vocab_size, + hidden_size=args.hidden_width, + intermediate_size=args.intermediate_size, + num_hidden_layers=args.num_layers, + num_attention_heads=args.num_heads, + num_key_value_heads=args.num_key_value_heads, + hidden_act="silu", + max_position_embeddings=args.max_context_width, + initializer_range=args.initializer_range, + rms_norm_eps=1e-5, + use_cache=False, + pretraining_tp=1, + tie_word_embeddings=False, + rope_scaling=None, + ) + elif "llama_v3" in args.model_type: + from transformers import LlamaConfig + + model_config = LlamaConfig( + vocab_size=args.vocab_size, + hidden_size=args.hidden_width, + intermediate_size=args.intermediate_size, + num_hidden_layers=args.num_layers, + num_attention_heads=args.num_heads, + num_key_value_heads=args.num_key_value_heads, + hidden_act="silu", + max_position_embeddings=args.max_context_width, + initializer_range=args.initializer_range, + rms_norm_eps=1e-5, + use_cache=False, + pretraining_tp=1, + tie_word_embeddings=False, + rope_scaling= {"type": "dynamic", "factor": 2.0}, + ) + + + elif "mixtral" in args.model_type: + from transformers import MixtralConfig + model_config = MixtralConfig( + vocab_size=args.vocab_size, + hidden_size=args.hidden_width, + intermediate_size=args.intermediate_size, + num_hidden_layers=args.num_layers, + num_attention_heads=args.num_heads, + num_key_value_heads=args.num_key_value_heads, + hidden_act="silu", + max_position_embeddings=args.max_context_width, + initializer_range=args.initializer_range, + rms_norm_eps=1e-5, + use_cache=False, + tie_word_embeddings=False, + num_experts_per_tok=2, + num_local_experts=8, + ) + elif "mistral" in args.model_type: + from transformers import MistralConfig + model_config = MistralConfig( + vocab_size=args.vocab_size, + hidden_size=args.hidden_width, + intermediate_size=args.intermediate_size, + num_hidden_layers=args.num_layers, + num_attention_heads=args.num_heads, + num_key_value_heads=args.num_key_value_heads, + hidden_act="silu", + max_position_embeddings=args.max_context_width, + initializer_range=args.initializer_range, + rms_norm_eps=1e-5, + use_cache=False, + tie_word_embeddings=False + ) + else: + raise NotImplementedError(f"Model {args.model_type} not implemented") + return model_config + +def compute_num_params(model): + """Get num params.""" + num_params = 0 + seen = set() + for p in model.parameters(): # pylint: disable=invalid-name + if p not in seen: + seen.add(p) + if hasattr(p, "ds_shape"): + num_params += np.prod(p.ds_shape) + else: + num_params += np.prod(p.size()) + + return num_params + +_logger = None +def get_logger(): + global _logger + if _logger is None: + logging.getLogger("torch.distributed.checkpoint._dedup_tensors").setLevel(logging.ERROR) + logging.getLogger("torch.distributed.distributed_c10d").setLevel(logging.ERROR) + _logger = logging.getLogger(__name__) + _logger.setLevel(logging.INFO) + _logger.handlers = [] + ch = logging.StreamHandler() + formatter = logging.Formatter( + "%(asctime)s %(levelname).1s " "[%(filename)s:%(lineno)d] %(message)s", + "%Y-%m-%d %H:%M:%S", + ) + ch.setFormatter(formatter) + _logger.addHandler(ch) + _logger.propagate = False + return _logger + +def get_transformer_layer(model_type="gpt2"): + """Get transformer layer.""" + if model_type == "gpt2": + from transformers.models.gpt2.modeling_gpt2 import GPT2Block + + transformer_layer = GPT2Block + + elif model_type == "gpt_neox": + from transformers.models.gpt_neox.modeling_gpt_neox import GPTNeoXLayer + + transformer_layer = GPTNeoXLayer + + elif model_type == "bloom": + from transformers.models.bloom.modeling_bloom import BloomBlock + + transformer_layer = BloomBlock + + elif model_type == "flash_gptneox": + from flash_attn.modules.block import ParallelBlock + + # TODO: Add support for Block + transformer_layer = ParallelBlock + elif model_type == "llama_v2": + from transformers.models.llama.modeling_llama import LlamaDecoderLayer + + transformer_layer = LlamaDecoderLayer + + elif model_type == "llama_v3": + from transformers.models.llama.modeling_llama import LlamaDecoderLayer + + transformer_layer = LlamaDecoderLayer + + elif model_type == "mixtral": + from transformers.models.mixtral.modeling_mixtral import MixtralDecoderLayer + + transformer_layer = MixtralDecoderLayer + + elif model_type == "mistral": + from transformers.models.mistral.modeling_mistral import MistralDecoderLayer + + transformer_layer = MistralDecoderLayer + + else: + raise NotImplementedError(f"Model type {model_type} not implemented") + + return transformer_layer + +def get_sharding_strategy(strategy: str): + """Get sharding strategy.""" + sharding_strategy = getattr(ShardingStrategy, strategy.upper()) + _logger.debug("Translating %s to %s.", strategy, sharding_strategy) + return sharding_strategy + + +def get_backward_fetch_policy(policy: str): + """Get backward fetch policy.""" + backward_fetch_policy = getattr(BackwardPrefetch, policy.upper()) + _logger.debug("Translating %s to %s.", policy, backward_fetch_policy) + return backward_fetch_policy + +def apply_activation_checkpoint(args, model=None): + from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import ( + CheckpointImpl, + apply_activation_checkpointing, + checkpoint_wrapper, + ) + + transformer_layer = get_transformer_layer(args.model_type) + check_fn_gpt = lambda submodule: isinstance( + submodule, transformer_layer + ) + entrant_wrapper = functools.partial( + checkpoint_wrapper, checkpoint_impl=CheckpointImpl.NO_REENTRANT + ) + apply_activation_checkpointing( + model, checkpoint_wrapper_fn=entrant_wrapper, check_fn=check_fn_gpt + ) + +def get_param_groups_by_weight_decay(module): + """Get param groups.""" + weight_decay_params = {"params": []} + no_weight_decay_params = {"params": [], "weight_decay": 0.0} + param_ids = set() + + from torch.nn import LayerNorm + + for module_ in module.modules(): + # if isinstance(module_, FusedLayerNorm) or + if isinstance(module_, LayerNorm): + for p in list( + module_._parameters.values() + ): # pylint: disable=invalid-name,protected-access + if p is not None and id(p) not in param_ids: + no_weight_decay_params["params"].append(p) + param_ids.add(id(p)) + else: + for n, p in list( + module_._parameters.items() + ): # pylint: disable=invalid-name,protected-access + if p is not None and n != "bias" and id(p) not in param_ids: + weight_decay_params["params"].append(p) + param_ids.add(id(p)) + for n, p in list( + module_._parameters.items() + ): # pylint: disable=invalid-name,protected-access + if p is not None and n == "bias" and id(p) not in param_ids: + no_weight_decay_params["params"].append(p) + param_ids.add(id(p)) + return weight_decay_params, no_weight_decay_params + +class AnnealingLR: # pylint: disable=too-many-instance-attributes + """Anneals the learning rate.""" + + def __init__( # pylint: disable=too-many-arguments + self, + optimizer, + start_lr, + warmup_iter, + plateau_iter, + total_iters, + decay_style, + last_iter, + min_lr=0.0, + use_checkpoint_lr_scheduler=True, + override_lr_scheduler=False, + ): + + # Class values. + self.optimizer = optimizer + self.start_lr = start_lr + self.min_lr = min_lr + self.warmup_iter = warmup_iter + self.plateau_iter = plateau_iter + self.num_iters = last_iter + self.end_iter = total_iters + assert self.end_iter > 0 + self.decay_style = decay_style + self.override_lr_scheduler = override_lr_scheduler + self.use_checkpoint_lr_scheduler = use_checkpoint_lr_scheduler + if self.override_lr_scheduler: + assert not self.use_checkpoint_lr_scheduler, ( + "both override and " "use-checkpoint are set." + ) + # Set the learning rate + self.step(self.num_iters) + self.rank = dist.get_rank() + + def get_lr(self): + """Learning rate decay functions from: + https://openreview.net/pdf?id=BJYwwY9ll pg. 4""" + + num_iters_ = min(self.num_iters, self.end_iter - self.warmup_iter) + # Warmup. + if self.warmup_iter > 0 and self.num_iters <= self.warmup_iter: + return float(self.start_lr) * num_iters_ / self.warmup_iter + + num_iters_ = num_iters_ - self.warmup_iter + if self.decay_style == "linear": + lr = self.start_lr * (self.end_iter - num_iters_) / self.end_iter + elif self.decay_style == "plateau": + if self.num_iters <= self.plateau_iter: + lr = self.start_lr + else: + lr = ( + self.start_lr + * (self.end_iter - self.num_iters) + / (self.end_iter - self.plateau_iter) + ) + elif self.decay_style == "cosine": + lr = self.start_lr / 2.0 * (math.cos(math.pi * num_iters_ / self.end_iter) + 1) + elif self.decay_style == "exponential": + # exp(-0.693) = 1/2 + lr = self.start_lr * math.exp(-0.693 * num_iters_ / self.end_iter) + else: + lr = self.start_lr + return max(lr, self.min_lr) + + def step(self, step_num=None): + """Set lr for all parameters groups.""" + if step_num is None: + step_num = self.num_iters + 1 + self.num_iters = step_num + new_lr = self.get_lr() + for group in self.optimizer.param_groups: + group["lr"] = new_lr + + def state_dict(self): + """State dict.""" + state_dict = { + "start_lr": self.start_lr, + "warmup_iter": self.warmup_iter, + "num_iters": self.num_iters, + "decay_style": self.decay_style, + "end_iter": self.end_iter, + "min_lr": self.min_lr, + } + return state_dict + + def _check_and_set(self, cls_value, sd_value, name): + """Auxiliary function for checking the values in the checkpoint and + setting them.""" + if self.override_lr_scheduler: + if self.rank == 0: + _logger.info(f"Overriding {name} value to {cls_value}") + return cls_value + + if not self.use_checkpoint_lr_scheduler: + assert ( + cls_value == sd_value + ), f"AnnealingLR: class input value and checkpoint values for {name} do not match" + if self.rank == 0: + _logger.info(f" > using checkpoint value {sd_value} for {name}") + return sd_value + + def load_state_dict(self, sd): + """Load state dict.""" + self.start_lr = self._check_and_set(self.start_lr, sd["start_lr"], "learning rate") + self.min_lr = self._check_and_set(self.min_lr, sd["min_lr"], "minimum learning rate") + self.warmup_iter = self._check_and_set( + self.warmup_iter, sd["warmup_iter"], "warmup iterations" + ) + self.end_iter = self._check_and_set( + self.end_iter, sd["end_iter"], "total number of iterations" + ) + self.decay_style = self._check_and_set(self.decay_style, sd["decay_style"], "decay style") + + self.num_iters = sd["num_iters"] + self.step(self.num_iters) + +def get_learning_rate_scheduler(optimizer, args): + """Get learning rate scheduler.""" + use_checkpoint_lr_scheduler = args.resume_from_checkpoint is not None + + # Add linear learning rate scheduler. + if args.lr_decay_iters is not None: + num_iters = args.lr_decay_iters + else: + num_iters = args.max_steps + num_iters = max(1, num_iters) + init_step = 0 + warmup_iter = args.warmup * num_iters + plateau_iter = warmup_iter + args.plateau * num_iters + lr_scheduler = AnnealingLR( + optimizer, + start_lr=args.lr, + warmup_iter=warmup_iter, + plateau_iter=plateau_iter, + total_iters=num_iters, + decay_style=args.lr_decay_style, + last_iter=init_step, + min_lr=args.min_lr, + use_checkpoint_lr_scheduler=use_checkpoint_lr_scheduler, + override_lr_scheduler=False, + ) + + return lr_scheduler + +# def create_streaming_dataloader(dataset, +# tokenizer, +# name=None, +# global_rank=0, +# batch_size=1, +# max_context_width=4096, +# workers=4, +# split=None): +# print(f"dataset={dataset}, name={name}") +# tokenizer = AutoTokenizer.from_pretrained(tokenizer,legacy=False) +# data = load_dataset(dataset, name=name, streaming=True, split=split).shuffle(42+global_rank) +# train_concat_dataset = ConcatTokensDataset(data, tokenizer, max_context_width, True) +# train_dataloader = DataLoader(train_concat_dataset, +# batch_size=batch_size, +# num_workers=workers, +# pin_memory=True, +# prefetch_factor=4, +# timeout=600) +# return train_dataloader + +def create_streaming_dataloader(dataset, + tokenizer, + name=None, + global_rank=0, + batch_size=1, + max_context_width=4096, + workers=4, + split=None): + print(f"dataset={dataset}, name={name}") + tokenizer = AutoTokenizer.from_pretrained(tokenizer,legacy=False) + + # Check if dataset is a local path + if dataset.startswith('/'): + # Local file path - load from JSON.gz files + data_files = f"{dataset}/*.json.gz" + # For local files, always use 'train' split and handle validation differently + data = load_dataset("json", data_files=data_files, streaming=True, split='train').shuffle(42+global_rank) + + # If validation split is requested, skip some data for validation + if split == 'validation': + # Skip first 90% of data for validation (use last 10%) + data = data.skip(int(0.9 * 1000000)) # Approximate skip for validation + elif split == 'train': + # Use first 90% for training + data = data.take(int(0.9 * 1000000)) # Approximate take for training + else: + # HuggingFace dataset - use original logic + data = load_dataset(dataset, name=name, streaming=True, split=split).shuffle(42+global_rank) + + train_concat_dataset = ConcatTokensDataset(data, tokenizer, max_context_width, True) + train_dataloader = DataLoader(train_concat_dataset, + batch_size=batch_size, + num_workers=workers, + pin_memory=True, + prefetch_factor=4, + timeout=600) + return train_dataloader + + diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/requirements.txt b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/requirements.txt new file mode 100644 index 0000000..f896d6c --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/requirements.txt @@ -0,0 +1,8 @@ +--extra-index-url https://download.pytorch.org/whl/cu128 +datasets +torch==2.7.1 +torchaudio==2.7.1 +torchvision==0.22.1 +transformers==4.52.4 +mlflow==3.0.0 +sagemaker-mlflow \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/train.py b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/train.py new file mode 100644 index 0000000..1fb6c8e --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/fsdp/src/train.py @@ -0,0 +1,324 @@ +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 + +import datetime +import functools +import math +import re +import time + +import numpy as np +import torch +from torch import optim +import torch.distributed as dist +import torch.utils.data + +import transformers +from transformers import AutoModelForCausalLM, AutoTokenizer +from datasets import load_dataset + +from torch.distributed.fsdp import FullyShardedDataParallel as FSDP +from torch.distributed.fsdp import MixedPrecision +from torch.distributed.fsdp import ShardingStrategy +from torch.distributed.fsdp import CPUOffload +from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy, transformer_auto_wrap_policy +from torch.utils.data import DataLoader + +from model_utils.concat_dataset import ConcatTokensDataset +from model_utils.train_utils import (get_model_config, + compute_num_params, + get_transformer_layer, + get_sharding_strategy, + get_backward_fetch_policy, + apply_activation_checkpoint, + get_param_groups_by_weight_decay, + get_logger, + get_learning_rate_scheduler, + create_streaming_dataloader) +from model_utils.checkpoint import save_checkpoint, load_checkpoint +from model_utils.arguments import parse_args + + +import logging +import sys +import mlflow + +logging.basicConfig(format="%(asctime)s [%(levelname)s] %(name)s: %(message)s", level=logging.INFO, stream=sys.stdout) + +logger = logging.getLogger(__name__) +logger.setLevel(logging.INFO) + +arn = "arn:aws:sagemaker:us-east-1:011528295005:mlflow-tracking-server/hyperpod-ts-demo" +mlflow.set_tracking_uri(arn) + + +def eval_model(model, dataloader, num_batches): + """Eval step.""" + model = model.eval() + n_batches = 0 + loss = 0.0 + + with torch.no_grad(): + for batch_idx, input_data in enumerate(dataloader): + if batch_idx >= num_batches: + break + + loss += model(input_ids=input_data, attention_mask=None, labels=input_data)["loss"] + n_batches += 1 + + if n_batches > 0: + detached_loss = loss.detach() + torch.distributed.all_reduce(detached_loss) + loss = detached_loss.item() / dist.get_world_size() + loss /= n_batches + ppl = math.exp(loss) + else: + loss = -1.0 + ppl = -1.0 + + return loss, ppl + +def train( + model, + optimizer, + train_dataloader, + val_dataloader, + lr_scheduler, + model_config, + num_params, + args, + global_rank, + world_size, + total_steps=0, + start_batch_index=0 + ): + model.train() + for index in range(args.epochs): + for batch_idx, input_data in enumerate(train_dataloader): + if batch_idx < start_batch_index: + continue + optimizer.zero_grad(set_to_none=True) + step_start = time.time() + loss = model(input_ids=input_data, attention_mask=None, labels=input_data)["loss"] + loss.backward() + model.clip_grad_norm_(args.grad_clip) + optimizer.step() + lr_scheduler.step() + total_steps += 1 + loss_metric = loss.item() + step_time = time.time() - step_start + sample_processed = input_data.shape[0] * world_size + throughput = sample_processed / step_time + loss_scalar = loss.item() + current_lr = lr_scheduler.get_lr() + if global_rank==0 and batch_idx%args.logging_freq==0: + logger.info( + "Batch %d Loss: %.5f, Speed: %.2f samples/sec, lr: %.6f", # pylint: disable=line-too-long + batch_idx, + loss_scalar, + throughput, + current_lr, + ) + # Log training metrics to MLflow + mlflow.log_metric("train_loss", loss_scalar, step=total_steps) + mlflow.log_metric("learning_rate", current_lr, step=total_steps) + mlflow.log_metric("throughput_samples_per_sec", throughput, step=total_steps) + if args.validation_freq and not total_steps % args.validation_freq: + val_loss, val_ppl = eval_model( + model, val_dataloader, args.validation_batches + ) + model = model.train() + if global_rank == 0: + logger.info( + "Batch %d Validation loss: %s", + batch_idx, + val_loss, + ) + # Log validation metrics to MLflow + mlflow.log_metric("val_loss", val_loss, step=total_steps) + mlflow.log_metric("val_perplexity", val_ppl, step=total_steps) + if args.checkpoint_dir and not total_steps % args.checkpoint_freq: + user_content = { + "cli_args": args.__dict__, + "num_params": num_params, + "total_steps": total_steps, + "model_config": model_config, + "start_batch_index": batch_idx + 1, + } + sub_dir = f"{args.model_type}-{total_steps}steps" + + save_checkpoint( + model, + optimizer, + lr_scheduler, + user_content, + args.checkpoint_dir, + sub_dir, + ) + if total_steps >= args.max_steps: + break + + +def main(args): + dist.init_process_group() + global_rank = dist.get_rank() + device = global_rank % torch.cuda.device_count() + world_size = dist.get_world_size() + + # Initialize MLflow experiment (only on rank 0) + if global_rank == 0: + mlflow.set_experiment("fsdp-training") + mlflow.start_run() + # Log hyperparameters + mlflow.log_params({ + "model_type": args.model_type, + "epochs": args.epochs, + "train_batch_size": args.train_batch_size, + "lr": args.lr, + "weight_decay": args.weight_decay, + "sharding_strategy": args.sharding_strategy, + "world_size": world_size, + "bf16": args.bf16, + "activation_checkpointing": args.activation_checkpointing, + "cpu_offload": args.cpu_offload + }) + + if args.bf16: + dtype = torch.bfloat16 + else: + dtype = torch.get_default_dtype() + + model_config = get_model_config(args) + if global_rank == 0: + logger.info( + "Creating Model" + ) + # Instantiate model on CPU on rank=0 only to prevent CPU OOM + # (e.g. 70B * 4 bytes * 8 processes > 2T RAM available on P5) + if global_rank == 0: + model = AutoModelForCausalLM.from_config(model_config) + else: + with torch.device("meta"): + # Instantiating model on `meta` device doesn't consume CPU memory, + # but requires specifing `param_init_fn=...` + # and `sync_module_states=True` in FSDP c-tor. + model = AutoModelForCausalLM.from_config(model_config) + + num_params = compute_num_params(model) + if global_rank == 0: + logger.info( + "Created model with total parameters: %d (%.2f B)", num_params, num_params * 1e-9 + ) + transformer_layer = get_transformer_layer(args.model_type) + + gpt_auto_wrap_policy = functools.partial( + transformer_auto_wrap_policy, + transformer_layer_cls={ + transformer_layer, + }, + ) + + torch.cuda.set_device(device) + mixed_precision_policy = MixedPrecision( + param_dtype=dtype, reduce_dtype=dtype, buffer_dtype=dtype + ) + + if args.sharding_strategy=="full": + sharding_strategy = ShardingStrategy.FULL_SHARD + elif args.sharding_strategy=="hybrid": + sharding_strategy = ShardingStrategy.HYBRID_SHARD + else: + raise NotImplementedError("Available sharding strategies are full and hybrid") + + if args.cpu_offload == 1: + cpu_offload = CPUOffload(offload_params=True) + else: + cpu_offload = None + + model = FSDP( + model, + auto_wrap_policy=gpt_auto_wrap_policy, + mixed_precision=mixed_precision_policy, + limit_all_gathers=args.limit_all_gathers, + device_id=torch.cuda.current_device(), + use_orig_params=False, + sharding_strategy=sharding_strategy, + cpu_offload=cpu_offload, + sync_module_states=True, + param_init_fn=(lambda module: module.to_empty(device=torch.device("cuda"), recurse=False)) + if global_rank != 0 else None, + ) + + if global_rank == 0: + logger.info("Wrapped model with FSDP") + + if args.activation_checkpointing > 0: + apply_activation_checkpoint(args, model=model) + + if args.offload_activations > 0: + from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import offload_wrapper + + model = offload_wrapper(model) + + param_groups = get_param_groups_by_weight_decay(model) + + optimizer = optim.AdamW( + param_groups, betas=(args.beta1, args.beta2), lr=args.lr, weight_decay=args.weight_decay + ) + + if global_rank == 0: + logger.info("Created optimizer") + + lr_scheduler = get_learning_rate_scheduler(optimizer, args) + + if args.resume_from_checkpoint: + ( + model, + optimizer, + lr_scheduler, + total_steps, + start_batch_index, + ) = load_checkpoint(model, + optimizer, + lr_scheduler, + args.resume_from_checkpoint, + args.model_type, + device) + else: + total_steps = 0 + start_batch_index = 0 + + train_dataloader = create_streaming_dataloader(args.dataset, + args.tokenizer, + name=args.dataset_config_name, + batch_size=args.train_batch_size, + split='train') + + val_dataloader = create_streaming_dataloader(args.dataset, + args.tokenizer, + name=args.dataset_config_name, + batch_size=args.train_batch_size, + split='validation') + + train(model, + optimizer, + train_dataloader, + val_dataloader, + lr_scheduler, + model_config, + num_params, + args, + global_rank, + world_size, + total_steps, + start_batch_index) + + # End MLflow run (only on rank 0) + if global_rank == 0: + mlflow.end_run() + + dist.destroy_process_group() + +if __name__ == "__main__": + args, _ = parse_args() + main(args) diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/hpto_1b.yaml b/Container-Root/hyperpod/deployment/eks/demo/hero/training/hpto_1b.yaml new file mode 100644 index 0000000..f8812d5 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/hpto_1b.yaml @@ -0,0 +1,160 @@ +apiVersion: sagemaker.amazonaws.com/v1 +kind: HyperPodPyTorchJob +metadata: + name: llama3-2-1b-fsdp-hpto + namespace: hyperpod-ns-training-team + labels: + kueue.x-k8s.io/queue-name: hyperpod-ns-training-team-localqueue + kueue.x-k8s.io/priority-class: training-priority +spec: + nprocPerNode: "$GPU_PER_NODE" + runPolicy: + jobMaxRetryCount: 100 # Maximum number of restarts at the process level + restartPolicy: + numRestartBeforeFullJobRestart: 10 # Maximum number of restarts at the process level before the operator restarts at the job level + evalPeriodSeconds: 43200 # The period of evaluating the restart limit in seconds + maxFullJobRestarts: 10 # Maximum number of full job restarts before the job fails + cleanPodPolicy: All # Specifies the pods that the operator should clean. Accepted values are All, OnlyComplete, and None + logMonitoringConfiguration: + - name: JobStart + logPattern: '.*Loss:.*' + expectedStartCutOffInSeconds: 720 + - name: JobHangingDetection + logPattern: '.*Loss:.*' + expectedRecurringFrequencyInSeconds: 1800 + replicaSpecs: + - name: pods + replicas: $ACCEL_INSTANCE_COUNT + template: + metadata: + labels: + job-name: llama3-2-1b-fsdp-hpto + replica-type: pods + spec: + serviceAccountName: sagemaker-mlflow-sa + volumes: + - name: shmem + emptyDir: + medium: Memory + sizeLimit: "200Gi" + # hostPath: + # path: /dev/shm + - name: local + hostPath: + path: /mnt/k8s-disks/0 + - name: fsx-storage + persistentVolumeClaim: + claimName: fsx-claim + affinity: + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchLabels: + job-name: llama3-2-1b-fsdp-hpto + topologyKey: kubernetes.io/hostname + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: sagemaker.amazonaws.com/node-health-status + operator: In + values: + - Schedulable + - key: sagemaker.amazonaws.com/compute-type + operator: In + values: + - '${ACCEL_INSTANCE_TYPE}' + topologySpreadConstraints: + - maxSkew: 1 + topologyKey: kubernetes.io/hostname + whenUnsatisfiable: DoNotSchedule + labelSelector: + matchLabels: + job-name: llama3-1-8b-fsdp-hpto + containers: + - name: pytorch + image: ${REGISTRY}${IMAGE}:${TAG} + imagePullPolicy: Always + resources: + requests: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + limits: + nvidia.com/gpu: $GPU_PER_NODE + vpc.amazonaws.com/efa: $EFA_PER_NODE + env: + # for P5 FI_* should be commented out + # - name: LOGLEVEL + # value: "DEBUG" + - name: FI_PROVIDER + value: efa + #- name: FI_EFA_USE_DEVICE_RDMA + # value: "1" + - name: FI_EFA_FORK_SAFE + value: "1" + #- name: FI_LOG_LEVEL + # value: "1" + #- name: FI_EFA_ENABLE_SHM_TRANSFER + # value: "1" + - name: TORCH_DISTRIBUTED_DEBUG + value: "DETAIL" + - name: TORCH_NCCL_ENABLE_MONITORING + value: "1" + - name: TORCH_NCCL_TRACE_BUFFER_SIZE + value: "20000" + - name: TORCH_NCCL_DUMP_ON_TIMEOUT + value: "1" + - name: TORCH_NCCL_DEBUG_INFO_TEMP_FILE + value: "/local/nccl_trace_rank_" + - name: PYTORCH_CUDA_ALLOC_CONF + value: "expandable_segments:True" + - name: NCCL_DEBUG + value: "INFO" + - name: NCCL_SOCKET_IFNAME + value: "^lo" + - name: TORCH_NCCL_ASYNC_ERROR_HANDLING + value: "1" + - name: CUDA_LAUNCH_BLOCKING + value: "1" + - name: HF_TOKEN + value: "${HF_TOKEN}" + #- name: TORCH_DIST_INIT_BARRIER + # value: "1" + #- name: NCCL_IGNORE_DISABLED_P2P + # value: "1" + #- name: NCCL_NVLS_ENABLE + # value: "0" + command: + - hyperpodrun + - '--tee=3' + - '--log_dir=/tmp/hyperpod' + - '--nproc_per_node=$GPU_PER_NODE' + - '--nnodes=$ACCEL_INSTANCE_COUNT' + - /fsdp/train.py + - --max_context_width=8192 + - --num_key_value_heads=2 + - --intermediate_size=8192 + - --hidden_width=2048 + - --num_layers=16 + - --num_heads=32 + - --model_type=llama_v3 + - --tokenizer=hf-internal-testing/llama-tokenizer + - --checkpoint_freq=10 + - --validation_freq=100 + - --max_steps=5000 + - --checkpoint_dir=/fsx/checkpoint1 + # - --dataset=allenai/c4 + - --dataset=/fsx/datasets/c4/en + # - --dataset_config_name=en + - --resume_from_checkpoint=/fsx/checkpoint1 + - --train_batch_size=1 + - --val_batch_size=1 + - --sharding_strategy=full # https://pytorch.org/docs/stable/fsdp.html + - --offload_activations=1 + volumeMounts: + - name: shmem + mountPath: /dev/shm + - name: local + mountPath: /local + - name: fsx-storage + mountPath: /fsx \ No newline at end of file diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/node-error.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/training/node-error.sh new file mode 100755 index 0000000..d1d3461 --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/node-error.sh @@ -0,0 +1,12 @@ +#!/bin/bash +export NODE=$(kubectl get pods -n hyperpod-ns-training-team -l HPJob=llama3-2-1b-fsdp-hpto -o jsonpath='{.items[*].spec.nodeName}' | tr ' ' '\n' | shuf -n 1) + +if [ -z "$NODE" ]; then + echo "No GPU nodes found!" + exit 1 +fi + +echo "Selected GPU node: $NODE" +kubectl label node $NODE \ + sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot \ + --overwrite=true diff --git a/Container-Root/hyperpod/deployment/eks/demo/hero/training/submit-job.sh b/Container-Root/hyperpod/deployment/eks/demo/hero/training/submit-job.sh new file mode 100755 index 0000000..c2a774c --- /dev/null +++ b/Container-Root/hyperpod/deployment/eks/demo/hero/training/submit-job.sh @@ -0,0 +1,9 @@ +#!/bin/bash + +echo "Running: envsubst < hpto_5b.yaml | kubectl apply -f -" + +echo "Creating HyperPod PytorchJob for training Llama3.2 1B parameter model..." + +echo "" + +envsubst < training/hpto_1b.yaml | kubectl apply -f - \ No newline at end of file