# Submitting a PyTorch Training Job - HyperPod CLI End-to-End Walkthrough

This example shows how to fine-tune a **Qwen3 4B Thinking** model using PyTorch FSDP and QLora on your HyperPod cluster.

This example assumes that you completed the **Setup instructions** in [00-getting-started/00-setup.md](../00-getting-started/00-setup.md) as well as the **Training Dataset** and **Training Docker Image** steps in [00-pytorch-training-job.md](00-pytorch-training-job.md).

### Import necessary classes

In [None]:
from sagemaker.hyperpod.training import (
    HyperPodPytorchJob,
    Containers,
    ReplicaSpec,
    Resources,
    RunPolicy,
    Spec,
    Template,
    Volumes,
    VolumeMounts,
)
from sagemaker.hyperpod.common.config import Metadata

### Define the environment variables

Please use the values according to the steps executed in [00-pytorch-training-job.md](00-pytorch-training-job.md).

In [None]:
# Configuration variables
AWS_REGION = "PLEASE_FILL_IN"
AWS_ACCOUNT_ID = "PLEASE_FILL_IN"

S3_PREFIX = "qwen-cli-example"
ECR_NAME = "qwen3-finetuning"
DOCKER_IMAGE_TAG = "pytorch2.8-cu129"
JOB_NAME = "qwen3-4b-thinking-2507-fsdp"

IMAGE_URI = f"{AWS_ACCOUNT_ID}.dkr.ecr.{AWS_REGION}.amazonaws.com/{ECR_NAME}:{DOCKER_IMAGE_TAG}"

### Define the training job

In [None]:
pytorch_job = HyperPodPytorchJob(
    metadata=Metadata(name=JOB_NAME, namespace="default"),
    nproc_per_node="4",
    replica_specs=[
        ReplicaSpec(
            name="pod",
            replicas=2,
            template=Template(
                spec=Spec(
                    containers=[
                        Containers(
                            name="training-container",
                            image=IMAGE_URI,
                            image_pull_policy="IfNotPresent",
                            command=["hyperpodrun", "--nnodes=2:2", "--nproc_per_node=4", f"/data/{S3_PREFIX}/scripts/train.py"],
                            args=["--config", f"/data/{S3_PREFIX}/scripts/args.yaml"],
                            env=[
                                {"name": "LOGLEVEL", "value": "INFO"},
                                {"name": "PYTORCH_CUDA_ALLOC_CONF", "value": "expandable_segments:True"},
                                {"name": "NCCL_DEBUG", "value": "INFO"},
                                {"name": "NCCL_SOCKET_IFNAME", "value": "^lo"},
                                {"name": "TORCH_NCCL_ASYNC_ERROR_HANDLING", "value": "1"},
                                {"name": "FI_PROVIDER", "value": "efa"},
                                {"name": "FI_EFA_FORK_SAFE", "value": "1"},
                                {"name": "NCCL_PROTO", "value": "simple"},
                            ],
                            resources=Resources(
                                requests={"nvidia.com/gpu": "4"},
                                limits={"nvidia.com/gpu": "4"},
                            ),
                            volume_mounts=[
                                VolumeMounts(name="shmem", mount_path="/dev/shm"),
                                VolumeMounts(name="local", mount_path="/local"),
                                VolumeMounts(name="fsx-volume", mount_path="/data"),
                            ],
                        )
                    ],
                    volumes=[
                        Volumes(name="shmem", host_path={"path": "/dev/shm"}),
                        Volumes(name="local", host_path={"path": "/mnt/k8s-disks/0"}),
                        Volumes(name="fsx-volume", persistent_volume_claim={"claim_name": "fsx-claim"}),
                    ]
                )
            ),
        )
    ],
    run_policy=RunPolicy(
        clean_pod_policy="None",
        job_max_retry_count=100
    ),
)

### Submit the training job to the HyperPod cluster

In [None]:
pytorch_job.create()

### Monitor the job status

In [None]:
print("List all jobs:")
print(HyperPodPytorchJob.list())

print("\nRefresh job and check status:")
pytorch_job.refresh()
print(pytorch_job.status)

### List pods and show the training logs

In [None]:
print("List all pods:")
print(pytorch_job.list_pods())

print("\nLogs from pod-0:")
print(pytorch_job.get_logs_from_pod(f"{JOB_NAME}-pod-0"))

### Delete the Job

In [None]:
pytorch_job.delete()