# Fine-tuning Qwen3-14B using PyTorch FSDP with Accelerate

This notebook shows how to fine-tune Qwen3-14B using PyTorch FSDP with [Hugging Face Accelerate](https://github.com/huggingface/accelerate) library for supervised fine-tuning (SFT).

## Setup and Imports

In [None]:
! pip install kubernetes
! pip install boto3

In [None]:
import os
import subprocess
import sys

# Set working directory
os.chdir(os.path.expanduser('~/amazon-eks-machine-learning-with-terraform-and-kubeflow'))
print(f"Working directory: {os.getcwd()}")

# Get the src directory
src_dir = os.path.join(os.getcwd(), "src")
sys.path.insert(0, src_dir)

from k8s.utils import wait_for_helm_release_pods

# Get notebook directory
notebook_dir = os.path.join(os.getcwd(), 'examples', 'training', 'accelerate', 'qwen3-14b-sft')
print(f"Notebook directory: {notebook_dir}")

# Initialize key variables
release_name = 'accel-qwen3-14b-sft'
namespace = 'kubeflow-user-example-com'

## Step 1: Download Qwen3-14B Model Weights

Replace `YourHuggingFaceToken` with your actual Hugging Face token.

In [None]:
hf_token = 'YourHuggingFaceToken'

cmd = [
    'helm', 'install', '--debug', release_name,
    'charts/machine-learning/model-prep/hf-snapshot',
    '--set-json', f'env=[{{"name":"HF_MODEL_ID","value":"Qwen/Qwen3-14B"}},{{"name":"HF_TOKEN","value":"{hf_token}"}}]',
    '-n', namespace
]

result = subprocess.run(cmd, capture_output=True, text=True)
print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)

In [None]:
# Wait for model download to complete
wait_for_helm_release_pods(release_name, namespace)

In [None]:
# Uninstall the model download job
cmd = ['helm', 'uninstall', release_name, '-n', namespace]
result = subprocess.run(cmd, capture_output=True, text=True)
print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)

## Step 2: Launch Fine-tuning

In [None]:
cmd = [
    'helm', 'install', '--debug', release_name,
    'charts/machine-learning/training/pytorchjob-distributed',
    '-f', f'{notebook_dir}/fine-tune.yaml',
    '-n', namespace
]

result = subprocess.run(cmd, capture_output=True, text=True)
print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)

In [None]:
# Wait for fine-tuning to complete
wait_for_helm_release_pods(release_name, namespace)

In [None]:
# Uninstall the training job
cmd = ['helm', 'uninstall', release_name, '-n', namespace]
result = subprocess.run(cmd, capture_output=True, text=True)
print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)

## Output

To access the output stored on EFS and FSx for Lustre file-systems:

```bash
kubectl apply -f eks-cluster/utils/attach-pvc.yaml -n kubeflow
kubectl exec -it -n kubeflow attach-pvc -- /bin/bash
```

### Logs
Training logs are available in `/efs/home/accel-qwen3-14b-sft/logs` folder.

### Output
Training output are available in `/efs/home/accel-qwen3-14b-sft/output` folder.

### S3 Backup
Any content stored under `/fsx` is automatically backed up to your configured S3 bucket.