# Spectrum Fine-Tuning with Amazon SageMaker AI 
Fine-tune LLM with PyTorch FSDP and spectrum techniques using training job.

In this notebook, we fine-tune LLM on Amazon SageMaker AI, using Python scripts and SageMaker ModelTrainer for executing a training job.

## Prerequisites

In [6]:
%pip install -r ./scripts/requirements.txt --upgrade

Collecting transformers==4.52.2 (from -r ./scripts/requirements.txt (line 1))
  Using cached transformers-4.52.2-py3-none-any.whl.metadata (40 kB)
Collecting datasets==4.0.0 (from -r ./scripts/requirements.txt (line 5))
  Using cached datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Using cached transformers-4.52.2-py3-none-any.whl (10.5 MB)
Using cached datasets-4.0.0-py3-none-any.whl (494 kB)
Installing collected packages: transformers, datasets
[2K  Attempting uninstall: transformers
[2K    Found existing installation: transformers 4.50.2
[2K    Uninstalling transformers-4.50.2:━━━━━━━━━━━[0m [32m0/2[0m [transformers]
[2K      Successfully uninstalled transformers-4.50.2[32m0/2[0m [transformers]
[2K  Attempting uninstall: datasets━━━━━━━━━━━━━━━━[0m [32m0/2[0m [transformers]
[2K    Found existing installation: datasets 3.2.0m [32m0/2[0m [transformers]
[2K    Uninstalling datasets-3.2.0:━━━━━━━━━━━━[0m [32m0/2[0m [transformers]
[2K      Successfully uninstalled da

In [2]:
from IPython import get_ipython
get_ipython().kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

***

In [1]:
import boto3
import shutil
import os
from random import randint

from datasets import load_dataset
from datasets import Dataset, DatasetDict

import sagemaker
from sagemaker.s3 import S3Uploader
from sagemaker import get_execution_role
from sagemaker import Model
from sagemaker import image_uris
from sagemaker.config import load_sagemaker_config

from sagemaker.modules.configs import Compute, InputData, OutputDataConfig, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import InputData

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [2]:
sagemaker_session = sagemaker.Session()
s3_client = boto3.client('s3')

bucket_name = sagemaker_session.default_bucket()
default_prefix = sagemaker_session.default_bucket_prefix

configs = load_sagemaker_config()

## Prepare spectrum Layer for training job

In [2]:
spectrum_clone_folder = "~/spectrum"

In [3]:
!git clone https://github.com/cognitivecomputations/spectrum.git {spectrum_clone_folder}

fatal: destination path '/home/sagemaker-user/spectrum' already exists and is not an empty directory.


In [4]:
model_id = "Qwen/Qwen3-1.7B"
filesafe_model_id = model_id.replace("/","-")
spectrum_layer_percent = 50

spectrum_output_filename = f"snr_results_{filesafe_model_id}_unfrozenparameters_{spectrum_layer_percent}percent.yaml"
spectrum_output_filepath = f"{spectrum_clone_folder}/{spectrum_output_filename}"
spectrum_output_filepath

'~/spectrum/snr_results_Qwen-Qwen3-1.7B_unfrozenparameters_50percent.yaml'

In [5]:
print(f"python3 {spectrum_clone_folder}/spectrum.py --model-name {model_id} --top-percent {spectrum_layer_percent}")

python3 ~/spectrum/spectrum.py --model-name Qwen/Qwen3-1.7B --top-percent 50


In [7]:
!ls -al {spectrum_output_filepath}
!mkdir -p ./scripts/spectrum-layer/
!cp {spectrum_output_filepath} ./scripts/spectrum-layer/

-rw-r--r-- 1 sagemaker-user users 14067 Jul 31 18:27 /home/sagemaker-user/spectrum/snr_results_Qwen-Qwen3-8B_unfrozenparameters_100percent.yaml


## Setup Configuration file path

If you have created a Managed MLflow server, copy the `ARN` code here and assign a name to the experiment

In [8]:
os.environ["mlflow_uri"] = "arn:aws:sagemaker:us-east-1:783764584149:mlflow-tracking-server/test"
os.environ["mlflow_experiment_name"] = "spectrum-fine-tuning"
os.environ["hf_token"] = ""
os.environ["model_id"] = model_id
os.environ["spectrum_layer_config"] = spectrum_output_filename

***

## Visualize and upload the dataset

We are going to read the data from huggingface and prepare it for training job.

In [9]:
train = load_dataset("rajpurkar/squad", split="train")
test = load_dataset("rajpurkar/squad", split="validation")

In [10]:
train

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

Create a prompt template and load the dataset with a random sample to try summarization.

In [11]:
# custom instruct prompt start
prompt_template = f"""
<s> [INST]
Context:
{{context}}

{{question}} [/INST] 
{{answer}}</s>"""

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = prompt_template.format(context=sample["context"],
                                            question=sample["question"],
                                            answer=sample["answers"])
    return sample

Use the Hugging Face Trainer class to fine-tune the model. Define the hyperparameters we want to use. We also create a DataCollator that will take care of padding our inputs and labels.

In [12]:
train_dataset = train
test_dataset = test

dataset = DatasetDict({"train": train_dataset, "test": test_dataset})

train_dataset = dataset["train"].map(template_dataset, remove_columns=list(dataset["train"].features))

print(train_dataset[randint(0, len(dataset))]["text"])

test_dataset = dataset["test"].map(template_dataset, remove_columns=list(dataset["test"].features))


<s> [INST]
Context:
Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

What is in front of the Notre Dame Main Building? [/INST] 
{'text': ['a copper statue of Christ'], 'answer_start': [188]}</s>


### Upload to Amazon S3

In [13]:
# save train_dataset to s3 using our SageMaker session
if default_prefix:
    input_path = f'{default_prefix}/datasets/spectrum-fine-tuning-modeltrainer-sft'
else:
    input_path = f'datasets/spectrum-fine-tuning-modeltrainer-sft'

# Save datasets to s3
# We will fine tune only with 20 records due to limited compute resource for the workshop
train_dataset.to_json("./data/train/dataset.json", orient="records")
test_dataset.to_json("./data/test/dataset.json", orient="records")

s3_client.upload_file("./data/train/dataset.json", bucket_name, f"{input_path}/train/dataset.json")
train_dataset_s3_path = f"s3://{bucket_name}/{input_path}/train/dataset.json"
s3_client.upload_file("./data/test/dataset.json", bucket_name, f"{input_path}/test/dataset.json")
test_dataset_s3_path = f"s3://{bucket_name}/{input_path}/test/dataset.json"

shutil.rmtree("./data")

print(f"Training data uploaded to:")
print(train_dataset_s3_path)
print(test_dataset_s3_path)

Creating json from Arrow format:   0%|          | 0/88 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/11 [00:00<?, ?ba/s]

Training data uploaded to:
s3://sagemaker-us-east-1-783764584149/datasets/spectrum-fine-tuning-modeltrainer-sft/train/dataset.json
s3://sagemaker-us-east-1-783764584149/datasets/spectrum-fine-tuning-modeltrainer-sft/test/dataset.json


***

## Model fine-tuning

We are now ready to fine-tune our model. We will use the [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) from transfomers to fine-tune our model. We prepared a script [train.py](./scripts/train.py) which will loads the dataset from disk, prepare the model, tokenizer and start the training.

For configuration we use `TrlParser`, that allows us to provide hyperparameters in a `yaml` file. This yaml will be uploaded and provided to Amazon SageMaker similar to our datasets. Below is the config file for fine-tuning the model on `ml.p4d.24xlarge`. We are saving the config file as `args.yaml` and upload it to S3.

In [14]:
os.environ["mlflow_uri"] = "arn:aws:sagemaker:us-east-1:783764584149:mlflow-tracking-server/test"
os.environ["mlflow_experiment_name"] = "spectrum-fine-tuning-Qwen3-8B-100percent"

In [15]:
%%bash

cat > ./args.yaml <<EOF
hf_token: "${hf_token}" # Use HF token to login into Hugging Face to access the DeepSeek distilled models
model_id: "${model_id}"       # Hugging Face model id
mlflow_uri: "${mlflow_uri}"
mlflow_experiment_name: "${mlflow_experiment_name}"
# sagemaker specific parameters
output_dir: "/opt/ml/model"                       # path to where SageMaker will upload the model 
train_dataset_path: "/opt/ml/input/data/train/"   # path to where FSx saves train dataset
test_dataset_path: "/opt/ml/input/data/test/"     # path to where FSx saves test dataset
spectrum_config_path: "/opt/ml/input/data/code/spectrum-layer/${spectrum_layer_config}"
# training parameters           
learning_rate: 2e-4                    # learning rate scheduler
num_train_epochs: 1                    # number of training epochs
per_device_train_batch_size: 8         # batch size per device during training
per_device_eval_batch_size: 2          # batch size for evaluation
gradient_accumulation_steps: 2         # number of steps before performing a backward/update pass
gradient_checkpointing: true           # use gradient checkpointing
bf16: true                             # use bfloat16 precision
tf32: false                            # use tf32 precision
fsdp: "full_shard auto_wrap offload"
fsdp_config: 
    backward_prefetch: "backward_pre"
    cpu_ram_efficient_loading: true
    offload_params: true
    forward_prefetch: false
    use_orig_params: true
merge_weights: false                    # merge weights in the base model
EOF

Lets upload the config file to S3.

In [16]:
if default_prefix:
    input_path = f"s3://{bucket_name}/{default_prefix}/datasets/spectrum-fine-tuning-modeltrainer-sft"
else:
    input_path = f"s3://{bucket_name}/datasets/spectrum-fine-tuning-modeltrainer-sft"

# upload the model yaml file to s3
model_yaml = "args.yaml"
train_config_s3_path = S3Uploader.upload(local_path=model_yaml, desired_s3_uri=f"{input_path}/config")

print(f"Training config uploaded to:")
print(train_config_s3_path)

Training config uploaded to:
s3://sagemaker-us-east-1-783764584149/datasets/spectrum-fine-tuning-modeltrainer-sft/config/args.yaml


## Fine-tune model

Below estimtor will train the model with spectrum.

#### Get PyTorch image_uri

We are going to use the native PyTorch container image, pre-built for Amazon SageMaker

In [17]:
instance_type = "ml.p4d.24xlarge" #"ml.g6.24xlarge" # Override the instance type if you want to get a different container version

instance_type

'ml.p4d.24xlarge'

In [18]:
image_uri = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=sagemaker_session.boto_session.region_name,
    version="2.6.0",
    instance_type=instance_type,
    image_scope="training"
)

image_uri

'763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.6.0-gpu-py312'

In [19]:
# Define the script to be run
source_code = SourceCode(
    source_dir="./scripts",
    requirements="requirements.txt",
    entry_script="train_spectrum.py",
)

# Define the compute
compute_configs = Compute(
    instance_type=instance_type,
    instance_count=1,
    keep_alive_period_in_seconds=3600,
)

# define Training Job Name 
job_name = f"train-{model_id.split('/')[-1].replace('.', '-')}-sft-spectrum-script"

# define OutputDataConfig path
if default_prefix:
    output_path = f"s3://{bucket_name}/{default_prefix}/{job_name}"
else:
    output_path = f"s3://{bucket_name}/{job_name}"

# Define the ModelTrainer
model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    base_job_name=job_name,
    compute=compute_configs,
    distributed=Torchrun(),
    stopping_condition=StoppingCondition(
        max_runtime_in_seconds=14400
    ),
    hyperparameters={
        "config": "/opt/ml/input/data/config/args.yaml" # path to TRL config which was uploaded to s3
    },
    output_data_config=OutputDataConfig(
        s3_output_path=output_path
    ),
)

In [20]:
# Pass the input data
train_input = InputData(
    channel_name="train",
    data_source=train_dataset_s3_path, # S3 path where training data is stored
)

test_input = InputData(
    channel_name="test",
    data_source=test_dataset_s3_path, # S3 path where training data is stored
)

config_input = InputData(
    channel_name="config",
    data_source=train_config_s3_path, # S3 path where training data is stored
)

# Check input channels configured
data = [train_input, test_input, config_input]
data

[InputData(channel_name='train', data_source='s3://sagemaker-us-east-1-783764584149/datasets/spectrum-fine-tuning-modeltrainer-sft/train/dataset.json'),
 InputData(channel_name='test', data_source='s3://sagemaker-us-east-1-783764584149/datasets/spectrum-fine-tuning-modeltrainer-sft/test/dataset.json'),
 InputData(channel_name='config', data_source='s3://sagemaker-us-east-1-783764584149/datasets/spectrum-fine-tuning-modeltrainer-sft/config/args.yaml')]

In [None]:
# starting the train job with our uploaded datasets as input
model_trainer.train(input_data_config=data, wait=True)

Output()

***

***

# Model Deployment

In the following sections, we are going to deploy the fine-tuned model on an Amazon SageMaker Real-time endpoint.

## Load Fine-Tuned model

In [67]:
job_prefix = f"train-{model_id.split('/')[-1].replace('.', '-')}-sft-spectrum-script"
print(job_prefix)

train-Qwen3-8B-sft-spectrum-script


In [68]:
def get_last_job_name(job_name_prefix):
    sagemaker_client = boto3.client('sagemaker')

    matching_jobs = []
    next_token = None

    while True:
        # Prepare the search parameters
        search_params = {
            'Resource': 'TrainingJob',
            'SearchExpression': {
                'Filters': [
                    {
                        'Name': 'TrainingJobName',
                        'Operator': 'Contains',
                        'Value': job_name_prefix
                    },
                    {
                        'Name': 'TrainingJobStatus',
                        'Operator': 'Equals',
                        'Value': "Completed"
                    }
                ]
            },
            'SortBy': 'CreationTime',
            'SortOrder': 'Descending',
            'MaxResults': 100
        }

        # Add NextToken if we have one
        if next_token:
            search_params['NextToken'] = next_token

        # Make the search request
        search_response = sagemaker_client.search(**search_params)

        # Filter and add matching jobs
        matching_jobs.extend([
            job['TrainingJob']['TrainingJobName'] 
            for job in search_response['Results']
            if job['TrainingJob']['TrainingJobName'].startswith(job_name_prefix)
        ])

        # Check if we have more results to fetch
        next_token = search_response.get('NextToken')
        if not next_token or matching_jobs:  # Stop if we found at least one match or no more results
            break

    if not matching_jobs:
        raise ValueError(f"No completed training jobs found starting with prefix '{job_name_prefix}'")

    return matching_jobs[0]

In [69]:
job_name = get_last_job_name(job_prefix)

job_name

'train-Qwen3-8B-sft-spectrum-script-20250730180513'

#### Inference configurations

In [70]:
instance_count = 1
instance_type = "ml.g5.12xlarge"
number_of_gpu = 1
health_check_timeout = 900

In [71]:
# image_uri = sagemaker.image_uris.retrieve(
#     framework="djl-lmi",
#     region=sagemaker_session.boto_session.region_name,
#     version="latest"
# )

# image_uri

In [72]:
image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128-v1.1"

In [None]:
if default_prefix:
    model_data=f"s3://{bucket_name}/{default_prefix}/{job_prefix}/{job_name}/output/model.tar.gz"
else:
    model_data=f"s3://{bucket_name}/{job_prefix}/{job_name}/output/model.tar.gz"

model = Model(
    image_uri=image_uri,
    model_data=model_data,
    role=get_execution_role(),
    env={
        'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
        'OPTION_TRUST_REMOTE_CODE': 'true',
        'OPTION_ROLLING_BATCH': "vllm",
        'OPTION_DTYPE': 'bf16',
        'OPTION_QUANTIZE': 'fp8',
        'OPTION_TENSOR_PARALLEL_DEGREE': 'max',
        'OPTION_MAX_ROLLING_BATCH_SIZE': '32',
        'OPTION_MODEL_LOADING_TIMEOUT': '3600',
        'OPTION_MAX_MODEL_LEN': '4096'
    }
)

In [None]:
endpoint_name = f"{model_id.split('/')[-1].replace('.', '-')}-sft-djl"

In [None]:
predictor = model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=instance_count,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    model_data_download_timeout=3600
)

#### Predict

In [None]:
endpoint_name = f"{model_id.split('/')[-1].replace('.', '-')}-sft-djl"

In [None]:
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

In [None]:
base_prompt = f"""
<s>
[INST]
{{question}}
[/INST]
"""

In [None]:
prompt = base_prompt.format(question="What statue is in front of the Notre Dame building?")

response = predictor.predict({
	"inputs": prompt,
    "parameters": {
        "temperature": 0.2,
        "top_p": 0.9,
        "return_full_text": False,
        "stop": ['<|eot_id|>', '<|end_of_text|>']
    }
})

response = response["generated_text"].split("<|end_of_text|>")[0]

response

#### Delete Endpoint

In [None]:
endpoint_name = f"{model_id.split('/')[-1].replace('.', '-')}-sft-djl"

In [None]:
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

In [None]:
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)