# Spectrum Fine-Tuning with Amazon SageMaker AI 

In this example, you will learn how to use [Spectrum](https://github.com/QuixiAI/spectrum) along with Amazon SageMaker fully managed training jobs to fine-tune a Qwen3-8B model. 

Spectrum fine-tuning involves a few key steps:

- First, Spectrum will download the desired model to be analyzed (automatically via AutoModelforCausalLM).
- Then, it will run an analysis using to determine the Signal-to-Noise Ratio (SnR) for each layer in the model. Based on this analysis, Spectrum will create a subset of the data for specific layer percentages that can be used as input to the training job.
- Next, you can use the training script provided in this example or implement the sample code into your own that can process the Spectrum output and selectively freeze or unfreeze layers accordingly.
- Finally, you will create an Amazon SageMaker training job, providing the Spectrum analysis as an input.

The goal of this process is to leverage Spectrum's insights about the model layers to fine-tune the model more effectively, by focusing the training on the most relevant layers. A complete sample can be found in the SageMaker Distributed Training GitHub repository. You'll see how Spectrum compares with QLoRA based fine-tuning and reduces resource requirements as well as training time, without a significant impact to model quality.

![](./images/Spectrum-ValidationLoss-Comparison.png)

## Prerequisites

Install the prerequisite packages to run this notebook.

In [None]:
%pip install -r ./scripts/requirements.txt --upgrade

## This cell will restart the kernel. Click "OK".

In [None]:
from IPython import get_ipython
get_ipython().kernel.do_shutdown(True)

# Wait for the kernel to restart before continuing.

In [None]:
import boto3
import shutil
import os
from random import randint

from datasets import load_dataset
from datasets import Dataset, DatasetDict

import sagemaker
from sagemaker.s3 import S3Uploader
from sagemaker import get_execution_role
from sagemaker import Model
from sagemaker import image_uris
from sagemaker.config import load_sagemaker_config

from sagemaker.modules.configs import Compute, InputData, OutputDataConfig, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import InputData

from helper_functions import utils

In [None]:
sagemaker_session = sagemaker.Session()
s3_client = boto3.client('s3')

bucket_name = sagemaker_session.default_bucket()
default_prefix = sagemaker_session.default_bucket_prefix
region = sagemaker_session.boto_session.region_name
configs = load_sagemaker_config()

## Prepare spectrum Layer for training job

Here you will clone the Spectrum repository into your local home directory so you can run analysis. 

>**Note:** If you have already cloned the repository and re-run the `git clone` command, it will return an error that the path already exists. You can ignore this error.

In [None]:
spectrum_clone_folder = "~/spectrum"

In [None]:
!git clone https://github.com/QuixiAI/spectrum.git {spectrum_clone_folder}

This example will use [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B), if you'd like to use a different model, change the value for `model_id`.

In [None]:
model_id = "Qwen/Qwen3-8B"
filesafe_model_id = model_id.replace("/","-")
spectrum_layer_percent = "10"

## The following cell generates a shell command that you need to run in your terminal
Spectrum's UI is interactive and cannot be run inside of this notebook. Please run the following cell, execute it in your terminal, then resume notebook execution. There is nothing from the terminal that you need to copy over.

In [None]:
print(f"cd {spectrum_clone_folder} && python3 {spectrum_clone_folder}/spectrum.py --model-name {model_id} --top-percent {spectrum_layer_percent}")

## Ensure you executed the previous output in your terminal before proceeding.

In [None]:
spectrum_output_filename = f"snr_results_{filesafe_model_id}_unfrozenparameters_{spectrum_layer_percent}percent.yaml"
spectrum_output_filepath = f"{spectrum_clone_folder}/{spectrum_output_filename}"
spectrum_output_filepath

This will copy the Spectrum output into your local scripts folder. It will be packaged with the code assets for your training job.

In [None]:
!mkdir -p ./scripts/spectrum-layer/
!cp {spectrum_output_filepath} ./scripts/spectrum-layer/

## Setup Configuration file path

If you have created a Managed MLflow server, copy the `ARN` of the instance here and assign a name to the experiment. This will track your experiment in MLflow automatically.

In [None]:
os.environ["mlflow_uri"] = ""
os.environ["mlflow_experiment_name"] = f"spectrum-fine-tuning-{filesafe_model_id}-{spectrum_layer_percent}pct"
os.environ["hf_token"] = ""
os.environ["model_id"] = model_id
os.environ["spectrum_layer_config"] = spectrum_output_filename

***

## Visualize and upload the dataset

We are going to read the data from huggingface and prepare it for the training job. This example uses the SQuAD

In [None]:
train_dataset = load_dataset("rajpurkar/squad", split="train")
test_dataset = load_dataset("rajpurkar/squad", split="validation")

#grab a sample from the training and test sets
print(f"Train Sample:\n{train_dataset[randint(0, len(train_dataset)-1)]}\n\n")
print(f"Test Sample:\n{test_dataset[randint(0, len(test_dataset)-1)]}\n\n")

### Upload training data to Amazon S3

In [None]:
# save train_dataset to s3 using our SageMaker session
if default_prefix:
    input_path = f'{default_prefix}/datasets/spectrum-fine-tuning-modeltrainer-sft'
else:
    input_path = f'datasets/spectrum-fine-tuning-modeltrainer-sft'

# Save datasets to s3
# We will fine tune only with 20 records due to limited compute resource for the workshop
train_dataset.to_json("./data/train/dataset.json", orient="records")
test_dataset.to_json("./data/test/dataset.json", orient="records")

s3_client.upload_file("./data/train/dataset.json", bucket_name, f"{input_path}/train/dataset.json")
train_dataset_s3_path = f"s3://{bucket_name}/{input_path}/train/dataset.json"
s3_client.upload_file("./data/test/dataset.json", bucket_name, f"{input_path}/test/dataset.json")
test_dataset_s3_path = f"s3://{bucket_name}/{input_path}/test/dataset.json"

shutil.rmtree("./data")

print(f"Training data uploaded to:")
print(train_dataset_s3_path)
print(test_dataset_s3_path)

***

## Model fine-tuning

We are now ready to fine-tune our model. We will use the [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) from transfomers to fine-tune our model. We prepared a script [train_spectrum.py](./scripts/train_spectrum.py) which will loads the dataset from disk, prepare the model, tokenizer and start the training.

For configuration we use `TrlParser`, that allows us to provide hyperparameters in a YAML file. This YAML will be uploaded and provided to Amazon SageMaker similar to our datasets. Below is the config file for fine-tuning the model on `ml.p4de.24xlarge`. We are saving the config file as `args.yaml` and upload it to S3.

In [None]:
%%bash

cat > ./args.yaml <<EOF

hf_token: "${hf_token}" # Use HF token to login into Hugging Face to access the DeepSeek distilled models
model_id: "${model_id}"       # Hugging Face model id
mlflow_uri: "${mlflow_uri}"
mlflow_experiment_name: "${mlflow_experiment_name}"

# sagemaker specific parameters
output_dir: "/opt/ml/model"                       # path to where SageMaker will upload the model 
train_dataset_path: "/opt/ml/input/data/train/"   # path to where FSx saves train dataset
test_dataset_path: "/opt/ml/input/data/test/"     # path to where FSx saves test dataset


enable_spectrum: true
spectrum_config_path: "/opt/ml/input/data/code/spectrum-layer/${spectrum_layer_config}"

#LoRA config
enable_lora: false # enable LoRA training
enable_quantization: false # set to true to also quantize the base model

lora_r: 8
lora_alpha: 16
lora_dropout: 0.1
merge_weights: true

# training parameters           
learning_rate: 2e-4                    # learning rate scheduler
num_train_epochs: 1                    # number of training epochs
per_device_train_batch_size: 8         # batch size per device during training
per_device_eval_batch_size: 1          # batch size for evaluation
gradient_accumulation_steps: 2         # number of steps before performing a backward/update pass
gradient_checkpointing: true           # use gradient checkpointing
bf16: true                             # use bfloat16 precision
tf32: false                            # use tf32 precision

fsdp: "full_shard auto_wrap offload"
fsdp_config: 
    backward_prefetch: "backward_pre"
    cpu_ram_efficient_loading: true
    offload_params: true
    forward_prefetch: false
    use_orig_params: true

EOF

Upload the training configuration file to S3.

In [None]:
if default_prefix:
    input_path = f"s3://{bucket_name}/{default_prefix}/datasets/spectrum-fine-tuning-modeltrainer-sft"
else:
    input_path = f"s3://{bucket_name}/datasets/spectrum-fine-tuning-modeltrainer-sft"

# upload the model yaml file to s3
model_yaml = "args.yaml"
train_config_s3_path = S3Uploader.upload(local_path=model_yaml, desired_s3_uri=f"{input_path}/config")

print(f"Training config uploaded to:")
print(train_config_s3_path)

## Fine-tune model

In the following steps you will select an instance type, pull a managed training container, and configure a SageMaker AI fully managed training job.

In [None]:
instance_type = "ml.p4de.24xlarge"

instance_type

#### Get PyTorch image_uri

We are going to use the native PyTorch container image, pre-built for Amazon SageMaker.

In [None]:
image_uri = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=region,
    version="2.6.0",
    instance_type=instance_type,
    image_scope="training"
)

image_uri

This training job setup uses the SageMaker Python SDK's `ModelTrainer` API to quickly configure the code and compute requirements for your training job.

The `SourceCode` section references the code in this example, provides a `requirements.txt` file to handle training dependencies, and sets the desired S3 location for the trained model artifacts.

In [None]:
# Define the script to be run
source_code = SourceCode(
    source_dir="./scripts",
    requirements="requirements.txt",
    entry_script="train_spectrum.py",
)

# Define the compute
compute_configs = Compute(
    instance_type=instance_type,
    instance_count=1,
    volume_size_in_gb=500,
    #keep_alive_period_in_seconds=3600 #uncomment this value to enable warm pools
)

# define Training Job Name 
job_name = f"train-{model_id.split('/')[-1].replace('.', '-')}-sft-spectrum-{spectrum_layer_percent}-script"

# define OutputDataConfig path
if default_prefix:
    output_path = f"s3://{bucket_name}/{default_prefix}/{job_name}"
else:
    output_path = f"s3://{bucket_name}/{job_name}"

# Define the ModelTrainer
model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    base_job_name=job_name,
    compute=compute_configs,
    distributed=Torchrun(),
    stopping_condition=StoppingCondition(
        max_runtime_in_seconds=36000
    ),
    hyperparameters={
        "config": "/opt/ml/input/data/config/args.yaml" # path to TRL config which was uploaded to s3
    },
    output_data_config=OutputDataConfig(
        s3_output_path=output_path
    )
)

Here you will define different channels for training data, test data, and training configuration information.

In [None]:
# Pass the input data
train_input = InputData(
    channel_name="train",
    data_source=train_dataset_s3_path, # S3 path where training data is stored
)

test_input = InputData(
    channel_name="test",
    data_source=test_dataset_s3_path, # S3 path where training data is stored
)

config_input = InputData(
    channel_name="config",
    data_source=train_config_s3_path, # S3 path where training data is stored
)

# Check input channels configured
data = [train_input, test_input, config_input]
data

With your `ModelTrainer` instance configured and data channels assigned, calling `train()` will start your training job.

This will spin up the desired instance(s), download the training container, download the base model artifacts, and invoke the specified training script. When the job completes, the infrastructure will be terminated and the results will be stored in the S3 output location defined earlier.

In [None]:
# starting the train job with our uploaded datasets as input
model_trainer.train(input_data_config=data, wait=True)

***

***

# Model Deployment

In the following sections, we are going to deploy the fine-tuned model on an Amazon SageMaker Real-time endpoint.

## Load Fine-Tuned model

In [None]:
job_prefix = f"train-{model_id.split('/')[-1].replace('.', '-')}-sft-spectrum-{spectrum_layer_percent}-script"
print(job_prefix)

In [None]:
job_name = utils.get_last_job_name(job_prefix)
job_name

#### Inference configurations

In [None]:
instance_count = 1
instance_type = "ml.g6e.2xlarge"
number_of_gpu = 1
health_check_timeout = 900

image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128"

In [None]:
if default_prefix:
    model_data=f"s3://{bucket_name}/{default_prefix}/{job_prefix}/{job_name}/output/model.tar.gz"
else:
    model_data=f"s3://{bucket_name}/{job_prefix}/{job_name}/output/model.tar.gz"

model = Model(
    image_uri=image_uri,
    model_data=model_data,
    role=get_execution_role(),
    env={
        'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
        'OPTION_TRUST_REMOTE_CODE': 'true',
        'OPTION_ROLLING_BATCH': "vllm",
        'OPTION_DTYPE': 'bf16',
        'OPTION_TENSOR_PARALLEL_DEGREE': 'max',
        'OPTION_MAX_ROLLING_BATCH_SIZE': '32',
        'OPTION_MODEL_LOADING_TIMEOUT': '3600',
        'OPTION_MAX_MODEL_LEN': '4096'
    }
)

In [None]:
from sagemaker.utils import name_from_base
endpoint_name = name_from_base(f"{model_id.split('/')[-1].replace('.', '-')}-sft-djl")

In [None]:
predictor = model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=instance_count,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    model_data_download_timeout=3600
)

#### Predict

This will create a `Predictor` using the SageMaker Python SDK to run inference against the endpoint.

In [None]:
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

In [None]:
SYSTEM_PROMPT = f"Answer the question based on the knowledge you have or shared by user. Just give the answer and do not explain your thinking process"
USER_PROMPT = "What statue is in front of the Notre Dame building?"


messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": USER_PROMPT},
]


payload = {
    "messages": messages,
    "parameters": {"max_new_tokens": 256, "temperature": 0.05}
}

output = predictor.predict(payload)

print(f"Output:\n\n{output['choices'][0]['message']['content']}")

#### Delete Endpoint

Set the `CLEAN_UP_ENDPOINT` flag to `True` and run the cleanup script to delete your endpoint.

In [None]:
#Set this value to True to delete resources.
CLEAN_UP_ENDPOINT = False

In [None]:
if CLEAN_UP_ENDPOINT:
    endpoint_name = f"{model_id.split('/')[-1].replace('.', '-')}-sft-djl"
    
    predictor = sagemaker.Predictor(
        endpoint_name=endpoint_name,
        sagemaker_session=sagemaker_session,
        serializer=sagemaker.serializers.JSONSerializer(),
        deserializer=sagemaker.deserializers.JSONDeserializer(),
    )
    
    predictor.delete_model()
    predictor.delete_endpoint(delete_endpoint_config=True)