# Spectrum-Aware Fine-Tuning with Amazon SageMaker

## Overview

This notebook demonstrates how to perform spectrum-aware fine-tuning of the Qwen3 0.6B model for SQL generation tasks using Amazon SageMaker Training Jobs. Spectrum-aware fine-tuning is a cost-effective approach that selectively updates only the most important parameters of a model, making the training process faster and more affordable than traditional full fine-tuning.

---


## 1. Introduction

### What You'll Learn

- How to prepare text-to-SQL datasets for fine-tuning
- Configure spectrum-aware fine-tuning parameters
- Use SageMaker ModelTrainer API for efficient training
- Deploy and evaluate fine-tuned models

### Key Technologies

- **Model**: Qwen3 0.6B (Alibaba's efficient reasoning model)
- **Dataset**: Synthetic Text-to-SQL from Gretel AI
- **Training**: SageMaker Training Jobs with spectrum-aware fine-tuning
- **Framework**: PyTorch with Hugging Face Transformers

```

```


In [None]:
# Load variables from setup notebook
%store -r account_id bucket_name region MLFLOW_TRACKING_URI

import sagemaker

model_id = "Qwen/Qwen3-0.6B"
filesafe_model_id = model_id.replace("/", "-")

sagemaker_session = sagemaker.Session()
default_prefix = sagemaker_session.default_bucket_prefix

## 2. Spectrum Layer Configuration

Next step is to configure spectrum-aware fine-tuning parameters for selective parameter updates using the `spectrum.py` file provided in the Spectrum repository. You have two options to proceed:

1. Ignore the next cell, as we have pre-populated the spectrum layers in the `spectrum_layers` folder
2. (suggested for advanced users) Run the following cell, then execute the output as a shell command in SageMaker Studio terminal. This will create the layers for Spectrum analysis of our model. Note that you MUST run the command in a terminal, since Spectrum's UI is interactive and cannot be run inside of this notebook.
	- If you do this, set Batch Size to 1 and select all layers in the prompts


In [None]:
# Generate spectrum analysis command
spectrum_clone_folder = "./spectrum"
spectrum_layer_percent = "50"

command_to_execute = f"cd {spectrum_clone_folder} && python3 spectrum.py --model-name {model_id} --top-percent {spectrum_layer_percent}"
print(f"Run this command in terminal: {command_to_execute}")

(Optional) If you have run the command in the terminal, change the value below to `True`. Otherwise, just execute the cell as is.


In [None]:
have_you_run_the_command_in_terminal = False

spectrum_output_filename = f"snr_results_{filesafe_model_id}_unfrozenparameters_{spectrum_layer_percent}percent.yaml"
spectrum_output_filepath = (
    f"{spectrum_clone_folder}/{spectrum_output_filename}"
    if have_you_run_the_command_in_terminal
    else f"./spectrum_layers/{spectrum_output_filename}"
)
spectrum_output_filepath

This will copy the Spectrum output into your local scripts folder. It will be packaged with the code assets for your training job.


In [None]:
!mkdir -p ./scripts/spectrum-layer/
!cp {spectrum_output_filepath} ./scripts/spectrum-layer/

## 3. Dataset Preparation

Load and prepare training datasets for spectrum-aware fine-tuning.


In [None]:
# disable xet in huggingface because of bug with ipykernel
# https://github.com/huggingface/xet-core/issues/526
import os
os.environ["HF_HUB_DISABLE_XET"] = "1"

In [None]:
from datasets import load_dataset, concatenate_datasets
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

nl2sql_text = load_dataset("gretelai/synthetic_text_to_sql").shuffle(seed=42)
ds = load_dataset("interstellarninja/hermes_reasoning_tool_use").shuffle(
    seed=42
)

In [None]:
# Demo - Sample the dataset
ds['train'] = ds['train'].select(range(600))
nl2sql_text['train'] = nl2sql_text['train'].select(range(500))
nl2sql_text['test'] = nl2sql_text['test'].select(range(100))

## 4. Data Preprocessing


In [None]:
def convert_and_tokenize_hermes_reasoning_tool_use(sample, tokenizer):
    # Replace "from" key with "role", and "value" with "content"
    for message in sample['conversations']:
        message["role"] = message.pop("from")
        message["content"] = message.pop("value")

    # Replace "human" value with "user", and "gpt" with "assistant"
    for message in sample['conversations']:
        if message["role"] == "human":
            message["role"] = "user"
        elif message["role"] == "gpt":
            message["role"] = "assistant"

    # Apply the chat template
    sample["text"] = tokenizer.apply_chat_template(
        sample['conversations'], tokenize=False, enable_thinking=False
    )
    return sample


ds_v2 = ds["train"].map(
    convert_and_tokenize_hermes_reasoning_tool_use,
    remove_columns=list(ds["train"].features),
    fn_kwargs={"tokenizer": tok},
)
ds_v2 = ds_v2.train_test_split(test_size=0.2)
tool_call_train_dataset = ds_v2['train']
tool_call_test_dataset = ds_v2['test']

In [None]:
def convert_and_tokenize_synthetic_text_to_sql(sample, tokenizer):
    system = f"""
        You are an expert SQL developer. Given the provided database schema and the following user question, generate a syntactically correct SQL query. 
        Only reply with the SQL query, nothing else. Do NOT use the backticks to identify the code, just reply with the pure SQL query.
    """
    query = f"""
        -- Schema --
        {sample["sql_context"]}
        -- Query --
        {sample["sql_prompt"]}
        -- SQL --
    """
    # reasoning = sample["sql_explanation"]
    answer = sample["sql"]
    chat = [
        {"role": "system", "content": system},
        {"role": "user", "content": query},
        # {"role": "assistant", "reasoning_content": reasoning, "content": answer},
        {"role": "assistant", "content": answer},
    ]
    sample["text"] = tokenizer.apply_chat_template(
        chat, tokenize=False, enable_thinking=False
    )
    return sample


nl2sql_train_dataset = nl2sql_text["train"].map(
    convert_and_tokenize_synthetic_text_to_sql,
    remove_columns=list(nl2sql_text["train"].features),
    fn_kwargs={"tokenizer": tok},
)
nl2sql_test_dataset = nl2sql_text["test"].map(
    convert_and_tokenize_synthetic_text_to_sql,
    remove_columns=list(nl2sql_text["test"].features),
    fn_kwargs={"tokenizer": tok},
)

In [None]:
train_dataset = concatenate_datasets(
    [tool_call_train_dataset, nl2sql_train_dataset]
)
test_dataset = concatenate_datasets(
    [tool_call_test_dataset, nl2sql_test_dataset]
)

## 5. Upload Training Data


In [None]:
import boto3
import shutil
import sagemaker

sagemaker_session = sagemaker.Session()
s3_client = boto3.client('s3')

bucket_name = sagemaker_session.default_bucket()
default_prefix = sagemaker_session.default_bucket_prefix

# save train_dataset to s3 using our SageMaker session
if default_prefix:
    input_path = f"{default_prefix}/datasets/finetuning-modeltrainer-accelerate"
else:
    input_path = f"datasets/finetuning-modeltrainer-accelerate"

train_dataset_s3_path = f"s3://{bucket_name}/{input_path}/train/dataset.json"
val_dataset_s3_path = f"s3://{bucket_name}/{input_path}/val/dataset.json"

# Save datasets to s3
# We will fine tune only with 20 records due to limited compute resource for the workshop
train_dataset.to_json("./data/train/dataset.json", orient="records")
test_dataset.to_json("./data/val/dataset.json", orient="records")

s3_client.upload_file(
    "./data/train/dataset.json", bucket_name, f"{input_path}/train/dataset.json"
)
s3_client.upload_file(
    "./data/val/dataset.json", bucket_name, f"{input_path}/val/dataset.json"
)

shutil.rmtree("./data")

print(f"Training data uploaded to:")
print(train_dataset_s3_path)
print(val_dataset_s3_path)

## 6. Training Configuration


In [None]:
import sagemaker
from sagemaker.config import load_sagemaker_config

sagemaker_session = sagemaker.Session()

bucket_name = sagemaker_session.default_bucket()
default_prefix = sagemaker_session.default_bucket_prefix
configs = load_sagemaker_config()
instance_type = "ml.g5.2xlarge"  # "ml.g5.2xlarge"# , "ml.p4d.24xlarge" # Override the instance type if you want to get a different container version
instance_count = 1
config_filename = "Qwen3-0.6B-spectrum.yaml"
accelerate_filename = "deepspeed_zero3.yaml"
print(instance_type)
image_uri = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=sagemaker_session.boto_session.region_name,
    version="2.6.0",
    instance_type=instance_type,
    image_scope="training",
)
print(config_filename)
print(image_uri)

Set up the Model Trainer API to launch the training job.


In [None]:
from sagemaker.modules.configs import (
    CheckpointConfig,
    Compute,
    OutputDataConfig,
    SourceCode,
    StoppingCondition,
)
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.train import ModelTrainer

# environment variables
env = {}
env["FI_PROVIDER"] = "efa"
env["NCCL_PROTO"] = "simple"
env["NCCL_SOCKET_IFNAME"] = "eth0"
env["NCCL_IB_DISABLE"] = "1"
env["NCCL_DEBUG"] = "WARN"
env["HF_token"] = ""  # os.environ['hf_token']
env["CONFIG_PATH"] = f"recipes/{config_filename}"
env["ACCELERATE_CONFIG_PATH"] = f"accelerate_configs/{accelerate_filename}"

# hyper parameters
params = {
    'mlflow_tracking_uri': MLFLOW_TRACKING_URI,
    'experiment_name': 'qwen3-spectrum-experiment',
}
# Define the script to be run
source_code = SourceCode(
    source_dir="./scripts",
    requirements="requirements.txt",
    entry_script="run_finetuning.sh",
)

# Define the compute
compute_configs = Compute(
    instance_type=instance_type,
    instance_count=instance_count,
    keep_alive_period_in_seconds=3600,
)

# define Training Job Name
job_name = f"train-{model_id.split('/')[-1].replace('.', '-')}-accelerate"

# define OutputDataConfig path
if default_prefix:
    output_path = f"s3://{bucket_name}/{default_prefix}/{job_name}"
else:
    output_path = f"s3://{bucket_name}/{job_name}"

# Define the ModelTrainer
model_trainer = ModelTrainer(
    training_image=image_uri,
    environment=env,
    hyperparameters=params,
    source_code=source_code,
    base_job_name=job_name,
    compute=compute_configs,
    stopping_condition=StoppingCondition(max_runtime_in_seconds=18000),
    output_data_config=OutputDataConfig(s3_output_path=output_path),
    checkpoint_config=CheckpointConfig(
        s3_uri=output_path + "/checkpoint", local_path="/opt/ml/checkpoints"
    ),
)

In [None]:
from sagemaker.modules.configs import InputData

# Pass the input data
train_input = InputData(
    channel_name="train",
    data_source=train_dataset_s3_path,  # S3 path where training data is stored
)

val_input = InputData(
    channel_name="val",
    data_source=val_dataset_s3_path,  # S3 path where training data is stored
)

# Check input channels configured
data = [train_input, val_input]
data

## 7. Start Training Job


In [None]:
model_trainer.train(input_data_config=data, wait=False)

## 8. Monitor Training Progress


In [None]:
# Monitor training job status
model_trainer._latest_training_job.wait()
training_job_name = model_trainer._latest_training_job.training_job_name
print(f"Training job: {training_job_name}")
print(f"Status: {model_trainer._latest_training_job.training_job_status}")

In [None]:
%store training_job_name

## Training Complete

Spectrum-aware fine-tuning job has been started. The model will be ready for evaluation once training completes.

<div class="alert alert-block alert-info">
  <b>Important:</b> If you're running this at an AWS event, we have pre-deployed a SageMaker endpoint for you in a central account with the fine-tuned model. If you want to deploy your own endpoint based on this training job in the current account AFTER the training is complete, you can use the notebook in `optional/2-custom-model-deployment.ipynb`. <b>You can skip this in an AWS-led event!</b>
</div>

-----

**Next**: Proceed to notebook 3 to evaluate the fine-tuned model performance.
