# Expert Specialized Fine-Tuning on SageMaker - ESFT on SageMaker


---

This notebook's CI test result for us-east-1 is as follows. CI test results in other regions can be found at the end of the notebook.

![](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/build_and_train_models)

---

## Introduction

This notebook demonstrates **Expert Specialized Fine-Tuning (ESFT)** on Amazon SageMaker, a novel approach for efficiently fine-tuning sparse architectural Large Language Models (LLMs) such as Mixture of Experts (MoE) models.

### Background

ESFT is based on the research paper ["Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models"](https://arxiv.org/abs/2407.01906) by DeepSeek. This method addresses the challenge of fine-tuning MoE models by selectively training only the most relevant experts for specific tasks, rather than updating all parameters.

### Key Benefits of ESFT:

- **Efficiency**: Reduces computational overhead by training only relevant experts
- **Performance**: Maintains or improves model performance on target tasks
- **Scalability**: Enables fine-tuning of large MoE models with limited resources
- **Task Specialization**: Allows experts to specialize in specific domains or tasks

### Example Use Case

In this notebook, we demonstrate ESFT using:
- **Model**: Qwen3-30B (a 30B parameter MoE model)
- **Task**: Low-resource translation (Cherokee → English)
- **Dataset**: Cherokee-English translation pairs for specialized language translation

This example showcases how ESFT can be particularly effective for low-resource language tasks where traditional fine-tuning might be challenging due to limited data availability.


## Contents

This notebook covers the following steps for running ESFT on SageMaker:

1. **Setup and Configuration** - Configure training parameters and environment
2. **Import Required Libraries** - Load necessary SageMaker and AWS libraries
3. **AWS Setup** - Configure AWS credentials and execution roles
4. **Upload Dataset to S3** - Prepare Cherokee-English translation dataset
5. **Configure Hyperparameters** - Set ESFT-specific training parameters
6. **Configure SageMaker Training Components** - Set up compute resources and storage
7. **Create and Configure ModelTrainer** - Initialize the SageMaker training job
8. **Start Training Job** - Launch the ESFT training process
9. **Training Results** - Monitor and retrieve training outputs

### Prerequisites

- AWS account with SageMaker access
- Pre-built ESFT Docker container in ECR
- Cherokee-English translation dataset
- Appropriate IAM roles and permissions


## 1. Setup and Configuration

First, let's set up the configuration parameters for ESFT training. We'll use the Qwen3-30B model for Cherokee to English translation as our example use case.

In [None]:
# Configuration parameters for ESFT training
INSTANCE = "ml.p4d.24xlarge"
NUM_GPU = 8
SCORE_TOKENS = 16384
TRAIN_DATASET = "datasets/train/translation.jsonl"
EVAL_DATASET = "datasets/eval/translation.jsonl"
MAX_RUN_HOURS = 5
STORAGE_VOLUME = 100

# Pre-built SageMaker container with ESFT implementation
# Region should be fixed to us-east-1 if you use the pre-built image below. 
# Because the image can't be used cross by region
REGION = "us-east-1"
SM_IMAGE_URI = "798050803670.dkr.ecr.us-east-1.amazonaws.com/esft-sagemaker-nvcr:0.0.1"

# Model and ESFT-specific parameters
MODEL = "Qwen/Qwen3-30B-A3B-Instruct-2507"  # Using Qwen3-30B for Cherokee-English translation
SCORE_FUNCTION = "token"
SCORE_THRESHOLD = 0.2
WORLD_SIZE = 1

print(f"Instance Type: {INSTANCE}")
print(f"Number of GPUs: {NUM_GPU}")
print(f"Model: {MODEL}")
print(f"Training Dataset: {TRAIN_DATASET}")
print(f"Validation Dataset: {EVAL_DATASET}")
print(f"SageMaker Image: {SM_IMAGE_URI}")
print(f"\nESFT Configuration:")
print(f"Score Function: {SCORE_FUNCTION}")
print(f"Score Threshold: {SCORE_THRESHOLD}")
print(f"Score Tokens: {SCORE_TOKENS}")

Instance Type: ml.p4d.24xlarge
Number of GPUs: 8
Model: Qwen/Qwen3-30B-A3B-Instruct-2507
Training Dataset: datasets/train/translation.jsonl
Validation Dataset: datasets/eval/translation.jsonl
SageMaker Image: 798050803670.dkr.ecr.us-east-1.amazonaws.com/esft-sagemaker-nvcr:0.0.1

ESFT Configuration:
Score Function: token
Score Threshold: 0.2
Score Tokens: 16384


## 2. Import Required Libraries

Import the necessary libraries for SageMaker training.

In [None]:
!pip install sagemaker==3.4.0 boto3

import os
import json
import boto3
boto3.setup_default_session(region_name=REGION)

from sagemaker.train import ModelTrainer
from sagemaker.train.configs import (
    SourceCode, 
    InputData, 
    Compute,
    StoppingCondition,
    TensorBoardOutputConfig,
)
from sagemaker.core.helper.session_helper import get_execution_role
import time
from utils import s3_upload

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml


## 3. AWS Setup

Configure AWS settings and get the execution role for SageMaker AI.

In [None]:
# Create IAM Client
iam = boto3.client('iam')
sts = boto3.client('sts')

# Get Account id
account_id = sts.get_caller_identity()['Account']
print(f"Account ID: {account_id}")

# Role name
role_name = 'ESFTTrainingRole'

# Trust Policy
trust_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "sagemaker.amazonaws.com",
                    "ec2.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

# 2. Create Role
try:
    response = iam.create_role(
        RoleName=role_name,
        AssumeRolePolicyDocument=json.dumps(trust_policy),
        Description='Role for ECR, S3, SageMaker Full Access',
        MaxSessionDuration=3600  # 1시간
    )
    print(f"Role Created: {response['Role']['Arn']}")
except iam.exceptions.EntityAlreadyExistsException:
    print(f"Role '{role_name}' already exists.")

# 3. Connect AWS Managed Policy
managed_policies = [
    'arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess',  # ECR
    'arn:aws:iam::aws:policy/AmazonS3FullAccess',                     # S3
    'arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'               # SageMaker
]

for policy_arn in managed_policies:
    iam.attach_role_policy(
        RoleName=role_name,
        PolicyArn=policy_arn
    )
    print(f"ㄴ Policy Connected: {policy_arn.split('/')[-1]}")

role = f"arn:aws:iam::{account_id}:role/{role_name}"

# AWS setting Finalize
region = REGION or boto3.Session().region_name or 'us-east-2'
s3_bucket = f"sagemaker-{region}-{account_id}"  # It will be created automatically

print(f"\n" + "=" * 20)
print(f"Account ID: {account_id}")
print(f"Region: {region}")
print(f"S3 Bucket: {s3_bucket}")
print(f"Role: {role}")

Account ID: 798050803670
Role 'ESFTTrainingRole' already exists.
ㄴ Policy Connected: AmazonEC2ContainerRegistryFullAccess
ㄴ Policy Connected: AmazonS3FullAccess
ㄴ Policy Connected: AmazonSageMakerFullAccess

Account ID: 798050803670
Region: us-east-1
S3 Bucket: sagemaker-us-east-1-798050803670
Role: arn:aws:iam::798050803670:role/ESFTTrainingRole


## 4. Upload Dataset to S3

Upload the Cherokee-English translation dataset to S3 bucket. This dataset contains parallel text pairs for low-resource language translation.
Note that SageMaker Training Job requires datasets to be in S3 format and it will be automatically downloaded under `/opt/ml/input/data` directory.

In [4]:
# Upload Cherokee-English translation dataset to S3
print(f"Uploading Cherokee-English translation dataset: \n Train: {TRAIN_DATASET}\n Eval: {EVAL_DATASET}")
s3_upload(TRAIN_DATASET, f"s3://{s3_bucket}/input/data/train/")
s3_upload(EVAL_DATASET, f"s3://{s3_bucket}/input/data/eval/")

# Set the dataset path for SageMaker container
train_dataset = f"/opt/ml/input/data/train/{os.path.basename(TRAIN_DATASET)}"
eval_dataset = f"/opt/ml/input/data/eval/{os.path.basename(EVAL_DATASET)}"

Uploading Cherokee-English translation dataset: 
 Train: datasets/train/translation.jsonl
 Eval: datasets/eval/translation.jsonl


datasets/train/translation.jsonl is uploaded at s3://sagemaker-us-east-1-798050803670/input/data/train/translation.jsonl
datasets/eval/translation.jsonl is uploaded at s3://sagemaker-us-east-1-798050803670/input/data/eval/translation.jsonl


## 5. Configure Hyperparameters

Set up the ESFT-specific hyperparameters. These parameters control expert selection and specialized fine-tuning behavior.

Expert scoring
- methodology and threshold settings are critical for determining which experts to train
- `n_sample_tokens` is recommended over $2^{17}$ tokens
- `score_function` can be choosen in ["token", "gate"]
- `score_threshold` is recommended under 0.5. Higher threshold can cause inefficent training

ESFT Training
- `train_shared_experts` and `train_non_expert_modules` are set to `false` for the most efficient training. If you want to focus on downstream task performance, set them to `true`


In [5]:
# ESFT-specific hyperparameters for Cherokee-English translation
hyperparameters = {
    "model": MODEL,
    "score_function": SCORE_FUNCTION,  # Expert scoring method
    "score_threshold": str(SCORE_THRESHOLD),  # Threshold for expert selection
    "world_size": str(WORLD_SIZE),
    "gpus_per_process": str(NUM_GPU),
    "train_dataset": train_dataset,
    "eval_dataset": eval_dataset,
    # Expert scoring configuration
    "n_sample_tokens": str(SCORE_TOKENS),  # Tokens used for expert scoring
    "score_function": "token",
    "score_threshold": "0.2",
    # ESFT-specific training configuration
    "train_shared_experts": "false",  # Only train selected experts
    "train_non_expert_modules": "false",  # Focus on expert modules only
    "expert_config_dir": "",
    # Standard training parameters optimized for low-resource translation
    "train_epochs": "1",
    "micro_batch_size": "1",
    "global_batch_size": "256",
    "learning_rate": "7e-6",
    "warmup_ratio": "0.1",
    "weight_decay": "0.1",
    "min_learning_rate": "0.0",
    "max_length": "16384",
    "lr_decay_style": "cosine",
    "save_interval": "100",
    "use_wandb": "false",
    "expert_parallel": str(NUM_GPU),
    "pipeline_parallel": "1",
}

print("ESFT Hyperparameters for Cherokee-English Translation:")
print("\nCore ESFT Parameters:")
print(f"  model: {hyperparameters['model']}")
print(f"  score_function: {hyperparameters['score_function']}")
print(f"  score_threshold: {hyperparameters['score_threshold']}")
print(f"  n_sample_tokens: {hyperparameters['n_sample_tokens']}")
print("\nDataset Configuration:")
print(f"  train_dataset: {hyperparameters['train_dataset']}")
print(f"  eval_dataset: {hyperparameters['eval_dataset']}")
print("\nTraining Configuration:")
print(f"  train_epochs: {hyperparameters['train_epochs']}")
print(f"  learning_rate: {hyperparameters['learning_rate']}")
print(f"  global_batch_size: {hyperparameters['global_batch_size']}")

ESFT Hyperparameters for Cherokee-English Translation:

Core ESFT Parameters:
  model: Qwen/Qwen3-30B-A3B-Instruct-2507
  score_function: token
  score_threshold: 0.2
  n_sample_tokens: 16384

Dataset Configuration:
  train_dataset: /opt/ml/input/data/train/translation.jsonl
  eval_dataset: /opt/ml/input/data/eval/translation.jsonl

Training Configuration:
  train_epochs: 1
  learning_rate: 7e-6
  global_batch_size: 256


## 6. Configure SageMaker Training Components

Set up the SageMaker training configuration.

In [6]:
# SageMaker configuration
source_code = SourceCode(
    command="python /opt/ml/code/sagemaker_entrypoint.py",  # already uploaded in the docker img
)

compute = Compute(
    instance_count=1,
    instance_type=INSTANCE,
    volume_size_in_gb=STORAGE_VOLUME,
)

tb_config = TensorBoardOutputConfig(
    s3_output_path=f"s3://{s3_bucket}/output/tensorboard",
    local_path="/opt/ml/output/tensorboard"
)

input_data_config = [
    InputData(
        data_source=f"s3://{s3_bucket}/input/data/train/",
        channel_name="train",
    ),
    InputData(
        data_source=f"s3://{s3_bucket}/input/data/eval/",
        channel_name="eval",
    ),
]

stopping_condition = StoppingCondition(
    max_runtime_in_seconds = MAX_RUN_HOURS * 3600
)

print(f"Instance Type: {INSTANCE}")
print(f"Volume Size: 100 GB")
print(f"TensorBoard Output: s3://{s3_bucket}/output/tensorboard")

Instance Type: ml.p4d.24xlarge
Volume Size: 100 GB
TensorBoard Output: s3://sagemaker-us-east-1-798050803670/output/tensorboard


## 7. Create and Configure ModelTrainer

Initialize the SageMaker ModelTrainer with all configurations.

In [7]:
# Create ModelTrainer
model_trainer = ModelTrainer(
    training_image=SM_IMAGE_URI,
    source_code=source_code,
    compute=compute,
    hyperparameters=hyperparameters,
    role=role,
    base_job_name="esft",
    stopping_condition=stopping_condition,
    environment={
        "TOKENIZERS_PARALLELISM": "false",
    }
)

# Add TensorBoard configuration
model_trainer.with_tensorboard_output_config(tb_config)

print("ModelTrainer configured successfully!")
print(f"Base job name: esft")
print(f"Training image: {SM_IMAGE_URI}")

ModelTrainer configured successfully!
Base job name: esft
Training image: 798050803670.dkr.ecr.us-east-1.amazonaws.com/esft-sagemaker-nvcr:0.0.1


## 8. Start Training Job

Launch the SageMaker training job. If you got errors like below, please visit [AWS Service Quotas](https://console.aws.amazon.com/servicequotas/) to increase instance limits 

> ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p5.48xlarge for training job usage' is 0 Instances, with current utilization of 0. Instances and a request delta of 1 Instances. Please use AWS Service Quotas to request an increase for this quota. If AWS Service Quotas is not available, contact AWS support to request an increase for this quota.

In [8]:
# Job Info
print("="*50)
print("SageMaker Training Start")
print(f"Instance Type: {INSTANCE}")
print(f"Instance Count: 1")
print(f"Model: {MODEL}")
print(f"Train Dataset: {TRAIN_DATASET}")
print(f"Valid Dataset: {EVAL_DATASET}")
print(f"Output Location: {model_trainer.output_data_config}")
print("="*50)

# Start training
model_trainer.train(
    input_data_config=input_data_config,
    wait=False,   # Wait for the training job to complete
    logs=False    # Display the training container logs
)

print("="*50)
print("Training Job Submitted!")
print("="*50)

SageMaker Training Start
Instance Type: ml.p4d.24xlarge
Instance Count: 1
Model: Qwen/Qwen3-30B-A3B-Instruct-2507
Train Dataset: datasets/train/translation.jsonl
Valid Dataset: datasets/eval/translation.jsonl
Output Location: s3_output_path='s3://sagemaker-us-east-1-798050803670/esft' kms_key_id=None compression_type='GZIP' remove_job_name_from_s3_output_path=<sagemaker.core.utils.utils.Unassigned object at 0x75e90cf7ca40> disable_model_upload=<sagemaker.core.utils.utils.Unassigned object at 0x75e90cf7ca40> channels=<sagemaker.core.utils.utils.Unassigned object at 0x75e90cf7ca40>


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml


Training Job Submitted!


## 9. Training Results

Check the training job status and results. You can also find your SageMaker Training Job at AWS console [SageMaker Training Job Console](https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/jobs). The training log will be found at [AWS Cloud Watch](https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FTrainingJobs)

In [9]:
# Get training job information
client = boto3.client('sagemaker')
training_job = client.list_training_jobs(MaxResults=10, SortBy='CreationTime')['TrainingJobSummaries']

valid_status = ['InProgress','Completed','Failed']
training_job = [job for job in training_job if job['TrainingJobStatus'] in valid_status]

print("Training job details:")
print(f"Job Name: {training_job[0]['TrainingJobName']}")
print(f"Job Status: {training_job[0]['TrainingJobStatus']}")
print(f"SageMaker Training Job Console: https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/jobs/{training_job[0]['TrainingJobName']}")

Training job details:
Job Name: esft-20260209094702
Job Status: InProgress
SageMaker Training Job Console: https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/jobs/esft-20260209094702


You can check computation status at the AWS console as below
![image](assets/GPU-utilization.png)

It is easy to check training curve by SageMaker Tensorboard application. When you run the code below, it will automatically redirect to the tensorboard page.  
Alternatively, you can download tensorboard file from s3 bucket to local and run it directly.

In [10]:
from sagemaker.core.interactive_apps import TensorBoardApp

app = TensorBoardApp(REGION)
print(
    app.get_app_url(
        training_job_name = training_job[0]['TrainingJobName'],
    )
)




![image](assets/tensorboard.png)

## 10. Download model

Finally, we can download trained model and output results from s3 bucket.

In [None]:
training_job_details = client.describe_training_job(TrainingJobName=training_job['TrainingJobName'])
model_s3_path = training_job_details['ModelArtifacts']['S3ModelArtifacts']
output_s3_path = model_s3_path.replace("model.tar.gz", "output.tar.gz")
print(f"\nModel artifects: {model_s3_path}")
print(f"Output S3: {training_job_details['OutputDataConfig']['S3OutputPath']}")
print(f"Tensorboard S3: {training_job_details['TensorBoardOutputConfig']['S3OutputPath']}")

In [None]:
# Download model
s3 = boto3.client('s3')
s3.download_file(s3_bucket, output_s3_path.split(f"{s3_bucket}/")[-1], 'model.tar.gz')