# Fine-tune Llama 3.2 3B Instruct with PyTorch FSDP and QLora on Amazon SageMaker AI using interactive @remote decorator


In this demo notebook, we demonstrate how to fine-tune the Meta-Llama-3.2-3B model on SageMaker AI using the @remote decorator for interactively execute Training Jobs directly from the notebook. We also use QLoRA, Hugging Face PEFT, and bitsandbytes.

**FSDP + Q-Lora Background**

Hugging Face share the support of Q-Lora and PyTorch FSDP (Fully Sharded Data Parallel). FSDP and Q-Lora allows you now to fine-tune Llama-like architectures or Mixtral 8x7B. Hugging Face PEFT is were the core logic resides, read more about it in the [PEFT documentation](https://huggingface.co/docs/peft/v0.10.0/en/accelerate/fsdp).

* [PyTorch FSDP](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) is a data/model parallelism technique that shards model across GPUs, reducing memory requirements and enabling the training of larger models more efficiently​​​​​​.
* Q-LoRA is a fine-tuning method that leverages quantization and Low-Rank Adapters to efficiently reduced computational requirements and memory footprint. 



Install the required libriaries, including the Hugging Face libraries, and **restart** the kernel.

In [1]:
# %pip install -r requirements.txt --upgrade
# %pip install -q -U python-dotenv

Collecting sagemaker==2.239.3 (from -r requirements.txt (line 10))
  Downloading sagemaker-2.239.3-py3-none-any.whl.metadata (16 kB)
Collecting antlr4-python3-runtime==4.9.* (from omegaconf<=2.3,>=2.2->sagemaker==2.239.3->-r requirements.txt (line 10))
  Downloading antlr4-python3-runtime-4.9.3.tar.gz (117 kB)
  Preparing metadata (setup.py) ... [?25ldone
Downloading sagemaker-2.239.3-py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m137.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: antlr4-python3-runtime
  Building wheel for antlr4-python3-runtime (setup.py) ... [?2done
[?25h  Created wheel for antlr4-python3-runtime: filename=antlr4_python3_runtime-4.9.3-py3-none-any.whl size=144592 sha256=34a313f86c8751acc957ae19330cf47d34edfdb794a8bfae4e011de989d0eea9
  Stored in directory: /home/sagemaker-user/.cache/pip/wheels/1a/97/32/461f837398029ad76911109f07047fde1d7b661a147c7c56d1
Successfully built 


## Setup Configuration file path

We are setting the directory in which the config.yaml file resides so that remote decorator can make use of the settings through [SageMaker Defaults](https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk).

This notebook is using the Hugging Face container for the `us-east-1` region. Make sure you are using the right image for your AWS region, otherwise edit [config.yaml](./config.yaml). Container Images are available [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)


In [1]:
from dotenv import load_dotenv
import os

# Use .env in case of hidden variables
load_dotenv()

# Set path to config file
os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

## Prepare the dataset

We are going to load [Samsung/samsum](https://huggingface.co/datasets/Samsung/samsum) dataset

In [2]:
from datasets import load_dataset
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the samsum dataset
dataset = load_dataset("samsum", trust_remote_code=True)

# Convert the train split to a pandas DataFrame
df = pd.DataFrame(dataset['train'])

# Optionally limit to first 1000 examples
df = df.iloc[0:500]

# Preview the data
print("Original dataframe shape:", df.shape)
df.head()

# Split the dataframe into train and test sets
train, test = train_test_split(df, test_size=0.1, random_state=42)

print("Number of train elements:", len(train))
print("Number of test elements:", len(test))

# If you need to retain the original column structure
#print("Train dataframe columns:", train.columns.tolist())
#print("Test dataframe columns:", test.columns.tolist())

README.md:   0%|          | 0.00/7.04k [00:00<?, ?B/s]

samsum.py:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

corpus.7z:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Original dataframe shape: (500, 3)
Number of train elements: 450
Number of test elements: 50


Create a prompt template and load the dataset with a random sample to try summarization.

In [3]:
from random import randint

# custom instruct prompt start
prompt_template = """<|system|>
{system}
<|user|>
{instruction}
<|assistant|>
{completion}<|endoftext|>"""

# template dataset to add prompt to each sample
def template_dataset(sample):
    # Define system message
    system_message = "You are an AI assistant trained to summarize conversations accurately and concisely."
    
    # Format the instruction using the dialogue
    instruction = f"Summarize the following conversation:\n\n{sample['dialogue']}"
    
    # Use the summary as the completion
    completion = sample['summary']
    
    # Create the formatted text
    sample["text"] = prompt_template.format(
        system=system_message,
        instruction=instruction,
        completion=completion
    )
    return sample

In [4]:
from datasets import Dataset, DatasetDict

train_dataset = Dataset.from_pandas(train)
test_dataset = Dataset.from_pandas(test)

dataset = DatasetDict({"train": train_dataset, "test": test_dataset})

train_dataset = dataset["train"].map(template_dataset, remove_columns=list(dataset["train"].features))

print(train_dataset[randint(0, len(dataset))]["text"])

test_dataset = dataset["test"].map(template_dataset, remove_columns=list(dataset["test"].features))

Map:   0%|          | 0/450 [00:00<?, ? examples/s]

<|system|>
You are an AI assistant trained to summarize conversations accurately and concisely.
<|user|>
Summarize the following conversation:

Ben: Rafal, how are you?
Rafal: Awesome, getting ready for the evening:D
Ben: In 2h and 30min, we can meet up:)
Ben: Cool
Rafal: Yee
Ben: Which subway exit is comfortable for you?
Rafal: All are fine, I haven't been there yet. Do you have any preferences?
Ben: I heaard that from exit 9 there are lots of restaurants, look at the map
Ben: <file_picture>
Ben: Which line are you supposed to take?
Rafal: I take blue line, so exit 9 will be perfect
Ben: good then I will be there
Rafal: Perfect, see you soon! 
Ben: Ah and if I arrive there I will contact your wife
Ben: If you have something trouble
Ben: can you send me text message 0123456789
Ben: I don't have any data left, hahhah
<|assistant|>
Ben and Rafal are meeting in 2.5 hours at the subway exit 9.<|endoftext|>


Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Utility function for initializing the distribution across multiple GPUs

In [5]:
import torch

def init_distributed():
    # Initialize the process group
    torch.distributed.init_process_group(
        backend="nccl", # Use "gloo" backend for CPU
        timeout=datetime.timedelta(seconds=5400)
    )
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    return local_rank

Utility function for model download

In [6]:
from huggingface_hub import snapshot_download
import os

def download_model(model_name):
    print("Downloading model ", model_name)

    os.makedirs("/tmp/tmp_folder", exist_ok=True)

    snapshot_download(repo_id=model_name, local_dir="/tmp/tmp_folder")

    print(f"Model {model_name} downloaded under /tmp/tmp_folder")

Use the Hugging Face Trainer class to fine-tune the model. Define the hyperparameters we want to use. We also create a DataCollator that will take care of padding our inputs and labels. To train our model, we need to convert our inputs (text) to token IDs. This is done by a Hugging Face Transformers Tokenizer. In addition to Lora, we will use bitsanbytes 4-bit precision to quantize out frozen LLM to 4-bit and attach LoRA adapters on it.

Define the train function

In [7]:
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

In [9]:
from accelerate import Accelerator
import datetime
import os
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training
from sagemaker.remote_function import remote
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seed
import transformers
import traceback
import mlflow
from mlflow.models import infer_signature

# Start training
@remote(
    keep_alive_period_in_seconds=0, #Warm-pool instance. Put 0 for avoiding additional costs
    volume_size=100,
    job_name_prefix=f"train-{model_id.split('/')[-1].replace('.', '-')}",
    use_torchrun=True,
)
def train_fn(
    model_name,             # Name or path of the base model to fine-tune
    train_ds,               # Training dataset
    test_ds=None,           # Optional test/validation dataset
    torch_dtype=torch.bfloat16,  # Precision type for training
    lora_r=8,               # LoRA rank - controls capacity of adaptations
    lora_alpha=16,          # LoRA alpha - scales the adaptations
    lora_dropout=0.1,       # Dropout probability for LoRA layers
    per_device_train_batch_size=8,  # Batch size for training
    per_device_eval_batch_size=8,   # Batch size for evaluation
    gradient_accumulation_steps=1,  # Number of steps to accumulate gradients
    learning_rate=2e-4,     # Learning rate for training
    num_train_epochs=1,     # Number of training epochs
    fsdp="",                # Fully Sharded Data Parallel configuration
    fsdp_config=None,       # Additional FSDP configurations
    gradient_checkpointing=False,  # Whether to use gradient checkpointing
    merge_weights=False,    # Whether to merge LoRA weights with base model
    seed=42,                # Random seed for reproducibility
    mlflow_uri=None,
    mlflow_experiment_name=None,
    token='<>'              # HuggingFace token for model access
):
    # Initialize distributed training if multiple GPUs are available
    if torch.cuda.is_available() and (torch.cuda.device_count() > 1 or int(os.environ.get("SM_HOST_COUNT", 1)) > 1):
        # Call this function at the beginning of your script
        local_rank = init_distributed()

        # Now you can use distributed functionalities
        torch.distributed.barrier(device_ids=[local_rank])

    # Enable HuggingFace transfer for model downloading
    os.environ.update({"HF_HUB_ENABLE_HF_TRANSFER": "1"})

    set_seed(seed)

    accelerator = Accelerator()

    # Set up HuggingFace token if provided
    if token is not None:
        os.environ.update({"HF_TOKEN": token})
        accelerator.wait_for_everyone()

    # Download model based on training setup (single or multi-node)
    if int(os.environ.get("SM_HOST_COUNT", 1)) == 1:
        if accelerator.is_main_process:
            download_model(model_name)
    else:
        download_model(model_name)

    accelerator.wait_for_everyone()

    model_name = "/tmp/tmp_folder"

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Set Tokenizer pad Token
    tokenizer.pad_token = tokenizer.eos_token

    with accelerator.main_process_first():
        # tokenize and chunk dataset
        lm_train_dataset = train_ds.map(
            lambda sample: tokenizer(sample["text"]), remove_columns=list(train_ds.features)
        )

        print(f"Total number of train samples: {len(lm_train_dataset)}")

        if test_ds is not None:
            lm_test_dataset = test_ds.map(
                lambda sample: tokenizer(sample["text"]), remove_columns=list(test_ds.features)
            )

            print(f"Total number of test samples: {len(lm_test_dataset)}")
        else:
            lm_test_dataset = None

    # Configure model settings for bfloat16 precision
    # Setup flash_attention_2 for memory-efficient attention computation
    if torch_dtype == torch.bfloat16:
        print("flash_attention_2 init")

        model_configs = {
            "attn_implementation": "flash_attention_2",
            "torch_dtype": torch_dtype,
        }
    else:
        model_configs = dict()

    # Configure training settings based on FSDP usage
    # Set up trainer configurations for FSDP or standard training
    if fsdp != "" and fsdp_config is not None:
        print("Configurations for FSDP")

        bnb_config_params = {
            "bnb_4bit_quant_storage": torch_dtype
        }

        trainer_configs = {
            "fsdp": fsdp,
            "fsdp_config": fsdp_config,
            "gradient_checkpointing_kwargs": {
                "use_reentrant": False
            }
        }
    else:
        bnb_config_params = dict()
        trainer_configs = {
            "gradient_checkpointing": gradient_checkpointing, # Enable in case of DDP
        }

    # Enable Quantization
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch_dtype,
        **bnb_config_params
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        quantization_config=bnb_config,
        use_cache=not gradient_checkpointing,
        cache_dir="/tmp/.cache",
        **model_configs
    )

    # Configure gradient checkpointing based on FSDP usage
    if fsdp == "" and fsdp_config is None:
        print("Prepare model for quantization")
        model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=gradient_checkpointing)

        if gradient_checkpointing:
            print("gradient_checkpointing enabled")
            model.gradient_checkpointing_enable()
    else:
        if gradient_checkpointing:
            print("gradient_checkpointing enabled")
            model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})

    config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules="all-linear",
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM"
    )

    model = get_peft_model(model, config)

    trainer = transformers.Trainer(
        model=model,
        train_dataset=lm_train_dataset,
        eval_dataset=lm_test_dataset if lm_test_dataset is not None else None,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=per_device_train_batch_size,
            per_device_eval_batch_size=per_device_eval_batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            logging_strategy="steps",
            logging_steps=1,
            log_on_each_node=False,
            num_train_epochs=num_train_epochs,
            learning_rate=learning_rate,
            bf16=(
                True if torch_dtype == torch.bfloat16 else False
            ),  # Enable mixed-precision training
            tf32=False,
            ddp_find_unused_parameters=False,
            save_strategy="no",
            output_dir="outputs",
            **trainer_configs
        ),
        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )

    if trainer.accelerator.is_main_process:
        trainer.model.print_trainable_parameters()
        
    if mlflow_uri is not None and mlflow_experiment_name is not None:
        print("MLflow tracking under ", mlflow_experiment_name)
        # Logs for experiments
        modules = find_all_linear_names(model)

        mlflow.set_tracking_uri(mlflow_uri)
        mlflow.set_experiment(mlflow_experiment_name)

        with mlflow.start_run(run_name=f"Training") as run:
            lora_params = {
                "lora_alpha": lora_alpha,
                "lora_dropout": lora_dropout,
                "r": lora_r,
                "modules": modules
            }

            mlflow.log_params(lora_params)

            trainer.train()
    else:
        trainer.train()

    if trainer.is_fsdp_enabled:
        trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")

    if merge_weights:
        output_dir = "/tmp/model"

        # merge adapter weights with base model and save
        # save int 4 model
        trainer.model.save_pretrained(output_dir, safe_serialization=False)

        if accelerator.is_main_process:
            # clear memory
            del model
            del trainer

            torch.cuda.empty_cache()

            # load PEFT model
            model = AutoPeftModelForCausalLM.from_pretrained(
                output_dir,
                torch_dtype=torch.float16,
                low_cpu_mem_usage=True,
                trust_remote_code=True,
                use_cache=True,
                cache_dir="/tmp/.cache",
            )

            # Merge LoRA and base model and save
            model = model.merge_and_unload()
            model.save_pretrained(
                os.environ.get("SM_MODEL_DIR", "/opt/ml/model"),
                safe_serialization=True,
                max_shard_size="2GB"
            )
    else:
        trainer.model.save_pretrained(
            os.environ.get("SM_MODEL_DIR", "/opt/ml/model"),
            safe_serialization=True
        )

    if accelerator.is_main_process:
        tokenizer.save_pretrained(os.environ.get("SM_MODEL_DIR", "/opt/ml/model"))
        
         # Model registration in MLFlow
        if mlflow_uri is not None and mlflow_experiment_name is not None:
            print("MLflow model registration under ", mlflow_experiment_name)

            params = {
                "top_p": 0.9,
                "temperature": 0.2,
                "max_new_tokens": 2048,
            }
            signature = infer_signature("inputs", "generated_text", params=params)

            mlflow.transformers.log_model(
                transformers_model={"model": model, "tokenizer": tokenizer},
                signature=signature,
                artifact_path="model",  # This is a relative path to save model files within MLflow run
                model_config=params,
                task="text-generation"
            )

    accelerator.wait_for_everyone()

sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.ImageUri
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.Dependencies
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.PreExecutionCommands
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.IncludeLocalWorkDir
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.CustomFileFilter.IgnoreNamePatterns
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.InstanceType


## Train the model

Run train_fn with merge_weights=True for merging the trained adapter. **Update HF_TOKEN with your HuggingFace access token**

In [10]:
train_fn(
    model_id,
    train_ds=train_dataset,
    test_ds=test_dataset,
    per_device_train_batch_size=12,
    per_device_eval_batch_size=24,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    num_train_epochs=2,
    fsdp="full_shard auto_wrap offload",
    fsdp_config={
        'backward_prefetch': 'backward_pre',
        'cpu_ram_efficient_loading': True,
        'offload_params': True,
        'forward_prefetch': False,
        'use_orig_params': False
    },
    merge_weights=True,
    mlflow_uri=os.environ.get("MLFLOW_URI", None),
    mlflow_experiment_name=os.environ.get("MLFLOW_EXPERIMENT_NAME", None)
)

2025-04-06 20:55:17,856 sagemaker.remote_function INFO     Serializing function code to s3://sagemaker-us-east-1-058264176820/train-DeepSeek-R1-Distill-Llama-8B-2025-04-06-20-55-17-856/function
2025-04-06 20:55:17,970 sagemaker.remote_function INFO     Serializing function arguments to s3://sagemaker-us-east-1-058264176820/train-DeepSeek-R1-Distill-Llama-8B-2025-04-06-20-55-17-856/arguments
2025-04-06 20:55:18,305 sagemaker.remote_function INFO     Copied user workspace to '/tmp/tmpji8d_ps3/temp_workspace/sagemaker_remote_function_workspace'
2025-04-06 20:55:18,307 sagemaker.remote_function INFO     Copied dependencies file at './requirements.txt' to '/tmp/tmpji8d_ps3/temp_workspace/sagemaker_remote_function_workspace/requirements.txt'
2025-04-06 20:55:18,308 sagemaker.remote_function INFO     Generated pre-execution script from commands to '/tmp/tmpji8d_ps3/temp_workspace/sagemaker_remote_function_workspace/pre_exec.sh'
2025-04-06 20:55:18,309 sagemaker.remote_function INFO     Succes

2025-04-06 20:55:22 Starting - Starting the training job
...........20:55:22 Pending - Training job waiting for capacity.
.....04-06 20:57:23 Pending - Preparing the instances for training.
....................Downloading - Downloading the training image.
..[34mINFO: CONDA_PKGS_DIRS is set to '/opt/ml/sagemaker/warmpoolcache/sm_remotefunction_user_dependencies_cache/conda/pkgs'[0m
[34mINFO: PIP_CACHE_DIR is set to '/opt/ml/sagemaker/warmpoolcache/sm_remotefunction_user_dependencies_cache/pip'[0m
[34mINFO: /opt/ml/input/config/resourceconfig.json:[0m
[34m{"current_host":"algo-1","current_instance_type":"ml.g5.12xlarge","current_group_name":"homogeneousCluster","hosts":["algo-1"],"instance_groups":[{"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.12xlarge","hosts":["algo-1"]}],"network_interface_name":"eth0"}INFO: Bootstraping runtime environment.[0m
[34m2025-04-06 21:02:17,500 sagemaker.remote_function INFO     Arguments:[0m
[34m2025-04-06 21:02:17,500 sagem

***

## Load Fine-Tuned model

In [11]:
import boto3
import sagemaker

In [12]:
sagemaker_session = sagemaker.Session()

In [13]:
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

bucket_name = sagemaker_session.default_bucket()
default_prefix = sagemaker_session.default_bucket_prefix
job_prefix = f"train-{model_id.split('/')[-1].replace('.', '-')}"

In [14]:
def get_last_job_name(job_name_prefix):
    sagemaker_client = boto3.client('sagemaker')

    matching_jobs = []
    next_token = None

    while True:
        # Prepare the search parameters
        search_params = {
            'Resource': 'TrainingJob',
            'SearchExpression': {
                'Filters': [
                    {
                        'Name': 'TrainingJobName',
                        'Operator': 'Contains',
                        'Value': job_name_prefix
                    },
                    {
                        'Name': 'TrainingJobStatus',
                        'Operator': 'Equals',
                        'Value': "Completed"
                    }
                ]
            },
            'SortBy': 'CreationTime',
            'SortOrder': 'Descending',
            'MaxResults': 100
        }

        # Add NextToken if we have one
        if next_token:
            search_params['NextToken'] = next_token

        # Make the search request
        search_response = sagemaker_client.search(**search_params)

        # Filter and add matching jobs
        matching_jobs.extend([
            job['TrainingJob']['TrainingJobName'] 
            for job in search_response['Results']
            if job['TrainingJob']['TrainingJobName'].startswith(job_name_prefix)
        ])

        # Check if we have more results to fetch
        next_token = search_response.get('NextToken')
        if not next_token or matching_jobs:  # Stop if we found at least one match or no more results
            break

    if not matching_jobs:
        raise ValueError(f"No completed training jobs found starting with prefix '{job_name_prefix}'")

    return matching_jobs[0]

In [15]:
job_name = get_last_job_name(job_prefix)

job_name

'train-DeepSeek-R1-Distill-Llama-8B-2025-04-06-20-55-17-856'

## Deploy Llama 3.2 3B Instruct fine-tuned model using Amazon SageMaker AI Endpoints and Amazon SageMaker Large Model Inference (LMI) Container with the SageMaker Python SDK 

In this example you will deploy your model using [SageMaker's Large Model Inference (LMI) Containers](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/index.html).

LMI containers are a set of high-performance Docker Containers purpose built for large language model (LLM) inference. With these containers, you can leverage high performance open-source inference libraries like vLLM, TensorRT-LLM, Transformers NeuronX to deploy LLMs on AWS SageMaker Endpoints. These containers bundle together a model server with open-source inference libraries to deliver an all-in-one LLM serving solution.

The LMI container supports a variety of different backends, outlined in the table below. 

The model for this example can be deployed using the vLLM backend, which corresponds to the `djl-lmi` container image.

| Backend | SageMakerDLC | Example URI |
| --- | --- | --- |
|vLLM|djl-lmi|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
|lmi-dist|djl-lmi|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
|hf-accelerate|djl-lmi|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
|tensorrt-llm|djl-tensorrtllm|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124
|transformers-neuronx|djl-neuronx|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-neuronx-sdk2.19.1


In [16]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker import Model

In [17]:
instance_count = 1
instance_type = "ml.g5.4xlarge"
number_of_gpu = 1
health_check_timeout = 700

In [18]:
image_uri = sagemaker.image_uris.retrieve(
    framework="djl-lmi",
    region=sagemaker_session.boto_session.region_name,
    version="latest"
)

image_uri

'763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124'

In [19]:
if default_prefix:
    model_data = f"s3://{bucket_name}/{default_prefix}/{job_name}/{job_name}/output/model.tar.gz"
else:
    model_data = f"s3://{bucket_name}/{job_name}/{job_name}/output/model.tar.gz"

model = Model(
    image_uri=image_uri,
    model_data=model_data,
    role=get_execution_role(),
    env={
        'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
        'OPTION_TRUST_REMOTE_CODE': 'true',
        'OPTION_ROLLING_BATCH': "vllm",
        'OPTION_DTYPE': 'bf16',
        'OPTION_TENSOR_PARALLEL_DEGREE': 'max',
        'OPTION_MAX_ROLLING_BATCH_SIZE': '1',
        'OPTION_MODEL_LOADING_TIMEOUT': '3600',
        'OPTION_MAX_MODEL_LEN': '4096'
    }
)

sagemaker.config INFO - Applied value from config key = SageMaker.Model.EnableNetworkIsolation


In [20]:
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

endpoint_name = f"{model_id.split('/')[-1].replace('.', '-')}-finetuned"

Creating an endpoint

In [21]:
predictor = model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=instance_count,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    model_data_download_timeout=3600
)

------------!

## Run Inference

In [22]:
import sagemaker

In [23]:
sagemaker_session = sagemaker.Session()

In [30]:
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

endpoint_name = f"{model_id.split('/')[-1].replace('.', '-')}-finetuned"

In [32]:
print(endpoint_name)

# DeepSeek-R1-Distill-Llama-8B-finetuned

DeepSeek-R1-Distill-Llama-8B-finetuned


In [33]:
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

In [26]:
def create_summarization_prompts(data_point):
    full_prompt =f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
                    You are an AI assistant trained to summarize conversations. Provide a concise summary of the dialogue, capturing the key points and overall context.
                    <|eot_id|><|start_header_id|>user<|end_header_id|>
                    Summarize the following conversation:

                    {data_point["dialogue"]}
                    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
                    Here's a concise summary of the conversation in a single sentence:

                    <|eot_id|>"""
    return {"prompt": full_prompt}

Pick a random prompt

In [27]:
from pprint import pprint
# HF dataset that we will be working with 
dataset_name="Samsung/samsum"
    
# Load dataset from the hub
dataset = load_dataset(dataset_name, split="test")

random_row = dataset.shuffle().select(range(1))[0]

random_prompt=create_summarization_prompts(random_row)
pprint(random_prompt)

README.md:   0%|          | 0.00/7.04k [00:00<?, ?B/s]

samsum.py:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

The repository for Samsung/samsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/Samsung/samsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

{'prompt': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n'
           '                    You are an AI assistant trained to summarize '
           'conversations. Provide a concise summary of the dialogue, '
           'capturing the key points and overall context.\n'
           '                    '
           '<|eot_id|><|start_header_id|>user<|end_header_id|>\n'
           '                    Summarize the following conversation:\n'
           '\n'
           '                    Gina: Hey love, do you have a free usb by any '
           'chance?\r\n'
           'Monica: Yes, I do :)\r\n'
           'Gina: Can I come up to your office?\r\n'
           "Monica: Of course, usb's ready\r\n"
           'Monica: 2nd floor, room 112\r\n'
           'Gina: Thanks!\n'
           '                    '
           '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n'
           "                    Here's a concise summary of the conversation "
           'in a singl

In [34]:
response = predictor.predict(
    {
        "inputs": random_prompt['prompt'],
        "parameters": {
            "do_sample":True,
            "max_new_tokens":200,
            "top_p":0.95,
            "top_k":50,
            "temperature":0.7,
            "stop": ['<|eot_id|>', '<|end_of_text|>']
        },
    }
)

response['generated_text']

'<think>\nOkay, I\'m trying to summarize a conversation between Gina and Monica. Gina asks Monica if she has a free USB. Monica says yes and mentions she\'s on the second floor, room 112. I need to capture the key points: the request for a USB and the location. I should keep it concise and clear. Let me see, "Gina asks Monica for a USB and is directed to Monica\'s office on the second floor, room 112." That covers both requests and the location. It\'s a single sentence and straightforward. I think that works.\n</think>\n\nGina asks Monica for a USB and is directed to Monica\'s office on the second floor, room 112.'

#### Delete Endpoint

In [None]:
# predictor.delete_model()
# predictor.delete_endpoint(delete_endpoint_config=True)