# 🚀 Deploy Your Fine-tuned RAFT Model with Amazon SageMaker

## Install Dependencies

In [17]:
%pip install -Uq sagemaker

Note: you may need to restart the kernel to use updated packages.
Collecting evaluate
  Using cached evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Using cached evaluate-0.4.3-py3-none-any.whl (84 kB)
Installing collected packages: evaluate
  Attempting uninstall: evaluate
    Found existing installation: evaluate 0.4.1
    Uninstalling evaluate-0.4.1:
      Successfully uninstalled evaluate-0.4.1
Successfully installed evaluate-0.4.3
Note: you may need to restart the kernel to use updated packages.
Collecting rouge-score
  Using cached rouge_score-0.1.2-py3-none-any.whl
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2
Note: you may need to restart the kernel to use updated packages.
Collecting bleu
  Using cached bleu-0.3-py3-none-any.whl
Collecting efficiency (from bleu)
  Using cached efficiency-2.0-py3-none-any.whl.metadata (2.5 kB)
Using cached efficiency-2.0-py3-none-any.whl (32 kB)
Installing collected packages: efficiency, bleu
Successfully

In [2]:
import json
import sagemaker
import boto3
from typing import List, Dict
from datetime import datetime
from sagemaker.huggingface import (
    HuggingFaceModel, 
    get_huggingface_llm_image_uri
)
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import JSONSerializer

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [3]:
boto_region = boto3.Session().region_name
session = sagemaker.session.Session(boto_session=boto3.Session(region_name=boto_region))
role = sagemaker.get_execution_role()

## Deploy using DJL-Inference Container

The [Deep Java Library (DJL) Large Model Inference (LMI)](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-container-docs.html) containers are specialized Docker containers designed to facilitate the deployment of large language models (LLMs) on Amazon SageMaker. These containers integrate a model server with optimized inference libraries, providing a comprehensive solution for serving LLMs. 

**Key Features of DJL LMI Containers:**

* __Optimized Inference Performance__: Support for popular model architectures like DeepSeek, Mistral, Llama, Falcon and many more..
* __Integration with Inference Libraries__: Seamless integration with libraries such as vLLM, TensorRT-LLM, and Transformers NeuronX.
* __Advanced Capabilities__: Features like continuous batching, token streaming, quantization (e.g., AWQ, GPTQ, FP8), multi-GPU inference using tensor parallelism, and support for LoRA fine-tuned models.

**Benefits for Deploying LLMs with DJL-LMI on Amazon SageMaker:**

* __Simplified Deployment__: DJL LMI containers offer a low-code interface, allowing users to specify configurations like model parallelization and optimization settings through a configuration file. 
* __Performance Optimization__: By leveraging optimized inference libraries and techniques, these containers enhance inference performance, reducing latency and improving throughput.
* __Scalability__: Designed to handle large models that may not fit on a single accelerator, enabling efficient scaling across multiple GPUs or specialized hardware like AWS Inferentia.

In [4]:
lmi_container_uri = f"763104351884.dkr.ecr.{boto_region}.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124"

Choose an appropriate model name and endpoint name when hosting your model.

In [72]:
model_timestamp = datetime.now().strftime('%y%m%d-%H%M%S')

base_model_name = f"llama-3-1-8B-base-lmi-{model_timestamp}"
tuned_model_name = f"llama-3-1-8B-tuned-lmi-{model_timestamp}"

base_endpoint_name = f"{base_model_name}-ep"
tuned_endpoint_name = f"{tuned_model_name}-ep"

print(f"base: \n{base_endpoint_name}")
print(f"tuned: \n{tuned_endpoint_name}")

base: 
llama-3-1-8B-base-lmi-250330-213014-ep
tuned: 
llama-3-1-8B-tuned-lmi-250330-213014-ep


Create a new [SageMaker Model](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html)

> ⚠ Swap `HF_MODEL_ID` with another model tag if you want to compare against a different base model.
>
> Gated models will require you to supply a HuggingFace API Token via the `HF_TOKEN: "hf_..."` parameter.

In [73]:
#set these to either S3 paths or HuggingFace model tags

BASE_MODEL_ARTIFACTS =  "<<PATH_TO_YOUR_BASE_MODEL>>"
TUNED_MODEL_ARTIFACTS = "<<PATH_TO_YOUR_TUNED_MERGED_MODEL>>"

In [63]:
base_model = sagemaker.Model(
    image_uri=lmi_container_uri,
    env={
        "HF_MODEL_ID": BASE_MODEL_ARTIFACTS,
        "OPTION_MAX_MODEL_LEN": "5000",
        "OPTION_GPU_MEMORY_UTILIZATION": "0.95",
        "OPTION_ENABLE_STREAMING": "false",
        "OPTION_ROLLING_BATCH": "auto",
        "OPTION_MODEL_LOADING_TIMEOUT": "3600",
        "OPTION_PAGED_ATTENTION": "false",
        "OPTION_DTYPE": "fp16",
    },
    role=role,
    name=base_model_name,
    sagemaker_session=sagemaker.Session()
)

In [74]:
tuned_model = sagemaker.Model(
    image_uri=lmi_container_uri,
    env={
        "HF_MODEL_ID": TUNED_MODEL_ARTIFACTS,
        "OPTION_MAX_MODEL_LEN": "5000",
        "OPTION_GPU_MEMORY_UTILIZATION": "0.95",
        "OPTION_ENABLE_STREAMING": "false",
        "OPTION_ROLLING_BATCH": "auto",
        "OPTION_MODEL_LOADING_TIMEOUT": "3600",
        "OPTION_PAGED_ATTENTION": "false",
        "OPTION_DTYPE": "fp16",
    },
    role=role,
    name=tuned_model_name,
    sagemaker_session=sagemaker.Session()
)

🚀 Deploy. Please wait for the endpoint to be `InService` before running inference against it!

In [64]:
base_model_predictor = base_model.deploy(
    endpoint_name=base_endpoint_name,
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    container_startup_health_check_timeout=600,
    wait=True
)
print(f"\nYour BASE Endpoint: {base_endpoint_name} is now deployed! 🚀")

-------------!
Your BASE Endpoint: llama-3-1-8B-sft-lmi-250329-032857-ep is now deployed! 🚀


In [75]:
tuned_model_predictor = tuned_model.deploy(
    endpoint_name=tuned_endpoint_name,
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    container_startup_health_check_timeout=600,
    wait=True
)
print(f"\nYour TUNED Endpoint: {tuned_endpoint_name} is now deployed! 🚀")

-------------!
Your TUNED Endpoint: llama-3-1-8B-tuned-lmi-250330-213014-ep is now deployed! 🚀


### Inference with SageMaker SDK

SageMaker python sdk simplifies the inference construct using `sagemaker.Predictor` method.

Llama 3 utilizes the following prompt format:


```json
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are an assistant for question-answering tasks. Answer the following question in 5 sentences using the provided context. If you don't know the answer, just say "I don't know.".

<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Context: {CONTEXT}

Question: {QUESTION} 
<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
```

In [76]:
def format_messages(messages: list[dict[str, str]]) -> list[str]:
    """
    Format messages for Llama 3+ chat models.
    
    The model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and 
    alternating (u/a/u/a/u...). The last message must be from 'user'.
    """
    # auto assistant suffix
    # messages.append({"role": "assistant"})
    
    output = "<|begin_of_text|>"
    # Adding an inferred prefix
    system_prefix = f"\n\nCutting Knowledge Date: December 2024\nToday Date: {datetime.now().strftime('%d %b %Y')}\n\n"
    for i, entry in enumerate(messages):
        output += f"<|start_header_id|>{entry['role']}<|end_header_id|>"
        if entry['role'] == 'system':
            output += f"{system_prefix}{entry['content']}<|eot_id|>"
        elif entry['role'] != 'system' and 'content' in entry:
            output += f"\n\n{entry['content']}<|eot_id|>"
    output += "<|start_header_id|>assistant<|end_header_id|>\n"
    return output


def send_prompt(predictor, messages, parameters):
    # convert u/a format 
    frmt_input = format_messages(messages)
    payload = {
        "inputs": frmt_input,
        "parameters": parameters
    }
    response = predictor.predict(payload)
    return response

In [77]:
def build_messages(data):
    system_content = f"""You are an assistant for question-answering tasks. Answer the following question in 5 sentences using the provided context. If you don't know the answer, just say "I don't know."."""
    user_content = f"""
        Context: {data["context"]} 
        
        Question: {data["question"]}
        """

    messages = [
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content}
    ]
    
    return messages

We can continue to use a simple `List[Dict[str, str]]` format to chat and simplify `system`, `user` and `assistant` chat transcripts.

In [78]:
from datasets import load_dataset, concatenate_datasets

test_dataset = load_dataset("json", data_files="datasets/raft/test/test.json", split="train")

Generating train split: 0 examples [00:00, ? examples/s]

In [79]:
test_item = test_dataset[0]
test_item

{'synthetic_answer': " Yes, UCP2 deficiency helps to restrict the pathogenesis of experimental cutaneous and visceral leishmaniosis in mice. This is evident from the significantly lower parasite loads found in infected UCP2KO mice compared to infected wild-type mice. Additionally, UCP2KO mice produced higher levels of certain cytokines, such as IFN-γ, IL-17, and IL-13, which suggests an enhanced immune response against Leishmania infection. The results indicate that UCP2 plays a role in facilitating the infection by suppressing the host's immune response. Overall, the study suggests that UCP2 deficiency could be beneficial in the context of Leishmania infection.",
 'distracted': True,
 'original_answer': 'In this way, UCP2KO mice were better able than their WT counterparts to overcome L. major and L. infantum infections. These findings suggest that upregulating host ROS levels, perhaps by inhibiting UPC2, may be an effective approach to preventing leishmaniosis.',
 'question': 'Does uC

reloading the predictors from endpoint names in case we are working with existing endpoints

In [80]:
base_predictor = sagemaker.Predictor(
    endpoint_name="<<YOUR_BASE_ENDPOINT_NAME>", #base_endpoint_name,
    sagemaker_session=session,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

tuned_predictor = sagemaker.Predictor(
    endpoint_name="<<YOUR_TUNED_ENDPOINT_NAME>>", #tuned_endpoint_name,
    sagemaker_session=session,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

In [81]:
%%time

messages = build_messages(test_item)

base_response = send_prompt(
    base_predictor,
    messages,
    parameters={
        "temperature": 0.9, 
        "max_new_tokens": 512,
        "top_p": 0.9
    }
)

tuned_response = send_prompt(
    tuned_predictor,
    messages,
    parameters={
        "temperature": 0.9, 
        "max_new_tokens": 512,
        "top_p": 0.9
    }
)

print(f"""
    ============== Question ============
    {test_item["question"]}

    ========= Baseline Answer ==========
    {base_response['generated_text']}
    
    ========= Generated Answer =========
    {tuned_response['generated_text']}

    ======== Ground Truth Answer =======
    {test_item["synthetic_answer"]}
    """
)


    Does uCP2 deficiency help to restrict the pathogenesis of experimental cutaneous and visceral leishmaniosis in mice?

    I don't know.
    
            Answer: Yes, the study suggests that UCP2 deficiency helps to restrict the pathogenesis of experimental cutaneous and visceral leishmaniasis in mice. This is evident from the reduced parasite load in the spleen and liver of UCP2-deficient mice compared to wild-type mice. The absence of UCP2 also resulted in a lower parasite load in the footpads of mice infected with L. major. Additionally, UCP2-deficient mice showed a lower parasite load in the liver and spleen after infection with L. donovani. Overall, the results indicate that UCP2 deficiency may have a protective effect against leishmaniasis.

     Yes, UCP2 deficiency helps to restrict the pathogenesis of experimental cutaneous and visceral leishmaniosis in mice. This is evident from the significantly lower parasite loads found in infected UCP2KO mice compared to infected wild

## Evaluate your results

In this section, you will build a dataset of evaluation data. The `MAX_EVALUATIONS` value will limit the scope of the evaluation and the time it takes to complete.

Since there are pure distractor documents from splitting our training dataset, we will remove them during the evaluation.

In [82]:
questions = []
oracle_context = []
test_context = []
ground_truth = []
base_predictions = []
sft_predictions = []

#evaluation_data = {}

MAX_EVALUATIONS = -1 #set to -1 to run the entire dataset. WARNING: THIS WILL TAKE A LONG TIME

if MAX_EVALUATIONS > -1:
    print(f"MAX_EVALUATIONS set, reducing input to {MAX_EVALUATIONS} items.")
else:
    MAX_EVALUATIONS = len(test_dataset)

for idx, test_item in enumerate(test_dataset.select(range(MAX_EVALUATIONS))):
    
    if test_item["distracted"] == True:
        continue #skip distractor docs

    
    messages = build_messages(test_item)
    
    base_response = send_prompt(
        base_predictor,
        messages,
        parameters={
            "temperature": 0.9, 
            "max_new_tokens": 512,
            "top_p": 0.9
        }
    )
    
    sft_response = send_prompt(
        tuned_predictor,
        messages,
        parameters={
            "temperature": 0.9, 
            "max_new_tokens": 512,
            "top_p": 0.9
        }
    )

    # Define the candidate predictions and reference sentences

    # evaluation_data.append({
    #     "ground_truth": test_item["ANSWER"],
    #     "base": base_response['generated_text'],
    #     "tuned": sft_response['generated_text'],
    #     "test_context": test_item["CONTEXT"],
    #     "oracle_context": test_item["ORACLE"]
    # })
    
    ground_truth.append(test_item["synthetic_answer"])
    base_predictions.append(base_response['generated_text'])
    sft_predictions.append(sft_response['generated_text'])

    print(f"{idx} of {MAX_EVALUATIONS}", end="\r")
    

evaluation_dataset = Dataset.from_dict(
    {
        "ground_truth": ground_truth, 
        "base": base_predictions, 
        "tuned": sft_predictions}
)
evaluation_dataset.to_json(f"./eval.json", orient="records")



3997 of 4000

Creating json from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

5988578

In [83]:
ground_truth = []
base_predictions = []
sft_predictions = []

for eval_item in evaluation_dataset:

    ground_truth.append(eval_item["ground_truth"])
    base_predictions.append(eval_item['base'])
    sft_predictions.append(eval_item['tuned'])


base_bleu_results = bleu.compute(predictions=base_predictions, references=ground_truth)
base_rouge_results = rouge.compute(predictions=base_predictions, references=ground_truth)

# Compute the BLEU score
sft_bleu_results = bleu.compute(predictions=sft_predictions, references=ground_truth)
sft_rouge_results = rouge.compute(predictions=sft_predictions, references=ground_truth)

base_scores = (base_bleu_results | base_rouge_results)
sft_scores = (sft_bleu_results | sft_rouge_results)


# base_scores.append(base_bleu_results | base_rouge_results)
# sft_scores.append(sft_bleu_results | sft_rouge_results)
print("=======BASE MODEL=======")
print(base_scores)
print("=======TUNED MODEL=======")
print(sft_scores)

{'bleu': 0.4114756154832229, 'precisions': [0.6616061544758556, 0.45710818427201044, 0.3469485627025168, 0.27320760722527254], 'brevity_penalty': 1.0, 'length_ratio': 1.0314523944310647, 'translation_length': 322497, 'reference_length': 312663, 'rouge1': 0.6611250630158876, 'rouge2': 0.4589607151287476, 'rougeL': 0.47945904131584977, 'rougeLsum': 0.47942825228714764}
{'bleu': 0.47653356485280896, 'precisions': [0.6957336146068233, 0.516426514730066, 0.41553424995375793, 0.34539467547247255], 'brevity_penalty': 1.0, 'length_ratio': 1.055219197666497, 'translation_length': 329928, 'reference_length': 312663, 'rouge1': 0.711687864095034, 'rouge2': 0.5278652867147173, 'rougeL': 0.5702438779020467, 'rougeLsum': 0.5706031751897982}


In [84]:
import pandas as pd
data = {'dimension':[], 'base': [], 'tuned': [], 'delta': [], 'delta_percent': []}

for key in base_scores.keys():
    if key in ["length_ratio","precisions","brevity_penalty","translation_length","reference_length"]:
        continue
        
    delta = sft_scores[key]-base_scores[key]
    delta_percent = (delta/base_scores[key])*100
    
    data['dimension'].append(key)
    data['base'].append(base_scores[key])
    data['tuned'].append(sft_scores[key])
    data['delta'].append(delta)
    data['delta_percent'].append(delta_percent)
    
df = pd.DataFrame(data)

df

Unnamed: 0,dimension,base,tuned,delta,delta_percent
0,bleu,0.411476,0.476534,0.065058,15.810888
1,rouge1,0.661125,0.711688,0.050563,7.647993
2,rouge2,0.458961,0.527865,0.068905,15.013174
3,rougeL,0.479459,0.570244,0.090785,18.934847
4,rougeLsum,0.479428,0.570603,0.091175,19.017428
