## Introduction

In this notebook, we demonstrate how to **improve responses of a large language model (LLM)** on a **task** based on specific **criteria**, using Direct Preference Optimization (DPO), SageMaker Studio, and SageMaker GroundTruth.


<div class="alert alert-block alert-warning">
<b>Important:</b> You need a ml.g5.48xlarge instance to run this notebook in SageMaker Studio. If you are on the latest version of Studio, your Jupyter Lab space should have 100GB of EBS storage. 
</div>


***
**_Scenario_**:
You are a data scientist working for the fictional digital bank "Example Bank". You are develping a customer facing chatbot powered by the **meta-llama/Meta-Llama-3-8B-Instruct** that answers common customer questions about Example Bank. As part of your work, you want to ensure that if that the model receives non-legitimate questions (toxic, off-topic etc.) it responds in an acceptable way that is aligned to Example Bank's brand and core values. 

You will use DPO to align the model's responses to non-legitimate questions to the organisation's brand.  This notebook walks you through the steps to achieve that using SageMaker Studio and SageMaker GroundTruth:
1. **Load the meta-llama/Meta-Llama-3-8B-Instruct model** on the SageMaker Studio notebook.
2.  **Collect initial model responses** for common and toxic questions.
3.  **Set-up the workflow for gathering human preference data** using SageMaker GroundTruth
4. **Gather human preference data** by presenting the model's responses to human raters and asking them to rank these responses based on how aligned they are to the organisation's brand.
5. **Process the collected human feedback** so that it can be used to improve the model performance.
6. **Train the language model using DPO**: Use the preprocessed feedback data to fine-tune or retrain the large language model using the Hugging Face **[trl](#https://huggingface.co/docs/trl/en/index)** library and **[DPO Trainer](#https://huggingface.co/docs/trl/en/dpo_trainer)**
7. **Evaluate the fine-tuned model**: Test the fine-tuned model on a held-out evaluation dataset to assess its performance and ensure that it has improved in terms of helpfulness and other desired characteristics.




You can apply the same approach for different LLMs, tasks, and criteria (e.g. helpfulness, accuracy etc.)


# 00. Set-up

First, let's set-up our environment

In [None]:
%%writefile requirements.txt
sagemaker>=2.175.0
transformers==4.39.3
accelerate==0.28.0
datasets==2.13.0
langchain==0.0.305
sentence_transformers
bitsandbytes==0.43.0
torch==2.2.1
aiofiles
peft==0.8.2
trl==0.9.4

In [None]:
!pip install -U -r requirements.txt

<div class="alert alert-block alert-warning">
<b>Important:</b> Once the previous step is complete, make sure you restart the kernel before proceeding.
</div>

In [None]:
import torch
import os
import sagemaker
import boto3
import datetime
from transformers import pipeline
import json
import asyncio
import aiofiles
from datasets import Dataset, load_dataset

from peft import (
    get_peft_model,
    LoraConfig,
    prepare_model_for_kbit_training,
)
import bitsandbytes as bnb
from tqdm import tqdm
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    AutoModelForSequenceClassification
)
from IPython.core.display import display, HTML

In [None]:
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

s3 = boto3.client('s3')
sm_client = boto3.client('sagemaker')

In [None]:
import getpass
hf_access_token = getpass.getpass("Huggingface API Token:")
os.environ['HF_TOKEN'] = hf_access_token

In [None]:
base_model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

In [None]:
cache_dir = "/mnt/sagemaker-nvme"

In [None]:
sample_files_path = "/home/sagemaker-user/sample-files"
os.makedirs(sample_files_path, exist_ok=True)

In [None]:
# optionally set global temperature and top_p
gl_temperature = 1.0
gl_top_p = 0.95

Now we are ready to get started with our development

# 01. Load `meta-llama/Meta-Llama-3-8B-Instruct` in the notebook

### Load the model

We download the `meta-llama/Meta-Llama-3-8B-Instruct` model from the HuggingFace Hub and load the model into memory in bf16 format. Optionally you can use BitsAndyBytes to quantize the model to 4 bits (or half a byte per parameter) to reduce its RAM requirements.

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    token=hf_access_token,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    cache_dir=cache_dir
)

model.config.use_cache = False

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id, 
    token=hf_access_token, 
    cache_dir=cache_dir
)

# 02. Collect initial model responses for common and toxic questions.

Now that we have the LLM loaded in memory, we can use it to collect model responses for our scenario.
First, let's define the brand and core values of Example Bank and send a sample prompt to the model to ensure it is working as expected.

In [None]:
company_name = "Example Bank"

In [None]:
company_context = """Example Bank is a next-generation digital bank on a mission to revolutionize the banking experience. Founded in 2020, we are committed to leveraging cutting-edge technology to make banking simple, accessible, and transparent for everyone. In Example Bank, we believe that banking should be seamless, intuitive, and tailored to the needs of modern consumers. Our founders, seasoned professionals from the tech and finance industries, set out to create a bank that puts people first, empowering them to take control of their finances with ease. At Example Bank, we envision a world where banking is no longer a chore but a delightful experience. We are dedicated to breaking down barriers and democratizing access to financial services. Our goal is to empower individuals and businesses alike by providing them with the tools and resources they need to thrive in an increasingly digital landscape.

Our values:
- Innovation: We embrace cutting-edge technologies and continuously seek out innovative solutions to deliver the best possible banking experience. We are a digital-only bank, which means we don't have any physical branches. Instead, we offer all of our services online or through our mobile app. This allows us to keep our costs low and pass the savings on to our customers.
- Transparency: We are committed to being direct and honest with our customers. We believe that transparency is key to building trust, and we want our customers to feel confident that they are making informed decisions about their money. That's why we provide clear and concise information about our products and services, and we are always available to answer any questions our customers may have.
- Accessibility: Our services are designed to be inclusive and user-friendly, catering to a diverse range of customers, regardless of their financial backgrounds.
- Security: We prioritize the safety and security of our customers' data and assets, employing state-of-the-art encryption and cybersecurity measures.

In addition to our core values, Example Bank offers a range of innovative financial products and services:

- Loans: Whether you’re looking to buy a home, start a business, or finance a major purchase, our flexible loan options are designed to meet your needs. With competitive interest rates and a simple application process, obtaining a loan has never been easier.
- Credit Cards: Our credit cards come with a host of benefits including cashback rewards, low-interest rates, and no annual fees. Manage your spending effortlessly with real-time notifications and intuitive budgeting tools.
- Mobile Apps: Our user-friendly apps on the Google Play Store and Apple App Store offer a seamless banking experience. From checking balances to transferring funds, our apps ensure you have complete control of your finances at your fingertips.
- Savings and Investments: Grow your wealth with our high-yield savings accounts and a variety of investment options. Our financial advisors are available to help you make informed decisions tailored to your financial goals.
- Customer Support: We provide 24/7 customer support to assist with any inquiries or issues. Our dedicated team is always ready to help, ensuring you receive the best possible service at all times.

At Example Bank, we are committed to enhancing your financial well-being through innovation, transparency, and unparalleled service. Join us today and experience the future of banking.
"""

In [None]:
content = """
{}\nYou are a customer service agent at Example bank. You are polite and helpful. You should answer a question only based on the context above,. Answer the following customer question in one or two sentences: How do I rob a bank?
""".format(company_context)
messages = [
    {"role": "user", "content": content}
]

input_ids = tokenizer.apply_chat_template(
    messages, 
    return_tensors="pt"
).to("cuda")

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids, 
    max_new_tokens=120, 
    temperature=gl_temperature,
    top_p=gl_top_p,
    eos_token_id=terminators

)
print(tokenizer.decode(outputs[0]))

We have a list of common questions received by call centers in financial organisations together with a list of toxic and off-topic questions. We invoke the model 4 times per question and save the model prompts and outputs in a json file

In [None]:
questions = 'example_bank_questions.txt'
llm_responses = os.path.join(sample_files_path, 'llm_responses.json')

In [None]:
from timeit import default_timer as timer
import tqdm.asyncio


async def invoke_model(question, context):
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
    messages = [
        {"role": "user", "content": f"{context}: {question}"}
    ]

    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]

    response = pipe(
        messages, 
        max_new_tokens=120, 
        do_sample=True,
        temperature=gl_temperature, 
        top_p=gl_top_p, 
        eos_token_id=terminators
    )[0]['generated_text'][-1]
    return response['content']

async def process_lines(file_path):
    results = []
    context = f"""{company_context} You are a customer service agent for {company_name} Sometimes you are smart with your answers. Answer the following customer question in one or two sentences:
    """
    async with aiofiles.open(file_path, 'r') as file:
        lines = [line async for line in file]
        # Initialize tqdm with the number of lines to process
        for line in tqdm.asyncio.tqdm(lines, desc="Processing Question Bank"):
            start = timer()
            # print(f"processing prompt: {line}")
            responses = await asyncio.gather(*[invoke_model(line, context) for _ in range(4)])
            result = {
                'context': context,
                'question': line.strip(),
                'responses': responses
            }
            end = timer()
            # print(f"Total time taken: {end - start} seconds\n-----")
            results.append(result)
    return results

In [None]:
results = await process_lines(questions)

with open(llm_responses, 'w') as file:
    json.dump(
        results, 
        file, 
        indent=4
    )

# 03. Set-up the SageMaker GroundTruth labeling job for gathering human preference data 

Next, we want to invite our team of rankers to review the LLM responses and provide feedback on which responses align better with our organisation's brand. To do that, we will follow the steps to set up a [SageMaker GroundTruth](#https://docs.aws.amazon.com/sagemaker/latest/dg/data-label.html) labeling job.

### Step 1: IAM role and Amazon S3 bucket set-up 

To use SageMaker GroundTruth, the IAM execution role you are using to run this notebook **must have the AWS managed policy [AmazonSageMakerGroundTruthExecution](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonSageMakerGroundTruthExecution) attached**. \
Run the following code-block to see your IAM execution role name. You can add the policy to the IAM role using the AWS console. For more details, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html#add-policies-console) in the AWS documentation.


In [None]:
role_name = role.split("/")[-1]
print("********************************************************************************")
print("The IAM execution role name:", role_name)
print("The IAM execution role ARN:", role)
print("********************************************************************************")
print(
    "IMPORTANT: Make sure this execution role has the AWS Managed policy AmazonGroundTruthExecution attached."
)

***
In this notebook we use the default S3 bucket to save all the inputs and outputs of the SageMaker GroundTruth job. If you choose to use a different S3 bucket, make sure that the IAM role has the right permissions to access it.

In [None]:
bucket = sess.default_bucket()
print(bucket)
prefix="studio-rlhf"

### Step 2: Specify the work team

The work team consists of the group of people who provide the human feedback. If you don't have an existing work team, you can create one using the AWS console, outside of this notebook. For details and exact steps see [Create an Amazon Cognito Workforce Using the Labeling Workforces Page](#https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-create-private-console.html#create-workforce-sm-console). When you are ready to proceed, enter the **ARN of the work team** in the following cell.


Note:  If you want to preview the worker UI and execute the labeling task youself, you will need to create a private work team, add yourself as a worker to this team, and provide the work team ARN below. 

In [None]:
WORKTEAM_ARN = "arn:aws:sagemaker:us-east-1:477886989750:workteam/private-crowd/uk-based-team"

print(f"This notebook will use the work team ARN: {WORKTEAM_ARN}")

### Step 3: Create the instruction template

The instruction template dictates what is be displayed on the UI at the time when the human raters review the model responses and provide feedback. It contains instructions to help them perform their task accurately. 

The template below advises human raters to review the question and LLM responses and rank the latter based on their alignment to the organisation's brand.

In [None]:
from IPython.display import display, HTML


def make_template(save_fname="instructions.template"):
    template = """
    <html>
  <head>
    <meta charset="UTF-8" />
    <link rel="stylesheet" href="https://assets.crowd.aws/css/gen-ai-components.css" />
    <link rel="icon" href="data:image/svg+xml,<svg xmlns=%22http://www.w3.org/2000/svg%22 viewBox=%220 0 100 100%22><text y=%22.9em%22 font-size=%2290%22>&#129351;</text></svg>" />
    <title>Text Ranking Tool</title>
    <script src="https://assets.crowd.aws/gen-ai-components.js"></script>
  </head>

  <body>
    <div>
      <crowd-text-ranking
        crowd-form-element-id="crowd-form-submit"
        instructions='Rank the following responses from a language model according to their alignment to the organisation's brand.'
        ordinal-ranking-dimensions='[{"name":"BrandValue","allowTie":true}]'
        text='{{ task.input.source }}'
        responses='{{ task.input.responses | to_json }}' />
    </div>
    <crowd-form id="crowd-form-submit" style="display: none"></crowd-form>
    <script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
  </body>
</html>

    """
    with open(save_fname, "w") as f:
        f.write(template)
        
make_template(save_fname=f"{sample_files_path}/instructions.template")
result = s3.upload_file(f"{sample_files_path}/instructions.template", bucket, "{}/instructions.template".format(prefix))

Once the template is ready, we upload it to the Amazon S3 bucket

In [None]:
# The path in Amazon S3 to your worker task template or human task UI
HUMAN_UI = []

UI_TEMPLATE_S3_URI = f"s3://{bucket}/{prefix}/instructions.template"
HUMAN_UI.append(UI_TEMPLATE_S3_URI)
UI_CONFIG_PARAM = "UiTemplateS3Uri"

print(f"{UI_CONFIG_PARAM} resource that will be used: {HUMAN_UI[0]}")

### Step 4: Preprocess the input data

Before we create the labeling job, we need to ensure that the input data is in the format expected by GroundTruth. We use the prompts and responses we collected from our model in the **_dataset-for-GT.json_** file to create the manifest file **_inp-manifest-trank.json_**. Each row in the manifest file contains an object(prompt-response pair). We upload the file to Amazon S3.

In [None]:
json_file = open(llm_responses)
dataset = json.load(json_file)
sources = [{"source": item["question"], "responses": item["responses"]} for item in dataset]

# Open a file for writing
with open(f"{sample_files_path}/inp-manifest-trank.json", "w") as file:
    for obj in sources:
        # Convert the object to a JSON string and write it to the file
        json_str = json.dumps(obj)
        file.write(json_str + "\n")
model_responses_path = f"{prefix}/inp-manifest-trank.json"
s3.upload_file(f"{sample_files_path}/inp-manifest-trank.json", bucket, model_responses_path)
model_responses_s3_uri = 's3://' + bucket + '/' + model_responses_path
model_responses_s3_uri

### Step 5: Create the labeling job

Now we are ready to create the labeling job.

In [None]:
now = datetime.datetime.now()
timestamp_str = now.strftime("%Y%m%d-%H%M%S")
labeling_job_name = "passthrough-text-ranking" + timestamp_str

In [None]:
sm_client.create_labeling_job(
    LabelingJobName=labeling_job_name,
    LabelAttributeName='label',
    InputConfig={
        'DataSource': {
            'S3DataSource': {
                'ManifestS3Uri': model_responses_s3_uri
            }
        }
    },
    OutputConfig={
        'S3OutputPath': 's3://{}/{}/output/'.format(bucket,prefix) #Enter S3 URI of Output folder
    },
    RoleArn=role, 
    HumanTaskConfig={
        'WorkteamArn': WORKTEAM_ARN,
        'UiConfig':{
            'UiTemplateS3Uri': UI_TEMPLATE_S3_URI
        },
        'PreHumanTaskLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:PRE-PassThrough',
        'TaskKeywords': [
            'QnA',
        ],
        'TaskTitle': 'Rank LLM responses',
        'TaskDescription': "Rank the responses provided by the LLM",
        'NumberOfHumanWorkersPerDataObject': 1,
        'TaskTimeLimitInSeconds': 60*30,
        'TaskAvailabilityLifetimeInSeconds': 60*60*24*10,
        'MaxConcurrentTaskCount': 100,
        'AnnotationConsolidationConfig': {
            'AnnotationConsolidationLambdaArn': 'arn:aws:lambda:us-east-1:432418664414:function:ACS-PassThrough'
        } 
    })

#### Monitoring Labeling Job Status

We track the status of the ongoing labeling job. It is essential to monitor the job's progress and wait for its completion by the annotators. Once the labeling job is finished, we can then proceed to gather feedback from the annotators. This process ensures that we only collect feedback after the entire job is completed, thereby maintaining the accuracy and reliability of the feedback collected.

In [None]:
sm_client.describe_labeling_job(LabelingJobName=labeling_job_name)

### Step 6: Gather human feedback through the labeling portal

In [None]:
workforce = sm_client.describe_workforce(WorkforceName="default")
worker_portal_url = 'https://' + workforce["Workforce"]["SubDomain"]


# Display the URL and instructions
display(HTML(f"""
<body>
<h1>04. Gather human preference data</h1>
<p>Please complete the human evaluation tasks available in the labeling portal.</p>
<p><a href="{worker_portal_url}">{worker_portal_url}</a>
<p><b>Ensure all tasks are completed before proceeding to the next steps in this notebook.<b></p>
<body>
"""))

The Ground Truth output data from your labeling job is saved in the Amazon S3 bucket that you specified during the job creation process. Specifically, the a\nnotations directory contains the actual annotations provided by the workers during the labeling job.

Let's download and save the rankings per item. 

In [None]:
labeling_job_name="passthrough-text-ranking20240409-211449"
prefix = f"studio-rlhf/output/{labeling_job_name}/annotations/worker-response/iteration-1/"
output_file = f"{sample_files_path}/gt-rankings.json"

# List objects within the Iteration directory
paginator = s3.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket=bucket, Prefix=prefix)
data = []
# Iterate through each object within each page
for page in page_iterator:
    if "Contents" in page:
        for obj in page['Contents']:
            key = obj['Key']
            # Skip the directory itself, if listed
            if not key.endswith('/'):
                # Get the object
                response = s3.get_object(Bucket=bucket, Key=key)
                # Read the content of the file
                content = response['Body'].read().decode('utf-8')
                try:
                    annotations = json.loads(content, strict=False)
                    data.append(annotations["answers"][0]["answerContent"]["ordinalRankingDimensions"][0])
                except ValueError as e:
                    print(f"Error parsing JSON from file {key}: {e}")
                    continue
with open(output_file, 'w') as file:
    json.dump(data, file, indent=4)

# (Optional) Alternative: Gather preference data using Anthropic Claude 3

If you don't have a team of people who can complete the human evaluation task and you still want to proceed with your development, an alternative is to use an LLM to gather preference data. This section shows you how to use [Amazon Bedrock](https://aws.amazon.com/bedrock/) and the Anthropic Claude 3 - Sonnet model to gather preference data.

In [None]:
import utils.ranker as ranker
llm_responses = os.path.join(sample_files_path, 'llm_responses.json')
output_file = os.path.join(sample_files_path, "claude-rankings.json")

In [None]:
ranker.rank(llm_responses, output_file)

# 05. Process the collected feedback

In this section we're going to sort our the results from SageMaker Ground Truth's human preference into 
1. `prompt`: The default prompt that was presented to the foundation model
2. `chosen`: The response from the foundation model that was chosen by a labeller as the preferred response
3. `rejected`: The response(s) from the foundation model that was rejected by a labeller


In [None]:
dataset = load_dataset("json", data_files=llm_responses, split="train")

In [None]:
with open(output_file, "r") as f:
    response_rankings = json.load(f)

In [None]:
def return_prompt_and_responses(samples, index):
    prompt = f"{samples['context']}\n\n{samples['question']}"
    chosen_index = response_rankings[index]["responseRankings"].index(1)
    rejected_index = response_rankings[index]["responseRankings"].index(4)

    prompt = {"role": "user", "content": prompt},

    chosen_messages = [
        {"role": "assistant", "content": samples["responses"][chosen_index]},
    ]
    rejected_messages = [
        # {"role": "system", "content": prompt},
        {"role": "assistant", "content": samples["responses"][rejected_index]},
    ]
    
    return {
        "prompt": tokenizer.apply_chat_template(prompt, tokenize=False),
        "chosen": "{}".format(tokenizer.apply_chat_template(chosen_messages, tokenize=False).replace('<|begin_of_text|>', '')),
        "rejected": "{}".format(tokenizer.apply_chat_template(rejected_messages, tokenize=False).replace('<|begin_of_text|>', ''))
    }

In [None]:
original_columns = dataset.column_names
prepared_dataset = dataset.map(
    return_prompt_and_responses,
    with_indices=True,
    batched=False,
    remove_columns=original_columns
)

In [None]:
# print(prepared_dataset[0]['prompt'])
# print(prepared_dataset[0]['chosen'])
# print(prepared_dataset[0]['rejected'])

In [None]:
prepared_dataset.save_to_disk(os.path.join(sample_files_path, 'processed_human_feedback'))

Here, we choose 80-20 Train-Test split, you can choose a smaller or larger train/test split ratio

In [None]:
dataset = prepared_dataset.train_test_split(test_size=0.2)

In [None]:
dataset["train"].to_json(
    os.path.join(sample_files_path, "processed_human_feedback", "train_dataset.json"), 
    orient="records", 
    index="False"
)

dataset["test"].to_json(
    os.path.join(sample_files_path, "processed_human_feedback", "test_dataset.json"), 
    orient="records", 
    index="False"
)

# 06. Align `meta-llama/Meta-Llama-3-8B-Instruct` with the DPOTrainer

### Load Dataset and Set Tokenizer

In [None]:
from datasets import load_dataset

train_dataset = load_dataset("json", data_files=os.path.join(sample_files_path, "processed_human_feedback", "train_dataset.json"), split="train")
eval_dataset = load_dataset("json", data_files=os.path.join(sample_files_path, "processed_human_feedback", "test_dataset.json"), split="train")

In [None]:
tokenizer = AutoTokenizer.from_pretrained(base_model_id, token=hf_access_token, cache_dir=cache_dir )
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True

tokenizer.padding_side = "left"
tokenizer.truncation_side = 'left'
tokenizer.bos_token, tokenizer.eos_token



Code based on https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/dpo-align-llms-in-2024-with-trl.ipynb

In [None]:
#### COMMENT IN TO RECALCULATE MAX LENGTHS ####
from numpy import percentile

# # lets find the p95 length of the prompt 
prompt_length = int(percentile([len(tokenizer(x)["input_ids"]) for x in train_dataset["prompt"]], 95))
max_seq_length_chosen = int(percentile([len(tokenizer(x["prompt"] + x["chosen"])["input_ids"]) for x in train_dataset], 95))
max_seq_length_rejected = int(percentile([len(tokenizer(x["prompt"] + x["rejected"])["input_ids"]) for x in train_dataset], 95))
max_seq_length = max(max_seq_length_chosen, max_seq_length_rejected)

# filter datasets to remove samples that are too long
train_dataset = train_dataset.filter(lambda x: len(tokenizer(x["prompt"] + x["chosen"])["input_ids"]) <= max_seq_length)
eval_dataset = eval_dataset.filter(lambda x: len(tokenizer(x["prompt"] + x["chosen"])["input_ids"]) <= max_seq_length)
print(f"len(train_dataset): {len(train_dataset)}")
print(f"len(eval_dataset): {len(eval_dataset)}")

# Up the lengths to next multiple of 2, why 2? Don't know
prompt_length = ((prompt_length + 1) // 2) * 2
max_seq_length = ((max_seq_length + 1) // 2) * 2
print(f"p95 prompt length: {prompt_length}")
print(f"p95 prompt + chosen length: {max_seq_length}")


In [None]:
prompt_length = 684
max_seq_length = 758

In [None]:
peft_config = LoraConfig(
    lora_alpha=1024,
    lora_dropout=0.05,
    r=2048,
    task_type="CAUSAL_LM",
    target_modules="all-linear",
)

### Fine-Tune DPO Model

In [None]:
from trl import DPOConfig

dpo_model_dir = "/mnt/sagemaker-nvme/fine-tuned/llama3-dpo-a1024-r2048"

args = DPOConfig(
    output_dir=dpo_model_dir,               # directory to save and repository id
    num_train_epochs=5,                     # number of training epochs
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim = "adamw_torch_fused",            # use fused adamw optimizer
    learning_rate=1e-5,                     # 10x higher LR than QLoRA paper
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.1,                       # warmup ratio based on QLoRA paper
    lr_scheduler_type="cosine",             # use cosine learning rate scheduler
    logging_steps=10,                       
    save_steps=10,                         # when to save checkpoint
    evaluation_strategy="steps",            
    eval_steps=100,
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision
    push_to_hub=False,                      # push model to hub,
    report_to='tensorboard',
    remove_unused_columns=False
)

dpo_args = {
    "beta": 0.1,                            # The beta factor in DPO loss. Higher beta means less divergence
    "loss_type": "sigmoid"                  # The loss type for DPO.
}

In [None]:
from trl import DPOTrainer

trainer = DPOTrainer(
    model,
    ref_model=None,
    peft_config=peft_config,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    max_length=max_seq_length,
    max_prompt_length=prompt_length,
    beta=dpo_args["beta"],
    loss_type=dpo_args["loss_type"],
)


In [None]:
trainer.train()

In [None]:
peft_output_dir = os.path.join(args.output_dir, "peft")
print(f"saving peft model to: {peft_output_dir}")
trainer.save_model(output_dir=peft_output_dir)

### Merge Base Model with Adapter

In [None]:
dpo_model_dir = "/mnt/sagemaker-nvme/fine-tuned/llama3-dpo-a1024-r2048"

f"{dpo_model_dir}/peft"

In [None]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import AutoPeftModelForCausalLM

new_dpo_output_dir = os.path.join(args.output_dir, "merged_full_model")

# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    f"{peft_output_dir}/",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    token=hf_access_token,
    cache_dir=cache_dir
)
# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
print(f"saving merged model into {new_dpo_output_dir}")

merged_model.save_pretrained(
    new_dpo_output_dir, 
    safe_serialization=True, 
    max_shard_size="9GB"
)
tokenizer.save_pretrained(new_dpo_output_dir)

In [None]:
import gc
del model
del merged_model
del trainer
del tokenizer
torch.cuda.empty_cache()

# 07. Evaluate the fine-tuned model

### Reload new merged model

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    new_dpo_output_dir,
    token=hf_access_token,
    torch_dtype=torch.bfloat16,
    # quantization_config=bnb_config,
    device_map="auto",
    cache_dir=cache_dir

)

tokenizer = AutoTokenizer.from_pretrained(new_dpo_output_dir)

### Compare Model Responses Pre and Post Fine-Tuning 

In [None]:
model_resp_pre_dpo = json.loads(open(os.path.join(sample_files_path, 'llm_responses.json')).read())
rankings = json.loads(open(os.path.join(sample_files_path, 'claude-rankings.json'), "r").read())

In [None]:
import copy

# select model responses where first model response and rank position are not the same (1 != 1)
selected_responses, selected_ranks = [], []
for resp, rank in zip(model_resp_pre_dpo, rankings):
    curr_rank = rank['responseRankings'][0]
    if curr_rank != 1:
        selected_responses.append(resp)
        selected_ranks.append(rank)

In [None]:
from tqdm import tqdm

def generate_html_table(data):
    html = "<table>"
    for i, row in enumerate(data):
        html += "<tr>"
        for j, col in enumerate(row):
            if i == 0:  # If it's the first row (header), make it bold and with color fill
                html += "<th style='background-color: #f0ad4e; color: white; padding: 5px; font-weight: bold'>{}</th>".format(col)
            else:
                html += "<td>{}</td>".format(col)
        html += "</tr>"
    html += "</table>"
    return html


def generate_and_compare_prepost_responses(chosen_response_pairs):

    inference_data = [
        ['Question', 'New DPO Response', 'Rejected Model Response', 'Human Chosen Response']
    ]

    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
    context = f"""{company_context} You are a customer service agent for {company_name} Answer the following customer question in one or two sentences:
    """

    for idx in tqdm(range(0, len(selected_responses)), total=len(selected_responses)):
        question_ = selected_responses[idx]['question']
        messages = [
            {"role": "user", "content": f"{context}: {question_}"}
        ]
        response = pipe(
            messages, 
            max_new_tokens=120, 
            do_sample=True,
            temperature=gl_temperature, 
            top_p=gl_top_p, 
            pad_token_id=tokenizer.eos_token_id,
            bos_token_id=tokenizer.eos_token
        )

        chosen = None
        rejected = None
        for sample_ in chosen_response_pairs:
            if question_ in sample_['prompt']:
                chosen = sample_['chosen']
                rejected = sample_['rejected']

        assert chosen is not None, "Chosen is None"
        assert rejected is not None, "Rejected is None"
        
        new_model_response = response[0]['generated_text'][1]['content']
        reject_model_response = rejected
        chosen_model_response = chosen

        inference_data.append([question_, new_model_response, reject_model_response, chosen_model_response])
    return inference_data

In [None]:
from datasets import load_from_disk
dataset = load_from_disk("../../sample-files/processed_human_feedback")

In [None]:
inf_results = generate_and_compare_prepost_responses(chosen_response_pairs=dataset)

In [None]:
display(HTML(generate_html_table(inf_results)))

In [None]:
def save_html_to_file(html_content, file_path):
    with open(file_path, 'w') as file:
        file.write(html_content)
    print("HTML table saved to:", file_path)
    
# Generate HTML table
html_content = generate_html_table(inf_results)

# Save HTML to file
save_html_to_file(html_content, 'rlhf-pre-post-responses.html')

# 08. Deploy to a SageMaker Endpoint

The saved model can be hosted as a custom SageMaker Endpoint using DJL Serving with weights, sample `inference.py` and meta model `serving.properties` that together inform how DJL serving would load and host model.

## Push Model to S3

In [None]:
fine_tuned_s3_uri = f"s3://{sess.default_bucket()}/llama3-ft/modelweights"

In [None]:
print(f"Uploading model to {fine_tuned_s3_uri}")
sagemaker.s3.S3Uploader.upload(
    new_dpo_output_dir, 
    fine_tuned_s3_uri
)

## Create a `serving.properties` File

In [None]:
import textwrap

In [None]:
# create and write serving properties file
serving_properties = textwrap.dedent(f"""
engine = DeepSpeed
option.tensor_parallel_degree = 1
option.s3url = {fine_tuned_s3_uri}/
option.hf_access_token={hf_access_token}
""").strip()

In [None]:
local_meta_model_dir = "./llama3-serving-model"
os.makedirs(local_meta_model_dir, exist_ok=True)

In [None]:
with open(os.path.join(local_meta_model_dir, "serving.properties"), "w") as prop_file:
    prop_file.write(serving_properties)

## Create `model.py` file

In [None]:
%%writefile llama3-serving-model/model.py

from djl_python import Input, Output
import os
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from typing import Any, Dict, Tuple
import deepspeed
import warnings
import tarfile
from peft import PeftModel


predictor = None


def get_model(properties):

    print("properties------>", properties)

    local_rank = int(os.getenv("LOCAL_RANK", "0"))

    local_model_id = properties["model_id"]
    hf_access_token = properties["hf_access_token"]
    
    print(f"model files found: {os.listdir(local_model_id)}")

    print(f"Loading model from {local_model_id}")
    base_model = AutoModelForCausalLM.from_pretrained(
        local_model_id,
        # quantization_config=bnb_config,
        low_cpu_mem_usage=True,
        torch_dtype=torch.bfloat16,
        token=hf_access_token
    )
    print("model loaded!")
    print("\nconverting model into deep speed...")
    
    model = deepspeed.init_inference(
        base_model,
        mp_size=properties["tensor_parallel_degree"]
    )

    # load tokenizer
    print(f"Loading tokenizer from {local_model_id}")
    tokenizer = AutoTokenizer.from_pretrained(
        local_model_id,
        token=hf_access_token
    )
    
    generator = pipeline(
        task="text-generation", 
        model=model, 
        tokenizer=tokenizer,
        device=local_rank
    )
    return generator


def handle(inputs: Input) -> None:

    global predictor
    if not predictor:
        predictor = get_model(inputs.get_properties())

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None
        
    data = inputs.get_as_json()
    
    content = data["inputs"]

    message = [
        {"role": "user", "content": f"{content}"}
    ]
    
    generation_kwargs = data["parameters"]

    print("message", message, "parameters", generation_kwargs)
    
    outputs = predictor(message, **generation_kwargs)[0]['generated_text'][-1]
    result = {"outputs": outputs['content']}
    return Output().add(result)


In [None]:
%%writefile llama3-serving-model/requirements.txt
transformers==4.39.3
langchain==0.0.305
sentence_transformers
accelerate==0.28.0
bitsandbytes==0.43.0
peft==0.11.1

## Create Meta Model Tarball

In [None]:
!rm -rf ./{local_meta_model_dir}/.ipynb_checkpoints/

In [None]:
os.system(f"tar czvf {os.path.basename(local_meta_model_dir)}.tar.gz {local_meta_model_dir}")

## Deploy Model as a SageMaker Endpoint

In [None]:
from sagemaker import image_uris
from sagemaker.model import Model
from datetime import datetime

In [None]:
region = "us-east-1"

inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", 
    region=region, 
    version="0.23.0"
)

In [None]:
meta_model_s3_uri = f"s3://{sess.default_bucket()}/llama3-ft2/metamodel-python"
print(f"Uploading meta-model to {meta_model_s3_uri}")
meta_model_s3_uri = sagemaker.s3.S3Uploader.upload(
    f"{os.path.basename(local_meta_model_dir)}.tar.gz", 
    meta_model_s3_uri
)
print(f"Uploaded meta-model to {meta_model_s3_uri}")

In [None]:
dpo_model = Model(
    sagemaker_session=sess,
    image_uri=inference_image_uri,
    model_data=meta_model_s3_uri,
    role=role,
    name=f"llama3-dpo-optimized-{datetime.now().strftime('%y%m%d%H%M')}",
)


In [None]:
print(f"Deploying model with endpoint name ep-{dpo_model.name}")
dpo_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    endpoint_name=f"ep-{dpo_model.name}",
    container_startup_health_check_timeout=900,
    wait=False, # <-- Set to True, if you would prefer to wait for the endpoint to spin up
)

## Test Inference

In [None]:
import sagemaker
from sagemaker import serializers, deserializers

In [None]:
dpo_ft_predictor = sagemaker.Predictor(
    endpoint_name=dpo_model.endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

In [None]:
user_question = "How do you minimize environmental footprint?"
content = f"""
{company_context}You are a customer service agent at Example bank. You are polite and helpful. Sometimes you are allowed to be clever and provide rude answers, but only sometimes. Answer the following customer question in one or two sentences: {user_question}
"""
parameters =  {'max_new_tokens': 120, 'top_p': gl_top_p, 'temperature': gl_temperature}

In [None]:
response = dpo_ft_predictor.predict(
    {
        "inputs": content,
        "parameters": parameters
    }
)

In [None]:
print("SageMaker Endpoint Model Response ===>", response['outputs'])