# Frugal Reinforcement Learning from Human Feedback (RLHF) with Meta Llama models: Preference aligning LLMs with Multi-Adapter PPO

<img src="./images/Llama.png" width="30%" alt='Llama.png'/> 

This workshop will guide you through the process of fine-tuning Large Language Models (LLMs) using Reinforcement Learning from Human Feedback (RLHF) with a multi-adapter Proximal Policy Optimization (PPO) approach.

## Workshop Overview

In this tutorial, you will:
1. Set up the necessary environment and dependencies
2. Prepare datasets for both reward model training and PPO fine-tuning
3. Train a reward model using human preference data
4. Perform PPO-based RLHF training using your reward model
5. Deploy the fine-tuned model to Amazon Bedrock for inference
6. Clean up resources

More information on LLM fine-tuning, benefits of reinforcement learning approaches like RLHF and the benefits or multi-adapter PPO can be found in [this blogpost](https://medium.com/data-science/preference-alignment-for-everyone-2563cec4d10e). 

## Scenario

While most of the models published these days have already gone through multiple fine-tuning steps like SFT or even PA, since these models are general purpose ones they where certainly not performed tailored to your target users or target domain. This means that even though we are using a pre-aligned model (e.g. an instruction fine-tuned model), for optimising model performance in your domain further alignment steps are required.

For this blog we will assume the model should be optimised towards maximising the helpfulness while carrying out user-facing single- and multi-turn conversations in a Q&A style in the scientific domain. Thus, we will start from a general-purpose instruct / Q&A pre-trained FM.

## Prerequisites

Before starting, you need to:

1. Accept the license terms for [Meta Llama 3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [Meta Llama 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) Instruct models on HuggingFace. Therefor navigate to the respective model pages through the links.

<img src="./images/LlamaLicenseAgreement.png" width="30%" alt='LlamaLicenseAgreement.png'/> 

2. Create an access token for HuggingFace authentication. Therefor please follow the instructions from the [HuggingFace documentation](https://huggingface.co/docs/hub/en/security-tokens).

<img src="./images/CreateHFToken.png" width="30%" alt='CreateHFToken.png'/> 

## Step 1: Model Selection

In this workshop, we'll be using a Meta Llama model for our fine-tuning process. You can choose between:
- Meta Llama-3.1-8B-Instruct: More powerful but requires more compute resources (training with ml.g5.12xlarge, configure in config.json)
- Meta Llama-3.2-1B-Instruct: Smaller model, faster training, lower resource requirements (training with ml.g5.2xlarge or ml.p3.2xlarge, configure in config.json)

Choose the model that best fits your available compute resources. For workshop settings with limited time or resources, the 1B model is recommended.

In [None]:
# Define model to be fine-tuned
#model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model_id = "meta-llama/Llama-3.2-1B-Instruct"

## Step 2: Environment Setup

Before we begin our RLHF training process, we need to set up our environment with all necessary packages and dependencies. We'll be using PyTorch with CUDA support for GPU acceleration and several specialized libraries for training and optimizing our language models.

In [None]:
# Install PyTorch with CUDA support
%pip install -U torch==2.2.0+cu118 --index-url https://download.pytorch.org/whl/cu118

The remaining dependencies are managed in a `requirements.txt` which will be shared by the notebook environment and the remote training jobs to ensure cross-compatability of dependencies. 

In [None]:
# Install all dependencies from requirements.txt file
%pip install -r requirements.txt

## Step 3: Import Libraries

Now we'll import all the libraries we need for the RLHF workflow. This includes:
- AWS SDK (boto3) for storage and compute management
- PyTorch for deep learning operations
- Transformers from Hugging Face for model loading and tokenization
- datasets for working with Hugging Face datasets
- TRL (Transformer Reinforcement Learning) for RLHF implementation
- PEFT (Parameter-Efficient Fine-Tuning) for adapter-based training
- accelerate for distributed training
- Other utilities for data handling and processing

In [None]:
# Import required libraries
import boto3
import botocore
import bitsandbytes as bnb
import multiprocessing
import sys
import functools
import json
import torch
import transformers
import warnings
from dataclasses import dataclass, field
from typing import Optional
from datasets import load_dataset, load_from_disk, Dataset, DatasetDict
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM, BitsAndBytesConfig, set_seed
from trl import ModelConfig, RewardConfig, PPOConfig, PPOTrainer, RewardTrainer, AutoModelForCausalLMWithValueHead, get_kbit_device_map, get_peft_config, get_quantization_config
from trl.core import LengthSampler
from accelerate import Accelerator
from peft import AutoPeftModelForCausalLM, AutoPeftModelForSequenceClassification, LoraConfig, get_peft_model, prepare_model_for_kbit_training
from sagemaker.remote_function import remote
from tqdm import tqdm
import s3fs

## Step 4: Setting Up Storage Infrastructure

For our training workflow, we need a place to store our datasets and model artifacts. We'll create an Amazon S3 bucket that will serve as our central storage repository throughout the workshop. 

The function below will:
1. Create a uniquely named S3 bucket 
2. Set up directories for model artifacts and datasets
3. Configure appropriate permissions for access

This setup ensures we have a persistent storage location for our data and models, which is especially important for distributed training workloads.

In [None]:
import boto3
import sagemaker
from sagemaker.session import Session
import uuid
import time

def create_s3_bucket_for_models(bucket_name=None, region=None):
    """
    Create an S3 bucket for storing SageMaker models.
    
    Args:
        bucket_name: Optional name for the S3 bucket. If not provided, a name will be generated.
        region: AWS region to create the bucket in. If not provided, uses the SageMaker session's region.
        
    Returns:
        tuple: (bucket_name, s3_output_path) - The name of the created bucket and the S3 path for model output
    """
    # Initialize boto3 clients
    s3_client = boto3.client('s3')
    session = Session()
    
    # Get the AWS region if not provided
    if not region:
        region = session.boto_region_name
    
    # Generate a unique bucket name if not provided
    if not bucket_name:
        timestamp = int(time.time())
        random_suffix = str(uuid.uuid4())[:8]
        bucket_name = f"sagemaker-model-artifacts-{timestamp}-{random_suffix}"
    
    # Create the S3 bucket
    try:
        if region == 'us-east-1':
            s3_client.create_bucket(Bucket=bucket_name)
        else:
            location = {'LocationConstraint': region}
            s3_client.create_bucket(
                Bucket=bucket_name,
                CreateBucketConfiguration=location
            )
        print(f"Created S3 bucket: {bucket_name}")
        
        # Create a folder for model output
        s3_output_path = f"s3://{bucket_name}/models"
        s3_data_path = f"s3://{bucket_name}/data"
        
        return bucket_name, s3_output_path, s3_data_path
    
    except Exception as e:
        print(f"Error creating S3 bucket: {e}")
        return None, None

# Create a bucket for your data and models
bucket_name, s3_output_path, s3_data_path = create_s3_bucket_for_models()
print(f"Model output path: {s3_output_path}, Data output path: {s3_data_path}")

## Step 5: Data Preparation for Reward Model Training

### Understanding Reward Model Data Format

A key component of RLHF is the reward model, which learns from human preference data to score model outputs. For this workshop, we'll use the open-source Anthropic [HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf), specifically focusing on the helpful subset.

The reward model dataset needs to contain pairs of responses where one response is preferred over the other (chosen vs rejected). The model will learn to assign higher scores to the preferred responses.

#### Target Format:
The dataset should have the following structure for reward model training:


```json
DatasetDict({
    train: Dataset({
        features: ['input_ids_chosen', 'attention_mask_chosen', 'input_ids_rejected', 'attention_mask_rejected'],
        num_rows: _
    })
    test: Dataset({
        features: ['input_ids_chosen', 'attention_mask_chosen', 'input_ids_rejected', 'attention_mask_rejected'],
        num_rows: _
    })
})
```

Let's start by downloading and preprocessing this data.

### Step 5.1: HuggingFace Authentication

First, we need to authenticate with HuggingFace Hub to access their datasets and models. Replace the placeholder with your own HuggingFace token.

> **Important**: Your HuggingFace token should have write access if you intend to push models to the Hub later in this workshop.

In [None]:
# Login to huggingface
hf_token = "***HF_TOKEN***"
login(hf_token)

In [None]:
# Load dataset
ds = load_dataset("Anthropic/hh-rlhf", data_dir="helpful-base")
ds

In [None]:
ds['train'][67]

### Step 5.2: Exploring the HH-RLHF Dataset

Let's examine a sample from the dataset to understand its structure. The Anthropic HH-RLHF dataset contains pairs of AI assistant responses to user queries, with one response (chosen) being preferred by human evaluators over the other (rejected).

Each example contains:
- A human query or instruction
- A chosen response that was preferred by human evaluators
- A rejected response that was less preferred

Understanding this structure is crucial as we'll need to transform it into our target format for the reward model.

### Step 5.3: Preprocessing the Dataset

Now we need to process the dataset into a structured format. The `extract_dialogue` function will parse the text into a list of message dictionaries with "role" and "content" fields, following a chat template format.

This preprocessing function:
1. Parses each dialogue into user/assistant message pairs
2. Extracts the initial prompt from the conversation
3. Structures both chosen and rejected conversations in a consistent format

This step is crucial because it transforms the raw text into a structured format that our models can work with.

In [None]:
def extract_dialogue(input_text):
    # Split the input by lines and initialize variables
    lines = input_text.strip().split("\n\n")
    dialogue_list = []

    # Iterate through each line and extract the dialogue
    for line in lines:
        # Check if the line starts with "Human" or "Assistant" and split accordingly
        if line.startswith("Human:"):
            role = "user"
            content = line.replace("Human: ", "").strip()
        elif line.startswith("Assistant:"):
            role = "assistant"
            content = line.replace("Assistant: ", "").strip()
        else:
            # If the line doesn't start with "Human" or "Assistant", it's part of the previous message's content
            # Append it to the last message's content
            dialogue_list[-1]["content"] += "\n\n" + line.strip()
            continue

        # Append the extracted dialogue piece to the list
        dialogue_list.append({"role": role, "content": content})

    return dialogue_list

def process(row):
        row["chosen"] = extract_dialogue(row["chosen"])
        row["rejected"] = extract_dialogue(row["rejected"])
        row["prompt"] = row["chosen"][0]["content"]
        return row

In [None]:
ds_processed = ds.map(
        process,
        load_from_cache_file=False,
    )
ds_processed

In [None]:
ds_processed['train'][67]

### Step 5.4: Applying Llama Chat Template

Different LLM families use different formatting for chat interactions. Since we're working with Llama models, we need to convert our structured data into Llama's specific chat template format.

The functions below:
1. Define the system prompt that will guide the model's behavior
2. Create helper functions to properly encode dialogue turns following Llama's format
3. Apply the chat template to both chosen and rejected responses

The formatted dialogues will include special tokens like `<|start_header_id|>`, `<|end_header_id|>`, and `<|eot_id|>` that Llama models recognize for chat interactions.

In [None]:
# Adjusting to llama prompt template format: https://github.com/meta-llama/llama-recipes
system_prompt = "Please answer the user's question to the best of your knowledge. If you don't know the answer respond that you don't know."

def encode_dialogue_turn(message):
    return f'<|start_header_id|>{message.get("role")}<|end_header_id|>{message.get("content")}<|eot_id|>'

def encode_dialogue(dialogue):
    if system_prompt:
        return f'<|begin_of_text|><|start_header_id|>system<|end_header_id|>{system_prompt}<|eot_id|>{functools.reduce(lambda a, b: a + encode_dialogue_turn(b), dialogue, "")}'
    else:
        return f'<|begin_of_text|>{functools.reduce(lambda a, b: a + encode_dialogue_turn(b), dialogue, "")}'


def encode_row(item):
    return {"chosen": encode_dialogue(item["chosen"]), "rejected": encode_dialogue(item["rejected"]), "prompt": item["prompt"]}
                                      
def encode_dataset(dataset):
    return list(map(encode_row, dataset))

In [None]:
encoded_dataset = ds_processed.map(encode_row)
encoded_dataset

In [None]:
encoded_dataset['train'][67]

### Step 5.5: Loading the Tokenizer

Now we need to load the tokenizer for our chosen model. The tokenizer converts text into token IDs that the model can process. It's important to use the tokenizer corresponding to the exact model we're fine-tuning to ensure compatible vocabulary and special tokens.

### Step 5.6: Tokenizing the Dataset

With our tokenizer loaded, we can now convert the text data into token IDs and attention masks that our model can process. For reward model training, we need to tokenize both the chosen and rejected responses.

The `preprocess_function` will:
1. Convert the chosen and rejected text into token IDs
2. Generate attention masks for both sequences
3. Return a dictionary with all four components required for our reward model

This step transforms human-readable text into the numerical representation that neural networks process.

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
# Tokenize and stack into target format
def preprocess_function(examples):
    new_examples = {
        "input_ids_chosen": [],
        "attention_mask_chosen": [],
        "input_ids_rejected": [],
        "attention_mask_rejected": [],
    }
    for chosen, rejected in zip(examples["chosen"], examples["rejected"]):
        tokenized_chosen = tokenizer(chosen)
        tokenized_rejected = tokenizer(rejected)

        new_examples["input_ids_chosen"].append(tokenized_chosen["input_ids"])
        new_examples["attention_mask_chosen"].append(tokenized_chosen["attention_mask"])
        new_examples["input_ids_rejected"].append(tokenized_rejected["input_ids"])
        new_examples["attention_mask_rejected"].append(tokenized_rejected["attention_mask"])

    return new_examples

In [None]:
tokenized_dataset_hhrlhf = encoded_dataset.map(
        preprocess_function,
        batched=True,
    ).remove_columns(["chosen", "rejected", "prompt"])
tokenized_dataset_hhrlhf

### Step 5.7: Saving the Processed Dataset

Now that we've prepared our dataset for reward model training, we need to save it to Amazon S3 for later use. This step ensures our processed data is:

1. Persistently stored for future training runs
2. Accessible to distributed training jobs
3. Organized in a structured way for the entire RLHF workflow

We'll use the S3FileSystem utility to efficiently upload the dataset to our S3 bucket.

In [None]:
import boto3
import s3fs
#from datasets.filesystems import S3FileSystem

# Define the S3 path
s3_bucket = bucket_name
dataset_path_hhrlhf = f's3://{s3_bucket}/experiments-hhrlhf/helpful-base-train-test-tokenized-llama318binstruct'

# Verify S3 bucket permissions first
s3_client = boto3.client('s3')
try:
    s3_client.head_bucket(Bucket=s3_bucket)
    print(f"Successfully connected to bucket: {s3_bucket}")
except Exception as e:
    print(f"Error connecting to bucket: {e}")
    raise


# Save the dataset to S3 using the appropriate filesystem
try:
    # Make sure you have the s3fs package installed
    tokenized_dataset_hhrlhf.save_to_disk(
        dataset_path_hhrlhf, 
        fs=S3FileSystem()
    )
    print(f"Successfully uploaded dataset to: {dataset_path_hhrlhf}")
except Exception as e:
    print(f"Error uploading to S3: {e}")
    
    # Alternative approach if the above fails
    print("Trying alternative approach...")
    #fs = s3fs.S3FileSystem()
    tokenized_dataset_hhrlhf.save_to_disk(
        dataset_path_hhrlhf
    )
    print(f"Successfully uploaded dataset using alternative method to: {dataset_path_hhrlhf}")

In [None]:
dataset_path_hhrlhf

## Step 6: Data Preparation for PPO Training

### Understanding PPO Training Data Requirements

For the Proximal Policy Optimization (PPO) phase of RLHF, we need a dataset of prompts that will be used to generate responses which will then be scored by our reward model. For this workshop, we'll use the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/) to provide diverse and engaging prompts for our model.

Unlike reward model training, PPO training doesn't need paired responses. Instead, it needs:
1. High-quality prompts for the model to respond to
2. A reward model to score the generated responses
3. A reference model to prevent policy drift

Let's prepare the SQuAD dataset for our PPO training.

### Step 6.1: Downloading the SQuAD Dataset

First, we need to download the SQuAD dataset, which contains diverse question-answer pairs across a range of topics. This dataset is perfect for PPO training as it provides natural, well-formed questions that users might ask an AI assistant.

The dataset should have the following structure for our PPO model training:

```json
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'query'],
        num_rows: ...
    })
    test: Dataset({
        features: ['input_ids', 'query'],
        num_rows: ...
    })
})
```

We'll download both the training and development sets to create our train and test splits.

In [None]:
# Download SQuAD dataset
!wget --no-check-certificate https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!wget --no-check-certificate https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

### Step 6.2: Loading the SQuAD Data Files

Now we'll load the downloaded JSON files into Python objects. SQuAD has a specific structure with topics, paragraphs, questions, and answers. We'll need to extract just the questions for our PPO prompts.

In [None]:
# Load files
with open('./train-v2.0.json') as f:
    d_train = json.load(f)
with open('./dev-v2.0.json') as f:
    d_test = json.load(f)

### Step 6.3: Processing the SQuAD Dataset

We need to extract questions from the SQuAD dataset and format them for our PPO training. The functions below will:

1. Extract questions from the nested SQuAD structure
2. Format each question with a system instruction and user message
3. Convert the structured messages into Llama's chat format
4. Package everything in a format suitable for the PPO training phase

This preprocessing transforms a QA dataset into prompts for generative responses.

In [None]:
def extract_questions(dataset):
    ret_questions = []
    for topic in dataset:
        paragraphs = topic['paragraphs']
        for paragraph in paragraphs:
            qas = paragraph['qas']
            for qa in qas:
                ret_questions.append([{
            "role": "system", "content": f'Instruction: Please answer the user\'s question to the best of your knowledge. If you don\'t know the answer respond that you don\'t know.',
        }, {
            "role": "user", "content": qa['question'],
        }])
    return ret_questions

# Adjusting to llama prompt template format: https://github.com/meta-llama/llama-recipes
def encode_dialogue_turn(message):
    message = message
    return f'<|start_header_id|>{message.get("role")}<|end_header_id|>{message.get("content")}<|eot_id|>'

def encode_dialogue(dialogue):
    return {'input': f'<|begin_of_text|>{functools.reduce(lambda a, b: a + encode_dialogue_turn(b), dialogue, "")}'}

                                      
def encode_dataset(dataset):
    #print(dataset)
    return list(map(encode_dialogue, dataset))

In [None]:
encoded_train = encode_dataset(extract_questions(d_train['data']))
encoded_test = encode_dataset(extract_questions(d_test['data']))

### Step 6.4: Examining the Processed Data

Let's look at the first example in our processed dataset to verify it has the correct format. Each example should contain an `input` field with the formatted prompt in Llama's chat template format. This is what the model will use to generate responses during PPO training.

In [None]:
encoded_train[0]

### Step 6.5: Creating the PPO Dataset Structure

To make our dataset compatible with the PPO training workflow, we'll organize it into a HuggingFace `DatasetDict` with train and test splits. This structure allows for easy batch processing and evaluation during training.

In [None]:
# Create DatasetDict
dataset_dict = DatasetDict({
    "train": Dataset.from_list(encoded_train),
    "test": Dataset.from_list(encoded_test)
})
dataset_dict

### Step 6.6: Tokenizing the PPO Training Dataset

Now we need to tokenize our prompts for the PPO training process. For this training phase, we'll:

1. Limit the context window to manage memory usage (1-2048 tokens)
2. Encode each example as token IDs
3. Create a query field that contains the decoded input for debugging
4. Format everything as PyTorch tensors for efficient GPU training

This step ensures our data is ready for efficient processing during the PPO training loop.

In [None]:
# Restrict training context size (due to memory limitations, can be adjusted)
input_min_text_length = 1
input_max_text_length = 2048

def create_and_prepare_dataset(tokenizer, dataset):
    
    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(example):
        text_size = input_size()
        example["input_ids"] = tokenizer.encode(example["input"])[:text_size]
        example["query"] = tokenizer.decode(example["input_ids"])
        return example

    dataset = dataset.map(tokenize, batched=False)
        
    dataset.set_format("torch")
    return dataset


tokenized_dataset_squad = create_and_prepare_dataset(tokenizer, dataset_dict).remove_columns(["input"])
tokenized_dataset_squad

In [None]:
tokenized_dataset_squad['train'][0]

### Step 6.7: Saving the PPO Dataset

Finally, we'll save our processed PPO dataset to Amazon S3 for use in the training phase. This makes the dataset:

1. Persistently available for the PPO training job
2. Accessible for distributed training
3. Properly organized in our project structure

Note that the commented-out code shows an alternative approach for saving to S3. We'll use a more robust method in the next cell.

In [None]:
import boto3
import s3fs

# Define the S3 path
s3_bucket = bucket_name
dataset_path_squad = f's3://{s3_bucket}/experiments-squad/train-test-contextwindow-padding-2048'

# Verify S3 bucket permissions first
s3_client = boto3.client('s3')
try:
    s3_client.head_bucket(Bucket=s3_bucket)
    print(f"Successfully connected to bucket: {s3_bucket}")
except Exception as e:
    print(f"Error connecting to bucket: {e}")
    raise


# Save the dataset to S3 using the appropriate filesystem
try:
    # Make sure you have the s3fs package installed
    tokenized_dataset_squad.save_to_disk(
        dataset_path_squad, 
        fs=S3FileSystem()
    )
    print(f"Successfully uploaded dataset to: {dataset_path_squad}")
except Exception as e:
    print(f"Error uploading to S3: {e}")
    
    # Alternative approach if the above fails
    print("Trying alternative approach...")
    tokenized_dataset_squad.save_to_disk(
        dataset_path_squad
    )
    print(f"Successfully uploaded dataset using alternative method to: {dataset_path_squad}")

## Step 7: RLHF Training

Now that we have prepared our datasets, we'll implement the two-phase RLHF training process (steps 2 and 4 in below illustration):

<img src="./images/MultiAdapterPPOprocess.png" width="50%" alt='MultiAdapterPPOprocess.png'/> 

1. **Step 2: Reward Model Training**: Train a model to learn from human preferences
2. **Step 4: PPO Training**: Use the reward model to guide policy optimization of our LLM

Each phase requires careful configuration of hyperparameters and training infrastructure. We'll use parameter-efficient fine-tuning (PEFT) with LoRA adapters to make the process computationally efficient. 

In [None]:
import os
# Set path to config file for remote decorator
os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

In [None]:
hf_token

## Step 7.1: Understanding the Reward Model Training Function

Now we'll implement the reward model training process using parameter-efficient fine-tuning (PEFT) with LoRA adapters. Let's break down the `train_fn` function:

### Key Components:
1. **Model Quantization**: We use 4-bit quantization to reduce memory requirements
2. **LoRA Configuration**: We apply adapters only to specific layers to make training efficient
3. **Gradient Checkpointing**: Trades computation for memory to handle larger batch sizes
4. **Remote Training**: The `@remote` decorator sends the job to a SageMaker training instance

### Training Parameters:
- `lora_r`: Rank of the low-rank update matrices (higher = more capacity but more parameters)
- `lora_alpha`: Scaling factor for the LoRA updates 
- `gradient_accumulation_steps`: Number of forward/backward passes before updating weights
- `learning_rate`: Controls how quickly the model adapts to the reward task

The reward model will learn to predict which responses humans would prefer, providing the "feedback" signal for our PPO training.

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )
    
def find_all_linear_names(hf_model):
    lora_module_names = set()
    for name, module in hf_model.named_modules():
        if isinstance(module, bnb.nn.Linear4bit):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)   

# Start training with remote decorator (https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator.html). Additional job config is being pulled in from config.yaml. 
@remote(volume_size=100, job_name_prefix=f"train-{model_id.split('/')[-1].replace('.', '-')}-reward", use_torchrun=True, nproc_per_node=1, include_local_workdir=True,
       #keep_alive_period_in_seconds=3600
        )
def train_fn(
        model_name,
        train_ds,
        lora_r=8,
        lora_alpha=32,
        lora_dropout=0.1,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        gradient_accumulation_steps=1,
        learning_rate=2e-4,
        num_train_epochs=1,
        fsdp="",
        fsdp_config=None,
        chunk_size=10000,
        gradient_checkpointing=False,
        merge_weights=False,
        seed=42,
        token=None,
        model_hub_repo_id=None,
        range_train=None,
        range_eval=None
):

    set_seed(seed)

    # Initialize Accelerator object handling distributed training
    accelerator = Accelerator()

    # Login to HuggingFace
    if token is not None:
        login(token)

    # Load tokenizer. Padding side is "left" because focus needs to be on completion
    tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side = "left")

    # Set tokenizer's pad Token
    tokenizer.pad_token = tokenizer.eos_token 
    tokenizer.pad_token_id = tokenizer.eos_token_id 

    # Load data from S3
    s3 = s3fs.S3FileSystem()
    dataset = load_from_disk(train_ds)  
    
    
    # Allow for partial dataset training
    if range_train:
        train_dataset = dataset["train"].select(range(range_train))
    else: 
        train_dataset = dataset["train"]
  
    if range_eval:
        eval_dataset = dataset["test"].select(range(range_eval))
    else:
        eval_dataset = dataset["test"]

    # Specify quantization config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        quant_storage_dtype=torch.bfloat16
    )
    
    # Load model with classification head for reward
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        #num_labels=1,
        trust_remote_code=True,
        quantization_config=bnb_config,
        #attn_implementation="flash_attention_2",
        use_cache=False if gradient_checkpointing else True,
        cache_dir="/tmp/.cache"
    )
    
    # Pre-LoRA trainable paremeters
    print_trainable_parameters(model)     
    
    # Set model pad token id
    model.config.pad_token_id = tokenizer.pad_token_id
    
    # Prepare model for quantized training
    model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=gradient_checkpointing)

    if gradient_checkpointing:
        model.gradient_checkpointing_enable()

    # Get lora target modules
    modules = find_all_linear_names(model)
    print(f"Found {len(modules)} modules to quantize: {modules}")
    
    # Specify LoRA config
    config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules=modules,
        lora_dropout=lora_dropout,
        bias="none",
        task_type="SEQ_CLS"
    )
    
    # Make sure to not train for CLM
    if config.task_type != "SEQ_CLS":
        warnings.warn(
            "You are using a `task_type` that is different than `SEQ_CLS` for PEFT. This will lead to silent bugs"
            " Make sure to pass --lora_task_type SEQ_CLS when using this script."
        )
    
    # Create PeftModel
    model = get_peft_model(model, config)
    
    # Post-LoRA trainable paremeters
    print_trainable_parameters(model)     
    
    # Specify training config
    reward_config = RewardConfig(
                        per_device_train_batch_size=per_device_train_batch_size,
                        per_device_eval_batch_size=per_device_eval_batch_size,
                        gradient_accumulation_steps=gradient_accumulation_steps,
                        gradient_checkpointing=gradient_checkpointing,
                        logging_strategy="steps",
                        logging_steps=100,
                        log_on_each_node=False,
                        num_train_epochs=num_train_epochs,
                        learning_rate=learning_rate,
                        bf16=True,
                        ddp_find_unused_parameters=False,
                        fsdp=fsdp,
                        fsdp_config=fsdp_config,
                        save_strategy="no",
                        output_dir="outputs",
                        max_length=512, 
                        remove_unused_columns=False,
                        gradient_checkpointing_kwargs = {"use_reentrant": False}
                        )
    
    # Initialize RewardTrainer object handling training
    trainer = RewardTrainer(
        model=model,
        tokenizer=tokenizer,
        args=reward_config,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
    )

    trainer.train()

    
    trainer.model.save_pretrained("/opt/ml/model", safe_serialization=True)
    
    if model_hub_repo_id is not None:
        trainer.model.push_to_hub(repo_id=model_hub_repo_id)

    with accelerator.main_process_first():
        tokenizer.save_pretrained("/opt/ml/model")

## Step 7.2: Setting Up the Reward Model Repository

Before launching the training job, we need to set up a repository on Hugging Face Hub to store our trained reward model adapter. This repository will:

1. Serve as a permanent storage location for our trained adapter weights
2. Make the reward model easily accessible for the subsequent PPO training phase
3. Allow for version control and sharing of the model

You can name the model repository `***HF_ALIAS***/llama-32-hhrlhf-reward-adapter`, where `HF_ALIAS`is your HuggingFace username. In case you haven't setup a model repository yet, please check the [HuggingFace documentation](https://huggingface.co/docs/hub/en/repositories-getting-started). 

<img src="./images/CreateHFRepo.png" width="30%" alt='CreateHFRepo.png'/> 

In a production environment, you might want to create a dedicated organization for your models rather than using a personal account. Make sure your Hugging Face token has write access to create and update repositories.

Note: Please make sure to repace placeholder below with your `HF_ALIAS`.

In [None]:
# Replace this with your HF alias
rm_adapter_hub_repo_id = "***HF_ALIAS***/llama-32-hhrlhf-reward-adapter"


## Step 7.3: Launching Reward Model Training

Now we're ready to start the reward model training job. This cell launches a SageMaker training instance via the `remote` decorator. 

### Understanding Key Parameters:
- `model_id`: The base model we're fine-tuning (e.g. `meta-llama/Llama-3.2-1B-Instruct`)
- `dataset_path_hhrlhf`: S3 path to our processed human preference dataset
- `per_device_*_batch_size`: How many examples to process at once on each GPU
- `gradient_accumulation_steps`: Accumulates gradients before weight updates to simulate larger batch sizes
- `gradient_checkpointing`: Memory optimization technique (trades compute for memory)
- `range_train/range_eval`: For workshop purposes, we're limiting the dataset size to complete training faster

For a production model, you would want to:
1. Use the full dataset instead of a small range
2. Train for multiple epochs 
3. Optimize hyperparameters like learning rate and batch size

When you run this cell, the training job will be dispatched to a remote instance. This process will take some time to complete, from 10-30 minutes depending on instance type and dataset size.

In [None]:
# Start training job
train_fn(
    model_id,
    train_ds=dataset_path_hhrlhf,  # Use S3 path instead of in-memory dataset
    #train_ds=tokenized_dataset_hhrlhf,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    num_train_epochs=1,
    token=hf_token,
    model_hub_repo_id=rm_adapter_hub_repo_id,
    range_train=100,
    range_eval=10
)

## Step 8: Preparing for PPO Training

Now that we've trained our reward model, we're ready to move on to the Proximal Policy Optimization (PPO) phase of RLHF. This is where the model learns to generate responses that maximize the reward predicted by our reward model.

### Understanding Multi-Adapter PPO

In our implementation, we're using a multi-adapter approach for PPO. This means:

1. **One Base Model**: The foundation model (Meta Llama 3.1 8B Instruct / Meta Llama 3.2 1B Instruct)
2. **Two Adapter Sets**:
   - **Policy Adapter**: The model being optimized through RL
   - **Reward Adapter**: Our trained model that evaluates response quality

This approach is efficient because:
- We only need to load one copy of the large base model in memory
- Multiple adapters can be attached to the same base model
- Each adapter only adds a small number of trainable parameters

<img src="./images/MultiAdapterPPO.png" width="30%" alt='MultiAdapterPPO.png'/> 

The following function implements the PPO training loop with reward computation all within a single model instance.

In [None]:
import os
# Set path to config file for remote decorator
os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

In [None]:
import boto3

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

# Start training with remote decorator (https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator.html). Additional job config is being pulled in from config.yaml. 
@remote(volume_size=100, job_name_prefix=f"train-{model_id.split('/')[-1].replace('.', '-')}-multi-adapter-ppo", use_torchrun=True, nproc_per_node=1
       #keep_alive_period_in_seconds=3600
       )
def train_fn(
        model_name,
        train_ds,
        rm_adapter,
        s3_output_path,
        log_with=None,
        use_safetensors=None,
        use_score_scaling=False,
        use_score_norm=False,
        score_clip=None,
        seed=42,
        token=None,
        model_hub_repo_id=None,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        gradient_accumulation_steps=2,
        gradient_checkpointing=True,
        num_train_epochs=1,
        merge_weights=True,
        range_train=None,
        ):

    set_seed(seed)

    # Initialize Accelerator object handling distributed training
    accelerator = Accelerator()
    
    # Login to HuggingFace 
    if token is not None:
        login(token)
        
    # Load tokenizer. Padding side is "left" because focus needs to be on completion
    tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')

    # Set tokenizer's pad Token
    tokenizer.pad_token = tokenizer.eos_token 
    tokenizer.pad_token_id = tokenizer.eos_token_id  
    
    
    # Load data from S3
    dataset = load_from_disk(train_ds)
    
    
    # Allow for partial dataset training
    if range_train:
        train_dataset = dataset["train"].select(range(range_train))
    else: 
        train_dataset = dataset["train"]
    
    # Specify LoRA config
    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
    )
    
    # Specify quantization config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16
    )
    
    # Load model
    model = AutoModelForCausalLMWithValueHead.from_pretrained(
        model_name,
        #device_map='auto',
        peft_config=lora_config,
        quantization_config=bnb_config,
        reward_adapter=rm_adapter,
        use_safetensors=use_safetensors,
        #attn_implementation="flash_attention_2",
    )
    
    # Set model pad token id
    model.config.pad_token_id = tokenizer.pad_token_id

    if gradient_checkpointing:
        model.gradient_checkpointing_enable()
        
    # Trainable paremeters
    print_trainable_parameters(model)    

    def collator(data):
        return {key: [d[key] for d in data] for key in data[0]}

    # Specify PPO training config
    config = PPOConfig(
        model_name,
        log_with=None,
        learning_rate=1e-5,
        batch_size=per_device_train_batch_size,
        mini_batch_size=1,
        gradient_accumulation_steps=gradient_accumulation_steps,
        optimize_cuda_cache=True,
        seed=42,
        use_score_scaling=False,
        use_score_norm=False,
        score_clip=None,
    )

    # Initialize PPOTrainer object handling training
    ppo_trainer = PPOTrainer(
        config,
        model,
        ref_model=None,
        tokenizer=tokenizer,
        dataset=train_dataset,
        data_collator=collator,
    )

    # Specifying inference params
    generation_kwargs = {
        #"top_k": 0.0,
        "top_p": 0.9,
        "do_sample": True,
        "pad_token_id": tokenizer.pad_token_id,
        "max_new_tokens": 32,
    }
    
    step = 0

    for _epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
        
        question_tensors = batch["input_ids"]
        
        # Inference through model being fine-tuned
        response_tensors = ppo_trainer.generate(
            question_tensors,
            return_prompt=False,
            **generation_kwargs,
        )
        
        # Decode response
        batch["response"] = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)
        
        # Concat query and response
        texts = [q + r for q, r in zip(batch["query"], batch["response"])]
        
        # Tokenize query - response pair
        inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(ppo_trainer.accelerator.device)
        
        # Compute reward score
        raw_rewards = ppo_trainer.accelerator.unwrap_model(ppo_trainer.model).compute_reward_score(**inputs)
        rewards = [raw_rewards[i, -1, 1] for i in range(len(raw_rewards))]  # take last token

        # Run PPO step
        stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
        ppo_trainer.log_stats(stats, batch, rewards)
        
        step = step + 1      


    if merge_weights:

        if accelerator.is_main_process:
            
            output_dir = "/tmp/model"
    

            ppo_trainer.save_pretrained(output_dir, safe_serialization=True)

       
            # clear memory
            del model
            del ppo_trainer

            torch.cuda.empty_cache()

            # load PEFT model
            model = AutoPeftModelForCausalLM.from_pretrained(
                output_dir,
                torch_dtype=torch.float16,
                low_cpu_mem_usage=True,
                trust_remote_code=True,
                use_cache=True,
                cache_dir="/tmp/.cache",
            )

            # Merge LoRA and base model and save
            model = model.merge_and_unload()
            model.save_pretrained(
                os.environ.get("SM_MODEL_DIR", "/opt/ml/model"),
                safe_serialization=True,
                max_shard_size="2GB"
            )
            if model_hub_repo_id is not None:
                model.push_to_hub(repo_id=model_hub_repo_id)

            tokenizer.save_pretrained(os.environ.get("SM_MODEL_DIR", "/opt/ml/model"))

            if model_hub_repo_id is not None:
                tokenizer.push_to_hub(repo_id=model_hub_repo_id)

        accelerator.wait_for_everyone()

    else:
        if accelerator.is_main_process:
            
            ppo_trainer.model.module.save_pretrained(
                os.environ.get("SM_MODEL_DIR", "/opt/ml/model"),
                safe_serialization=True
            )
    
            if model_hub_repo_id is not None:
                ppo_trainer.push_to_hub(repo_id=model_hub_repo_id)
    
    
            tokenizer.save_pretrained(os.environ.get("SM_MODEL_DIR", "/opt/ml/model"))
    
            if model_hub_repo_id is not None:
                tokenizer.push_to_hub(repo_id=model_hub_repo_id)


        accelerator.wait_for_everyone()

    if accelerator.is_main_process:
        # Upload the model files to S3
       
        
        # Get the S3 output path from the environment variables
        # SageMaker automatically sets these environment variables
        if os.environ.get("SM_MODEL_DIR") and s3_output_path:
            model_dir = os.environ.get("SM_MODEL_DIR", "/opt/ml/model")
            
            print(f"Uploading model from {model_dir} to {s3_output_path}")
            
            # Initialize S3 client
            s3_client = boto3.client('s3')
            
            # Extract bucket name and prefix from S3 URI
            s3_uri_parts = s3_output_path.replace("s3://", "").split("/")
            bucket_name = s3_uri_parts[0]
            prefix = "/".join(s3_uri_parts[1:]) if len(s3_uri_parts) > 1 else ""
            
            # Walk through all files in the model directory and upload them
            for root, dirs, files in os.walk(model_dir):
                for file in files:
                    local_path = os.path.join(root, file)
                    # Create relative path to maintain directory structure
                    relative_path = os.path.relpath(local_path, model_dir)
                    s3_key = os.path.join(prefix, relative_path)
                    
                    print(f"Uploading {local_path} to s3://{bucket_name}/{s3_key}")
                    try:
                        s3_client.upload_file(local_path, bucket_name, s3_key)
                    except Exception as e:
                        print(f"Failed to upload {local_path} to S3: {e}")
            
            print("Model upload to S3 completed")

    # Wait for all processes to complete
    accelerator.wait_for_everyone()

## Step 8.1: Creating the HuggingFace Repository for the RLHF Policy Model

Similar to what we did for the reward model, we need to set up a repository to store our final RLHF-trained policy model. This time, with `***HF_ALIAS***/llama-32-hhrlhf-squad-rlhf-policy-model` we're choosing a name that indicates this is the full policy model rather than just an adapter.

For a production workflow, consider:
1. Using descriptive naming conventions that indicate model version and training approach
2. Setting up model cards on HuggingFace that document the model's capabilities and limitations
3. Implementing proper access control for sensitive or proprietary models

Note: Please make sure to repace placeholder below with your `HF_ALIAS`.

In [None]:
model_hub_repo_id = "***HF_ALIAS***/llama-31-hhrlhf-squad-rlhf-policy-model"

## Step 8.2: Setting Up the Hugging Face Authentication

Similar to the reward model training step, we need to provide our Hugging Face token to authenticate with the Hub. This token will be used to push our trained model to the repository we specified above.

> Note: For a real workshop, don't show your token directly in the notebook. Consider using environment variables or a secrets manager for sensitive credentials.

In [None]:
hf_token

## Step 8.3: Launching the PPO Training Job

Now we'll launch the PPO training job to fine-tune our model using the reward model we trained earlier. This step is where the actual reinforcement learning happens.

### PPO Training Process:
1. The model generates responses to prompts from the SQuAD dataset
2. Each response is scored by the reward model
3. The PPO algorithm uses these scores to update the policy model
4. This process iterates to maximize the expected reward

### Key Parameters:
- `model_id`: The base model we're fine-tuning
- `train_ds`: The dataset of prompts for PPO training
- `rm_adapter`: The path to our trained reward model
- `merge_weights`: If True, merges the adapters with the base model for deployment
- `range_train`: Number of examples to use for training (limited for workshop)

This training job will take 15-45 minutes depending on the compute instance and dataset size. For a production model, you would want to use the full dataset and tune hyperparameters carefully.

While training, the model gradually learns to generate responses that the reward model will score highly, effectively aligning with human preferences encoded in the reward model.

In [None]:
train_fn(
    model_id,
    train_ds=dataset_path_squad,  # Use S3 path instead of in-memory dataset
    #train_ds=tokenized_dataset_squad,
    s3_output_path=s3_output_path,
    rm_adapter=rm_adapter_hub_repo_id,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    num_train_epochs=1,
    token=hf_token,
    model_hub_repo_id=model_hub_repo_id,
    range_train=50,
    merge_weights=True
)

## Step 9: Deploying the Model to Amazon Bedrock

After training our RLHF policy model, we want to make it available for inference. Amazon Bedrock provides a serverless endpoint for model deployment, allowing us to serve the model without managing infrastructure.

The Custom Model Import (CMI) feature of Amazon Bedrock lets us import our fine-tuned model and access it through the same API as other foundation models. This provides:

1. Scalable, serverless infrastructure
2. Built-in security and compliance features
3. Easy integration with other AWS services
4. Cost-effective pay-per-use pricing

## Step 9.1: Creating the Bedrock Model Import Job

First, we need to set up an IAM role that gives Bedrock permission to access our model files in S3. Once we've set up the necessary permissions, we can create a model import job in Amazon Bedrock. This process:

1. Takes our fine-tuned model from S3
2. Optimizes it for the Bedrock infrastructure
3. Makes it available through the Bedrock API

We'll specify:
- A unique name for our imported model
- The IAM role we created for Bedrock
- The S3 path where our trained model is stored

The import process will take some time (typically ~5 minutes) as Bedrock prepares your model for deployment.

In [None]:
from botocore.exceptions import ClientError

# Initialize IAM client
iam = boto3.client('iam')

# Role name - consider making this unique if you create multiple roles
role_name = "BedrockCustomModelImportRole"

# Define trust policy to allow Bedrock to assume this role
trust_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

# Define permissions policy for Bedrock model import
bedrock_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::*/*",
                "arn:aws:s3:::*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateModelCustomizationJob",
                "bedrock:CreateModelImportJob",
                "bedrock:GetModelCustomizationJob",
                "bedrock:GetModelImportJob",
                "bedrock:StopModelCustomizationJob",
                "bedrock:StopModelImportJob"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:GenerateDataKey"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "cloudwatch:namespace": "AWS/Bedrock"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:log-group:/aws/bedrock/*"
        }
    ]
}

# Create or update the role
try:
    # Check if role exists
    try:
        iam.get_role(RoleName=role_name)
        print(f"Role {role_name} already exists. Updating policies...")
        # Delete any existing policies to ensure clean slate
        for policy in iam.list_attached_role_policies(RoleName=role_name)['AttachedPolicies']:
            iam.detach_role_policy(RoleName=role_name, PolicyArn=policy['PolicyArn'])
    except ClientError as e:
        if e.response['Error']['Code'] == 'NoSuchEntity':
            # Create role with trust policy if it doesn't exist
            print(f"Creating new role: {role_name}")
            iam.create_role(
                RoleName=role_name,
                AssumeRolePolicyDocument=json.dumps(trust_policy),
                Description="Role for Amazon Bedrock Model Import operations"
            )
        else:
            raise e
    
    # Create policy for Bedrock operations
    policy_name = f"{role_name}Policy"
    
    # Check if policy exists and delete if it does
    try:
        existing_policy = iam.get_policy(PolicyArn=f"arn:aws:iam::{boto3.client('sts').get_caller_identity()['Account']}:policy/{policy_name}")
        
        # Detach policy if attached to our role
        try:
            iam.detach_role_policy(
                RoleName=role_name,
                PolicyArn=existing_policy['Policy']['Arn']
            )
        except ClientError:
            pass  # Policy may not be attached
        
        # Delete existing versions (except default)
        policy_versions = iam.list_policy_versions(PolicyArn=existing_policy['Policy']['Arn'])['Versions']
        for version in policy_versions:
            if not version['IsDefaultVersion']:
                iam.delete_policy_version(
                    PolicyArn=existing_policy['Policy']['Arn'],
                    VersionId=version['VersionId']
                )
        
        # Create new version and set as default
        iam.create_policy_version(
            PolicyArn=existing_policy['Policy']['Arn'],
            PolicyDocument=json.dumps(bedrock_policy_document),
            SetAsDefault=True
        )
        policy_arn = existing_policy['Policy']['Arn']
        
    except ClientError as e:
        if e.response['Error']['Code'] == 'NoSuchEntity':
            # Create new policy
            response = iam.create_policy(
                PolicyName=policy_name,
                PolicyDocument=json.dumps(bedrock_policy_document),
                Description="Permissions for Amazon Bedrock Model Import operations"
            )
            policy_arn = response['Policy']['Arn']
        else:
            raise e
    
    # Attach policy to role
    iam.attach_role_policy(
        RoleName=role_name,
        PolicyArn=policy_arn
    )
    
    # Wait for policy to propagate
    print(f"Waiting for IAM role and policies to propagate...")
    time.sleep(10)  # IAM changes can take a few seconds to propagate
    
    # Get the role ARN
    role = iam.get_role(RoleName=role_name)
    role_arn = role['Role']['Arn']
    
    print(f"Successfully configured role: {role_arn}")

except Exception as e:
    print(f"Error setting up IAM role: {str(e)}")
    role_arn = None  # Set to None if there was an error

In [None]:
from sagemaker.session import Session

# Initialize the Bedrock client
bedrock = boto3.client('bedrock', region_name=Session().boto_region_name)

# Define name for imported model
imported_model_name = f'llama-31-hhrlhf-squad-rlhf-policy-model-{int(time.time())}'

# Create the model import job
response = bedrock.create_model_import_job(
    jobName=imported_model_name,
    importedModelName=imported_model_name,
    roleArn=role_arn,
    modelDataSource={
        's3DataSource': {
            's3Uri': s3_output_path
        }
    }
)

job_Arn = response['jobArn']

# Output the job ARN
print(f"Model import job created with ARN: {response['jobArn']}")

## Step 9.2: Monitoring the Import Job

After initiating the model import, we need to monitor its progress. The import job goes through several states:
- PENDING: Job is queued
- IN_PROGRESS: Model is being imported
- COMPLETED: Import successful
- FAILED: Import encountered errors

This cell polls the Bedrock API every 60 seconds to check the status of our import job, continuing until it reaches a terminal state (COMPLETED or FAILED). Once the job completes successfully, we'll have the model ARN which we can use for inference.

In [None]:
# Check CMI job status
while True:
    response = bedrock.get_model_import_job(jobIdentifier=job_Arn)
    status = response['status'].upper()
    print(f"Status: {status}")
    
    if status in ['COMPLETED', 'FAILED']:
        break
        
    time.sleep(60)  # Check every 60 seconds

# Get the model ID
model_arn = response['importedModelArn']


## Step 10: Testing the Deployed Model

Now that our RLHF-fine-tuned model is deployed to Amazon Bedrock, we can invoke it for inference. We'll set up the necessary clients and functions to interact with our model through the Bedrock Runtime API.

### Inference Setup Components:
1. **Tokenizer**: To properly format inputs for the model
2. **Bedrock Runtime Client**: AWS SDK client for making inference calls
3. **Helper Function**: To handle retry logic and properly format requests

The `generate` function we're defining:
- Applies the proper chat template to user messages
- Handles retry logic for robustness
- Sets appropriate generation parameters like temperature and top-p

This setup allows us to easily test how well our RLHF training worked by sending queries to the model and evaluating its responses.

In [None]:
from transformers import AutoTokenizer
import json
import boto3
from botocore.config import Config
from IPython.display import Markdown, display

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Initialize Bedrock Runtime client
session = boto3.Session()
client = session.client(
    service_name='bedrock-runtime',
    region_name='us-west-2',
    config=Config(
        connect_timeout=300,  # 5 minutes
        read_timeout=300,     # 5 minutes
        retries={'max_attempts': 3}
    )
)

In [None]:
def generate(messages, temperature=0.3, max_tokens=4096, top_p=0.9, continuation=False, max_retries=10):
    """
    Generate response using the model with proper tokenization and retry mechanism
    
    Parameters:
        messages (list): List of message dictionaries with 'role' and 'content'
        temperature (float): Controls randomness in generation (0.0-1.0)
        max_tokens (int): Maximum number of tokens to generate
        top_p (float): Nucleus sampling parameter (0.0-1.0)
        continuation (bool): Whether this is a continuation of previous generation
        max_retries (int): Maximum number of retry attempts
    
    Returns:
        dict: Model response containing generated text and metadata
    """
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, 
                                         add_generation_prompt=not continuation)
    
    attempt = 0
    while attempt < max_retries:
        try:
            response = client.invoke_model(
                modelId=model_arn,
                body=json.dumps({
                    'prompt': prompt,
                    'temperature': temperature,
                    'max_gen_len': max_tokens,
                    'top_p': top_p
                }),
                accept='application/json',
                contentType='application/json'
            )
            
            result = json.loads(response['body'].read().decode('utf-8'))
            return result
            
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {str(e)}")
            attempt += 1
            if attempt < max_retries:
                time.sleep(30)
    
    raise Exception("Failed to get response after maximum retries")

## Step 10.1: Testing Our RLHF-Trained Model

Now let's put our model to the test with a simple prompt. This will help us evaluate whether the RLHF fine-tuning has improved the model's responses in alignment with human preferences.

When testing an RLHF model, consider the following evaluation criteria:

1. **Helpfulness**: Does the model provide useful information that addresses the user's query?
2. **Harmlessness**: Does the model avoid generating harmful, misleading, or inappropriate content?
3. **Alignment**: Does the response generally match what humans would prefer?
4. **Factuality**: Is the information accurate and well-supported?
5. **Style**: Is the response well-structured and appropriately formatted?

For a comprehensive evaluation, you would want to:
- Compare responses with the original base model
- Test a diverse set of prompts, including edge cases
- Gather human feedback on the responses
- Use quantitative metrics like ROUGE or BERTScore

In this workshop, we'll do a simple test with a conversational greeting to verify the model is working correctly.

In [None]:
test_prompt = """Hi, how are you?
"""

messages = [{"role": "user", "content": test_prompt}]
response = generate(messages)
print("Model Response:")
print(response["generation"])

## Step 11: Conclusion and Next Steps

Congratulations! You've successfully:
1. Prepared datasets for both reward model and PPO training
2. Trained a reward model that can evaluate response quality based on human preferences
3. Fine-tuned an LLM using RLHF with multi-adapter PPO
4. Deployed the model to Amazon Bedrock for serverless inference

### Key Takeaways:
- RLHF is a powerful technique for aligning LLMs with human preferences
- The multi-adapter approach allows for efficient training with limited resources
- Parameter-efficient techniques like LoRA adapters make fine-tuning accessible
- Serverless deployment enables practical use without complex infrastructure

Remember that RLHF is an iterative process, and improving alignment often requires multiple rounds of training with increasingly refined preference data.

## Step 12: Resource Cleanup

To avoid incurring unnecessary costs after completing this workshop, it's important to clean up all the resources we've created. This includes removing:

1. The imported Bedrock model
2. The S3 bucket and its contents
3. IAM roles and policies created for Bedrock
4. Local downloaded files

Following proper cleanup practices ensures you don't have unexpected charges on your AWS account and maintains good security hygiene by removing permissions that are no longer needed.


## Workshop Summary

You've completed the RLHF Workshop focused on fine-tuning LLMs using multi-adapter PPO. Throughout this tutorial, you've gained practical experience with:

1. **Data Preparation**: Processing human preference data for reward modeling and creating prompt datasets for PPO training

2. **Reward Modeling**: Training a model to distinguish between preferred and non-preferred responses based on human feedback

3. **RLHF Training**: Using PPO to optimize a language model policy to generate responses that align with human preferences

4. **Efficient Fine-tuning**: Leveraging parameter-efficient techniques like LoRA adapters to make training more accessible and cost-effective

5. **Model Deployment**: Creating serverless inference endpoints on Amazon Bedrock for your custom fine-tuned model

These skills form the foundation of modern LLM alignment techniques used by leading AI labs and companies to create helpful, harmless, and honest AI assistants. By continuing to experiment with different datasets, model sizes, and training configurations, you'll develop deeper expertise in RLHF and contribute to advancing responsible AI development.

Thank you for participating in this workshop!