# Reinforcement Fine-Tuning Amazon Nova 2.0 Lite with PandaLM

This notebook walks through training an Amazon Nova model using Reinforcement Fine-Tuning (RFT) on the [PandaLM](https://github.com/WeOpenML/PandaLM) evaluation dataset.

## What's RFT?

Traditional fine-tuning shows a model examples and says "produce outputs like this." RFT takes a different approach: it lets the model generate its own responses, then uses a reward signal to reinforce good outputs and discourage bad ones.

## What's PandaLM?

PandaLM is a dataset designed to train and evaluate LLM-as-a-judge models. Each example contains:

- **Instruction**: A task description
- **Input**: Optional context for the task
- **Response 1 & 2**: Two model responses to compare
- **Human annotations**: Which response is better (1, 2, or tie)

The goal is to train a model that can accurately evaluate and compare LLM outputs—essentially creating an AI judge for RLAIF (Reinforcement Learning from AI Feedback).

> *Example:*
> 
> **Instruction:** Rewrite the sentence to make it clearer and more concise.
> 
> **Input:** "If you have any questions about my rate or if you find it necessary to increase or decrease the scope..."
> 
> **Response 1:** "If you have any questions about my rate, please let me know."
> 
> **Response 2:** "If you have any questions, please let me know."
> 
> **Label:** Response 2 is better (more concise)

## What we'll build

1. Prepare PandaLM data in the format Bedrock RFT expects
2. Deploy a Lambda function that uses LLM-as-judge to score model evaluations
3. Kick off an RFT training job on Amazon Bedrock
4. Monitor the job until completion

By the end, you'll have a Nova model that's better at evaluating LLM outputs.


## Prerequisites: SageMaker Role Permissions

**NOTE:** If you are running this notebook using an AWS Profile with Admin you can skip this cell...

....otherwise this Jupyter notebook requires your SageMaker execution role to have these IAM permissions:

| Service | Actions | Resources | Why |
|---------|---------|-----------|-----|
| **S3** | `PutObject`, `GetObject`, `ListBucket`, `DeleteObject` | `arn:aws:s3:::YOUR-BUCKET/*` and `arn:aws:s3:::YOUR-BUCKET` | Upload/download training data |
| **IAM** | `CreateRole`, `GetRole`, `AttachRolePolicy`, `PutRolePolicy`, **`PassRole`** | `arn:aws:iam::ACCOUNT:role/PANDALM-Lambda-Role`, `arn:aws:iam::ACCOUNT:role/BedrockRFT-pandalm-Role` | Create Lambda & Bedrock roles |
| **Lambda** | `CreateFunction`, `GetFunction`, `UpdateFunctionCode`, `InvokeFunction` | `arn:aws:lambda:REGION:ACCOUNT:function:pandalm-reward-function` | Deploy reward function |
| **Bedrock** | `CreateModelCustomizationJob`, `GetModelCustomizationJob`, `InvokeModel` | `*` | Start/monitor training, LLM-as-judge |
| **STS** | `GetCallerIdentity` | `*` | Get account info |

**Critical**: The Lambda function needs `bedrock:InvokeModel` permission to call the LLM judge.


---
## 0. Install Dependencies

In [None]:
%pip install -qU boto3 botocore


---
## 1. Configuration & Data Prep

First, set your AWS region, S3 bucket, and profile. Then we'll pull PandaLM from GitHub, format it for Bedrock RFT, and upload to S3.

For this RLAIF use-case, we format each example as an evaluation task where the model must judge which response is better and provide reasoning.


In [None]:
import sys
sys.path.insert(0, "../..")

import boto3
import json
import time
import os
import random
import urllib.request

from helpers import (
    create_lambda_deployment_package,
    cleanup_lambda_deployment_package
)

# ============== UPDATE THESE VALUES ==============
AWS_REGION = "us-east-1"
S3_BUCKET = "your-bucket-name"
AWS_PROFILE = None  # Set to your profile name, or None for default credentials
# =================================================

# Create session
session = boto3.Session(profile_name=AWS_PROFILE, region_name=AWS_REGION) if AWS_PROFILE else boto3.Session(region_name=AWS_REGION)
AWS_ACCOUNT_ID = session.client('sts').get_caller_identity()['Account']

# Dataset configuration
DATASET_NAME = "pandalm"
PANDALM_URL = "https://raw.githubusercontent.com/WeOpenML/PandaLM/main/data/testset-v1.json"
TOTAL_SAMPLES = None  # Set to None to use all available data, or an integer to limit
LOCAL_DATA_DIR = "../../tmp-data"

assert S3_BUCKET != "your-bucket-name", "Please update S3_BUCKET with your own bucket name"
S3_OUTPUT_PATH = f"s3://{S3_BUCKET}/rft-output/"

# Resource names
LAMBDA_FUNCTION_NAME = f"{DATASET_NAME}-reward-function"
LAMBDA_ROLE_NAME = f"{DATASET_NAME.upper()}-Lambda-Role"
BEDROCK_ROLE_NAME = f"BedrockRFT-{DATASET_NAME}-Role"
REWARD_FUNCTION_FILE = f"../../reward-functions/{DATASET_NAME}_rew_func.py"
REWARD_FUNCTION_MODULE = f"{DATASET_NAME}_rew_func"

# Model configuration
BASE_MODEL_ID = f"arn:aws:bedrock:{AWS_REGION}::foundation-model/amazon.nova-2-lite-v1:0:256k"

# Initialize AWS clients
s3_client = session.client('s3')
bedrock_client = session.client('bedrock')
lambda_client = session.client('lambda')
iam_client = session.client('iam')


In [None]:
def format_size(n):
    """Format sample count as human-readable string (e.g., 7k, 1.2k)."""
    if n >= 1000:
        return f"{n/1000:.0f}k" if n % 1000 == 0 else f"{n/1000:.1f}k"
    return str(n)

def preprocess_pandalm(url, total_samples, output_dir, train_ratio=0.8, val_ratio=0.1):
    os.makedirs(output_dir, exist_ok=True)

    print(f"Downloading {url}...")
    with urllib.request.urlopen(url) as response:
        data = json.loads(response.read().decode())

    random.seed(42)
    random.shuffle(data)

    available = len(data)
    total = min(total_samples, available) if total_samples else available

    train_size = int(total * train_ratio)
    val_size = int(total * val_ratio)
    test_size = total - train_size - val_size

    def get_majority_label(item):
        """Get majority vote from annotators (1=resp1 better, 2=resp2 better, 0=tie)."""
        votes = [item.get("annotator1", 0), item.get("annotator2", 0), item.get("annotator3", 0)]
        return max(set(votes), key=votes.count)

    def label_to_text(label):
        if label == 1:
            return "Response 1 is better"
        elif label == 2:
            return "Response 2 is better"
        return "Both responses are equally good (tie)"

    def format_row(item, idx, split):
        label = get_majority_label(item)
        instruction = item.get("instruction", "")
        input_text = item.get("input", "")
        response1 = item.get("response1", "")
        response2 = item.get("response2", "")

        # Build the evaluation task prompt
        task = f"Instruction: {instruction}"
        if input_text:
            task += f"\n\nInput: {input_text}"
        task += f"\n\nResponse 1:\n{response1}\n\nResponse 2:\n{response2}"

        user_content = f"""You are an expert evaluator comparing two AI responses.

{task}

Evaluate both responses and determine which is better. Provide:
1. Your judgment: Which response is better (Response 1, Response 2, or Tie)
2. Your reasoning: Explain why, considering accuracy, helpfulness, and quality

Format your answer as:
JUDGMENT: [Response 1 / Response 2 / Tie]
REASONING: [Your detailed explanation]"""

        ground_truth = f"JUDGMENT: {label_to_text(label)}\nREASONING: Based on the task requirements, {'Response 1' if label == 1 else 'Response 2' if label == 2 else 'both responses'} better addresses the instruction."

        return {
            "messages": [
                {"role": "system", "content": "You are an expert AI evaluator who compares model responses and provides detailed judgments with reasoning."},
                {"role": "user", "content": user_content}
            ],
            "reference_answer": {
                "label": label,
                "question": task,
                "ground_truth": ground_truth
            },
            "task_id": f"pandalm_{split}_{idx}",
            "domain": "evaluation",
            "data_source": "pandalm"
        }

    def write_split(data, start_idx, size, filename, split_name):
        with open(f"{output_dir}/{filename}", "w") as f:
            for i, item in enumerate(data[start_idx:start_idx + size]):
                f.write(json.dumps(format_row(item, i, split_name)) + "\n")
        print(f"✓ Created {output_dir}/{filename} ({size} samples)")

    write_split(data, 0, train_size, "train.jsonl", "train")
    write_split(data, train_size, val_size, "val.jsonl", "val")
    write_split(data, train_size + val_size, test_size, "test.jsonl", "test")

    return train_size, val_size, test_size

print("Preprocessing PandaLM dataset...")
train_size, val_size, test_size = preprocess_pandalm(PANDALM_URL, TOTAL_SAMPLES, LOCAL_DATA_DIR)

# S3 paths with sample counts in filenames
S3_TRAINING_DATA = f"s3://{S3_BUCKET}/rft-data/datasets/{DATASET_NAME}/train-{format_size(train_size)}.jsonl"
S3_VALIDATION_DATA = f"s3://{S3_BUCKET}/rft-data/datasets/{DATASET_NAME}/val-{format_size(val_size)}.jsonl"
S3_TEST_DATA = f"s3://{S3_BUCKET}/rft-data/datasets/{DATASET_NAME}/test-{format_size(test_size)}.jsonl"

print("\nUploading to S3...")
for local_file, s3_uri in [
    ("train.jsonl", S3_TRAINING_DATA),
    ("val.jsonl", S3_VALIDATION_DATA),
    ("test.jsonl", S3_TEST_DATA)
]:
    s3_key = '/'.join(s3_uri.split('/')[3:])
    s3_client.upload_file(f"{LOCAL_DATA_DIR}/{local_file}", S3_BUCKET, s3_key)
    print(f"✓ Uploaded {s3_uri.split('/')[-1]}")

print(f"\n✓ Ready | {train_size} train / {val_size} val / {test_size} test")


In [None]:
import shutil

print("\nCleaning up temporary files...")
if os.path.exists(LOCAL_DATA_DIR):
    shutil.rmtree(LOCAL_DATA_DIR)
    print(f"✓ Removed {LOCAL_DATA_DIR}")
else:
    print(f"✓ No temporary files to clean")


---
## 2. Deploy the Reward Function

For PandaLM, we use an LLM-as-judge approach (RLAIF). The Lambda function calls a Bedrock model to evaluate how well the training model's judgment matches the ground truth human annotations.

This is different from RLVR (like GSM8K/FinQA) where we can programmatically verify correctness.


In [None]:
# Create Lambda execution role with Bedrock permissions
print("Creating Lambda execution role...")

lambda_trust_policy = {
    "Version": "2012-10-17",
    "Statement": [{"Effect": "Allow", "Principal": {"Service": "lambda.amazonaws.com"}, "Action": "sts:AssumeRole"}]
}

lambda_permissions = {
    "Version": "2012-10-17",
    "Statement": [
        {"Effect": "Allow", "Action": ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"], "Resource": "arn:aws:logs:*:*:*"},
        {"Effect": "Allow", "Action": ["bedrock:InvokeModel", "bedrock:Converse"], "Resource": "*"}  # For LLM-as-judge
    ]
}

try:
    response = iam_client.create_role(
        RoleName=LAMBDA_ROLE_NAME,
        AssumeRolePolicyDocument=json.dumps(lambda_trust_policy),
        Description=f"Execution role for {DATASET_NAME} reward function"
    )
    lambda_role_arn = response['Role']['Arn']
    iam_client.put_role_policy(RoleName=LAMBDA_ROLE_NAME, PolicyName='LambdaBedrockPolicy', PolicyDocument=json.dumps(lambda_permissions))
    print(f"✓ Created role: {LAMBDA_ROLE_NAME}")
    print("Waiting 10s for role propagation...")
    time.sleep(10)
except iam_client.exceptions.EntityAlreadyExistsException:
    lambda_role_arn = iam_client.get_role(RoleName=LAMBDA_ROLE_NAME)['Role']['Arn']
    iam_client.put_role_policy(RoleName=LAMBDA_ROLE_NAME, PolicyName='LambdaBedrockPolicy', PolicyDocument=json.dumps(lambda_permissions))
    print(f"✓ Using existing role: {LAMBDA_ROLE_NAME}")

# Package and deploy Lambda
lambda_zip_content = create_lambda_deployment_package(
    source_file=REWARD_FUNCTION_FILE,
    zip_filename="lambda_deployment.zip",
    archive_name=f"{REWARD_FUNCTION_MODULE}.py"
)

print(f"\nDeploying Lambda: {LAMBDA_FUNCTION_NAME}...")
try:
    lambda_client.get_function(FunctionName=LAMBDA_FUNCTION_NAME)
    lambda_client.update_function_code(FunctionName=LAMBDA_FUNCTION_NAME, ZipFile=lambda_zip_content)
    waiter = lambda_client.get_waiter('function_updated_v2')
    waiter.wait(FunctionName=LAMBDA_FUNCTION_NAME)
    print("✓ Updated existing function")
except lambda_client.exceptions.ResourceNotFoundException:
    lambda_client.create_function(
        FunctionName=LAMBDA_FUNCTION_NAME,
        Runtime='python3.11',
        Role=lambda_role_arn,
        Handler=f"{REWARD_FUNCTION_MODULE}.lambda_handler",
        Code={'ZipFile': lambda_zip_content},
        Timeout=300,  # Longer timeout for LLM calls
        MemorySize=512
    )
    print("✓ Created new function")

waiter = lambda_client.get_waiter('function_active_v2')
waiter.wait(FunctionName=LAMBDA_FUNCTION_NAME)
lambda_arn = lambda_client.get_function(FunctionName=LAMBDA_FUNCTION_NAME)['Configuration']['FunctionArn']
print(f"✓ Lambda ready: {lambda_arn}")

# Create Bedrock role
print(f"\nCreating Bedrock role: {BEDROCK_ROLE_NAME}...")

bedrock_trust_policy = {
    "Version": "2012-10-17",
    "Statement": [{"Effect": "Allow", "Principal": {"Service": "bedrock.amazonaws.com"}, "Action": "sts:AssumeRole"}]
}

bedrock_permissions = {
    "Version": "2012-10-17",
    "Statement": [
        {"Effect": "Allow", "Action": ["s3:GetObject", "s3:ListBucket"], "Resource": [f"arn:aws:s3:::{S3_BUCKET}/*", f"arn:aws:s3:::{S3_BUCKET}"]},
        {"Effect": "Allow", "Action": "s3:PutObject", "Resource": f"arn:aws:s3:::{S3_BUCKET}/rft-output/*"},
        {"Effect": "Allow", "Action": "lambda:InvokeFunction", "Resource": lambda_arn}
    ]
}

try:
    response = iam_client.create_role(
        RoleName=BEDROCK_ROLE_NAME,
        AssumeRolePolicyDocument=json.dumps(bedrock_trust_policy),
        Description="Execution role for Bedrock RFT"
    )
    bedrock_role_arn = response['Role']['Arn']
    print(f"✓ Created role: {BEDROCK_ROLE_NAME}")
except iam_client.exceptions.EntityAlreadyExistsException:
    bedrock_role_arn = iam_client.get_role(RoleName=BEDROCK_ROLE_NAME)['Role']['Arn']
    print(f"✓ Using existing role: {BEDROCK_ROLE_NAME}")

iam_client.put_role_policy(RoleName=BEDROCK_ROLE_NAME, PolicyName='BedrockRFTPermissions', PolicyDocument=json.dumps(bedrock_permissions))
print(f"✓ Bedrock role ready: {bedrock_role_arn}")

cleanup_lambda_deployment_package()


---
## 3. Test the Reward Function

Before kicking off training, let's verify the LLM-as-judge reward function works correctly.


In [None]:
print("Testing reward function...")

test_payload = [{
    "id": "test_001",
    "messages": [
        {"role": "user", "content": "Evaluate these two responses..."},
        {"role": "assistant", "content": "JUDGMENT: Response 2 is better\nREASONING: Response 2 is more concise and directly addresses the task without unnecessary repetition."}
    ],
    "reference_answer": {
        "label": 2,
        "question": "Which response better rewrites the sentence?",
        "ground_truth": "JUDGMENT: Response 2 is better\nREASONING: Response 2 is more concise."
    }
}]

response = lambda_client.invoke(
    FunctionName=LAMBDA_FUNCTION_NAME,
    InvocationType='RequestResponse',
    Payload=json.dumps(test_payload)
)

result = json.loads(response['Payload'].read())
print(json.dumps(result, indent=2))

if 'errorMessage' in result:
    print(f"\n✗ Error: {result['errorMessage']}")
elif isinstance(result, list) and result[0].get('aggregate_reward_score', 0) > 0:
    print("\n✓ Reward function working correctly!")
else:
    print("\n⚠ Unexpected result - check the output above")


### Test with Real Training & Validation Data

Let's verify the reward function works correctly with actual samples from our dataset.


In [None]:
def load_samples_from_s3(s3_uri, n=5):
    bucket = s3_uri.split('/')[2]
    key = '/'.join(s3_uri.split('/')[3:])
    obj = s3_client.get_object(Bucket=bucket, Key=key)
    lines = obj['Body'].read().decode('utf-8').strip().split('\n')
    return [json.loads(line) for line in random.sample(lines, min(n, len(lines)))]

def simulate_correct_response(sample):
    """Add an assistant message with a correct-ish judgment."""
    label = sample['reference_answer']['label']
    judgment = "Response 1 is better" if label == 1 else "Response 2 is better" if label == 2 else "Tie"
    sample_copy = sample.copy()
    sample_copy['messages'] = sample['messages'] + [
        {'role': 'assistant', 'content': f'JUDGMENT: {judgment}\nREASONING: Based on the evaluation criteria, this response better addresses the task requirements.'}
    ]
    return sample_copy

print('Loading samples from S3...')
train_samples = load_samples_from_s3(S3_TRAINING_DATA, n=5)
val_samples = load_samples_from_s3(S3_VALIDATION_DATA, n=5)

test_payloads = [simulate_correct_response(s) for s in train_samples + val_samples]

print(f'Testing {len(test_payloads)} samples (5 train + 5 val)...')
response = lambda_client.invoke(
    FunctionName=LAMBDA_FUNCTION_NAME,
    InvocationType='RequestResponse',
    Payload=json.dumps(test_payloads)
)
results = json.loads(response['Payload'].read())

print('\nResults:')
for r in results:
    score = r.get('aggregate_reward_score', 0)
    status = '✓' if score >= 0.5 else '⚠'
    print(f"  {status} {r['id']}: {score:.2f}")

avg_score = sum(r.get('aggregate_reward_score', 0) for r in results) / len(results)
print(f'\nAverage score: {avg_score:.2f}')


### Analyze Dataset for Hyperparameter Selection

**Key hyperparameters to consider:**

| Parameter | What it controls | Trade-off |
|-----------|-----------------|-----------|
| `maxPromptLength` | Max tokens for input prompts | Higher = more context, but more memory & slower training |
| `inferenceMaxTokens` | Max tokens the model can generate per response | Higher = longer reasoning, but slower & more expensive |
| `trainingSamplePerPrompt` | Number of response samples per prompt | More samples = better reward estimation, but slower |
| `batchSize` | Samples per training batch | Larger = more stable gradients, but more memory |

**For PandaLM specifically:**
- Prompts include instruction + input + two full responses (can be long)
- Model needs to output judgment + detailed reasoning
- LLM-as-judge scoring is slower than programmatic verification


In [None]:
%pip install -q tiktoken
import tiktoken
import statistics

enc = tiktoken.get_encoding('cl100k_base')

def count_tokens(text):
    return len(enc.encode(text))

print('Analyzing training data...')
obj = s3_client.get_object(Bucket=S3_BUCKET, Key='/'.join(S3_TRAINING_DATA.split('/')[3:]))
samples = [json.loads(line) for line in obj['Body'].read().decode('utf-8').strip().split('\n')]

prompt_tokens = []
for s in samples:
    prompt_text = ' '.join(m['content'] for m in s['messages'])
    prompt_tokens.append(count_tokens(prompt_text))

print(f'\nDataset Statistics ({len(samples)} samples)')
print(f'\nPrompt tokens (input):')
print(f'  Min: {min(prompt_tokens)}, Max: {max(prompt_tokens)}, Mean: {statistics.mean(prompt_tokens):.0f}')
print(f'  P95: {sorted(prompt_tokens)[int(len(prompt_tokens)*0.95)]}, P99: {sorted(prompt_tokens)[int(len(prompt_tokens)*0.99)]}')

recommended_prompt_len = sorted(prompt_tokens)[int(len(prompt_tokens)*0.99)] * 2
print(f'\nRecommended hyperparameters:')
print(f'  maxPromptLength: {recommended_prompt_len} (2x P99 prompt length)')
print(f'  inferenceMaxTokens: 500 (judgment + reasoning)')


---
## 4. Start the RFT Training Job

Now we'll create a model customization job using RLAIF (LLM-as-judge) to train the model to be a better evaluator.


In [None]:
print("Creating RFT training job...")

from datetime import datetime
date_str = datetime.now().strftime('%Y%m%d')
hp_suffix = f"e{1}_bs{16}_lr{5e-5}".replace('.', '').replace('-', '')
CUSTOM_MODEL_NAME = f"{DATASET_NAME}-nova-rft-{date_str}-{hp_suffix}"
JOB_NAME = f"{DATASET_NAME}-rft-{date_str}-{int(time.time())}"

print(f"  Job: {JOB_NAME}")
print(f"  Model: {CUSTOM_MODEL_NAME}")
print(f"  Base: {BASE_MODEL_ID}")

response = bedrock_client.create_model_customization_job(
    jobName=JOB_NAME,
    customModelName=CUSTOM_MODEL_NAME,
    roleArn=bedrock_role_arn,
    baseModelIdentifier=BASE_MODEL_ID,
    customizationType='REINFORCEMENT_FINE_TUNING',
    trainingDataConfig={'s3Uri': S3_TRAINING_DATA},
    validationDataConfig={'validators': [{'s3Uri': S3_VALIDATION_DATA}]},
    outputDataConfig={'s3Uri': S3_OUTPUT_PATH},
    customizationConfig={
        'rftConfig': {
            'graderConfig': {'lambdaGrader': {'lambdaArn': lambda_arn}},
            'hyperParameters': {
                'batchSize': 16,  # Smaller batch for LLM-as-judge (slower reward computation)
                'epochCount': 1,  # Start with 1; increase if validation rewards still rising
                'evalInterval': 10,  # Eval every 10 steps
                'inferenceMaxTokens': 500,  # Room for judgment + detailed reasoning
                'learningRate': 0.00005,
                'maxPromptLength': 2500,  # PandaLM prompts include two full responses - adjust based on analysis
                'reasoningEffort': 'high',
                'trainingSamplePerPrompt': 2  # Fewer samples since LLM-as-judge is expensive
            }
        }
    }
)

print(f"\n✓ Job created: {response['jobArn']}")


---
## 5. Monitor Training Progress

Run this cell periodically to check on your training job. Status will progress through: `InProgress` → `Completed` (or `Failed`).


In [None]:
response = bedrock_client.get_model_customization_job(jobIdentifier=JOB_NAME)
print(f"Job: {JOB_NAME}")
print(f"Status: {response['status']}")

if response['status'] == 'Completed' and 'outputModelArn' in response:
    print(f"\n✓ Training complete!")
    print(f"  Model ARN: {response['outputModelArn']}")
elif response['status'] == 'Failed':
    print(f"\n✗ Training failed: {response.get('failureMessage', 'Unknown error')}")
elif response['status'] == 'InProgress':
    print("\nStill training... run this cell again to check progress")


## Conclusion

Congratulations, you've successfully launched a Reinforcement Fine-Tuning job for Amazon Nova on the PandaLM evaluation dataset.

### What You've Built

- **Preprocessed PandaLM dataset** into Bedrock RFT format for evaluation tasks
- **Deployed an LLM-as-judge Lambda** that scores model evaluations using RLAIF
- **Created IAM roles** for Lambda and Bedrock execution
- **Started an RFT training job** with customized hyperparameters

### Key Differences from RLVR (GSM8K/FinQA)

| Aspect | RLVR (Math) | RLAIF (PandaLM) |
|--------|-------------|-----------------|
| Reward Signal | Programmatic verification | LLM-as-judge |
| Speed | Fast | Slower (LLM calls) |
| Cost | Lower | Higher |
| Use Case | Verifiable answers | Subjective quality |

### Next Steps

Once your training job completes:

1. **Test your fine-tuned model** as an evaluator on held-out examples
2. **Compare judgments** against human annotations
3. **Use the model** for RLAIF in other training pipelines
4. **Experiment with hyperparameters** for better performance

### Learn More

- [Amazon Bedrock RFT Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/reinforcement-fine-tuning.html)
- [PandaLM Paper](https://arxiv.org/abs/2306.05087)
- [RLAIF: Scaling Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2309.00267)
