# RLHF
RLHF (Reinforcement Learning from Human Feedback) is an important component of the current method used to train advanced language models. It helps include people’s feedback when fine-tuning the model, which ultimately makes the model more useful and secure.

In [1]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.5.0
!pip install -q sentencepiece

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m98.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.1/88.1 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m73.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.8/294.8 kB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m68.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import random

import torch
from datasets import Dataset, load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
)
from trl import RewardTrainer

In [3]:
random.seed(42)

In [4]:
hf_dataset = load_dataset(
    "argilla/dolly-curated-comparison-falcon-7b-instruct", split="train"
)

Downloading readme:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/6.60M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/7401 [00:00<?, ? examples/s]

In [5]:
df = hf_dataset.to_pandas()
df

Unnamed: 0,prompt,response-1,response-2,category,original_response,external_id
0,What is Depreciation,What is Depreciation – 10 Important Facts to K...,What is Depreciation on a Car?\nDepreciation i...,open_qa,Depreciation is the drop in value of an asset ...,518
1,What do you know about the city of Aberdeen in...,"Aberdeen, the city in which I've lived for the...","As an AI language model, I don't have personal...",open_qa,Aberdeen is a city located in the North East o...,351
2,Describe thunderstorm season in the United Sta...,Describe thunderstorm season in the United Sta...,Describe thunderstorm season in the United Sta...,information_extraction,Thunderstorm season in the United States and C...,10567
3,"When did Peloton IPO?\nOn September 26, 2019, ...",When did Peloton IPO?\nPeloton launched its in...,When did Peloton IPO?\nPeloton IPO'd on May 26...,closed_qa,Peloton became a public company via an initial...,12412
4,What is the best way to answer an interview qu...,The best way to answer an interview question m...,Some of the best ways to answer an interview q...,creative_writing,The first recommended step is to ask clarifyin...,2521
...,...,...,...,...,...,...
7396,How do i accept the change,How do i accept the change in my life\nAccepti...,I's a great opportunity to improve. The only t...,brainstorming,Embrace the change and see the difference,15010
7397,Extract the teams that the footballer Sócrates...,Extract the teams that the footballer Sócrates...,Extract the teams that the footballer Sócrates...,information_extraction,"Brazil, Botafogo-SP, Corinthians, Fiorentina",9970
7398,Without quoting directly from the text give me...,Without quoting directly from the text give me...,Without quoting directly from the text give me...,summarization,"Brendon Small is a stand-up comedian, Creator...",14205
7399,Is Killing is Sin ? Is it ture,Is Killing is Sin ? Is it ture?\nKilling can b...,Is Killing is Sin ? Is it ture?\nKilling is no...,brainstorming,Killing a human being should not be sin becaus...,11253


In [6]:
# List of response options
responses = ["response-1", "response-2"]


def get_chosen_and_not_chosen(responses):
    """
    Given a list of responses, randomly selects one and returns it along with the non-selected response.

    Args:
        responses (list): List of response options.

    Returns:
        tuple: A tuple containing the chosen response, non-chosen response, and the index of the chosen response.
    """
    chosen_id = random.randint(0, len(responses) - 1)
    not_chosen_id = 1 - chosen_id

    return responses[chosen_id], responses[not_chosen_id], chosen_id


# List to store rows of data
rows = []

# Iterate through the hf_dataset
for record in hf_dataset:
    chosen, not_chosen, chosen_id = get_chosen_and_not_chosen(responses)
    chosen_from_falcon, _, _ = get_chosen_and_not_chosen(responses)

    # Append a new row to the 'rows' list
    rows.append(
        {
            "instruction": record["prompt"],
            "chosen_response": record[chosen],
            "rejected_response": record[not_chosen],
        }
    )

In [7]:
prepared_dataset = Dataset.from_list(rows)
prepared_dataset.to_pandas()

Unnamed: 0,instruction,chosen_response,rejected_response
0,What is Depreciation,What is Depreciation – 10 Important Facts to K...,What is Depreciation on a Car?\nDepreciation i...
1,What do you know about the city of Aberdeen in...,"As an AI language model, I don't have personal...","Aberdeen, the city in which I've lived for the..."
2,Describe thunderstorm season in the United Sta...,Describe thunderstorm season in the United Sta...,Describe thunderstorm season in the United Sta...
3,"When did Peloton IPO?\nOn September 26, 2019, ...",When did Peloton IPO?\nPeloton launched its in...,When did Peloton IPO?\nPeloton IPO'd on May 26...
4,What is the best way to answer an interview qu...,Some of the best ways to answer an interview q...,The best way to answer an interview question m...
...,...,...,...
7396,How do i accept the change,I's a great opportunity to improve. The only t...,How do i accept the change in my life\nAccepti...
7397,Extract the teams that the footballer Sócrates...,Extract the teams that the footballer Sócrates...,Extract the teams that the footballer Sócrates...
7398,Without quoting directly from the text give me...,Without quoting directly from the text give me...,Without quoting directly from the text give me...
7399,Is Killing is Sin ? Is it ture,Is Killing is Sin ? Is it ture?\nKilling can b...,Is Killing is Sin ? Is it ture?\nKilling is no...


In [8]:
prepared_dataset

Dataset({
    features: ['instruction', 'chosen_response', 'rejected_response'],
    num_rows: 7401
})

In [9]:
prepared_dataset_mini = prepared_dataset.select(range(1000))

In [10]:
prepared_dataset_mini

Dataset({
    features: ['instruction', 'chosen_response', 'rejected_response'],
    num_rows: 1000
})

In [11]:
from transformers import (
    AutoModelForSequenceClassification,
    BitsAndBytesConfig,
    AutoTokenizer,
)

# Load tokenizer for the "facebook/opt-350m" model
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

# Prepare quantization parameters
quantization_config = BitsAndBytesConfig(load_in_8bit=False, load_in_4bit=True)

# Initialize the sequence classification model
model = AutoModelForSequenceClassification.from_pretrained(
    "facebook/opt-350m",
    quantization_config=quantization_config,  # Apply the quantization configuration
    device_map={"": 0},  # Assign the model to device 0
    trust_remote_code=True,  # Trust remote code
    num_labels=1,  # Set the number of labels for classification (in this case, 1)
)

# Disable cache in model configuration
model.config.use_cache = False

Downloading (…)okenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/663M [00:00<?, ?B/s]

Some weights of OPTForSequenceClassification were not initialized from the model checkpoint at facebook/opt-350m and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# If the tokenizer's pad_token is not set, use eos_token as pad_token and update model's pad_token_id
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id


# Define a formatting function for processing examples
def formatting_func(examples):
    kwargs = {
        "padding": "max_length",
        "truncation": True,
        "max_length": 512,
        "return_tensors": "pt",
    }

    # Prepend the instruction and a line break to the chosen_response and rejected_response fields.
    prompt_plus_chosen_response = (
        examples["instruction"] + "\n" + examples["chosen_response"]
    )
    prompt_plus_rejected_response = (
        examples["instruction"] + "\n" + examples["rejected_response"]
    )

    # Tokenize the modified fields.
    tokens_chosen = tokenizer.encode_plus(prompt_plus_chosen_response, **kwargs)
    tokens_rejected = tokenizer.encode_plus(prompt_plus_rejected_response, **kwargs)

    return {
        "input_ids_chosen": tokens_chosen["input_ids"][0],
        "attention_mask_chosen": tokens_chosen["attention_mask"][0],
        "input_ids_rejected": tokens_rejected["input_ids"][0],
        "attention_mask_rejected": tokens_rejected["attention_mask"][0],
    }


# Apply the formatting function to the prepared dataset
formatted_dataset = prepared_dataset_mini.map(formatting_func)

# Split the formatted dataset into training and testing sets
formatted_dataset = formatted_dataset.train_test_split()

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
from transformers import TrainingArguments
from peft import LoraConfig
from trl import RewardTrainer

# Prepare training parameters
training_args = TrainingArguments(
    output_dir="./train_logs",  # Output folder
    max_steps=100,  # Maximum number of training steps
    per_device_train_batch_size=4,  # Batch size per GPU for training
    gradient_accumulation_steps=1,  # Number of steps to accumulate gradients
    learning_rate=1.0e-4,  # Learning rate
    optim="adamw_torch",  # Optimizer
    save_steps=50,  # How often to save checkpoints
    logging_steps=10,  # How often to log training information
    report_to="tensorboard",  # Reporting method (in this case, TensorBoard)
    remove_unused_columns=False,  # Whether to remove unused columns
    evaluation_strategy="steps",  # Evaluation strategy
    num_train_epochs=5,  # Number of training epochs
)

# Prepare PEFT parameters
peft_config = LoraConfig(
    r=16,  # Value of r
    lora_alpha=16,  # Value of lora_alpha
    bias="none",  # Bias setting
    task_type="SEQ_CLS",  # Task type (Sequence Classification)
    modules_to_save=["scores"],  # Modules to save
)

# Prepare RewardTrainer
trainer = RewardTrainer(
    model=model,  # The model for reinforcement learning
    tokenizer=tokenizer,  # The tokenizer for processing input data
    args=training_args,  # Training arguments
    train_dataset=formatted_dataset["train"],  # Training dataset
    eval_dataset=formatted_dataset["test"],  # Evaluation dataset
    peft_config=peft_config,  # PEFT configuration
    max_length=512,  # Maximum length of input
)

# Execute training
trainer.train()

# Save the pretrained reward model
trainer.model.save_pretrained("./reward_model")

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Accuracy
10,0.7396,0.792111,0.528
20,0.8415,0.793017,0.528
30,0.9162,0.788118,0.528
40,0.7412,0.782025,0.544
50,0.7865,0.781935,0.556
60,0.69,0.780752,0.556
70,0.8578,0.779521,0.54
80,0.7566,0.778371,0.536
90,0.7744,0.777437,0.544
100,0.774,0.777306,0.54




In [None]:
import torch


def get_score(model, tokenizer, prompt, response):
    """
    Computes a score for a given prompt and response using a provided model and tokenizer.

    Args:
        model (nn.Module): The model for scoring.
        tokenizer: The tokenizer for processing input data.
        prompt (str): The prompt text.
        response (str): The response text.

    Returns:
        float: The computed score.
    """
    # Tokenize the input sequences
    inputs = tokenizer.encode_plus(
        prompt,
        response,
        truncation=True,
        padding="max_length",
        max_length=512,
        return_tensors="pt",
    ).to("cuda:0")

    # Perform forward pass
    with torch.no_grad():
        outputs = model(**inputs)

    # Extract the logits
    logits = outputs.logits

    return logits.item()

In [None]:
x = 4242
prepared_dataset[x]

{'instruction': 'Given this paragraph about WWII, how many fatalities happened?\n"World War II or the Second World War, often abbreviated as WWII or WW2, was a global conflict that lasted from 1939 to 1945. The vast majority of the world\'s countries, including all of the great powers, fought as part of two opposing military alliances: the Allies and the Axis. Many participants threw their economic, industrial, and scientific capabilities behind this total war, blurring the distinction between civilian and military resources. Aircraft played a major role, enabling the strategic bombing of population centres and the delivery of the only two nuclear weapons ever used in war.\n\nWorld War II was by far the deadliest conflict in history; it resulted in an estimated 70 to 85 million fatalities, mostly among civilians. Tens of millions died due to genocides (including the Holocaust), starvation, massacres, and disease. In the wake of the Axis defeat, Germany and Japan were occupied, and war 

In [None]:
# Get the prompt and responses for the example
prompt = prepared_dataset[x]["instruction"]
rejected_response = prepared_dataset[x]["rejected_response"]
chosen_response = prepared_dataset[x]["chosen_response"]

# Get the score for the example with the less preferred response
score_less_pref = get_score(model, tokenizer, prompt, rejected_response)
print(f"Score for less preferred response: {score_less_pref}")

# Get the score for the example with the preferred response
score_pref = get_score(model, tokenizer, prompt, chosen_response)
print(f"Score for preferred response: {score_pref}")

Score for less preferred response: 0.539810061454773
Score for preferred response: 0.5915793776512146
