# Reward Model testing pipeline

In [Chai Prize Reward Model — Part I: Data](https://wild-chatter-b52.notion.site/Chai-Prize-Reward-Model-Part-I-Data-026a10a8998a404ca6a52251c0c8d052) and [Chai Prize Reward Model — Part II: Training Model](https://wild-chatter-b52.notion.site/Chai-Prize-Reward-Model-Part-II-Training-model-753b574c843f4d0780bf8d85b084da57), we introduced the basics of reward model, how to make a dataset, and how to train one.  
Now all the ingredients are ready! We prepared this notebook for you to explore the convenience of fitting your own reward model with Chaiverse reward model training pipeline!

In [None]:
import logging
import os
import huggingface_hub

import chai_guanaco as chai

from chaiverse.dataset import DatasetLoader, RewardDatasetBuilder
from chaiverse.tokenizer import GPT2Tokenizer
from chaiverse.model.reward_model import RewardClassificationTrainer, RewardRegressionTrainer

## login setup

In [None]:
chai.developer_login()

In [None]:
huggingface_hub.login()

In [None]:
os.environ["WANDB_DISABLED"] = "true"
logging.basicConfig(level=logging.INFO)

## Load and process a small dataset

Download [chai prize reward model dataset](https://huggingface.co/datasets/ChaiML/20231012_chai_prize_reward_model_data) from Huggingface, we shuffle dataset then split 1% dataset as validation fold. The DatasetDict looks as expected.

In [None]:
# load data
data_path = 'ChaiML/20231012_chai_prize_reward_model_data'
data_loader = DatasetLoader(
        hf_path=data_path,
        data_samples=1000,
        validation_split_size=0.1,
        shuffle=True,
        )
df = data_loader.load()
print(df)

In [None]:
# A sample of the data. It includes `input_text` - the conversation, and `labels` - wether or not user gave it a thumbs up.
# The labels here are 1 or 0, perfect for classification job. 
df['train'][0]

### Base model / tokenizer

In this feedback dataset, we use thumbs up as single-label targets, which means we can directly train with sequence classification/regression method. Here, we use `gpt2` as base model. It would be easier for us to compare performance with Chai's in-house gpt2 reward model.

- `padding_side` of the tokenizer should match the base model’s config.
- `truncation_side = 'left'`  ensures that the most recent responses are included in input.
- `block_size = 512` which can be expanded to 1024 as gpt2 max_length.

In [None]:
# process data
tokenizer_loader = GPT2Tokenizer(
        padding_side='right',
        truncation_side='left',
        )
data_builder = RewardDatasetBuilder(
        tokenizer_loader=tokenizer_loader,
        block_size=512,
        )
data = data_builder.generate(df)
print(data)

## model setup and fitting

- If using `RewardRegressionTrainer`, we treat it as a regression task, default `num_labels` equal to 1.
- If using `RewardClassificationTrainer`, we treat it as a single-label classification task, default `num_labels` equal to 2.
- It’s important to assign the same tokenizer_loader to ensure the training step and inference step have the same processing configuration.

In [None]:
# train setup
# set bf16=False if you are not using gpu
model = RewardClassificationTrainer(
        model_name='gpt2',
        tokenizer_loader=tokenizer_loader,
        output_dir='test_reward_model',
        learning_rate=1e-5,
        num_train_epochs=1,
        bf16=True,
        logging_strategy='steps',
        logging_steps=2,
        eval_strategy='steps',
        eval_steps=2
        )


We can also build RewawrdRegressionTrainer, with similar code. 

```
model = RewardRegressionTrainer(
        model_name='gpt2',
        tokenizer_loader=tokenizer_loader,
        output_dir='test_reward_model',
        learning_rate=1e-5,
        num_train_epochs=1,
        bf16=True,
        logging_strategy='steps',
        logging_steps=2,
        eval_strategy='steps',
        eval_steps=2
        )
```

In [None]:
# model initialisation and fitting
model.fit(data)

## upload the reward model

In [None]:
#The destination model path on your huggingface
reward_url = ''

#push model to huggingface
model.push_to_hub(reward_url,private=False)

## Submit to competition

In [None]:
#The base model url
model_url = ""

#To upload reward model, need to add "reward_repo" in submission_parameters
submission_parameters = {
	"model_repo": model_url,
  "reward_repo": reward_url,
	"generation_params": {
		"temperature": 0.99,
    "top_p": 0.2,
    "top_k": 40,
    "stopping_words": ['\n'],
    "presence_penalty": 0.,
    "frequency_penalty": 0.,
    "max_input_tokens": 2048
	},
  'model_name': 'My-First-RwardModel',
}

In [None]:
submitter = chai.ModelSubmitter()
submission_id = submitter.submit(submission_parameters)

# Yay! You just trained and submitted a reward model! 

Make sure you check out the writeup [Chai Prize Reward Model — Part II: Training model](https://wild-chatter-b52.notion.site/Chai-Prize-Reward-Model-Part-II-Training-model-753b574c843f4d0780bf8d85b084da57) to see how the reward model massively improved the performance of the base model, and ranked to the top of the competition!

Reward model + best_of_N sampling is an effective way to improve model performance. It looks simple and straightforward, but you do need a lot of detailed attention to make it work. And obviously, there is a lot of space to explore for reward model improvement.

Feel free to play with this notebook and test out different methods to make a better model! 
Maybe something like...
- Better data filtering/cleaning/formatting 🧠
- Different models: gpt2, deberta, roberta, phi-1.5 🏋️
- Parameter optimization 🛠️
- New targets from feedback dataset or external data 🎯
- Multi-labels objects, weighted loss, joint reward tasks 🤹
- Pair-wise reward model ⚖️
