# Processing of the 'synthetic-instruct-gptj-pairwise' dataset

In this notebook, we process the dataset 'synthetic-instruct-gptj-pairwise' which is available on [HuggingFace](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise) in order to convert it to the format needed by our reward model.

We start by processing the train split:

In [None]:
# Step 1: Install required libraries
!pip install datasets tqdm matplotlib

# Step 2: Import libraries
import json
import matplotlib.pyplot as plt
from tqdm import tqdm
from datasets import load_dataset

# Step 3: Load the dataset
dataset = load_dataset('Dahoas/synthetic-instruct-gptj-pairwise')

# Step 4: Initialize a counter and a list to hold the processed data
processed_data = []
counter = 0

# Step 5: Process the dataset and save data in the required format
num_datapoints = int(1.6 * len(dataset['train']))
for data in tqdm(dataset['train']):
    # The first 80% of the dataset is used for training and the remaining 20% is used for validation
    if counter == num_datapoints:
        break

    # We arbitrarily assign a grade of 5 to the chosen chat and a grade of 0 to the rejected chat
    # (see the report for more details on this)
    processed_data.append({
        "chat": f"Human: {data['prompt']} \n\nAssistant: {data['chosen']} ",
        "grade": 5
    })
    processed_data.append({
        "chat": f"Human: {data['prompt']} \n\nAssistant: {data['rejected']} ",
        "grade": 0
    })
    counter += 2  # Increase counter

# Step 6: Save the processed data to a json file
with open('synthetic-instruct-gptj-pairwise_train.json', 'w') as f:
    json.dump(processed_data, f)

print(f'Total datapoints processed: {counter}')

And we then process the test split:

In [None]:
# Step 1: Load the dataset
dataset = load_dataset('Dahoas/synthetic-instruct-gptj-pairwise')

# Step 2: Initialize a counter and a list to hold the processed data
processed_data = []
counter = 0

# Step 3: Process the dataset and save data in the required format
num_datapoints = int(0.8 * len(dataset['train']))
for data in tqdm(dataset['train']):
    # The first 80% of the dataset is used for training and the remaining 20% is used for validation
    if counter >= num_datapoints:
        processed_data.append({
            "chosen": f"Human: {data['prompt']} \n\nAssistant: {data['chosen']}",
            "rejected": f"Human: {data['prompt']} \n\nAssistant: {data['rejected']}"
        })
    counter += 1  # Increase counter

# Step 4: Save the processed data to a json file
with open('synthetic-instruct-gptj-pairwise_eval.json', 'w') as f:
    json.dump(processed_data, f)

print(f'Total datapoints processed: {counter}')