# **Establishing RL Pipeline Between Evaluator and Generator**

---



The goal is to achieve the following pipeline:
1. Input tensor list of queries
2. Output tensor list of responses
3. Convert to csv for input to evaluator
4. Pass through evaluator to obtain output csv with evaluator scores.
5. Parse output csv to extract scores.
6. Utilize scores to compute reward
7. Push input, output, and reward tensors to trl to perform train step.

Edit: I've achieved this pipeline with individual queries, goal is now to do this in batches so that it's at least reasonably fast. Currently at ~1 minute/iteration


In [None]:
import csv
import pandas as pd

def construct_output_csv(fileName, responses):
    with open(fileName, 'w', newline='') as csvfile:
        fieldnames = ['id', 'seeker_post', 'response_post']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for i in range(len(input_queries)):
            writer.writerow({'id':f"{i+1}",'seeker_post':str(input_queries[i]), 'response_post':str(responses[i])})


In [None]:
#Imports and Definitions

import torch
from transformers import AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
from trl.core import respond_to_batch

# get models
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
model_ref = create_reference_model(model)

tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# initialize trainer
ppo_config = {"mini_batch_size": 1, "batch_size": 1}
config = PPOConfig(**ppo_config)
ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer)

Example of steps 1,2, and 7 in the pipeline. We use a constant value for reward function.

In [None]:
#example of one single train iteration
query_txt = input_queries[0]
query_tensor = tokenizer.encode(query_txt, return_tensors="pt").cuda()

# get model response
response_tensor  = respond_to_batch(model, query_tensor)

# reward -- just a constant for this example
reward = [torch.tensor(1.0)]

# train model for one step with ppo
train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)

In [None]:
import os
import shutil

try:

    from google.colab import drive
    drive.mount('/content/gdrive')

    DRIVE_PATH = '/content/gdrive/My\ Drive/CS247-Empathy-Mental-Health'
    DRIVE_PYTHON_PATH = DRIVE_PATH.replace('\\', '')
    if not os.path.exists(DRIVE_PYTHON_PATH):
      %mkdir $DRIVE_PATH

    ## the space in `My Drive` causes some issues,
    ## make a symlink to avoid this
    # Solved -> symlink for convenience
    SYM_PATH = '/content/CS247-Empathy-Mental-Health'
    if not os.path.exists(SYM_PATH):
      !ln -s $DRIVE_PATH $SYM_PATH

    running_in_colab = True

    # We already mounted in our google drive.
    # Enter the foler where you put files in:
    %cd '/content/CS247-Empathy-Mental-Health'

    # What files are there:
    !ls


except ModuleNotFoundError:
    running_in_colab = False
    print(
        "I guess you are running locally. If you get this message in Colab, check the files."
    )

Mounted at /content/gdrive
/content/gdrive/.shortcut-targets-by-id/1qwurxfG3wTYT_VY1AQ0AaMBt4LPZf23w/CS247-Empathy-Mental-Health
 best_emotion.pt	   NaiveBaselineModel.ipynb	   rlhf_q_2
 checkpoint_other_131.pt   output			   rlhf_q_3
 checkpoint_other_79.pt    PretrainedModelQuerying.ipynb   rlhf_question_0_100
 Empathy-Mental-Health	   rlhf_default_0_100		   rlhf_question_0_200
 EmpDialogue_RecEC	   rlhf_default_0_200		   rlhf_question_0_300
 ER-reddit-test.csv	   rlhf_default_logs.json	   rlhf_question_logs.json
'Generative Model.ipynb'   rlhf_length_0		   rlhf_therapist_length_0_100
 glove.6B.100d.txt	   rlhf_length_0_100		   rlhf_therapist_length_0_200
 glove.6B.200d.txt	   rlhf_length_0_200		   rlhf_therapist_length_0_300
 glove.6B.300d.txt	   rlhf_length_0_300		   rlhf_therapist_length_logs.json
 glove.6B.50d.txt	   rlhf_length_logs.json	  'RL Training.ipynb'
 hard-gate-test.gdoc	  'RLHF on SFT'			   roberta-large.tsv
 hard-gate-test.txt	   rlhf_q_1			   SFT_GPT2


In [None]:
%cd Empathy-Mental-Health/
!ls

[Errno 2] No such file or directory: 'Empathy-Mental-Health/'
/content/gdrive/.shortcut-targets-by-id/1eUJZcBYmEsh0qtMhIO6uDeGrZPPBmQio/Empathy-Mental-Health
'Althoff academic license.docx'		  dataset   README.md	       src
'Althoff attribution only license.docx'   output    requirements.txt   train.sh


This helps setup steps 3-6 in the pipeline. All functions for converting/parsing csv, extracting reward value are here.

In [None]:
def convert_csv_for_evaluator(fileName, query, response):
    with open(fileName, 'w', newline='') as csvfile:
        fieldnames = ['id', 'seeker_post', 'response_post']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerow({'id':'1','seeker_post':str(query), 'response_post':str(response)})

def calculate_reward_function_using_evaluator():
    !python3 src/test.py \
	--input_path /content/evaluator_input.csv \
	--output_path /content/evaluator_output.csv \
	--ER_model_path output/reddit_ER.pth \
	--IP_model_path output/reddit_IP.pth \
	--EX_model_path output/reddit_EX.pth
    return

#Algorithm to compute reward. Currently just summing up all 3 labels and normalizing to this scale:
#6->3, 5->2, 4->1, 3->0, 2->-1, 1->-2, 0->-3
def compute_reward_single_value(ER_label, IP_label, EX_label):
    return [torch.tensor((ER_label + IP_label + EX_label) - 3, dtype=torch.float32)]


def extract_reward_from_output_csv_from_evaluator_single_value():
    input_df = pd.read_csv('/content/evaluator_output.csv', header=0)
    ER_label = int(input_df.ER_label.astype(str).tolist()[0])
    IP_label = int(input_df.IP_label.astype(str).tolist()[0])
    EX_label = int(input_df.EX_label.astype(str).tolist()[0])
    return compute_reward_single_value(ER_label, IP_label, EX_label)


Steps 1-7 for a single input query:

In [None]:
chat_entry = input_queries[0]
query_tensor = tokenizer.encode(chat_entry, return_tensors="pt").cuda()

# get model response
response_tensor  = respond_to_batch(model, query_tensor)
response = tokenizer.decode(response_tensor[0], skip_special_tokens=True)

# use query, response to obtain evaluator reward score
convert_csv_for_evaluator('/content/evaluator_input.csv', chat_entry, response)
calculate_reward_function_using_evaluator()
reward = extract_reward_from_output_csv_from_evaluator_single_value()
# # train step based on query, resonse, reward
train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)

print("query: ", chat_entry)
print("response: ", response)
print("reward", reward)

The following code will train using ppo on all of the queries in the list "input_queries". However, it does so only one query at a time, so we have to establish the full pipeline for each query. Thus, very very slow. About 1 min per query.

In [None]:
for chat_entry in input_queries:
    print("query: ", chat_entry)
    query_tensor = tokenizer.encode(chat_entry, return_tensors="pt").cuda()

    # get model response
    response_tensor  = respond_to_batch(model, query_tensor)
    response = tokenizer.decode(response_tensor[0], skip_special_tokens=True)
    print("response: ", response)

    # use query, response to obtain evaluator reward score
    convert_csv_for_evaluator('/content/evaluator_input.csv', chat_entry, response)
    calculate_reward_function_using_evaluator()
    print("done forwarding through evaluator")
    reward = extract_reward_from_output_csv_from_evaluator_single_value()
    print("reward", reward)
    # # train step based on query, resonse, reward
    train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)

query:  Help. Help me. I dunno what I'm doing anymore
response:  . Just get this boy kicked out of my god-shitclaw. Photo Archives Prev Next Field Administrator
2024-03-08 23:02:08.896714: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-08 23:02:08.896778: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-08 23:02:08.903163: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation 



query:  All the people who will be kissed on New year's eve. I'll be alone like usual I'll never get someone to kiss me It's fine No one understands how hopelessly, alone, and angrily some people live. It would scare some attractive people that haven't ever lived like that.
response:   It's beautiful. And it's means that you can, me you dare, I can multiply desires
2024-03-08 23:18:41.517798: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-08 23:18:41.517856: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-08 23:18:41.523580: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been register