## Notebook used to fine-tune T5 model on bias neutralizing task

Adapted from https://shivanandroy.com/fine-tune-t5-transformer-with-pytorch/

In [None]:
!pip install sentencepiece
!pip install transformers
!pip install rich[jupyter]

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 4.5 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96
Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 5.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 44.1 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 8.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 54.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_

In [None]:
#mount google drive to access files
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
cd gdrive

/content/gdrive


In [None]:
cd MyDrive/

/content/gdrive/MyDrive


In [None]:
import pandas as pd


In [None]:
#import training dataset
df = pd.read_csv("biased.word.train.csv")
df.columns = ["id", "source_tokenized", "target_tokenized", "source_raw", "target_raw","POS","POS_2"]
df["source_raw"] = "Neutralize bias: " + df["source_raw"]
df

Unnamed: 0,id,source_tokenized,target_tokenized,source_raw,target_raw,POS,POS_2
0,123204846,the free software gnu class ##path project is ...,the free software gnu class ##path project is ...,Neutralize bias: the free software gnu classpa...,the free software gnu classpath project is par...,DET ADJ NOUN NOUN NOUN NOUN NOUN VERB ADV ADV ...,det amod nmod compound compound compound nsubj...
1,706783956,"other campaign ##ers , especially the controve...","other campaign ##ers , especially the british ...","Neutralize bias: other campaigners, especially...","other campaigners, especially the british acti...",ADJ NOUN NOUN PUNCT ADV DET ADJ ADJ NOUN ADJ N...,amod nsubj nsubj punct advmod det amod amod am...
2,612378448,vocalist rob half ##ord ' s performance is con...,vocalist rob half ##ord ' s performance is con...,Neutralize bias: vocalist rob halford's perfor...,vocalist rob halford's performance is consider...,ADJ X NOUN NOUN PUNCT PART NOUN VERB VERB NUM ...,amod amod poss poss punct case nsubjpass auxpa...
3,876796337,the proud general is a chinese animated featur...,the proud general is a chinese animated featur...,Neutralize bias: the proud general is a chines...,the proud general is a chinese animated featur...,DET ADJ NOUN VERB DET ADJ VERB NOUN NOUN VERB ...,det amod nsubj ROOT det amod amod attr attr ac...
4,91653449,"gaming system , an dice pool system where matc...","gaming system , a unique dice pool system wher...","Neutralize bias: gaming system, an dice pool s...","gaming system, a unique dice pool system where...",NOUN NOUN PUNCT DET NOUN NOUN NOUN ADV VERB VE...,compound ROOT punct det compound compound appo...
...,...,...,...,...,...,...,...
53797,341593940,the national lawyers guild is a progressive / ...,the national lawyers guild is a progressive ba...,Neutralize bias: the national lawyers guild is...,the national lawyers guild is a progressive ba...,DET ADJ NOUN ADJ VERB DET ADJ SYM ADJ PUNCT NO...,det amod compound nsubj ROOT det amod punct am...
53798,640510650,a plan to red ##eve ##lo ##p the old tiger sta...,a plan to red ##eve ##lo ##p the old tiger sta...,Neutralize bias: a plan to redevelop the old t...,a plan to redevelop the old tiger stadium site...,DET NOUN PART VERB VERB VERB VERB DET ADJ NOUN...,det nsubj aux acl acl acl acl det amod compoun...
53799,162719260,"instrumental ##ly , life ##son is regarded as ...","instrumental ##ly , life ##son is regarded as ...","Neutralize bias: instrumentally, lifeson is re...","instrumentally, lifeson is regarded as a guita...",ADV ADV PUNCT NOUN NOUN VERB VERB ADP DET ADJ ...,advmod advmod punct nsubjpass nsubjpass auxpas...
53800,62331672,fly ##nt joined the us army in 1958 at only fi...,fly ##nt joined the us army in 1958 at only fi...,Neutralize bias: flynt joined the us army in 1...,flynt joined the us army in 1958 at only fifte...,NOUN NOUN VERB DET PRON NOUN ADP NUM ADP ADV N...,nsubj nsubj ROOT det compound dobj prep pobj p...


In [None]:
#test to look for certain sentences
#df[df['source_raw'].str.contains("once for kidney stones")][['source_raw','target_raw']]

In [None]:
#for double fine-tuning, use a dataset that has a target classification of 'biased' or 'unbiased'
class_df = pd.read_csv('class_df.csv')
class_df["sentence"] = "Classify bias: " + class_df["sentence"]
class_df

Unnamed: 0.1,Unnamed: 0,sentence,class
0,0,Classify bias: in the early 80s a standardisat...,biased
1,1,Classify bias: in the early 80s a standardisat...,unbiased
2,2,"Classify bias: up to the late 19th century, th...",biased
3,3,"Classify bias: up to the late 19th century, th...",unbiased
4,4,"Classify bias: since then, mauboy has scored t...",biased
...,...,...,...
9995,9995,"Classify bias: moreover, the amount of compens...",unbiased
9996,9996,Classify bias: the return of adam chandler and...,biased
9997,9997,Classify bias: the return of adam chandler and...,unbiased
9998,9998,Classify bias: the act has been through a numb...,biased


In [None]:
# Importing libraries
import os
import numpy as np
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
import os

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

from rich.table import Column, Table
from rich import box
from rich.console import Console

# define a rich console logger
console=Console(record=True)

def display_df(df):
  """display dataframe in ASCII format"""

  console=Console()
  table = Table(Column("source_text", justify="center" ), Column("target_text", justify="center"), title="Sample Data",pad_edge=False, box=box.ASCII)

  for i, row in enumerate(df.values.tolist()):
    table.add_row(row[0], row[1])

  console.print(table)

training_logger = Table(Column("Epoch", justify="center" ), 
                        Column("Steps", justify="center"),
                        Column("Loss", justify="center"), 
                        title="Training Status",pad_edge=False, box=box.ASCII)


In [None]:
# Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

In [None]:
class YourDataSetClass(Dataset):
  """
  Creating a custom dataset for reading the dataset and 
  loading it into the dataloader to pass it to the neural network for finetuning the model

  """

  def __init__(self, dataframe, tokenizer, source_len, target_len, source_text, target_text):
    self.tokenizer = tokenizer
    self.data = dataframe
    self.source_len = source_len
    self.summ_len = target_len
    self.target_text = self.data[target_text]
    self.source_text = self.data[source_text]

  def __len__(self):
    return len(self.target_text)

  def __getitem__(self, index):
    source_text = str(self.source_text[index])
    target_text = str(self.target_text[index])

    #cleaning data so as to ensure data is in string type
    source_text = ' '.join(source_text.split())
    target_text = ' '.join(target_text.split())

    source = self.tokenizer.batch_encode_plus([source_text], max_length= self.source_len, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt')
    target = self.tokenizer.batch_encode_plus([target_text], max_length= self.summ_len, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt')

    source_ids = source['input_ids'].squeeze()
    source_mask = source['attention_mask'].squeeze()
    target_ids = target['input_ids'].squeeze()
    target_mask = target['attention_mask'].squeeze()

    return {
        'source_ids': source_ids.to(dtype=torch.long), 
        'source_mask': source_mask.to(dtype=torch.long), 
        'target_ids': target_ids.to(dtype=torch.long),
        'target_ids_y': target_ids.to(dtype=torch.long)
    }

In [None]:
def train(epoch, tokenizer, model, device, loader, optimizer):

  """
  Function to be called for training with the parameters passed from main function

  """

  model.train()
  for _,data in enumerate(loader, 0):
    y = data['target_ids'].to(device, dtype = torch.long)
    y_ids = y[:, :-1].contiguous()
    lm_labels = y[:, 1:].clone().detach()
    lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
    ids = data['source_ids'].to(device, dtype = torch.long)
    mask = data['source_mask'].to(device, dtype = torch.long)

    outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels)
    loss = outputs[0]

    if _%10==0:
      training_logger.add_row(str(epoch), str(_), str(loss))
      console.print(training_logger)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

In [None]:
def validate(epoch, tokenizer, model, device, loader):

  """
  Function to evaluate model for predictions

  """
  model.eval()
  predictions = []
  actuals = []
  with torch.no_grad():
      for _, data in enumerate(loader, 0):
          y = data['target_ids'].to(device, dtype = torch.long)
          ids = data['source_ids'].to(device, dtype = torch.long)
          mask = data['source_mask'].to(device, dtype = torch.long)

          generated_ids = model.generate(
              input_ids = ids,
              attention_mask = mask, 
              max_length=150, 
              num_beams=3,
              repetition_penalty=2.5, 
              length_penalty=1.0, 
              #num_return_sequences=3,
              early_stopping=True
              )
          
          preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
          target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
          if _%10==0:
              console.print(f'Completed {_}')

          predictions.extend(preds)
          actuals.extend(target)
  return predictions, actuals

In [None]:
def T5Trainer(dataframe, source_text, target_text, model_params, output_dir="./outputs/" ):
  
  """
  T5 trainer

  """

  # Set random seeds and deterministic pytorch for reproducibility
  torch.manual_seed(model_params["SEED"]) # pytorch random seed
  np.random.seed(model_params["SEED"]) # numpy random seed
  torch.backends.cudnn.deterministic = True

  # logging
  console.log(f"""[Model]: Loading {model_params["MODEL"]}...\n""")

  # tokenzier for encoding the text
  tokenizer = T5Tokenizer.from_pretrained(model_params["MODEL"])

  # Defining the model. We are using t5-base model  
  # Further this model is sent to device (GPU/TPU) for using the hardware.
  model = T5ForConditionalGeneration.from_pretrained(model_params["MODEL"])
  model = model.to(device)
  
  # logging
  console.log(f"[Data]: Reading data...\n")

  # Importing the raw dataset
  dataframe = dataframe[[source_text,target_text]]
  display_df(dataframe.head(2))

  
  # Creation of Dataset and Dataloader
  # Defining the train size. So 80% of the data will be used for training and the rest for validation. 
  train_size = 0.8
  train_dataset=dataframe.sample(frac=train_size,random_state = model_params["SEED"])
  val_dataset=dataframe.drop(train_dataset.index).reset_index(drop=True)
  train_dataset = train_dataset.reset_index(drop=True)

  console.print(f"FULL Dataset: {dataframe.shape}")
  console.print(f"TRAIN Dataset: {train_dataset.shape}")
  console.print(f"TEST Dataset: {val_dataset.shape}\n")


  # Creating the Training and Validation dataset for further creation of Dataloader
  training_set = YourDataSetClass(train_dataset, tokenizer, model_params["MAX_SOURCE_TEXT_LENGTH"], model_params["MAX_TARGET_TEXT_LENGTH"], source_text, target_text)
  val_set = YourDataSetClass(val_dataset, tokenizer, model_params["MAX_SOURCE_TEXT_LENGTH"], model_params["MAX_TARGET_TEXT_LENGTH"], source_text, target_text)


  # Defining the parameters for creation of dataloaders
  train_params = {
      'batch_size': model_params["TRAIN_BATCH_SIZE"],
      'shuffle': True,
      'num_workers': 0
      }


  val_params = {
      'batch_size': model_params["VALID_BATCH_SIZE"],
      'shuffle': False,
      'num_workers': 0
      }


  # Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
  training_loader = DataLoader(training_set, **train_params)
  val_loader = DataLoader(val_set, **val_params)


  # Defining the optimizer that will be used to tune the weights of the network in the training session. 
  optimizer = torch.optim.Adam(params =  model.parameters(), lr=model_params["LEARNING_RATE"])


  # Training loop
  console.log(f'[Initiating Fine Tuning]...\n')

  for epoch in range(model_params["TRAIN_EPOCHS"]):
      train(epoch, tokenizer, model, device, training_loader, optimizer)
      
  console.log(f"[Saving Model]...\n")
  #Saving the model after training
  path = os.path.join(output_dir, "model_files")
  model.save_pretrained(path)
  tokenizer.save_pretrained(path)


  # evaluating test dataset
  console.log(f"[Initiating Validation]...\n")
  for epoch in range(model_params["VAL_EPOCHS"]):
    predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
    final_df = pd.DataFrame({'Generated Text':predictions,'Actual Text':actuals})
    final_df.to_csv(os.path.join(output_dir,'predictions.csv'))
    return(final_df)
  
  console.save_text(os.path.join(output_dir,'logs.txt'))
  
  console.log(f"[Validation Completed.]\n")
  console.print(f"""[Model] Model saved @ {os.path.join(output_dir, "model_files")}\n""")
  console.print(f"""[Validation] Generation on Validation data saved @ {os.path.join(output_dir,'predictions.csv')}\n""")
  console.print(f"""[Logs] Logs saved @ {os.path.join(output_dir,'logs.txt')}\n""")

In [None]:
#use if double fine-tuning. takes pre-fine-tuned model and fine-tunes it again
def T5Trainer2(dataframe, source_text, target_text, model_params, output_dir="./outputs/" ):
  
  """
  T5 trainer

  """

  # Set random seeds and deterministic pytorch for reproducibility
  torch.manual_seed(model_params["SEED"]) # pytorch random seed
  np.random.seed(model_params["SEED"]) # numpy random seed
  torch.backends.cudnn.deterministic = True

  # logging
  console.log(f"""[Model]: Loading {model_params["MODEL"]}...\n""")

  # tokenzier for encoding the text from fine-tuned model on classification
  path = os.path.join(output_dir, "model_files")
  tokenizer = T5Tokenizer.from_pretrained(path)

  # Defining the model. We are using t5-base model fine-tuned on classification
  # Further this model is sent to device (GPU/TPU) for using the hardware.
  model = T5ForConditionalGeneration.from_pretrained(path)
  model = model.to(device)
  
  # logging
  console.log(f"[Data]: Reading data...\n")

  # Importing the raw dataset
  dataframe = dataframe[[source_text,target_text]]
  display_df(dataframe.head(2))

  
  # Creation of Dataset and Dataloader
  # Defining the train size. So 80% of the data will be used for training and the rest for validation. 
  train_size = 0.8
  train_dataset=dataframe.sample(frac=train_size,random_state = model_params["SEED"])
  val_dataset=dataframe.drop(train_dataset.index).reset_index(drop=True)
  train_dataset = train_dataset.reset_index(drop=True)

  console.print(f"FULL Dataset: {dataframe.shape}")
  console.print(f"TRAIN Dataset: {train_dataset.shape}")
  console.print(f"TEST Dataset: {val_dataset.shape}\n")


  # Creating the Training and Validation dataset for further creation of Dataloader
  training_set = YourDataSetClass(train_dataset, tokenizer, model_params["MAX_SOURCE_TEXT_LENGTH"], model_params["MAX_TARGET_TEXT_LENGTH"], source_text, target_text)
  val_set = YourDataSetClass(val_dataset, tokenizer, model_params["MAX_SOURCE_TEXT_LENGTH"], model_params["MAX_TARGET_TEXT_LENGTH"], source_text, target_text)


  # Defining the parameters for creation of dataloaders
  train_params = {
      'batch_size': model_params["TRAIN_BATCH_SIZE"],
      'shuffle': True,
      'num_workers': 0
      }


  val_params = {
      'batch_size': model_params["VALID_BATCH_SIZE"],
      'shuffle': False,
      'num_workers': 0
      }


  # Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
  training_loader = DataLoader(training_set, **train_params)
  val_loader = DataLoader(val_set, **val_params)


  # Defining the optimizer that will be used to tune the weights of the network in the training session. 
  optimizer = torch.optim.Adam(params =  model.parameters(), lr=model_params["LEARNING_RATE"])


  # Training loop
  console.log(f'[Initiating Fine Tuning]...\n')

  for epoch in range(model_params["TRAIN_EPOCHS"]):
      train(epoch, tokenizer, model, device, training_loader, optimizer)
      
  console.log(f"[Saving Model]...\n")
  #Saving the model after training
  path = os.path.join(output_dir, "model_files2")
  model.save_pretrained(path)
  tokenizer.save_pretrained(path)


  # evaluating test dataset
  console.log(f"[Initiating Validation]...\n")
  for epoch in range(model_params["VAL_EPOCHS"]):
    predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
    final_df = pd.DataFrame({'Generated Text':predictions,'Actual Text':actuals})
    final_df.to_csv(os.path.join(output_dir,'predictions.csv'))
    return(final_df)
  
  console.save_text(os.path.join(output_dir,'logs.txt'))
  
  console.log(f"[Validation Completed.]\n")
  console.print(f"""[Model] Model saved @ {os.path.join(output_dir, "model_files")}\n""")
  console.print(f"""[Validation] Generation on Validation data saved @ {os.path.join(output_dir,'predictions.csv')}\n""")
  console.print(f"""[Logs] Logs saved @ {os.path.join(output_dir,'logs.txt')}\n""")

In [None]:
model_params={
    "MODEL":"t5-base",             # model_type: t5-base/t5-large
    "TRAIN_BATCH_SIZE":8,          # training batch size
    "VALID_BATCH_SIZE":8,          # validation batch size
    "TRAIN_EPOCHS":4,              # number of training epochs
    "VAL_EPOCHS":1,                # number of validation epochs
    "LEARNING_RATE":1e-4,          # learning rate
    "MAX_SOURCE_TEXT_LENGTH":512,  # max length of source text
    "MAX_TARGET_TEXT_LENGTH":50,   # max length of target text
    "SEED": 42                     # set seed for reproducibility 

}

In [None]:
#runs the fine-tuning
final = T5Trainer(dataframe=df[:50000], source_text="source_raw", target_text="target_raw", model_params=model_params, output_dir="best_outputs")

In [None]:
#if using double fine tuning, this is first fine-tune
#final = T5Trainer(dataframe=class_df[:1000], source_text="sentence", target_text="class", model_params=model_params, output_dir="outputs4")

In [None]:
#if using double fine-tuning, this is our second fine tune
#final2 = T5Trainer2(dataframe=df[:500], source_text="source_raw", target_text="target_raw", model_params=model_params, output_dir="outputs4")

In [None]:
final

In [None]:
#if using double fine tuning, this is the final sentence prediction df
#final2

In [None]:
final['accuracy'] = np.where(final['Generated Text'] == final['Actual Text'],1,0)

In [None]:
#for double fine-tuning
#final2['accuracy'] = np.where(final2['Generated Text'] == final2['Actual Text'],1,0)

In [None]:
final.to_csv("50000_4epochs.csv")

In [None]:
print('Final Accuracy: ', final['accuracy'].mean())

In [None]:
#for double fine tuning
#print('Final Accuracy for classification: ', final['accuracy'].mean())

In [None]:
final

Unnamed: 0,Generated Text,Actual Text,accuracy,BLEURT,BLEU
0,vocalist rob halford's performance is consider...,vocalist rob halford's performance is consider...,1,0.876450,0.647408
1,"the church teaches that god the father, jesus,...","the church teaches that god the father, jesus ...",0,0.541272,0.592926
2,"in 1970, curiel directed mil in two of his mos...","in 1970, curiel directed mil in two of his tea...",0,0.711075,0.769902
3,"he returned, his mother told him his stepfathe...","when he returned, his mother told him his step...",0,0.415915,0.721795
4,"under fire is a film set in 1979, during the l...","under fire is a political film set in 1979, du...",0,0.561104,0.712742
...,...,...,...,...,...
9995,"monbiot asserts that climate change is the ""mo...",monbiot has written that climate change is the...,0,0.585672,0.606731
9996,", commonly referred to as the gospel of john o...","the gospel according to john (greek ), commonl...",0,0.070446,0.680904
9997,the south african farm attacks refer to the fa...,the south african farm attacks refer to the cl...,0,0.657578,0.590260
9998,"christian forces conquered buda, and in the ne...","the christian forces seized buda, and in the n...",0,0.483363,0.649202


In [None]:
final.to_csv("50000_lr_3e-4.csv")