## Notebook used to fine-tune T5 model on bias neutralizing task

Adapted from https://shivanandroy.com/fine-tune-t5-transformer-with-pytorch/

In [None]:
%%capture
%%bash
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git-lfs install

wget https://storage.googleapis.com/bleurt-oss/bleurt-base-512.zip 
unzip bleurt-base-512.zip

In [None]:
%%capture
!pip install sentencepiece
!pip install transformers
!pip install rich[jupyter]
!pip install huggingface_hub
!pip install wandb
!pip install git+https://github.com/google-research/bleurt.git

In [None]:
# login to huggingface
from huggingface_hub import notebook_login
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [None]:
import wandb
wandb.login()


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
!git clone https://github.com/erickfm/bias-neutralization.git

Cloning into 'bias-neutralization'...
remote: Enumerating objects: 20, done.[K
remote: Counting objects: 100% (20/20), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 20 (delta 5), reused 4 (delta 0), pack-reused 0[K
Unpacking objects: 100% (20/20), done.


In [None]:
# Importing libraries
import os
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
import os
import uuid
from bleurt import score
from nltk.translate.bleu_score import sentence_bleu

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

from rich.table import Column, Table
from rich import box
from rich.console import Console

# Set BLEURT model for bleurt scoring
bleurt_scorer = score.BleurtScorer('bleurt-base-512')

# define a rich console logger
console=Console(record=True)

def display_df(df):
  """display dataframe in ASCII format"""

  console=Console()
  table = Table(Column("source_text", justify="center" ), Column("target_text", justify="center"), title="Sample Data",pad_edge=False, box=box.ASCII)

  for i, row in enumerate(df.values.tolist()):
    table.add_row(row[0], row[1])

  console.print(table)

training_logger = Table(Column("Epoch", justify="center" ), 
                        Column("Steps", justify="center"),
                        Column("Loss", justify="center"), 
                        title="Training Status",pad_edge=False, box=box.ASCII)

# Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

INFO:tensorflow:Reading checkpoint bleurt-base-512.
INFO:tensorflow:Config file found, reading.
INFO:tensorflow:Will load checkpoint bert_custom
INFO:tensorflow:Loads full paths and checks that files exists.
INFO:tensorflow:... name:bert_custom
INFO:tensorflow:... vocab_file:vocab.txt
INFO:tensorflow:... bert_config_file:bert_config.json
INFO:tensorflow:... do_lower_case:True
INFO:tensorflow:... max_seq_length:512
INFO:tensorflow:Creating BLEURT scorer.
INFO:tensorflow:Creating WordPiece tokenizer.
INFO:tensorflow:WordPiece tokenizer instantiated.
INFO:tensorflow:Creating Eager Mode predictor.
INFO:tensorflow:Loading model.
INFO:tensorflow:BLEURT initialized.


In [None]:
# Set model checkpoint
model_checkpoint = 't5-base'

# Sweep Setup
sweep_config = {'method': 'random',
 'metric': {'goal': 'maximize', 'name': 'accuracy'},
 'parameters': {'batch_size': {'values': [4]},
                'epochs': {'values': [4]},
                'learning_rate': {'values': [1e-4,6e-4]}}}
sweep_id = wandb.sweep(sweep_config, project=model_checkpoint)

Create sweep with ID: umojid6j
Sweep URL: https://wandb.ai/unbias/t5-base/sweeps/umojid6j


In [None]:
class YourDataSetClass(Dataset):
  """
  Creating a custom dataset for reading the dataset and 
  loading it into the dataloader to pass it to the neural network for finetuning the model

  """

  def __init__(self, dataframe, tokenizer, source_len, target_len, source_text, target_text):
    self.tokenizer = tokenizer
    self.data = dataframe
    self.source_len = source_len
    self.summ_len = target_len
    self.target_text = self.data[target_text]
    self.source_text = self.data[source_text]

  def __len__(self):
    return len(self.target_text)

  def __getitem__(self, index):
    source_text = str(self.source_text[index])
    target_text = str(self.target_text[index])

    #cleaning data so as to ensure data is in string type
    source_text = ' '.join(source_text.split())
    target_text = ' '.join(target_text.split())

    source = self.tokenizer.batch_encode_plus([source_text], max_length= self.source_len, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt')
    target = self.tokenizer.batch_encode_plus([target_text], max_length= self.summ_len, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt')

    source_ids = source['input_ids'].squeeze()
    source_mask = source['attention_mask'].squeeze()
    target_ids = target['input_ids'].squeeze()
    target_mask = target['attention_mask'].squeeze()

    return {
        'source_ids': source_ids.to(dtype=torch.long), 
        'source_mask': source_mask.to(dtype=torch.long), 
        'target_ids': target_ids.to(dtype=torch.long),
        'target_ids_y': target_ids.to(dtype=torch.long)
    }

In [None]:
def train(epoch, tokenizer, model, device, loader, optimizer):

  """
  Function to be called for training with the parameters passed from main function

  """

  model.train()
  cumulative_loss = 0
  for _,data in enumerate(loader, 0):
    y = data['target_ids'].to(device, dtype = torch.long)
    y_ids = y[:, :-1].contiguous()
    lm_labels = y[:, 1:].clone().detach()
    lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
    ids = data['source_ids'].to(device, dtype = torch.long)
    mask = data['source_mask'].to(device, dtype = torch.long)

    outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels)
    loss = outputs[0]

    if _%1000==0:
      training_logger.add_row(str(epoch), str(_), str(loss))
      console.print(training_logger)
      train.loss = loss

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

In [None]:
def validate(epoch, tokenizer, model, device, loader):

  """
  Function to evaluate model for predictions

  """
  model.eval()
  predictions = []
  actuals = []
  with torch.no_grad():
      for _, data in enumerate(loader, 0):
          y = data['target_ids'].to(device, dtype = torch.long)
          ids = data['source_ids'].to(device, dtype = torch.long)
          mask = data['source_mask'].to(device, dtype = torch.long)

          generated_ids = model.generate(
              input_ids = ids,
              attention_mask = mask, 
              max_length=70, 
              num_beams=3,
              repetition_penalty=2.5, 
              length_penalty=1.0, 
              #num_return_sequences=3,
              early_stopping=True
              )
          
          preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
          target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
          if _%1000==0:
              console.print(f'Completed {_}')

          predictions.extend(preds)
          actuals.extend(target)
  return predictions, actuals

In [None]:
def T5Trainer(config=None):

  """
  T5 trainer

  """

  # load df
  df = pd.read_csv('/content/bias-neutralization/data/biased.word.train.csv')
  df=df[["id", "source_tokenized", "target_tokenized", "source_raw", "target_raw","POS","POS_2"]]
  df["source_raw"] = "neutralize bias: " + df["source_raw"]
  dataframe = df

  # set static model params
  source_text= "source_raw"
  target_text= "target_raw"
  output_dir= "./outputs/"
  model_params={
    "MODEL":model_checkpoint,      # model_type
    "MAX_SOURCE_TEXT_LENGTH":512,  # max length of source text
    "MAX_TARGET_TEXT_LENGTH":70,   # max length of target text
    "SEED": 42,                    # set seed for reproducibility 
  }

  # Track training with weights and biases
  # Initialize a new wandb run
  with wandb.init(config=config):

    # Allow Agent to set config
    config = wandb.config  
    model_params["MODEL_NAME"] = wandb.run.name

    # Set random seeds and deterministic pytorch for reproducibility
    torch.manual_seed(model_params["SEED"]) # pytorch random seed
    np.random.seed(model_params["SEED"]) # numpy random seed
    torch.backends.cudnn.deterministic = True

    # logging
    console.log(f"""[Model]: Loading {model_params["MODEL"]}...\n""")

    # tokenzier for encoding the text
    tokenizer = T5Tokenizer.from_pretrained(model_params["MODEL"])

    # Defining the model. We are using t5-base model  
    # Further this model is sent to device (GPU/TPU) for using the hardware.
    model = T5ForConditionalGeneration.from_pretrained(model_params["MODEL"])
    model = model.to(device)
    
    # logging
    console.log(f"[Data]: Reading data...\n")

    # Importing the raw dataset
    dataframe = dataframe[[source_text,target_text]]
    
    # Creation of Dataset and Dataloader
    # Defining the train size. So 80% of the data will be used for training and the rest for validation. 
    train_size = 0.8
    train_dataset=dataframe.sample(frac=train_size,random_state = model_params["SEED"])
    val_dataset=dataframe.drop(train_dataset.index).reset_index(drop=True)
    train_dataset = train_dataset.reset_index(drop=True)

    console.print(f"FULL Dataset: {dataframe.shape}")
    console.print(f"TRAIN Dataset: {train_dataset.shape}")
    console.print(f"TEST Dataset: {val_dataset.shape}\n")


    # Creating the Training and Validation dataset for further creation of Dataloader
    training_set = YourDataSetClass(train_dataset, tokenizer, model_params["MAX_SOURCE_TEXT_LENGTH"], model_params["MAX_TARGET_TEXT_LENGTH"], source_text, target_text)
    val_set = YourDataSetClass(val_dataset, tokenizer, model_params["MAX_SOURCE_TEXT_LENGTH"], model_params["MAX_TARGET_TEXT_LENGTH"], source_text, target_text)


    # Defining the parameters for creation of dataloaders
    train_params = {
        'batch_size': config['batch_size'], 
        'shuffle': True,
        'num_workers': 0
        }


    val_params = {
        'batch_size': config['batch_size'],
        'shuffle': False,
        'num_workers': 0
        }


    # Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
    training_loader = DataLoader(training_set, **train_params)
    val_loader = DataLoader(val_set, **val_params)


    # Defining the optimizer that will be used to tune the weights of the network in the training session. 
    optimizer = torch.optim.Adam(params =  model.parameters(), lr=config['learning_rate'])


    # Training loop
    console.log(f'[Initiating Fine Tuning]...\n')

    for epoch in range(config['epochs']):
        train(epoch, tokenizer, model, device, training_loader, optimizer)
        loss = train.loss

        predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
        final = pd.DataFrame({'Generated Text':predictions,'Actual Text':actuals})
        final['accuracy'] = np.where(final['Generated Text'] == final['Actual Text'],1,0)
        accuracy = final['accuracy'].mean()

        final['BLEU'] = final.apply(lambda row: sentence_bleu([row['Actual Text']],row['Generated Text']), axis=1)
        bleu = final["BLEU"].mean()

        final['BLEURT'] = bleurt_scorer.score(references=final['Actual Text'],candidates=final['Generated Text'])
        bleurt = final["BLEURT"].mean()

        # Try to log metrics to W&B
        try:
          wandb.log({"loss": loss, "accuracy": accuracy, "BLEU": bleu, "BLEURT": bleurt})
          # Optional
          wandb.watch(model)
        except Exception as e:
          console.print(e)
        
    console.log(f"[Saving Model]...\n")
    #Saving the model after training
    path = os.path.join(output_dir, "model_files")
    model.save_pretrained(path)
    tokenizer.save_pretrained(path)
    console.print('saved to:',path)
    
    # Try to push to hub
    if True:
      try:
        console.print('[Pushing to Hub]...\n')
        model.push_to_hub(model_params["MODEL_NAME"])
        tokenizer.push_to_hub(model_params["MODEL_NAME"])
      except:
        try:
          console.print('Pushing FAILED\n')
          console.print('[Retrying with Temp Dir]...\n')
          model.push_to_hub(model_params["MODEL_NAME"],use_temp_dir=True)
          tokenizer.push_to_hub(model_params["MODEL_NAME"],use_temp_dir=True)
        except:
          console.print('Pushing FAILED.')

    # Use Garbage Collector to collect and then empty gpu memory cache to help CUDA out of memory issue
    import gc
    gc.collect()
    torch.cuda.empty_cache()
    
    console.save_text(os.path.join(output_dir,'logs.txt'))
    
    console.log(f"[Validation Completed.]\n")
    console.print(f"""[Model] Model saved @ {os.path.join(output_dir, "model_files")}\n""")
    console.print(f"""[Validation] Generation on Validation data saved @ {os.path.join(output_dir,'predictions.csv')}\n""")
    console.print(f"""[Logs] Logs saved @ {os.path.join(output_dir,'logs.txt')}\n""")

In [None]:
# Runs Sweep
wandb.agent(sweep_id, T5Trainer, count=36)

[34m[1mwandb[0m: Agent Starting Run: 9ch91d0f with config:
[34m[1mwandb[0m: 	batch_size: 4
[34m[1mwandb[0m: 	epochs: 4
[34m[1mwandb[0m: 	learning_rate: 0.0006
[34m[1mwandb[0m: Currently logged in as: [33merickfm[0m ([33munbias[0m). Use [1m`wandb login --relogin`[0m to force relogin


Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Cloning https://huggingface.co/erickfm/proud-sweep-1 into local empty directory.


Upload file pytorch_model.bin:   0%|          | 32.0k/850M [00:00<?, ?B/s]

To https://huggingface.co/erickfm/proud-sweep-1
   df38e16..0672e85  main -> main



Upload file spiece.model:   4%|4         | 32.0k/773k [00:00<?, ?B/s]

To https://huggingface.co/erickfm/proud-sweep-1
   0672e85..0f94280  main -> main



VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
BLEU,█▄▃▁
BLEURT,█▄▂▁
accuracy,█▅▁▂
loss,█▂▅▁

0,1
BLEU,0.65748
BLEURT,0.20252
accuracy,0.08913
loss,0.02091


[34m[1mwandb[0m: Agent Starting Run: uzfya7xe with config:
[34m[1mwandb[0m: 	batch_size: 4
[34m[1mwandb[0m: 	epochs: 4
[34m[1mwandb[0m: 	learning_rate: 0.0001


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Cloning https://huggingface.co/erickfm/zesty-sweep-2 into local empty directory.


Upload file pytorch_model.bin:   0%|          | 32.0k/850M [00:00<?, ?B/s]

To https://huggingface.co/erickfm/zesty-sweep-2
   dbe8a3a..639c23f  main -> main



Upload file spiece.model:   4%|4         | 32.0k/773k [00:00<?, ?B/s]

To https://huggingface.co/erickfm/zesty-sweep-2
   639c23f..b3e841c  main -> main



VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
BLEU,▁▂█▃
BLEURT,▆▆█▁
accuracy,▇██▁
loss,█▃▄▁

0,1
BLEU,0.93788
BLEURT,0.81214
accuracy,0.31673
loss,0.01187


[34m[1mwandb[0m: Agent Starting Run: lsye64oo with config:
[34m[1mwandb[0m: 	batch_size: 4
[34m[1mwandb[0m: 	epochs: 4
[34m[1mwandb[0m: 	learning_rate: 0.0001


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
