# Group Project - Advanced Model

Divam Arora, Connor Moore, Hemanth Velan

DSBA 6165

### Sources:
* https://huggingface.co/datasets/gigaword
* https://huggingface.co/docs/datasets/v1.11.0/splits.html
* https://huggingface.co/docs/datasets/process#export
* https://aparnamishra144.medium.com/how-to-change-string-data-or-text-data-of-a-column-to-lowercase-in-pandas-248a8ce4ae01
* https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
* https://www.geeksforgeeks.org/string-punctuation-in-python/
* https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
* https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
* https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
* https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
* https://www.analyticsvidhya.com/blog/2021/06/pre-processing-of-text-data-in-nlp/

First we need to re-run the code from our EDA/pre-processing notebook that loads and prepares our dataset for implementation.

In [1]:
# import needed packages
import re
import nltk
import time
import torch
import math
import random
import string
import evaluate
import argparse
import numpy as np
import pandas as pd
import datasets as ds
from evaluate import load
import pytorch_lightning as pl
from torch.utils.data import DataLoader, TensorDataset, RandomSampler
from transformers import BartTokenizer, BartForConditionalGeneration

# download stop word package from nltk library
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Dataset
Because our dataset is pulled directly from Huggingface's datasets library, there is no need for a local copy of the data. Running the cell below creates an instance of the specified dataset in your workspace environment.

In [2]:
# https://huggingface.co/datasets/gigaword
# https://huggingface.co/docs/datasets/v1.11.0/splits.html

# download gigaword dataset from Hugging Face dataset library
train, test, validation = ds.load_dataset("gigaword", split=["train", "test", "validation"])

In [3]:
# display the dataset splits
print(train)
print(test)
print(validation)

Dataset({
    features: ['document', 'summary'],
    num_rows: 3803957
})
Dataset({
    features: ['document', 'summary'],
    num_rows: 1951
})
Dataset({
    features: ['document', 'summary'],
    num_rows: 189651
})


In [4]:
# https://huggingface.co/docs/datasets/process#export

# export the training dataset to a pandas dataframe and display
df_train = train.to_pandas()
print("Train df exported.")

# export the test dataset to a pandas dataframe
df_test = test.to_pandas()
print("Test df exported.")

# export the validation dataset to a pandas dataframe
df_val = validation.to_pandas()
print("Validation df exported.")

Train df exported.
Test df exported.
Validation df exported.


### Balancing the train-test split
The standard provided division between train, test, and validation is extremely unbalanced towards train (95%), and the dataset overall is far too large to run through our model in a reasonable timespan. We decided to shrink the train set to 70,000 entries, and concat the provided test and validation sets. From that combined test-val set we will extract a 25,000-entry test set and a 5,000 entry validation set.

In [5]:
# select 70,000 rows randomly from the train dataframe

df_train_short = df_train.sample(n = 70000, random_state=2, ignore_index=True)

df_train_short

Unnamed: 0,document,summary
0,a british soldier was killed saturday by an ex...,british soldier killed in afghanistan blast
1,ukraine insists on building two new nuclear re...,ukraine insists on linking chernobyl closure t...
2,portuguese president mario soares will pay an ...,portugal 's president to visit angola next month
3,aol stepped up its transformation from interne...,aol introduces new advertising network plans t...
4,marine experts from wwf flew to the northern k...,suspected toxic algae bloom leaves thousands o...
...,...,...
69995,hong kong 's benchmark hang seng index ended h...,hong kong stocks edged up after four straight ...
69996,former brazil coach carlos alberto parreira sa...,parreira says he 's close to an agreement to c...
69997,around ## youths on thursday protested outside...,latvian youths protest ban of UNK symbols
69998,ohio 's method of putting prisoners to death i...,ohio judge says state s lethal injection proce...


In [6]:
# combine provided test and val sets and reseparate randomly into smaller subsets

# concat test and validation sets
test_val = [df_test, df_val]
df_testval_bulk = pd.concat(test_val)

# take a random sample of 30000 rows from the test and validation bulk set
df_testval_short = df_testval_bulk.sample(n = 30000, random_state=3, ignore_index=True)

# take a random 5000 row sample from the test-val subset
df_val_short = df_testval_short.sample(n = 5000, random_state=4, ignore_index=True)

# drop all rows taken for the validation sample from the test-val subset to create the test set
df_test_short = df_testval_short.drop(df_val_short.index, axis=0)

### Data Pre-Processing
We decided to truncate our pre-processing pipeline slightly from our original model and submission because BART models are designed to accept full, grammatically correct sentences, so we thought passing more "normal" text may give the model better context and improve training.

In [7]:
# the methods required to perform this function were found in this article -
# https://aparnamishra144.medium.com/how-to-change-string-data-or-text-data-of-a-column-to-lowercase-in-pandas-248a8ce4ae01
# the function and comments are our original work

# set all words in all rows to lower case

def lower(df):
    # vectorize strings in each row in summary column and set to lower case
    df["summary"] = df["summary"].str.lower()
    print("summary column lowercased")
    # vectorize strings in each row in document column and set to lower case
    df["document"] = df["document"].str.lower()
    print("document column lowercased")

In [8]:
# geeks for geeks and pandas doc pages were used as template source code and informed about parameter options
# stackoverflow posts helped with debugging issues
# https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
# https://www.geeksforgeeks.org/string-punctuation-in-python/
# https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
# https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
# comments and function are our original work, source code was modifed to fit our workspace

# remove all symbols and punctuation

# create instance of all punctuation symbols
punctuation = string.punctuation

# since we learned there are lots of apostrophe s in the dataset during EDA, we will add this to our remove list
punct_list = ["'s"]

# add all punctuation from the premade variable to our new list
for symbol in punctuation:
    punct_list.append(symbol)

# display the symbols included in our list
print(punct_list)

def remove_punctuation(df):
    # for each symbol in our punctuation list
    for symbol in punct_list:
        # iterate through the dataframe and replace every instance of the symbol with an empty string
        df["document"] = df["document"].str.replace(symbol, "", regex=False)
        df["summary"] = df["summary"].str.replace(symbol, "", regex=False)
    print("symbols removed")

["'s", '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [9]:
# create data pre-processing pipeline

def pre_proc(df):
    # lowercase
    lower(df)
    # remove punctuation and symbols
    remove_punctuation(df)
    print("pre-processed successfully")

In [10]:
# call the data pre-processing pipeline for each of the dataset splits

pre_proc(df_train_short)
print("train df completed")
pre_proc(df_val_short)
print("test df completed")
pre_proc(df_test_short)
print("validation df completed")

# display new format of data using training set
df_train_short.head()

summary column lowercased
document column lowercased
symbols removed
pre-processed successfully
train df completed
summary column lowercased
document column lowercased
symbols removed
pre-processed successfully
test df completed
summary column lowercased
document column lowercased
symbols removed
pre-processed successfully
validation df completed


Unnamed: 0,document,summary
0,a british soldier was killed saturday by an ex...,british soldier killed in afghanistan blast
1,ukraine insists on building two new nuclear re...,ukraine insists on linking chernobyl closure t...
2,portuguese president mario soares will pay an ...,portugal president to visit angola next month
3,aol stepped up its transformation from interne...,aol introduces new advertising network plans t...
4,marine experts from wwf flew to the northern k...,suspected toxic algae bloom leaves thousands o...


Our dataset splits are now pre-processed and ready for use with models.

# adv model writeup goes here
(from canvas) "Write about how your advanced model is different from your baseline model. Why did you choose the model architecture ? What evidence from the previous model milestone did you use to drive your decision making? Write at least 100 words."

Attempting to optimize model using this article - https://towardsdatascience.com/teaching-bart-to-rap-fine-tuning-hugging-faces-bart-model-41749d38f3ef

and this notebook - https://colab.research.google.com/drive/1Cy27V-7qqYatqMA7fEqG2kgMySZXw9I4?usp=sharing&pli=1#scrollTo=t77cjYY_fZlb

everything in cells below is from that source notebook, i modified it to fit our workspace and changed some of the parameters/functions to make sense for our application

In [45]:
# Load the model

tokenizer = BartTokenizer.from_pretrained('facebook/bart-base', add_prefix_space=True)

bart_model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")

In [46]:
def shift_tokens_right(input_ids, pad_token_id):
  """ Shift input ids one token to the right, and wrap the last non pad token (usually <eos>).
      This is taken directly from modeling_bart.py
  """
  prev_output_tokens = input_ids.clone()
  index_of_eos = (input_ids.ne(pad_token_id).sum(dim=1) - 1).unsqueeze(-1)
  prev_output_tokens[:, 0] = input_ids.gather(1, index_of_eos).squeeze()
  prev_output_tokens[:, 1:] = input_ids[:, :-1]
  return prev_output_tokens

def encode_sentences(tokenizer, source_sentences, target_sentences, max_length=32, pad_to_max_length=True, return_tensors="pt"):
  ''' Function that tokenizes a sentence 
      Args: tokenizer - the BART tokenizer; source and target sentences are the source and target sentences
      Returns: Dictionary with keys: input_ids, attention_mask, target_ids
  '''

  input_ids = []
  attention_masks = []
  target_ids = []
  tokenized_sentences = {}

  for sentence in source_sentences:
    encoded_dict = tokenizer(
          sentence,
          max_length=max_length,
          padding="max_length" if pad_to_max_length else None,
          truncation=True,
          return_tensors=return_tensors,
          add_prefix_space = True
      )

    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])

  input_ids = torch.cat(input_ids, dim = 0)
  attention_masks = torch.cat(attention_masks, dim = 0)

  for sentence in target_sentences:
    encoded_dict = tokenizer(
          sentence,
          max_length=max_length,
          padding="max_length" if pad_to_max_length else None,
          truncation=True,
          return_tensors=return_tensors,
          add_prefix_space = True
      )
    # Shift the target ids to the right
    # shifted_target_ids = shift_tokens_right(encoded_dict['input_ids'], tokenizer.pad_token_id)
    target_ids.append(encoded_dict['input_ids'])

  target_ids = torch.cat(target_ids, dim = 0)
  

  batch = {
      "input_ids": input_ids,
      "attention_mask": attention_masks,
      "labels": target_ids,
  }

  return batch


def noise_sentence(sentence_, percent_words, replacement_token = "<mask>"):
  '''
  Function that noises a sentence by adding <mask> tokens
  Args: sentence - the sentence to noise
        percent_words - the percent of words to replace with <mask> tokens; the number is rounded up using math.ceil
  Returns a noised sentence
  '''
  # Create a list item and copy
  sentence_ = sentence_.split(' ')
  sentence = sentence_.copy()
  
  num_words = math.ceil(len(sentence) * percent_words)
  
  # Create an array of tokens to sample from; don't include the last word as an option because in the case of lyrics
  # that word is often a rhyming word and plays an important role in song construction
  sample_tokens = set(np.arange(0, np.maximum(1, len(sentence)-1)))
  
  words_to_noise = random.sample(sample_tokens, num_words)
  
  # Swap out words, but not full stops
  for pos in words_to_noise:
      if sentence[pos] != '.':
          sentence[pos] = replacement_token
  
  # Remove redundant spaces
  sentence = re.sub(r' {2,5}', ' ', ' '.join(sentence))
  
  # Combine concurrent <mask> tokens into a single token; this just does two rounds of this; more could be done
  sentence = re.sub(r'<mask> <mask>', "<mask>", sentence)
  sentence = re.sub(r'<mask> <mask>', "<mask>", sentence)
  return sentence
  

In [47]:
# Create the hparams dictionary to pass in the model
# I realise that this isn't really how this is meant to be used, but having this here reminds me that I can edit it when I need
params = argparse.Namespace()

params.freeze_encoder = True
params.freeze_embeds = True
params.eval_beams = 4

In [48]:
class LitModel(pl.LightningModule):
  # Instantiate the model
  def __init__(self, learning_rate, tokenizer, model, params):
    super().__init__()
    self.tokenizer = tokenizer
    self.model = model
    self.learning_rate = learning_rate
    # self.freeze_encoder = freeze_encoder
    # self.freeze_embeds_ = freeze_embeds
    self.params = params

    if self.params.freeze_encoder:
      freeze_params(self.model.get_encoder())

    if self.params.freeze_embeds:
      self.freeze_embeds()
  
  def freeze_embeds(self):
    ''' freeze the positional embedding parameters of the model; adapted from finetune.py '''
    freeze_params(self.model.model.shared)
    for d in [self.model.model.encoder, self.model.model.decoder]:
      freeze_params(d.embed_positions)
      freeze_params(d.embed_tokens)

  # Do a forward pass through the model
  def forward(self, input_ids, **kwargs):
    return self.model(input_ids, **kwargs)
  
  def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr = self.learning_rate)
    return optimizer

  def training_step(self, batch, batch_idx):
    # Load the data into variables
    src_ids, src_mask = batch[0], batch[1]
    tgt_ids = batch[2]
    # Shift the decoder tokens right (but NOT the tgt_ids)
    decoder_input_ids = shift_tokens_right(tgt_ids, tokenizer.pad_token_id)

    # Run the model and get the logits
    outputs = self(src_ids, attention_mask=src_mask, decoder_input_ids=decoder_input_ids, use_cache=False)
    lm_logits = outputs[0]
    # Create the loss function
    ce_loss_fct = torch.nn.CrossEntropyLoss(ignore_index=self.tokenizer.pad_token_id)
    # Calculate the loss on the un-shifted tokens
    loss = ce_loss_fct(lm_logits.view(-1, lm_logits.shape[-1]), tgt_ids.view(-1))

    return {'loss':loss}

  def validation_step(self, batch, batch_idx):

    src_ids, src_mask = batch[0], batch[1]
    tgt_ids = batch[2]

    decoder_input_ids = shift_tokens_right(tgt_ids, tokenizer.pad_token_id)
    
    # Run the model and get the logits
    outputs = self(src_ids, attention_mask=src_mask, decoder_input_ids=decoder_input_ids, use_cache=False)
    lm_logits = outputs[0]

    ce_loss_fct = torch.nn.CrossEntropyLoss(ignore_index=self.tokenizer.pad_token_id)
    val_loss = ce_loss_fct(lm_logits.view(-1, lm_logits.shape[-1]), tgt_ids.view(-1))

    return {'loss': val_loss}
  
  # Method that generates text using the BartForConditionalGeneration's generate() method
  def generate_text(self, inputs, max_length, min_length, length_penalty, num_beams, early_stopping=True):
    ''' Function to generate text '''
    generated_id = self.model.generate(inputs, max_length=max_length, min_length=min_length, length_penalty=length_penalty, num_beams=num_beams, early_stopping=early_stopping)
    return generated_id

def freeze_params(model):
  ''' Function that takes a model as input (or part of a model) and freezes the layers for faster training
      adapted from finetune.py '''
  for layer in model.parameters():
    layer.requires_grade = False

In [49]:
# Create a dataloading module as per the PyTorch Lightning Docs
class SummaryDataModule(pl.LightningDataModule):
  def __init__(self, tokenizer, train, test, validate, batch_size):
    super().__init__()
    self.tokenizer = tokenizer
    self.batch_size = batch_size
    self.train = train
    self.test = test
    self.validate = validate

  # encode the sentences using the tokenizer  
  def setup(self, stage):
    self.train = encode_sentences(self.tokenizer, self.train['document'], self.train['summary'])
    self.validate = encode_sentences(self.tokenizer, self.validate['document'], self.validate['summary'])
    self.test = encode_sentences(self.tokenizer, self.test['document'], self.test['summary'])

  # Load the training, validation and test sets in Pytorch Dataset objects
  def train_dataloader(self):
    dataset = TensorDataset(self.train['input_ids'], self.train['attention_mask'], self.train['labels'])                          
    train_data = DataLoader(dataset, num_workers=7, persistent_workers=True, sampler = RandomSampler(dataset), batch_size = self.batch_size)
    return train_data

  def val_dataloader(self):
    dataset = TensorDataset(self.validate['input_ids'], self.validate['attention_mask'], self.validate['labels']) 
    val_data = DataLoader(dataset, num_workers=7, persistent_workers=True, batch_size = self.batch_size)                       
    return val_data

  def test_dataloader(self):
    dataset = TensorDataset(self.test['input_ids'], self.test['attention_mask'], self.test['labels']) 
    test_data = DataLoader(dataset, num_workers=7, persistent_workers=True, batch_size = self.batch_size)                   
    return test_data

In [50]:
# Load the data into the model for training
summary_data = SummaryDataModule(tokenizer, df_train_prac, df_test_prac, df_val_prac, batch_size = 16)

model = LitModel(learning_rate=2e-5, tokenizer=tokenizer, model=bart_model, params=params)

In [51]:
trainer = pl.Trainer(logger=False,
                     max_epochs = 1,
                     min_epochs = 1,
                     enable_model_summary=True,
                     enable_progress_bar=True
                     )

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [52]:
# Fit the instantiated model to the data
trainer.fit(model, summary_data)

C:\Users\cmoor197\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:639: Checkpoint directory c:\Users\cmoor197\Documents\ai_work\text_summarization\checkpoints exists and is not empty.

  | Name  | Type                         | Params
-------------------------------------------------------
0 | model | BartForConditionalGeneration | 139 M 
-------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
557.682   Total estimated model params size (MB)


Epoch 0:   0%|          | 2/4375 [08:05<294:58:09,  0.00it/s]              
Epoch 0: 100%|██████████| 5/5 [01:34<00:00,  0.05it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.


Epoch 0: 100%|██████████| 5/5 [01:44<00:00,  0.05it/s]


end of using that source notebook - trying out the trained model 

In [53]:
# NEW VERSION OF RUNBART with no NA outputs


def runBart(df):

    # Empty lists for predictions and performance timestamps
    predictions = []
    times = []

    # For the number of rows in the given dataframe
    for i in range(len(df)):
        # Create a start timestamp
        start = time.perf_counter()

        # Create a document instance using the row's entry for the stringified document
        doc = df.iloc[i]["document"]

        # Encoding inputs using BART tokenizer 
        inputs = tokenizer.encode(doc, return_tensors='pt', max_length=1024, truncation=True)

        # Generate vectorized summary using encoded inputs
        summary_ids = model.generate_text(inputs, max_length=150, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)

        # Decode the summary into a human-readable format
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        # Append the predicted summary to a list of predictions
        predictions.append(summary)

        # Create an end timestamp
        end = time.perf_counter()

        # Calculate computation speed
        speed = end - start

        # Append computation speed to list
        times.append(speed)

        # If the iteration is a multiple of 1000
        if i % 5000 == 0:
            # Calculate the average computation time per row so far and print
            avg_time = sum(times) / len(times)
            print("Average time per row at", i, "row:", avg_time)

    # Create a new column for the dataframe using the predictions generated and return the modified dataframe
    df["BART_Pred"] = predictions
    return df

In [54]:
runBart(df_train_prac)

df_train_prac

Average time per row at 0 row: 2.119162500006496


KeyboardInterrupt: 

In [None]:
runBart(df_test_prac)

df_test_prac

# BERTScore Metrics

In [None]:
# initialize BERTScore metric

bertscore = load("bertscore")

In [None]:
# generating BERTScore metrics for model predictions

# create list of train prediction outputs
train_predictions = list(df_train_short["BART_Pred"].astype(str))
# create list of true outputs
train_references = list(df_train_short["summary"].astype(str))
# calculate BERTScore values comparing model predictions with true summaries
train_results_bert = bertscore.compute(predictions=train_predictions, references=train_references, lang="en")

# create list of prediction outputs
test_predictions = list(df_train_short["BART_Pred"].astype(str))
# create list of true outputs
test_references = list(df_train_short["summary"].astype(str))
# calculate BERTScore values comparing model predictions with true summaries
test_results_bert = bertscore.compute(predictions=test_predictions, references=test_references, lang="en")

In [None]:
# calculate average precision, recall and F1 scores from BERTScore model based on model predictions

# create list of train result keys
train_keys = list(train_results_bert.keys())

# for number of values in keylist-1
for k in range(len(train_keys)-1):
    # sum total of all result values
    s_train = sum(train_results_bert[train_keys[k]])
    # calculate the total number of result values
    le_train = len(train_results_bert[train_keys[k]])
    # compute average result value
    avg_train = s_train/le_train

    print(("train results:"))
    print("Average {} is {}".format(train_keys[k], avg_train))
    print()


# create list of result keys
test_keys = list(test_results_bert.keys())

# for number of values in keylist-1
for k in range(len(test_keys)-1):
    # sum total of all result values
    s_test = sum(test_results_bert[test_keys[k]])
    # calculate the total number of result values
    le_test = len(test_results_bert[test_keys[k]])
    # compute average result value
    avg_test = s_test/le_test

    print("test result:")
    print("Average {} is {}".format(test_keys[k], avg_test))

# ROUGE Metrics

In [None]:
# initialize ROGUE metrics model

rouge = evaluate.load('rouge')

In [None]:
# generate ROGUE metrics scores for train and test model outputs

# compute ROGUE metrics scores comparing model predictions with true outputs
train_results_bert = rouge.compute(predictions=train_predictions, references=train_references)

# compute ROGUE metrics scores comparing model predictions with true outputs
test_results_rogue = rouge.compute(predictions=test_predictions, references=test_references)

print("train results:")
print(train_results_bert)
print()
print("test results:")
print(test_results_rogue)

# model performance results writeup goes here
(from canvas) "You have been able to create a training and testing set from your data (or it has already been given to you). We want to see evidence that you were able to train your advanced model and have performance metrics. How does your model perform on the metrics you have chosen from your previous submission?"