# Group Project - Advanced Model

Divam Arora, Connor Moore, Hemanth Velan

DSBA 6165

### Sources:
* https://huggingface.co/datasets/gigaword
* https://huggingface.co/docs/datasets/process#export
* https://huggingface.co/docs/datasets/v1.11.0/splits.html
* https://www.geeksforgeeks.org/string-punctuation-in-python/
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
* https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame
* https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
* https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
* https://aparnamishra144.medium.com/how-to-change-string-data-or-text-data-of-a-column-to-lowercase-in-pandas-248a8ce4ae01

First we need to re-run the code from our EDA/pre-processing notebook that loads and prepares our dataset for implementation.

In [1]:
# import needed packages
import nltk
import time
import random
import string
import pandas as pd
import datasets as ds
from evaluate import load
from transformers import BartTokenizer, BartForConditionalGeneration

# download stop word package from nltk library
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Dataset
Because our dataset is pulled directly from Huggingface's datasets library, there is no need for a local copy of the data. Running the cell below creates an instance of the specified dataset in your workspace environment.

In [2]:
# https://huggingface.co/datasets/gigaword
# https://huggingface.co/docs/datasets/v1.11.0/splits.html

# download gigaword dataset from Hugging Face dataset library
train, test, validation = ds.load_dataset("gigaword", split=["train", "test", "validation"])

In [3]:
# display the dataset splits
print(train)
print(test)
print(validation)

Dataset({
    features: ['document', 'summary'],
    num_rows: 3803957
})
Dataset({
    features: ['document', 'summary'],
    num_rows: 1951
})
Dataset({
    features: ['document', 'summary'],
    num_rows: 189651
})


In [4]:
# https://huggingface.co/docs/datasets/process#export

# export the training dataset to a pandas dataframe and display
df_train = train.to_pandas()
print("Train df exported.")

# export the test dataset to a pandas dataframe
df_test = test.to_pandas()
print("Test df exported.")

# export the validation dataset to a pandas dataframe
df_val = validation.to_pandas()
print("Validation df exported.")

Train df exported.
Test df exported.
Validation df exported.


### Balancing the train-test split
The standard provided division between train, test, and validation is extremely unbalanced towards train (95%), and the dataset overall is far too large to run through our model in a reasonable timespan. We decided to shrink the train set to 70,000 entries, and concat the provided test and validation sets. From that combined test-val set we will extract a 25,000-entry test set and a 5,000 entry validation set.

In [5]:
# select 70,000 rows randomly from the train dataframe

df_train_short = df_train.sample(n = 70000, random_state=2, ignore_index=True)

df_train_short

Unnamed: 0,document,summary
0,a british soldier was killed saturday by an ex...,british soldier killed in afghanistan blast
1,ukraine insists on building two new nuclear re...,ukraine insists on linking chernobyl closure t...
2,portuguese president mario soares will pay an ...,portugal 's president to visit angola next month
3,aol stepped up its transformation from interne...,aol introduces new advertising network plans t...
4,marine experts from wwf flew to the northern k...,suspected toxic algae bloom leaves thousands o...
...,...,...
69995,hong kong 's benchmark hang seng index ended h...,hong kong stocks edged up after four straight ...
69996,former brazil coach carlos alberto parreira sa...,parreira says he 's close to an agreement to c...
69997,around ## youths on thursday protested outside...,latvian youths protest ban of UNK symbols
69998,ohio 's method of putting prisoners to death i...,ohio judge says state s lethal injection proce...


In [6]:
# combine provided test and val sets and reseparate randomly into smaller subsets

# concat test and validation sets
test_val = [df_test, df_val]
df_testval_bulk = pd.concat(test_val)

# take a random sample of 30000 rows from the test and validation bulk set
df_testval_short = df_testval_bulk.sample(n = 30000, random_state=3, ignore_index=True)

# take a random 5000 row sample from the test-val subset
df_val_short = df_testval_short.sample(n = 5000, random_state=4, ignore_index=True)

# drop all rows taken for the validation sample from the test-val subset to create the test set
df_test_short = df_testval_short.drop(df_val_short.index, axis=0)

### Data Pre-Processing
We decided to truncate our pre-processing pipeline slightly from our baseline model notebook (removing lemmatization and not vectorizing text prior to passing it to the BART pretrained tokenizer) because BART models are designed to accept full, grammatically correct sentences. We thought passing more "normal" text during training may give the model better context and improve learning.

In [7]:
# the methods required to perform this function were found in this article -
# https://aparnamishra144.medium.com/how-to-change-string-data-or-text-data-of-a-column-to-lowercase-in-pandas-248a8ce4ae01
# the function and comments are our original work

# set all words in all rows to lower case

def lower(df):
    # vectorize strings in each row in summary column and set to lower case
    df["summary"] = df["summary"].str.lower()
    print("summary column lowercased")
    # vectorize strings in each row in document column and set to lower case
    df["document"] = df["document"].str.lower()
    print("document column lowercased")

In [8]:
# geeks for geeks and pandas doc pages were used as template source code and informed about parameter options
# stackoverflow posts helped with debugging issues
# https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
# https://www.geeksforgeeks.org/string-punctuation-in-python/
# https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
# https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
# comments and function are our original work, source code was modifed to fit our workspace

# remove all symbols and punctuation

# create instance of all punctuation symbols
punctuation = string.punctuation

# since we learned there are lots of apostrophe s in the dataset during EDA, we will add this to our remove list
punct_list = ["'s"]

# add all punctuation from the premade variable to our new list
for symbol in punctuation:
    punct_list.append(symbol)

# display the symbols included in our list
print(punct_list)

def remove_punctuation(df):
    # for each symbol in our punctuation list
    for symbol in punct_list:
        # iterate through the dataframe and replace every instance of the symbol with an empty string
        df["document"] = df["document"].str.replace(symbol, "", regex=False)
        df["summary"] = df["summary"].str.replace(symbol, "", regex=False)
    print("symbols removed")

["'s", '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [9]:
# create data pre-processing pipeline

def pre_proc(df):
    # lowercase
    lower(df)
    # remove punctuation and symbols
    remove_punctuation(df)
    print("pre-processed successfully")

In [10]:
# call the data pre-processing pipeline for each of the dataset splits
# minus the lemmatization and tokenizer steps from the baseline notebook

pre_proc(df_train_short)
print("train df completed")
pre_proc(df_val_short)
print("test df completed")
pre_proc(df_test_short)
print("validation df completed")

# display new format of data using training set
df_train_short.head()

summary column lowercased
document column lowercased
symbols removed
pre-processed successfully
train df completed
summary column lowercased
document column lowercased
symbols removed
pre-processed successfully
test df completed
summary column lowercased
document column lowercased
symbols removed
pre-processed successfully
validation df completed


Unnamed: 0,document,summary
0,a british soldier was killed saturday by an ex...,british soldier killed in afghanistan blast
1,ukraine insists on building two new nuclear re...,ukraine insists on linking chernobyl closure t...
2,portuguese president mario soares will pay an ...,portugal president to visit angola next month
3,aol stepped up its transformation from interne...,aol introduces new advertising network plans t...
4,marine experts from wwf flew to the northern k...,suspected toxic algae bloom leaves thousands o...


Our dataset splits are now pre-processed and ready for use with models.

# New attempt at fine-tuning / advanced model
Using same approach utilized in baseline model notebook, with modifications and adjustments made to runBART() function to eliminate null value model output issues and attempt to improve hyperparameters.

In [11]:
# tokenizer and model loading for bart-base

tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')

In [12]:
# new version OF runBart function, updated from our baseline model for improved speed and performance
# testing and trial and error showed that removing the component of the function that calculated a unique maxlength for every summary sped up prediction generation significantly
# in the baseline model we had issues with large numbers of NaN empty summary predictions,
# through some research, testing, and trial and error we were able to find a configuration for the encode and generate() steps that corrected this issue

def runBart(df):

    # Empty lists for predictions and performance timestamps
    predictions = []
    times = []

    # For the number of rows in the given dataframe
    for i in range(len(df)):
        # Create a start timestamp
        start = time.perf_counter()

        # Create a document instance using the row's entry for the stringified document
        doc = df.iloc[i]["document"]

        # Encoding inputs using BART tokenizer 
        inputs = tokenizer.encode(doc, return_tensors='pt', max_length=1024, truncation=True)

        # Generate vectorized summary using encoded inputs
        summary_ids = model.generate(inputs, max_length=150, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)

        # Decode the summary into a human-readable format
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        # Append the predicted summary to a list of predictions
        predictions.append(summary)

        # Create an end timestamp
        end = time.perf_counter()

        # Calculate computation speed
        speed = end - start

        # Append computation speed to list
        times.append(speed)

        # If the iteration is a multiple of 1000
        if i % 1000 == 0:
            # Calculate the average computation time per row so far and print
            avg_time = sum(times) / len(times)
            print("Average time per row at", i, "row:", avg_time)

    # Create a new column for the dataframe using the predictions generated and return the modified dataframe
    df["BART_Pred"] = predictions
    return df

In [13]:
# run the model to generate predictions using the test set
runBart(df_test_short)

df_test_short

Average time per row at 0 row: 2.2704760999999998
Average time per row at 1000 row: 2.1695680690309698
Average time per row at 2000 row: 2.1347162535732087
Average time per row at 3000 row: 2.126662612062643
Average time per row at 4000 row: 2.127252925843536
Average time per row at 5000 row: 2.1181546087382452
Average time per row at 6000 row: 2.1135295082652883
Average time per row at 7000 row: 2.11838996016283
Average time per row at 8000 row: 2.1159988310586164
Average time per row at 9000 row: 2.1139535430063363
Average time per row at 10000 row: 2.1138210887711244
Average time per row at 11000 row: 2.113529034642307
Average time per row at 12000 row: 2.117017731197398
Average time per row at 13000 row: 2.1165382607491736
Average time per row at 14000 row: 2.1160646754089028
Average time per row at 15000 row: 2.1147231192320524
Average time per row at 16000 row: 2.1154324689956807
Average time per row at 17000 row: 2.114863878195382
Average time per row at 18000 row: 2.11479855086

Unnamed: 0,document,summary,BART_Pred
5000,japanese electronics giant toshiba said tuesda...,toshiba profits times up in third quarter,japanese electronics giant toshiba said tuesda...
5001,michael campbell opened a unk lead on defendin...,campbell puts defending champion woosnam in tr...,michael campbell opened a unk lead on defendin...
5002,iran on tuesday dismissed the allegation by th...,iran denies allegation on its military deployment,iran on tuesday dismissed the allegation by th...
5003,turkish foreign minister and deputy prime mini...,turkish fm hails eu plan to end economic sanct...,turkish foreign minister and deputy prime mini...
5004,patricia mcgovern the former senate ways and ...,former senator joins governor race,patricia mcgovern the former senate ways and ...
...,...,...,...
29995,three activists are charged with staging an un...,charges against activists could set precedent ...,three activists are charged with staging an un...
29996,a prominent fatah leader and his yearold son...,father son killed in fatahhamas fight in gaza,a prominent fatah leader and his yearold son...
29997,trustees at virginia union university one of ...,wilder in running to head virginia union,trustees at virginia union university one of ...
29998,patrouille de france lrb paf rrb the famous...,french famous unk to present aerobatics shows ...,patrouille de france lrb paf rrb the famous...


In [14]:
# check generated predictions for NaN values

# count null values in BART pred columnm
null_predictions = df_test_short['BART_Pred'].isna().sum()

print("there were", null_predictions, "empty predictions generated.")

there were 0 empty predictions generated.


# BERTScore Metrics

In [15]:
# initialize BERTScore metric

bertscore = load("bertscore")

In [16]:
# generating BERTScore metrics for model predictions

# create list of prediction outputs
test_predictions = list(df_test_short["BART_Pred"].astype(str))
# create list of true outputs
test_references = list(df_test_short["summary"].astype(str))
# calculate BERTScore values comparing model predictions with true summaries
test_results_bert = bertscore.compute(predictions=test_predictions, references=test_references, lang="en")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
# calculate average precision, recall and F1 scores from BERTScore model based on model predictions

# create list of result keys
test_keys = list(test_results_bert.keys())

# for number of values in keylist-1
for k in range(len(test_keys)-1):
    # sum total of all result values
    s_test = sum(test_results_bert[test_keys[k]])
    # calculate the total number of result values
    le_test = len(test_results_bert[test_keys[k]])
    # compute average result value
    avg_test = s_test/le_test

    print("test set result:")
    print("Average {} is {}".format(test_keys[k], avg_test))

test set result:
Average precision is 0.8153642509675026
test set result:
Average recall is 0.8942867579483986
test set result:
Average f1 is 0.8527202885365486


# ROUGE Metrics

In [18]:
# initialize ROUGE metrics model

rouge = load('rouge')

In [19]:
# generate ROUGE metrics scores for train and test model outputs

# compute ROUGE metrics scores comparing model predictions with true outputs
test_results_rouge = rouge.compute(predictions=test_predictions, references=test_references)

print("test set results:")
print(test_results_rouge)

test set results:
{'rouge1': 0.2140434424280383, 'rouge2': 0.06928421625683026, 'rougeL': 0.18753636941510993, 'rougeLsum': 0.18756050305883293}


# Observe a few specific model predictions for human evaluation

In [20]:
# produce five random input/target/prediction pairs from the model

# empty list for index sample
indexList = []

# generate five random index values and add to list
for i in range(5):
    index = random.randint(0,25000)
    indexList.append(index)

print(indexList)
print()

# for index in list
for ind in indexList:
    # display the original input, the target summary, and what the model predicted for a summary
    print("index:", ind)
    print("original input:", df_test_short.iloc[ind]["document"])
    print("target summary:", df_test_short.iloc[ind]["summary"])
    print("model prediction:", df_test_short.iloc[ind]["BART_Pred"])
    print()

[21906, 15907, 13640, 12386, 16459]

index: 21906
target summary: malaysians believe corruption problem acute

index: 15907
original input: kohlberg kravis roberts amp co said it will invest   million in randalls food markets  getting a majority stake in the closely held grocerystore chain 
target summary: kkr to invest   mln in texas grocer unk food markets
model prediction: kohlberg kravis roberts amp co said it will invest   million in randalls food markets  getting a majority stake in the closely held grocerystore chain  and is expected to raise about $1 billion in the coming year.

index: 13640
original input: the french consumer products group bic said thursday that group sales jumped by  percent to  billion euros lrb  billion dollars rrb last year  owing to stronger demand for its throwaway ballpoint pens  razors and cigarette lighters 
target summary: bic  sales rise  pct to  billion euros
model prediction: the french consumer products group bic said thursday that group sales j

# Takeaways from new attempt


This version of the model performed slightly better than our original advanced model, but not better than our original baseline model. This version generated no null summaries, which is good, but looking at the example outputs it appears part of our problem may be related to our data. Many of BART's generated summaries are longer than our original input sentences, and we think this may be because BART is designed to produce coherent, human-readable sentences. Becuase many of our input strings are shortened or abbreviated sentences (and some are not sentences at all), we wondered if it is possible BART is having a difficult time writing logical sentences that are any shorter (the entire purpose of summarization). Going into this project, none of us had experience with text summarization, but over the course of the semester it has become apparent text summarization models appear to perform better with larger bodies of input text. A takeaway or modification to our project would be to utilize a dataset with longer input text.