# Group Project - Model Baseline
### DSBA 6165
### Divam Arora, Connor Moore, Hemanth Velan

### Sources:
* https://huggingface.co/datasets/gigaword
* https://pytorch.org/get-started/locally/
* https://www.width.ai/post/bart-text-summarization
* https://huggingface.co/docs/datasets/process#export
* https://huggingface.co/docs/datasets/v1.11.0/splits.html
* https://www.geeksforgeeks.org/string-punctuation-in-python/
* https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
* https://blog.paperspace.com/bart-model-for-text-summarization-part1/
* https://www.projectpro.io/article/transformers-bart-model-explained/553
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
* https://www.analyticsvidhya.com/blog/2021/06/pre-processing-of-text-data-in-nlp/
* https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame
* https://towardsdatascience.com/bertscore-evaluating-text-generation-with-bert-beb7b3431300
* https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499
* https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
* https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
* https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
* https://aparnamishra144.medium.com/how-to-change-string-data-or-text-data-of-a-column-to-lowercase-in-pandas-248a8ce4ae01

First we need to re-run the code from our EDA/pre-processing notebook that loads and prepares our dataset for implementation.

In [1]:
# import needed packages
import nltk
import time
import torch
import string
import evaluate
import pandas as pd
import datasets as ds
from evaluate import load
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from transformers import BartForConditionalGeneration, BartTokenizer

# download stop word package from nltk library
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\heman\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\heman\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\heman\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Dataset

Because our dataset is pulled directly from Huggingface's datasets library, there is no need for a local copy of the data. Running the cell below creates an instance of the specified dataset in your workspace environment.

In [2]:
# https://huggingface.co/datasets/gigaword
# https://huggingface.co/docs/datasets/v1.11.0/splits.html

# download gigaword dataset from Hugging Face dataset library
train, test, validation = ds.load_dataset("gigaword", split=["train", "test", "validation"])

In [3]:
# display the dataset splits
print(train)
print(test)
print(validation)

Dataset({
    features: ['document', 'summary'],
    num_rows: 3803957
})
Dataset({
    features: ['document', 'summary'],
    num_rows: 1951
})
Dataset({
    features: ['document', 'summary'],
    num_rows: 189651
})


In [4]:
# https://huggingface.co/docs/datasets/process#export

# export the training dataset to a pandas dataframe and display
df_train = train.to_pandas()
print("Train df exported.")

# export the test dataset to a pandas dataframe
df_test = test.to_pandas()
print("Test df exported.")

# export the validation dataset to a pandas dataframe
df_val = validation.to_pandas()
print("Validation df exported.")

Train df exported.
Test df exported.
Validation df exported.


### Balancing the train-test split
The standard provided division between train, test, and validation is extremely unbalanced towards train (95%), and the dataset overall is far too large to run through our model in a reasonable timespan. We decided to shrink the train set to 70,000 entries, and concat the provided test and validation sets. From that combined test-val set we will extract a 25,000-entry test set and a 5,000 entry validation set.

In [5]:
# select 70,000 rows randomly from the train dataframe

df_train_short = df_train.sample(n = 70000, random_state=2)

df_train_short

Unnamed: 0,document,summary
644708,a british soldier was killed saturday by an ex...,british soldier killed in afghanistan blast
1506983,ukraine insists on building two new nuclear re...,ukraine insists on linking chernobyl closure t...
3429980,portuguese president mario soares will pay an ...,portugal 's president to visit angola next month
2028209,aol stepped up its transformation from interne...,aol introduces new advertising network plans t...
1392922,marine experts from wwf flew to the northern k...,suspected toxic algae bloom leaves thousands o...
...,...,...
3124694,hong kong 's benchmark hang seng index ended h...,hong kong stocks edged up after four straight ...
1237703,former brazil coach carlos alberto parreira sa...,parreira says he 's close to an agreement to c...
671101,around ## youths on thursday protested outside...,latvian youths protest ban of UNK symbols
1601285,ohio 's method of putting prisoners to death i...,ohio judge says state s lethal injection proce...


In [6]:
# combine provided test and val sets and reseparate randomly into smaller subsets

# concat test and validation sets
test_val = [df_test, df_val]
df_testval_bulk = pd.concat(test_val)

# take a random sample of 30000 rows from the test and validation bulk set
df_testval_short = df_testval_bulk.sample(n = 30000, random_state=3)

# take a random 5000 row sample from the test-val subset
df_val_short = df_testval_short.sample(n = 5000, random_state=4)

# drop all rows taken for the validation sample from the test-val subset to create the test set
df_test_short = df_testval_short.drop(df_val_short.index, axis=0)

In [7]:
# the methods required to perform this function were found in this article -
# https://aparnamishra144.medium.com/how-to-change-string-data-or-text-data-of-a-column-to-lowercase-in-pandas-248a8ce4ae01
# the function and comments are our original work

# set all words in all rows to lower case

def lower(df):
    # vectorize strings in each row in summary column and set to lower case
    df["summary"] = df["summary"].str.lower()
    print("summary column lowercased")
    # vectorize strings in each row in document column and set to lower case
    df["document"] = df["document"].str.lower()
    print("document column lowercased")

In [8]:
# geeks for geeks and pandas doc pages were used as template source code and informed about parameter options
# stackoverflow posts helped with debugging issues
# https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
# https://www.geeksforgeeks.org/string-punctuation-in-python/
# https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
# https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
# comments and function are our original work, source code was modifed to fit our workspace

# remove all symbols and punctuation

# create instance of all punctuation symbols
punctuation = string.punctuation

# since we learned there are lots of apostrophe s in the dataset during EDA, we will add this to our remove list
punct_list = ["'s"]

# add all punctuation from the premade variable to our new list
for symbol in punctuation:
    punct_list.append(symbol)

# display the symbols included in our list
print(punct_list)

def remove_punctuation(df):
    # for each symbol in our punctuation list
    for symbol in punct_list:
        # iterate through the dataframe and replace every instance of the symbol with an empty string
        df["document"] = df["document"].str.replace(symbol, "", regex=False)
        df["summary"] = df["summary"].str.replace(symbol, "", regex=False)
    print("symbols removed")

["'s", '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [9]:
# source code and ideas for this process were gathered from the following geeks for geeks page and article -
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
# comments and functions are original work, source code was modified to fit our workspace

# tokenization and removal of stopwords

# create an instance of all stopwords
stop_words = set(stopwords.words('english'))

# function for removing stopwords from a given input
def remove_stopwords(text):
    # tokenize the input string
    tokens = word_tokenize(text)
    # create an empty list for the new output
    filtered_tokens = []
    # for each word in the tokenized text
    for word in tokens:
        # if the word is not a stop word
        if word not in stop_words:
            # add the token to the new output list
            filtered_tokens.append(word)

    return filtered_tokens

# function to apply the stopword removal/tokenization function to input dataframes
def tokenize_nostop(df):
    # iterate through the dataframe and tokenize/remove stop words for each row
    df["document"] = df["document"].apply(remove_stopwords)
    print("stopwords removed from document column")
    
    df["summary"] = df["summary"].apply(remove_stopwords)
    print("stopwords removed from summary column")

In [10]:
# inspiration and source code for NLTK's word net lemmatizer came from the following article -
# https://www.analyticsvidhya.com/blog/2021/06/pre-processing-of-text-data-in-nlp/
# functions and comments are our original work, source code was modified to fit our workspace

# lemmatization

# create an instance of NLTK's word net lemmatizer class
wml = WordNetLemmatizer()

# function to lemmatize a given tokenized text input
def lemmatization(text):
    # create an empty list for new output
    lemma_words = []

    # for each word in the given input
    for word in text:
        # lemmatize the word
        token = wml.lemmatize(word)
        # and add it into our new output list
        lemma_words.append(token)
    
    return lemma_words

# function to call lemmatization function on the rows of an input dataframe
def lemmatize(df):
    # iterate through the rows of the input dataframe and apply the lemmatization function to each row
    df["document"] = df["document"].apply(lemmatization)
    print("document column lemmatized")

    df["summary"] = df["summary"].apply(lemmatization)
    print("summary column lemmatized")

In [11]:
# create data pre-processing pipeline

def pre_proc(df):
    # lowercase
    lower(df)
    # remove punctuation and symbols
    remove_punctuation(df)
    # tokenize and remove stopwords
    tokenize_nostop(df)
    # lemmatize
    lemmatize(df)
    print("pre-processed successfully")

In [12]:
# call the data pre-processing pipeline for each of the dataset splits

pre_proc(df_train_short)
print("train df completed")
pre_proc(df_val_short)
print("test df completed")
pre_proc(df_test_short)
print("validation df completed")

# display new format of data using training set
df_train_short.head(10)

summary column lowercased
document column lowercased
symbols removed
stopwords removed from document column
stopwords removed from summary column
document column lemmatized
summary column lemmatized
pre-processed successfully
train df completed
summary column lowercased
document column lowercased
symbols removed
stopwords removed from document column
stopwords removed from summary column
document column lemmatized
summary column lemmatized
pre-processed successfully
test df completed
summary column lowercased
document column lowercased
symbols removed
stopwords removed from document column
stopwords removed from summary column
document column lemmatized
summary column lemmatized
pre-processed successfully
validation df completed


Unnamed: 0,document,summary
644708,"[british, soldier, killed, saturday, explosion...","[british, soldier, killed, afghanistan, blast]"
1506983,"[ukraine, insists, building, two, new, nuclear...","[ukraine, insists, linking, chernobyl, closure..."
3429980,"[portuguese, president, mario, soares, pay, of...","[portugal, president, visit, angola, next, month]"
2028209,"[aol, stepped, transformation, internet, acces...","[aol, introduces, new, advertising, network, p..."
1392922,"[marine, expert, wwf, flew, northern, kenyan, ...","[suspected, toxic, algae, bloom, leaf, thousan..."
887130,"[indian, share, price, closed, percent, higher...","[indian, share, close, pct]"
1230066,"[gustav, slammed, cuba, tobaccogrowing, wester...","[gustav, slam, cuba, massive, category, hurric..."
2933565,"[week, ago, researcher, wisconsin, japan, said...","[unk, stem, cell, venture, land, million]"
1592999,"[two, japan, biggest, soccer, star, returned, ...","[soccer, star, return, home, dropped, national..."
960389,"[united, state, britain, unleashed, massive, a...","[u, unleashes, aerial, assault, take, port, ai..."


Our dataset splits are now pre-processed and ready for use with models.

### Convert lists to string format for improved model compatibility

In [13]:
# convert list entries into single strings

def list2string(input):
    output = " ".join(input)
    return output

In [14]:
# apply function to create new stringified columns for

# train
df_train_short["docString"] = df_train_short["document"].map(list2string)
df_train_short["sumString"] = df_train_short["summary"].map(list2string)

# test
df_test_short["docString"] = df_test_short["document"].map(list2string)
df_test_short["sumString"] = df_test_short["summary"].map(list2string)

# and val
df_val_short["docString"] = df_val_short["document"].map(list2string)
df_val_short["sumString"] = df_val_short["summary"].map(list2string)

df_train_short.head(5)

Unnamed: 0,document,summary,docString,sumString
644708,"[british, soldier, killed, saturday, explosion...","[british, soldier, killed, afghanistan, blast]",british soldier killed saturday explosion sout...,british soldier killed afghanistan blast
1506983,"[ukraine, insists, building, two, new, nuclear...","[ukraine, insists, linking, chernobyl, closure...",ukraine insists building two new nuclear react...,ukraine insists linking chernobyl closure buil...
3429980,"[portuguese, president, mario, soares, pay, of...","[portugal, president, visit, angola, next, month]",portuguese president mario soares pay official...,portugal president visit angola next month
2028209,"[aol, stepped, transformation, internet, acces...","[aol, introduces, new, advertising, network, p...",aol stepped transformation internet access pro...,aol introduces new advertising network plan mo...
1392922,"[marine, expert, wwf, flew, northern, kenyan, ...","[suspected, toxic, algae, bloom, leaf, thousan...",marine expert wwf flew northern kenyan coast t...,suspected toxic algae bloom leaf thousand fish...


# BART model for text summarization

We have chosen to use the Bidirectional Auto-Regressive Transformer (BART) model for our text summarization task. The BART model is a sequence-to-sequence transformer model, utilizing a bidirectional encoder and a left to right decoder, that has been shown to be very effective at processing one body of text and outputting different text in response. Bart learns by "noising" (scrambling/shuffling/generally introducing chaos to) a given body of text, and training the seq2seq element of the model to recreate the original text. This allows the seq2seq process to learn the unique semantics, flow, grammar, and construction of the given language, allowing the model to produce (roughly) coherent, novel outputs in the given language in response to presented tasks. This makes BART a strong choice for text summarization, and there are several pre-trained variations of BART that are specialized for this task. In addition to manually lemmatizing and tokenizing the string entries in our dataset during our data preprocessing stage, we decided to also utilize BART's pre-trained built-in tokenizer, converting the data into a vectorized form the model can easily interpret, with the hope of optimizing the model's function. For the actual summarization task, we are employing the base pre-trained form of the BART model. We wanted to use the least-trained form of the model for our baseline performance test, so there would be room for optimization and training during the next "advanced model" stage of the project.

### Sources - 
* https://www.width.ai/post/bart-text-summarization
* https://www.projectpro.io/article/transformers-bart-model-explained/553
* https://blog.paperspace.com/bart-model-for-text-summarization-part1/

In [16]:
# Tokenizer and model loading for bart-base

tokenizer=BartTokenizer.from_pretrained('facebook/bart-base')
model=BartForConditionalGeneration.from_pretrained('facebook/bart-base')

### Testing for GPU

In [17]:
# Download Cuda Toolkit 12.1.0
# Download Pytorch Cuda from this website:
# https://pytorch.org/get-started/locally/

In [18]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [19]:
model.to(device)

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=

In [20]:
# create function to generate model predictions and append to new dataframe column

def runBart(df):

    # empty lists for predictions and performance timestamps
    predictions = []
    times = []

    # for the number of rows in the given dataframe
    for i in range(len(df)):
        # create a start timestamp
        start = time.perf_counter()
        # create a document instance using that row's entry for the stringified document
        doc = df.iloc[i]["docString"]
        # specify the summary maximum length to be the length of the original document divided by 10 plus one
        maxLen = int(len(doc) / 10)+1

        # encoding inputs using BART tokenizer 
        inputs = tokenizer.batch_encode_plus([doc],return_tensors='pt').to(device)
        # generate vectorized summary using encoded inputs
        summary_ids =  model.generate(inputs['input_ids'], max_length=maxLen, min_length=0)

        # decode the summary into a human-readable format
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        
        # append the predicted summary to a list of predictions
        predictions.append(summary)
        
        # create an end timestamp
        end = time.perf_counter()
        # calculate computation speed
        speed = end - start
        # append computation speed to list
        times.append(speed)

        # if the iteration is a multiple of 1000
        if i % 1000 == 0:
            # calculate the average computation time per row so far and print
            avg_time = sum(times) / len(times)
            print("average time per row at", i, "row: ", avg_time)

    # create a new column for the dataframe using the predictions generated and return the modified dataframe
    df["BART_Pred"] = predictions
    return df

In [21]:
# run the modified version of the training dataset through the prediction function and display
runBart(df_train_short)

df_train_short

average time per row at 0 row:  0.9781096000224352
average time per row at 1000 row:  0.11846682427511025
average time per row at 2000 row:  0.11818595332304609
average time per row at 3000 row:  0.117193621226007
average time per row at 4000 row:  0.11742063978969107
average time per row at 5000 row:  0.12212711257703028
average time per row at 6000 row:  0.1346871187464001
average time per row at 7000 row:  0.15649409161513889
average time per row at 8000 row:  0.17164029151326465
average time per row at 9000 row:  0.18086405897093802
average time per row at 10000 row:  0.18923156223352033
average time per row at 11000 row:  0.19510109616378032
average time per row at 12000 row:  0.19617675206213428
average time per row at 13000 row:  0.1972584353663348
average time per row at 14000 row:  0.19804303527591882
average time per row at 15000 row:  0.1992362401972552
average time per row at 16000 row:  0.19934204219730664
average time per row at 17000 row:  0.2034355373976873
average time

Unnamed: 0,document,summary,docString,sumString,BART_Pred
644708,"[british, soldier, killed, saturday, explosion...","[british, soldier, killed, afghanistan, blast]",british soldier killed saturday explosion sout...,british soldier killed afghanistan blast,british soldier killed saturday
1506983,"[ukraine, insists, building, two, new, nuclear...","[ukraine, insists, linking, chernobyl, closure...",ukraine insists building two new nuclear react...,ukraine insists linking chernobyl closure buil...,ukraine insists building two new nuclear react...
3429980,"[portuguese, president, mario, soares, pay, of...","[portugal, president, visit, angola, next, month]",portuguese president mario soares pay official...,portugal president visit angola next month,portuguese president mario soares pay official...
2028209,"[aol, stepped, transformation, internet, acces...","[aol, introduces, new, advertising, network, p...",aol stepped transformation internet access pro...,aol introduces new advertising network plan mo...,aol stepped transformation internet access pro...
1392922,"[marine, expert, wwf, flew, northern, kenyan, ...","[suspected, toxic, algae, bloom, leaf, thousan...",marine expert wwf flew northern kenyan coast t...,suspected toxic algae bloom leaf thousand fish...,marine expert wwf flew northern kenyan coast t...
...,...,...,...,...,...
3124694,"[hong, kong, benchmark, hang, seng, index, end...","[hong, kong, stock, edged, four, straight, ses...",hong kong benchmark hang seng index ended high...,hong kong stock edged four straight session dec,hong kong benchmark hang seng index
1237703,"[former, brazil, coach, carlos, alberto, parre...","[parreira, say, close, agreement, coach, south...",former brazil coach carlos alberto parreira sa...,parreira say close agreement coach south africa,former brazil coach carlos alberto parreira sa...
671101,"[around, youth, thursday, protested, outside, ...","[latvian, youth, protest, ban, unk, symbol]",around youth thursday protested outside latvia...,latvian youth protest ban unk symbol,around youth thursday protested outside latvia...
1601285,"[ohio, method, putting, prisoner, death, uncon...","[ohio, judge, say, state, lethal, injection, p...",ohio method putting prisoner death unconstitut...,ohio judge say state lethal injection process ...,ohio method putting prisoner death unconstitut...


# Performance Metrics - BERTScore and ROGUE

We have chosen to use BERTScore and ROGUE as our two metrics for evaluating the performance of our model. We chose to use ROGUE metrics because it is purpose-built for evaluating text summarization, which is the task our model will be completing. We will be generating scores for ROGUE-N (as ROGUE-1 and ROGUE-2), ROUGE-L, and ROUGE-S for the most comprehensive possible model evaluation. ROGUE-1 and ROGUE-2 will observe the number of unigrams and bigrams (respectively) shared between the model output and the "correct" output. ROGUE-L measures the longest common subsequence of words shared between the model's output and the true output. ROGUE-S observes shared skipgrams between the model's output and the desired one, this can identify sequences of consecutive words that may be correct in the model's output but are separated by a word or sequence of words. These metrics will provide a method by which to assign accuracy, precision, recall, and F1 scores when comparing the model's produced summaries with the original human-generated ones.

We chose to use BERTScore as our second metric because it is another metric that is designed to evaluate how a model's text output compares with a true output. We thought it would be interesting to pair a BERTScore evaluation with our ROGUE evaluations because BERTScore, unlike ROGUE or BLEU, focuses on a semantic comparison of the model's output and the original output, rather than a purely syntactical one. This means, rather than computing a pure accuracy score in terms of how many exact words are matched between the true and model outputs, BERTScore takes into account the meaning of individual words in each output when making evaluations. This can make for an analysis that may be more in line with human intuition.

### Sources
* https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499
* https://towardsdatascience.com/bertscore-evaluating-text-generation-with-bert-beb7b3431300

# BERTScore Metrics

In [22]:
# initialize BERTScore metric

bertscore = load("bertscore")

In [27]:
# generating BERTScore metrics for model predictions

# create list of prediction outputs
predictions = list(df_train_short["BART_Pred"].astype(str))
# create list of true outputs
references = list(df_train_short["sumString"].astype(str))

# calculate BERTScore values comparing model predictions with true summaries
results = bertscore.compute(predictions=predictions, references=references, lang="en")
# display
results

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'precision': [0.9311743378639221,
  0.8636964559555054,
  0.8486814498901367,
  0.841934323310852,
  0.8072885870933533,
  0.9413087368011475,
  0.8767479658126831,
  0.8141822218894958,
  0.8492639064788818,
  0.8462842702865601,
  0.9057227373123169,
  0.8114935755729675,
  0.8519151210784912,
  0.8692603707313538,
  0.8386660218238831,
  0.935811460018158,
  0.7969496250152588,
  0.9232096672058105,
  0.7809390425682068,
  0.8833740949630737,
  0.8463585376739502,
  0.8570193648338318,
  0.8863598108291626,
  0.8623895645141602,
  0.8350139856338501,
  0.8269885182380676,
  0.7919474840164185,
  0.8319289684295654,
  0.9097435474395752,
  0.890620768070221,
  0.8762685060501099,
  0.8205379247665405,
  0.8069441318511963,
  0.8763716220855713,
  0.8753389716148376,
  0.8572087287902832,
  0.9271894693374634,
  0.9310461282730103,
  0.8713871836662292,
  0.8492502570152283,
  0.8259228467941284,
  0.8721873760223389,
  0.8351958394050598,
  0.8471522331237793,
  0.8851442337036133,


In [28]:
# calculate average precision, recall and F1 scores from BERTScore model based on model predictions

# create list of result keys
keys = list(results.keys())

# for number of values in keylist-1
for k in range(len(keys)-1):
    # sum total of all result values
    s = sum(results[keys[k]])
    # calculate the total number of result values
    le = len(results[keys[k]])
    # compute average result value
    avg = s/le

    print("Average {} is {}".format(keys[k], avg))

Average precision is 0.8511762052280563
Average recall is 0.8626955210038594
Average f1 is 0.8566267217857497


# ROUGE Metrics

In [29]:
# initialize ROGUE metrics model

rouge = evaluate.load('rouge')

In [30]:
# generate ROGUE metrics scores for model outputs

# generate prediction list
predictions = list(df_train_short["BART_Pred"].astype(str))
# create true value list
references = list(df_train_short["sumString"].astype(str))

# compute ROGUE metrics scores comparing model predictions with true outputs
results = rouge.compute(predictions=predictions, references=references)

print(results)

{'rouge1': 0.2832839132813352, 'rouge2': 0.09128860940641217, 'rougeL': 0.2632363689414325, 'rougeLsum': 0.2632207764661384}


Observing the metrics scores for ROGUE and BERTScore, it seems that the results varied widely between the two different metrics. By BERTScore's measures, the model performed fairly well (roughly 85 percent for precision and F1, with around 86 percent for recall), while the ROGUE metric scores were pretty terrible (ranging from 9 percent for ROGUE 2 to around 26 percent for ROGUE L and ROGUE L sum and 28 percent for ROGUE 1). To us, this means that while the model did a decent job matching the semantic meaning of the true summaries with its summary predictions (what BERTScore measures), it did not do a very good job of matching the true summaries word-for-word with its predicitons (what ROGUE measures). We will attempt to improve these scores with further training and fine tuning of the model in the next project milestone, and will work to address the model outputting large numbers of empty summaries as predictions.