# Group Project - Model Baseline
### DSBA 6165
### Divam Arora, Connor Moore, Hemanth Velan

### Sources:
* https://huggingface.co/datasets/gigaword
* https://huggingface.co/docs/datasets/process#export
* https://huggingface.co/docs/datasets/v1.11.0/splits.html
* https://www.geeksforgeeks.org/string-punctuation-in-python/
* https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
* https://www.analyticsvidhya.com/blog/2021/06/pre-processing-of-text-data-in-nlp/
* https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame
* https://towardsdatascience.com/bertscore-evaluating-text-generation-with-bert-beb7b3431300
* https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499
* https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
* https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
* https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
* https://aparnamishra144.medium.com/how-to-change-string-data-or-text-data-of-a-column-to-lowercase-in-pandas-248a8ce4ae01

First we need to re-run the code from our EDA/pre-processing notebook that loads and prepares our dataset for implementation.

In [None]:
# !pip install evaluate
# !conda install -c huggingface transformers
# !pip install transformers==2.5.0

In [1]:
import transformers
transformers.__version__

'4.35.0'

In [2]:
# import needed packages
import nltk
import time
import string
import evaluate
import pandas as pd
import datasets as ds
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from transformers import pipeline, BartForConditionalGeneration, BartTokenizer

# download stop word package from nltk library
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\heman\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\heman\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\heman\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
# https://huggingface.co/datasets/gigaword
# https://huggingface.co/docs/datasets/v1.11.0/splits.html

# download gigaword dataset from Hugging Face dataset library
train, test, validation = ds.load_dataset("gigaword", split=["train", "test", "validation"])

In [4]:
# display the dataset splits
print(train)
print(test)
print(validation)

Dataset({
    features: ['document', 'summary'],
    num_rows: 3803957
})
Dataset({
    features: ['document', 'summary'],
    num_rows: 1951
})
Dataset({
    features: ['document', 'summary'],
    num_rows: 189651
})


In [5]:
# https://huggingface.co/docs/datasets/process#export

# export the training dataset to a pandas dataframe and display
df_train = train.to_pandas()
print("Train df exported.")

# export the test dataset to a pandas dataframe
df_test = test.to_pandas()
print("Test df exported.")

# export the validation dataset to a pandas dataframe
df_val = validation.to_pandas()
print("Validation df exported.")

Train df exported.
Test df exported.
Validation df exported.


### Balancing the train-test split
The standard provided division between train, test, and validation is extremely unbalanced towards train (95%), and the dataset overall is far too large to run through our model in a reasonable timespan. We decided to shrink the train set to 70,000 entries, and concat the provided test and validation sets. From that combined test-val set we will extract a 25,000-entry test set and a 5,000 entry validation set.

In [6]:
# select 70,000 rows randomly from the train dataframe

df_train_short = df_train.sample(n = 70000, random_state=2)

df_train_short

Unnamed: 0,document,summary
644708,a british soldier was killed saturday by an ex...,british soldier killed in afghanistan blast
1506983,ukraine insists on building two new nuclear re...,ukraine insists on linking chernobyl closure t...
3429980,portuguese president mario soares will pay an ...,portugal 's president to visit angola next month
2028209,aol stepped up its transformation from interne...,aol introduces new advertising network plans t...
1392922,marine experts from wwf flew to the northern k...,suspected toxic algae bloom leaves thousands o...
...,...,...
3124694,hong kong 's benchmark hang seng index ended h...,hong kong stocks edged up after four straight ...
1237703,former brazil coach carlos alberto parreira sa...,parreira says he 's close to an agreement to c...
671101,around ## youths on thursday protested outside...,latvian youths protest ban of UNK symbols
1601285,ohio 's method of putting prisoners to death i...,ohio judge says state s lethal injection proce...


In [7]:
# combine provided test and val sets and reseparate randomly into smaller subsets

# concat test and validation sets
test_val = [df_test, df_val]
df_testval_bulk = pd.concat(test_val)

# take a random sample of 30000 rows from the test and validation bulk set
df_testval_short = df_testval_bulk.sample(n = 30000, random_state=3)

# take a random 5000 row sample from the test-val subset
df_val_short = df_testval_short.sample(n = 5000, random_state=4)

# drop all rows taken for the validation sample from the test-val subset to create the test set
df_test_short = df_testval_short.drop(df_val_short.index, axis=0)

In [8]:
# the methods required to perform this function were found in this article -
# https://aparnamishra144.medium.com/how-to-change-string-data-or-text-data-of-a-column-to-lowercase-in-pandas-248a8ce4ae01
# the function and comments are our original work

# set all words in all rows to lower case

def lower(df):
    # vectorize strings in each row in summary column and set to lower case
    df["summary"] = df["summary"].str.lower()
    print("summary column lowercased")
    # vectorize strings in each row in document column and set to lower case
    df["document"] = df["document"].str.lower()
    print("document column lowercased")

In [9]:
# geeks for geeks and pandas doc pages were used as template source code and informed about parameter options
# stackoverflow posts helped with debugging issues
# https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
# https://www.geeksforgeeks.org/string-punctuation-in-python/
# https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
# https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
# comments and function are our original work, source code was modifed to fit our workspace

# remove all symbols and punctuation

# create instance of all punctuation symbols
punctuation = string.punctuation

# since we learned there are lots of apostrophe s in the dataset during EDA, we will add this to our remove list
punct_list = ["'s"]

# add all punctuation from the premade variable to our new list
for symbol in punctuation:
    punct_list.append(symbol)

# display the symbols included in our list
print(punct_list)

def remove_punctuation(df):
    # for each symbol in our punctuation list
    for symbol in punct_list:
        # iterate through the dataframe and replace every instance of the symbol with an empty string
        df["document"] = df["document"].str.replace(symbol, "", regex=False)
        df["summary"] = df["summary"].str.replace(symbol, "", regex=False)
    print("symbols removed")

["'s", '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [10]:
# source code and ideas for this process were gathered from the following geeks for geeks page and article -
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
# comments and functions are original work, source code was modified to fit our workspace

# tokenization and removal of stopwords

# create an instance of all stopwords
stop_words = set(stopwords.words('english'))

# function for removing stopwords from a given input
def remove_stopwords(text):
    # tokenize the input string
    tokens = word_tokenize(text)
    # create an empty list for the new output
    filtered_tokens = []
    # for each word in the tokenized text
    for word in tokens:
        # if the word is not a stop word
        if word not in stop_words:
            # add the token to the new output list
            filtered_tokens.append(word)

    return filtered_tokens

# function to apply the stopword removal/tokenization function to input dataframes
def tokenize_nostop(df):
    # iterate through the dataframe and tokenize/remove stop words for each row
    df["document"] = df["document"].apply(remove_stopwords)
    print("stopwords removed from document column")
    
    df["summary"] = df["summary"].apply(remove_stopwords)
    print("stopwords removed from summary column")

In [11]:
# inspiration and source code for NLTK's word net lemmatizer came from the following article -
# https://www.analyticsvidhya.com/blog/2021/06/pre-processing-of-text-data-in-nlp/
# functions and comments are our original work, source code was modified to fit our workspace

# lemmatization

# create an instance of NLTK's word net lemmatizer class
wml = WordNetLemmatizer()

# function to lemmatize a given tokenized text input
def lemmatization(text):
    # create an empty list for new output
    lemma_words = []

    # for each word in the given input
    for word in text:
        # lemmatize the word
        token = wml.lemmatize(word)
        # and add it into our new output list
        lemma_words.append(token)
    
    return lemma_words

# function to call lemmatization function on the rows of an input dataframe
def lemmatize(df):
    # iterate through the rows of the input dataframe and apply the lemmatization function to each row
    df["document"] = df["document"].apply(lemmatization)
    print("document column lemmatized")

    df["summary"] = df["summary"].apply(lemmatization)
    print("summary column lemmatized")

In [12]:
# create data pre-processing pipeline

def pre_proc(df):
    # lowercase
    lower(df)
    # remove punctuation and symbols
    remove_punctuation(df)
    # tokenize and remove stopwords
    tokenize_nostop(df)
    # lemmatize
    lemmatize(df)
    print("pre-processed successfully")

In [13]:
# call the data pre-processing pipeline for each of the dataset splits

pre_proc(df_train_short)
print("train df completed")
pre_proc(df_val_short)
print("test df completed")
pre_proc(df_test_short)
print("validation df completed")

# display new format of data using training set
df_train_short.head(10)

summary column lowercased
document column lowercased
symbols removed
stopwords removed from document column
stopwords removed from summary column
document column lemmatized
summary column lemmatized
pre-processed successfully
train df completed
summary column lowercased
document column lowercased
symbols removed
stopwords removed from document column
stopwords removed from summary column
document column lemmatized
summary column lemmatized
pre-processed successfully
test df completed
summary column lowercased
document column lowercased
symbols removed
stopwords removed from document column
stopwords removed from summary column
document column lemmatized
summary column lemmatized
pre-processed successfully
validation df completed


Unnamed: 0,document,summary
644708,"[british, soldier, killed, saturday, explosion...","[british, soldier, killed, afghanistan, blast]"
1506983,"[ukraine, insists, building, two, new, nuclear...","[ukraine, insists, linking, chernobyl, closure..."
3429980,"[portuguese, president, mario, soares, pay, of...","[portugal, president, visit, angola, next, month]"
2028209,"[aol, stepped, transformation, internet, acces...","[aol, introduces, new, advertising, network, p..."
1392922,"[marine, expert, wwf, flew, northern, kenyan, ...","[suspected, toxic, algae, bloom, leaf, thousan..."
887130,"[indian, share, price, closed, percent, higher...","[indian, share, close, pct]"
1230066,"[gustav, slammed, cuba, tobaccogrowing, wester...","[gustav, slam, cuba, massive, category, hurric..."
2933565,"[week, ago, researcher, wisconsin, japan, said...","[unk, stem, cell, venture, land, million]"
1592999,"[two, japan, biggest, soccer, star, returned, ...","[soccer, star, return, home, dropped, national..."
960389,"[united, state, britain, unleashed, massive, a...","[u, unleashes, aerial, assault, take, port, ai..."


Our dataset splits are now pre-processed and ready for use with models.

### Convert lists to string format for improved model compatibility

In [14]:
# convert list entries into single strings

def list2string(input):
    output = " ".join(input)
    return output

In [15]:
# apply function to create new stringified columns for

# train
df_train_short["docString"] = df_train_short["document"].map(list2string)
df_train_short["sumString"] = df_train_short["summary"].map(list2string)

# test
df_test_short["docString"] = df_test_short["document"].map(list2string)
df_test_short["sumString"] = df_test_short["summary"].map(list2string)

# and val
df_val_short["docString"] = df_val_short["document"].map(list2string)
df_val_short["sumString"] = df_val_short["summary"].map(list2string)

df_train_short.head(5)

Unnamed: 0,document,summary,docString,sumString
644708,"[british, soldier, killed, saturday, explosion...","[british, soldier, killed, afghanistan, blast]",british soldier killed saturday explosion sout...,british soldier killed afghanistan blast
1506983,"[ukraine, insists, building, two, new, nuclear...","[ukraine, insists, linking, chernobyl, closure...",ukraine insists building two new nuclear react...,ukraine insists linking chernobyl closure buil...
3429980,"[portuguese, president, mario, soares, pay, of...","[portugal, president, visit, angola, next, month]",portuguese president mario soares pay official...,portugal president visit angola next month
2028209,"[aol, stepped, transformation, internet, acces...","[aol, introduces, new, advertising, network, p...",aol stepped transformation internet access pro...,aol introduces new advertising network plan mo...
1392922,"[marine, expert, wwf, flew, northern, kenyan, ...","[suspected, toxic, algae, bloom, leaf, thousan...",marine expert wwf flew northern kenyan coast t...,suspected toxic algae bloom leaf thousand fish...


### BART Model

In [16]:
# summarizer = pipeline("summarization", model="facebook/bart-base")

### trying something different

https://blog.paperspace.com/bart-model-for-text-summarization-part1/

In [17]:
# Tokenizer and model loading for bart-base

tokenizer=BartTokenizer.from_pretrained('facebook/bart-base')
model=BartForConditionalGeneration.from_pretrained('facebook/bart-base')

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


### GPU ?

In [18]:
# Download Cuda Toolkit 12.1.0
# Download Pytorch Cuda from this website:
# https://pytorch.org/get-started/locally/

In [19]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [20]:
model.to(device)

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=

In [22]:
# tokenizer.to(device)

In [25]:
def runBart(df):
    predictions = []
    times = []

    for i in range(len(df)):
        start = time.perf_counter()
        doc = df.iloc[i]["docString"]
        maxLen = int(len(doc) / 10)+1

        # Transmitting the encoded inputs to the model.generate() function
        inputs = tokenizer.batch_encode_plus([doc],return_tensors='pt').to(device)
        summary_ids =  model.generate(inputs['input_ids'], max_length=maxLen, min_length=0)

        # Decoding and printing the summary
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        
        predictions.append(summary)

        end = time.perf_counter()
        speed = end - start
        times.append(speed)
        if i % 1000 == 0:
            avg_time = sum(times) / len(times)
            print("average time per row at", i, "row: ", avg_time)


    df["BART_Pred"] = predictions
    return df

In [26]:
runBart(df_train_short)

df_train_short

average time per row at 0 row:  1.3844849000161048
average time per row at 1000 row:  0.13752335014969477
average time per row at 2000 row:  0.13661073253370531
average time per row at 3000 row:  0.14155803655433416
average time per row at 4000 row:  0.14157741497109072
average time per row at 5000 row:  0.13809193395304983
average time per row at 6000 row:  0.13655051448093058
average time per row at 7000 row:  0.1351348340237436
average time per row at 8000 row:  0.13427754680675216
average time per row at 9000 row:  0.1335037070548648
average time per row at 10000 row:  0.13289672331774152
average time per row at 11000 row:  0.13216971444427067
average time per row at 12000 row:  0.1316068232898574
average time per row at 13000 row:  0.1314000820630301
average time per row at 14000 row:  0.13102607890876516
average time per row at 15000 row:  0.131005642337324
average time per row at 16000 row:  0.1307042999439079
average time per row at 17000 row:  0.13076432159887147
average time 

Unnamed: 0,document,summary,docString,sumString,BART_Pred
644708,"[british, soldier, killed, saturday, explosion...","[british, soldier, killed, afghanistan, blast]",british soldier killed saturday explosion sout...,british soldier killed afghanistan blast,british soldier killed saturday
1506983,"[ukraine, insists, building, two, new, nuclear...","[ukraine, insists, linking, chernobyl, closure...",ukraine insists building two new nuclear react...,ukraine insists linking chernobyl closure buil...,ukraine insists building two new nuclear react...
3429980,"[portuguese, president, mario, soares, pay, of...","[portugal, president, visit, angola, next, month]",portuguese president mario soares pay official...,portugal president visit angola next month,portuguese president mario soares pay official...
2028209,"[aol, stepped, transformation, internet, acces...","[aol, introduces, new, advertising, network, p...",aol stepped transformation internet access pro...,aol introduces new advertising network plan mo...,aol stepped transformation internet access pro...
1392922,"[marine, expert, wwf, flew, northern, kenyan, ...","[suspected, toxic, algae, bloom, leaf, thousan...",marine expert wwf flew northern kenyan coast t...,suspected toxic algae bloom leaf thousand fish...,marine expert wwf flew northern kenyan coast t...
...,...,...,...,...,...
3124694,"[hong, kong, benchmark, hang, seng, index, end...","[hong, kong, stock, edged, four, straight, ses...",hong kong benchmark hang seng index ended high...,hong kong stock edged four straight session dec,hong kong benchmark hang seng index
1237703,"[former, brazil, coach, carlos, alberto, parre...","[parreira, say, close, agreement, coach, south...",former brazil coach carlos alberto parreira sa...,parreira say close agreement coach south africa,former brazil coach carlos alberto parreira sa...
671101,"[around, youth, thursday, protested, outside, ...","[latvian, youth, protest, ban, unk, symbol]",around youth thursday protested outside latvia...,latvian youth protest ban unk symbol,around youth thursday protested outside latvia...
1601285,"[ohio, method, putting, prisoner, death, uncon...","[ohio, judge, say, state, lethal, injection, p...",ohio method putting prisoner death unconstitut...,ohio judge say state lethal injection process ...,ohio method putting prisoner death unconstitut...


## plan is to try and incorporate that^ model as a starting point into some kind of training/epoch/class/loop thing

#### Hide

In [27]:
# df_train["index"] = df_train.index

In [28]:
# samp_li = df_train.iloc[0]["document"]
# samp_li

In [29]:
# samp_li = samp_li.replace("\'", "").strip('][').replace(',', '')
# print(type(samp_li))
# # samp_li = samp_li.split(', ')
# max_len = int(len(samp_li)/2)
# print(max_len)
# samp_li

In [30]:
# samp_art = " ".join(samp_li)
# samp_art

In [31]:
# samp_sum = df_train.iloc[0]["summary"].replace("\'", "").strip('][').split(', ')
# samp_sum = " ".join(samp_sum)
# samp_sum

#### Dont Hide

In [None]:
# print(df_train.iloc[0]["document"])

In [None]:
# print(df_train.iloc[0]["docString"])

In [None]:
# dft = df_train.head(1000)
# dft

In [32]:
# def runBart(df):
#     predictions = []
#     times = []

#     for i in range(len(df)):
#         start = time.perf_counter()
#         doc = df.iloc[i]["docString"]
#         maxLen = int(len(doc) / 10)
#         summary = summarizer(doc, max_length=maxLen,  min_length=1, do_sample=False)[0]["summary_text"]
#         predictions.append(summary)
#         end = time.perf_counter()
#         speed = end - start
#         times.append(speed)
#         if i % 25 == 0:
#             avg_time = sum(times) / len(times)
#             print("average time per row at", i, "row: ", avg_time)


#     df["BART_Pred"] = predictions
#     return df

In [None]:
# df_train_short = runBart(df_train_short)
# df_train_short

In [None]:
# predictions = []
# for i in range(len(df_train)):
#     doc = df_train.iloc[i]["docString"]
#     # print(doc)
#     # print(len(doc.split(" ")))
#     maxlen = int(len(doc.split(" ")))
#     # print(maxlen)
#     # pred = summarizer(doc, max_length=maxlen, min_length=4, do_sample=False)[0]["summary_text"]
#     # print(pred)
#     predictions.append(summarizer(doc, max_length=maxlen, min_length=1, do_sample=False)[0]["summary_text"])
    
# df_train["Prediction"] = predictions
# df_train
# predictions

In [None]:
# res = summarizer(samp_art, max_length=max_len, min_length=4, do_sample=False)
# res = res[0]["summary_text"]
# res

### BERTScore Metrics

In [None]:
# ! pip install evaluate
# ! pip install bert_score

In [33]:
from evaluate import load
bertscore = load("bertscore")

In [35]:
df_train_short

Unnamed: 0,document,summary,docString,sumString,BART_Pred
644708,"[british, soldier, killed, saturday, explosion...","[british, soldier, killed, afghanistan, blast]",british soldier killed saturday explosion sout...,british soldier killed afghanistan blast,british soldier killed saturday
1506983,"[ukraine, insists, building, two, new, nuclear...","[ukraine, insists, linking, chernobyl, closure...",ukraine insists building two new nuclear react...,ukraine insists linking chernobyl closure buil...,ukraine insists building two new nuclear react...
3429980,"[portuguese, president, mario, soares, pay, of...","[portugal, president, visit, angola, next, month]",portuguese president mario soares pay official...,portugal president visit angola next month,portuguese president mario soares pay official...
2028209,"[aol, stepped, transformation, internet, acces...","[aol, introduces, new, advertising, network, p...",aol stepped transformation internet access pro...,aol introduces new advertising network plan mo...,aol stepped transformation internet access pro...
1392922,"[marine, expert, wwf, flew, northern, kenyan, ...","[suspected, toxic, algae, bloom, leaf, thousan...",marine expert wwf flew northern kenyan coast t...,suspected toxic algae bloom leaf thousand fish...,marine expert wwf flew northern kenyan coast t...
...,...,...,...,...,...
3124694,"[hong, kong, benchmark, hang, seng, index, end...","[hong, kong, stock, edged, four, straight, ses...",hong kong benchmark hang seng index ended high...,hong kong stock edged four straight session dec,hong kong benchmark hang seng index
1237703,"[former, brazil, coach, carlos, alberto, parre...","[parreira, say, close, agreement, coach, south...",former brazil coach carlos alberto parreira sa...,parreira say close agreement coach south africa,former brazil coach carlos alberto parreira sa...
671101,"[around, youth, thursday, protested, outside, ...","[latvian, youth, protest, ban, unk, symbol]",around youth thursday protested outside latvia...,latvian youth protest ban unk symbol,around youth thursday protested outside latvia...
1601285,"[ohio, method, putting, prisoner, death, uncon...","[ohio, judge, say, state, lethal, injection, p...",ohio method putting prisoner death unconstitut...,ohio judge say state lethal injection process ...,ohio method putting prisoner death unconstitut...


In [52]:
len(df_train_short["BART_Pred"])

70000

In [51]:
len(df_train_short["sumString"])

70000

In [47]:
predictions = df_train_short["BART_Pred"]
references = df_train_short["sumString"]
results = bertscore.compute(predictions=predictions, references=references, model_type="distilbert-base-uncased")
# print(results)

KeyError: 0

In [None]:
# keys = list(results.keys())
# keys

In [None]:
# keys = list(results.keys())
# for k in range(len(keys)-1):
#     s = sum(results[keys[k]])
#     le = len(results[keys[k]])
#     avg = s/le
#     print("Average {} is {}".format(keys[k], avg))

### ROUGE Metrics

In [None]:
# ! pip install rouge-score

In [1]:
import evaluate
rouge = evaluate.load('rouge')

In [3]:
predictions = df_train_short["BART_Pred"]
references = df_train_short["sumString"]
results = rouge.compute(predictions=[predictions], references=[references])
print(results)

NameError: name 'df_train_short' is not defined

### METEOR Metrics

In [None]:
# from nltk.translate import meteor

In [None]:
# ss = samp_sum.split(" ")
# r = res.split(" ")
# print(ss)
# print(r)

In [None]:
# result = round(meteor([r, ss], r), 4)
# result

### BLEU Metrics

In [None]:
# from datasets import load_metric

In [None]:
# bleu = load_metric("bleu")

In [None]:
# predictions = res.split(" ")
# references = samp_sum.split(" ")
# results = bleu.compute(predictions=[[predictions]], references=[[references]])
# print(results)

# Performance Metrics - BERTScore and ROGUE

https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499
https://towardsdatascience.com/bertscore-evaluating-text-generation-with-bert-beb7b3431300


We have chosen to use BERTScore and ROGUE as our two metrics for evaluating the performance of our model. We chose to use ROGUE metrics because it is purpose-built for evaluating text summarization, which is the task our model will be completing. We will be generating scores for ROGUE-N (as ROGUE-1 and ROGUE-2), ROUGE-L, and ROUGE-S for the most comprehensive possible model evaluation. ROGUE-1 and ROGUE-2 will observe the number of unigrams and bigrams (respectively) shared between the model output and the "correct" output. ROGUE-L measures the longest common subsequence of words shared between the model's output and the true output. ROGUE-S observes shared skipgrams between the model's output and the desired one, this can identify sequences of consecutive words that may be correct in the model's output but are separated by a word or sequence of words. These metrics will provide a method by which to assign accuracy, precision, recall, and F1 scores when comparing the model's produced summaries with the original human-generated ones.

We chose to use BERTScore as our second metric because it is another metric that is designed to evaluate how a model's text output compares with a true output. We thought it would be interesting to pair a BERTScore evaluation with our ROGUE evaluations because BERTScore, unlike ROGUE or BLEU, focuses on a semantic comparison of the model's output and the original output, rather than a purely syntactical one. This means, rather than computing a pure accuracy score in terms of how many exact words are matched between the true and model outputs, BERTScore takes into account the meaning of individual words in each output when making evaluations. This can make for an analysis that may be more in line with human intuition.

https://towardsdatascience.com/teaching-bart-to-rap-fine-tuning-hugging-faces-bart-model-41749d38f3ef

https://github.com/facebookresearch/fairseq/tree/main/examples/bart

Building our own Model if needed:
https://github.com/aravindpai/How-to-build-own-text-summarizer-using-deep-learning/blob/master/How_to_build_own_text_summarizer_using_deep_learning.ipynb

Fine-Tuning Pre-Trained Models from Huggingface: https://huggingface.co/docs/transformers/training

# why bart