# Group Project - Model Baseline
### DSBA 6165
### Divam Arora, Connor Moore, Hemanth Velan

### Sources:
* https://huggingface.co/datasets/gigaword
* https://huggingface.co/docs/datasets/process#export
* https://huggingface.co/docs/datasets/v1.11.0/splits.html
* https://www.geeksforgeeks.org/string-punctuation-in-python/
* https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
* https://www.analyticsvidhya.com/blog/2021/06/pre-processing-of-text-data-in-nlp/
* https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame
* https://towardsdatascience.com/bertscore-evaluating-text-generation-with-bert-beb7b3431300
* https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499
* https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
* https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
* https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
* https://aparnamishra144.medium.com/how-to-change-string-data-or-text-data-of-a-column-to-lowercase-in-pandas-248a8ce4ae01

First we need to re-run the code from our EDA/pre-processing notebook that loads and prepares our dataset for implementation.

In [29]:
# import needed packages
import nltk
import time
import string
import pandas as pd
import datasets as ds
from nltk.corpus import stopwords
from transformers import pipeline
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# download stop word package from nltk library
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# https://huggingface.co/datasets/gigaword
# https://huggingface.co/docs/datasets/v1.11.0/splits.html

# download gigaword dataset from Hugging Face dataset library
train, test, validation = ds.load_dataset("gigaword", split=["train", "test", "validation"])

In [3]:
# display the dataset splits
print(train)
print(test)
print(validation)

In [4]:
# https://huggingface.co/docs/datasets/process#export

# export the training dataset to a pandas dataframe and display
df_train = train.to_pandas()
print("Train df exported.")

# export the test dataset to a pandas dataframe
df_test = test.to_pandas()
print("Test df exported.")

# export the validation dataset to a pandas dataframe
df_val = validation.to_pandas()
print("Validation df exported.")

In [5]:
# the methods required to perform this function were found in this article -
# https://aparnamishra144.medium.com/how-to-change-string-data-or-text-data-of-a-column-to-lowercase-in-pandas-248a8ce4ae01
# the function and comments are our original work

# set all words in all rows to lower case

def lower(df):
    # vectorize strings in each row in summary column and set to lower case
    df["summary"] = df["summary"].str.lower()
    print("summary column lowercased")
    # vectorize strings in each row in document column and set to lower case
    df["document"] = df["document"].str.lower()
    print("document column lowercased")

In [6]:
# geeks for geeks and pandas doc pages were used as template source code and informed about parameter options
# stackoverflow posts helped with debugging issues
# https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
# https://www.geeksforgeeks.org/string-punctuation-in-python/
# https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
# https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
# comments and function are our original work, source code was modifed to fit our workspace

# remove all symbols and punctuation

# create instance of all punctuation symbols
punctuation = string.punctuation

# since we learned there are lots of apostrophe s in the dataset during EDA, we will add this to our remove list
punct_list = ["'s"]

# add all punctuation from the premade variable to our new list
for symbol in punctuation:
    punct_list.append(symbol)

# display the symbols included in our list
print(punct_list)

def remove_punctuation(df):
    # for each symbol in our punctuation list
    for symbol in punct_list:
        # iterate through the dataframe and replace every instance of the symbol with an empty string
        df["document"] = df["document"].str.replace(symbol, "", regex=True)
        df["summary"] = df["summary"].str.replace(symbol, "", regex=True)
    print("symbols removed")

In [7]:
# source code and ideas for this process were gathered from the following geeks for geeks page and article -
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
# comments and functions are original work, source code was modified to fit our workspace

# tokenization and removal of stopwords

# create an instance of all stopwords
stop_words = set(stopwords.words('english'))

# function for removing stopwords from a given input
def remove_stopwords(text):
    # tokenize the input string
    tokens = word_tokenize(text)
    # create an empty list for the new output
    filtered_tokens = []
    # for each word in the tokenized text
    for word in tokens:
        # if the word is not a stop word
        if word not in stop_words:
            # add the token to the new output list
            filtered_tokens.append(word)

    return filtered_tokens

# function to apply the stopword removal/tokenization function to input dataframes
def tokenize_nostop(df):
    # iterate through the dataframe and tokenize/remove stop words for each row
    df["document"] = df["document"].apply(remove_stopwords)
    print("stopwords removed from document column")
    
    df["summary"] = df["summary"].apply(remove_stopwords)
    print("stopwords removed from summary column")

In [8]:
# inspiration and source code for NLTK's word net lemmatizer came from the following article -
# https://www.analyticsvidhya.com/blog/2021/06/pre-processing-of-text-data-in-nlp/
# functions and comments are our original work, source code was modified to fit our workspace

# lemmatization

# create an instance of NLTK's word net lemmatizer class
wml = WordNetLemmatizer()

# function to lemmatize a given tokenized text input
def lemmatization(text):
    # create an empty list for new output
    lemma_words = []

    # for each word in the given input
    for word in text:
        # lemmatize the word
        token = wml.lemmatize(word)
        # and add it into our new output list
        lemma_words.append(token)
    
    return lemma_words

# function to call lemmatization function on the rows of an input dataframe
def lemmatize(df):
    # iterate through the rows of the input dataframe and apply the lemmatization function to each row
    df["document"] = df["document"].apply(lemmatization)
    print("document column lemmatized")

    df["summary"] = df["summary"].apply(lemmatization)
    print("summary column lemmatized")

In [9]:
# create data pre-processing pipeline

def pre_proc(df):
    # lowercase
    lower(df)
    # remove punctuation and symbols
    remove_punctuation(df)
    # tokenize and remove stopwords
    tokenize_nostop(df)
    # lemmatize
    lemmatize(df)
    print("pre-processed successfully")

In [10]:
# call the data pre-processing pipeline for each of the dataset splits

pre_proc(df_train)
print("train df completed")
pre_proc(df_test)
print("test df completed")
pre_proc(df_val)
print("validation df completed")

# display new format of data using training set
df_train.head(10)

Our dataset splits are now pre-processed and ready for use with models.

In [11]:
# export processed datasets to csv files so that pre-processing does not have to be re-run

df_test.to_csv("test.csv", index=False)
print("test set exported")

df_val.to_csv("val.csv", index=False)
print("validation set exported")

df_train.to_csv("train.csv", index=False)
print("train set exported")

In [5]:
# import dataframes from saved csv files

df_train = pd.read_csv("train.csv")
print("train set imported")

df_test = pd.read_csv("test.csv")
print("test set imported")

df_val = pd.read_csv("val.csv")  
print("validation set imported")

train set imported
test set imported
validation set imported


In [6]:
df_train

Unnamed: 0,document,summary
0,"['australia', 'current', 'account', 'deficit',...","['australian', 'current', 'account', 'deficit'..."
1,"['least', 'two', 'people', 'killed', 'suspecte...","['least', 'two', 'dead', 'southern', 'philippi..."
2,"['australian', 'share', 'closed', 'percent', '...","['australian', 'stock', 'close', 'percent']"
3,"['south', 'korea', 'nuclear', 'envoy', 'kim', ...","['envoy', 'urge', 'north', 'korea', 'restart',..."
4,"['south', 'korea', 'monday', 'announced', 'swe...","['skorea', 'announces', 'tax', 'cut', 'stimula..."
...,...,...
3803952,"['state', 'duma', 'lower', 'house', 'russian',...","['duma', 'urge', 'yeltsin', 'reconsider', 'tro..."
3803953,"['u', 'justice', 'department', 'today', 'rejec...","['u', 'justice', 'department', 'reject', 'spec..."
3803954,"['united', 'nation', 'calling', 'million', 'do...","['un', 'seek', 'fund', 'program', 'former', 'y..."
3803955,"['president', 'jacques', 'chirac', 'today', 'p...","['chirac', 'get', 'birthday', 'gift', 'th', 'c..."


### Convert everything to string for better compatibility with text summarization models

In [7]:
def all2String(samp_li):
    return samp_li.replace("\'", "").strip('][').replace(',', '')

In [8]:
df_train["docString"] = df_train["document"].map(all2String)
df_train["sumString"] = df_train["summary"].map(all2String)
df_train.head(5)

Unnamed: 0,document,summary,docString,sumString
0,"['australia', 'current', 'account', 'deficit',...","['australian', 'current', 'account', 'deficit'...",australia current account deficit shrunk recor...,australian current account deficit narrow sharply
1,"['least', 'two', 'people', 'killed', 'suspecte...","['least', 'two', 'dead', 'southern', 'philippi...",least two people killed suspected bomb attack ...,least two dead southern philippine blast
2,"['australian', 'share', 'closed', 'percent', '...","['australian', 'stock', 'close', 'percent']",australian share closed percent monday followi...,australian stock close percent
3,"['south', 'korea', 'nuclear', 'envoy', 'kim', ...","['envoy', 'urge', 'north', 'korea', 'restart',...",south korea nuclear envoy kim sook urged north...,envoy urge north korea restart nuclear disable...
4,"['south', 'korea', 'monday', 'announced', 'swe...","['skorea', 'announces', 'tax', 'cut', 'stimula...",south korea monday announced sweeping tax refo...,skorea announces tax cut stimulate economy


In [9]:
df_test["docString"] = df_test["document"].map(all2String)
df_test["sumString"] = df_test["summary"].map(all2String)
df_test.head(5)

Unnamed: 0,document,summary,docString,sumString
0,"['japan', 'nec', 'corp', 'unk', 'computer', 'c...","['nec', 'unk', 'computer', 'sale', 'tieup']",japan nec corp unk computer corp united state ...,nec unk computer sale tieup
1,"['sri', 'lankan', 'government', 'wednesday', '...","['sri', 'lanka', 'close', 'school', 'war', 'es...",sri lankan government wednesday announced clos...,sri lanka close school war escalates
2,"['police', 'arrested', 'five', 'antinuclear', ...","['protester', 'target', 'french', 'research', ...",police arrested five antinuclear protester thu...,protester target french research ship
3,"['factory', 'order', 'manufactured', 'good', '...","['u', 'september', 'factory', 'order', 'percent']",factory order manufactured good rose percent s...,u september factory order percent
4,"['bank', 'japan', 'appealed', 'financial', 'ma...","['bank', 'unk', 'unk', 'calm', 'financial', 'm...",bank japan appealed financial market remain ca...,bank unk unk calm financial market


In [10]:
df_val["docString"] = df_val["document"].map(all2String)
df_val["sumString"] = df_val["summary"].map(all2String)
df_val.head(5)

Unnamed: 0,document,summary,docString,sumString
0,"['fivetime', 'world', 'champion', 'michelle', ...","['injury', 'leaf', 'kwan', 'olympic', 'hope', ...",fivetime world champion michelle kwan withdrew...,injury leaf kwan olympic hope limbo
1,"['u', 'business', 'leader', 'lashed', 'wednesd...","['u', 'business', 'attack', 'tough', 'immigrat...",u business leader lashed wednesday legislation...,u business attack tough immigration law
2,"['general', 'motor', 'corp', 'said', 'wednesda...","['gm', 'december', 'sale', 'fall', 'percent']",general motor corp said wednesday u sale fell ...,gm december sale fall percent
3,"['several', 'thousand', 'people', 'gathered', ...","['thousand', 'croatian', 'celebrate', 'world',...",several thousand people gathered wednesday eve...,thousand croatian celebrate world cup slalom
4,"['u', 'first', 'lady', 'laura', 'bush', 'u', '...","['laura', 'bush', 'unk', 'rice', 'attend', 'si...",u first lady laura bush u secretary state cond...,laura bush unk rice attend sirleaf inauguratio...


### BART Model

In [12]:
summarizer = pipeline("summarization", model="facebook/bart-base")

Downloading model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
All PyTorch model weights were used when initializing TFBartForConditionalGeneration.

All the weights of TFBartForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

#### Hide

In [19]:
# df_train["index"] = df_train.index

In [20]:
# samp_li = df_train.iloc[0]["document"]
# samp_li

In [21]:
# samp_li = samp_li.replace("\'", "").strip('][').replace(',', '')
# print(type(samp_li))
# # samp_li = samp_li.split(', ')
# max_len = int(len(samp_li)/2)
# print(max_len)
# samp_li

In [22]:
# samp_art = " ".join(samp_li)
# samp_art

In [23]:
# samp_sum = df_train.iloc[0]["summary"].replace("\'", "").strip('][').split(', ')
# samp_sum = " ".join(samp_sum)
# samp_sum

#### Dont Hide

In [24]:
print(df_train.iloc[0]["document"])

['australia', 'current', 'account', 'deficit', 'shrunk', 'record', 'billion', 'dollar', 'lrb', 'billion', 'u', 'rrb', 'june', 'quarter', 'due', 'soaring', 'commodity', 'price', 'figure', 'released', 'monday', 'showed']


In [25]:
print(df_train.iloc[0]["docString"])

australia current account deficit shrunk record billion dollar lrb billion u rrb june quarter due soaring commodity price figure released monday showed


In [14]:
dft = df_train.head(1000)
dft

Unnamed: 0,document,summary,docString,sumString
0,"['australia', 'current', 'account', 'deficit',...","['australian', 'current', 'account', 'deficit'...",australia current account deficit shrunk recor...,australian current account deficit narrow sharply
1,"['least', 'two', 'people', 'killed', 'suspecte...","['least', 'two', 'dead', 'southern', 'philippi...",least two people killed suspected bomb attack ...,least two dead southern philippine blast
2,"['australian', 'share', 'closed', 'percent', '...","['australian', 'stock', 'close', 'percent']",australian share closed percent monday followi...,australian stock close percent
3,"['south', 'korea', 'nuclear', 'envoy', 'kim', ...","['envoy', 'urge', 'north', 'korea', 'restart',...",south korea nuclear envoy kim sook urged north...,envoy urge north korea restart nuclear disable...
4,"['south', 'korea', 'monday', 'announced', 'swe...","['skorea', 'announces', 'tax', 'cut', 'stimula...",south korea monday announced sweeping tax refo...,skorea announces tax cut stimulate economy
...,...,...,...,...
995,"['majestic', 'citadel', 'atop', 'syria', 'anci...","['aga', 'khan', 'pours', 'wealth', 'islamic', ...",majestic citadel atop syria ancient city alepp...,aga khan pours wealth islamic site syria
996,"['eu', 'institutional', 'crisis', 'sparked', '...","['eu', 'losing', 'hope', 'swift', 'solution', ...",eu institutional crisis sparked irish voter re...,eu losing hope swift solution treaty crisis
997,"['business', 'brisk', 'rosary', 'palace', 'unk...","['lourdes', 'supermarket', 'soul', 'want', 'ke...",business brisk rosary palace unk pilgrim fill ...,lourdes supermarket soul want keep pilgrim coming
998,"['people', 'expected', 'attend', 'openair', 'm...","['pope', 'make', 'pilgrimage', 'lourdes', 'shr...",people expected attend openair mass given pope...,pope make pilgrimage lourdes shrine


In [42]:
def runBart(df):
    predictions = []
    times = []

    for i in range(len(df)):
        start = time.perf_counter()
        doc = df.iloc[i]["docString"]
        maxLen = int(len(doc.split(" ")))
        predictions.append(summarizer(doc, max_length=maxLen, min_length=1, do_sample=False)[0]["summary_text"])
        end = time.perf_counter()
        speed = end - start
        times.append(speed)
        if i % 25 == 0:
            avg_time = sum(times) / len(times)
            print("average time per row at", i, "row: ", avg_time)


    df["BART_Pred"] = predictions
    return df

In [43]:
dft = runBart(dft)
dft

average time per row at 0 row:  6.299133299999994
average time per row at 25 row:  5.195839138461556
average time per row at 50 row:  5.222647949019583
average time per row at 75 row:  5.3129766157894665
average time per row at 100 row:  5.313197743564345
average time per row at 125 row:  5.360058168253967
average time per row at 150 row:  5.371951892715232
average time per row at 175 row:  5.403914425000005
average time per row at 200 row:  5.445149328855723
average time per row at 225 row:  5.509214126106189


KeyboardInterrupt: 

In [29]:
# predictions = []
# for i in range(len(df_train)):
#     doc = df_train.iloc[i]["docString"]
#     # print(doc)
#     # print(len(doc.split(" ")))
#     maxlen = int(len(doc.split(" ")))
#     # print(maxlen)
#     # pred = summarizer(doc, max_length=maxlen, min_length=4, do_sample=False)[0]["summary_text"]
#     # print(pred)
#     predictions.append(summarizer(doc, max_length=maxlen, min_length=1, do_sample=False)[0]["summary_text"])
    
# df_train["Prediction"] = predictions
# df_train
# predictions

In [30]:
# res = summarizer(samp_art, max_length=max_len, min_length=4, do_sample=False)
# res = res[0]["summary_text"]
# res

### BERTScore Metrics

In [31]:
# ! pip install evaluate
# ! pip install bert_score

In [32]:
from evaluate import load
bertscore = load("bertscore")

In [33]:
predictions = dft["BART_Pred"]
references = dft["sumString"]
results = bertscore.compute(predictions=predictions, references=references, model_type="distilbert-base-uncased")
# print(results)

In [34]:
# keys = list(results.keys())
# keys

In [35]:
keys = list(results.keys())
for k in range(len(keys)-1):
    s = sum(results[keys[k]])
    le = len(results[keys[k]])
    avg = s/le
    print("Average {} is {}".format(keys[k], avg))

Average precision is 0.7917156093120575
Average recall is 0.8347355498075485
Average f1 is 0.8123201101422309


### ROUGE Metrics

In [None]:
# ! pip install rouge-score

In [None]:
import evaluate
rouge = evaluate.load('rouge')

In [None]:
predictions = res
references = samp_sum
results = rouge.compute(predictions=[predictions], references=[references])
print(results)

### METEOR Metrics

In [None]:
from nltk.translate import meteor

In [None]:
ss = samp_sum.split(" ")
r = res.split(" ")
print(ss)
print(r)

In [None]:
result = round(meteor([r, ss], r), 4)
result

### BLEU Metrics

In [None]:
from datasets import load_metric

In [None]:
bleu = load_metric("bleu")

In [None]:
predictions = res.split(" ")
references = samp_sum.split(" ")
results = bleu.compute(predictions=[[predictions]], references=[[references]])
print(results)

# Performance Metrics - BERTScore and ROGUE

https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499
https://towardsdatascience.com/bertscore-evaluating-text-generation-with-bert-beb7b3431300


We have chosen to use BERTScore and ROGUE as our two metrics for evaluating the performance of our model. We chose to use ROGUE metrics because it is purpose-built for evaluating text summarization, which is the task our model will be completing. We will be generating scores for ROGUE-N (as ROGUE-1 and ROGUE-2), ROUGE-L, and ROUGE-S for the most comprehensive possible model evaluation. ROGUE-1 and ROGUE-2 will observe the number of unigrams and bigrams (respectively) shared between the model output and the "correct" output. ROGUE-L measures the longest common subsequence of words shared between the model's output and the true output. ROGUE-S observes shared skipgrams between the model's output and the desired one, this can identify sequences of consecutive words that may be correct in the model's output but are separated by a word or sequence of words. These metrics will provide a method by which to assign accuracy, precision, recall, and F1 scores when comparing the model's produced summaries with the original human-generated ones.

We chose to use BERTScore as our second metric because it is another metric that is designed to evaluate how a model's text output compares with a true output. We thought it would be interesting to pair a BERTScore evaluation with our ROGUE evaluations because BERTScore, unlike ROGUE or BLEU, focuses on a semantic comparison of the model's output and the original output, rather than a purely syntactical one. This means, rather than computing a pure accuracy score in terms of how many exact words are matched between the true and model outputs, BERTScore takes into account the meaning of individual words in each output when making evaluations. This can make for an analysis that may be more in line with human intuition.

https://towardsdatascience.com/teaching-bart-to-rap-fine-tuning-hugging-faces-bart-model-41749d38f3ef

https://github.com/facebookresearch/fairseq/tree/main/examples/bart

Building our own Model if needed:
https://github.com/aravindpai/How-to-build-own-text-summarizer-using-deep-learning/blob/master/How_to_build_own_text_summarizer_using_deep_learning.ipynb

Fine-Tuning Pre-Trained Models from Huggingface: https://huggingface.co/docs/transformers/training

# why bart