# Group Project - Model Baseline
### DSBA 6165
### Divam Arora, Connor Moore, Hemanth Velan

### Sources:
* https://huggingface.co/datasets/gigaword
* https://huggingface.co/docs/datasets/process#export
* https://huggingface.co/docs/datasets/v1.11.0/splits.html
* https://www.geeksforgeeks.org/string-punctuation-in-python/
* https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
* https://www.analyticsvidhya.com/blog/2021/06/pre-processing-of-text-data-in-nlp/
* https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame
* https://towardsdatascience.com/bertscore-evaluating-text-generation-with-bert-beb7b3431300
* https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499
* https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
* https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
* https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
* https://aparnamishra144.medium.com/how-to-change-string-data-or-text-data-of-a-column-to-lowercase-in-pandas-248a8ce4ae01

First we need to re-run the code from our EDA/pre-processing notebook that loads and prepares our dataset for implementation.

In [1]:
# import needed packages
import nltk
import string
import pandas as pd
import datasets as ds
from nltk.corpus import stopwords
from transformers import pipeline
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# download stop word package from nltk library
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

It's worth noting that because the dataset we used is sourced directly from HuggingFace's library, there is no need for a local copy of the data. Running this notebook automatically downloads and initializes the dataset and allows it to be used as a variable.

In [2]:
# https://huggingface.co/datasets/gigaword
# https://huggingface.co/docs/datasets/v1.11.0/splits.html

# download gigaword dataset from Hugging Face dataset library
train, test, validation = ds.load_dataset("gigaword", split=["train", "test", "validation"])

In [3]:
# display the dataset splits
print(train)
print(test)
print(validation)

In [4]:
# https://huggingface.co/docs/datasets/process#export

# export the training dataset to a pandas dataframe and display
df_train = train.to_pandas()
print("Train df exported.")

# export the test dataset to a pandas dataframe
df_test = test.to_pandas()
print("Test df exported.")

# export the validation dataset to a pandas dataframe
df_val = validation.to_pandas()
print("Validation df exported.")

In [5]:
# the methods required to perform this function were found in this article -
# https://aparnamishra144.medium.com/how-to-change-string-data-or-text-data-of-a-column-to-lowercase-in-pandas-248a8ce4ae01
# the function and comments are our original work

# set all words in all rows to lower case

def lower(df):
    # vectorize strings in each row in summary column and set to lower case
    df["summary"] = df["summary"].str.lower()
    print("summary column lowercased")
    # vectorize strings in each row in document column and set to lower case
    df["document"] = df["document"].str.lower()
    print("document column lowercased")

In [6]:
# geeks for geeks and pandas doc pages were used as template source code and informed about parameter options
# stackoverflow posts helped with debugging issues
# https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
# https://www.geeksforgeeks.org/string-punctuation-in-python/
# https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
# https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
# comments and function are our original work, source code was modifed to fit our workspace

# remove all symbols and punctuation

# create instance of all punctuation symbols
punctuation = string.punctuation

# since we learned there are lots of apostrophe s in the dataset during EDA, we will add this to our remove list
punct_list = ["'s"]

# add all punctuation from the premade variable to our new list
for symbol in punctuation:
    punct_list.append(symbol)

# display the symbols included in our list
print(punct_list)

def remove_punctuation(df):
    # for each symbol in our punctuation list
    for symbol in punct_list:
        # iterate through the dataframe and replace every instance of the symbol with an empty string
        df["document"] = df["document"].str.replace(symbol, "", regex=True)
        df["summary"] = df["summary"].str.replace(symbol, "", regex=True)
    print("symbols removed")

In [7]:
# source code and ideas for this process were gathered from the following geeks for geeks page and article -
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
# comments and functions are original work, source code was modified to fit our workspace

# tokenization and removal of stopwords

# create an instance of all stopwords
stop_words = set(stopwords.words('english'))

# function for removing stopwords from a given input
def remove_stopwords(text):
    # tokenize the input string
    tokens = word_tokenize(text)
    # create an empty list for the new output
    filtered_tokens = []
    # for each word in the tokenized text
    for word in tokens:
        # if the word is not a stop word
        if word not in stop_words:
            # add the token to the new output list
            filtered_tokens.append(word)

    return filtered_tokens

# function to apply the stopword removal/tokenization function to input dataframes
def tokenize_nostop(df):
    # iterate through the dataframe and tokenize/remove stop words for each row
    df["document"] = df["document"].apply(remove_stopwords)
    print("stopwords removed from document column")
    
    df["summary"] = df["summary"].apply(remove_stopwords)
    print("stopwords removed from summary column")

In [8]:
# inspiration and source code for NLTK's word net lemmatizer came from the following article -
# https://www.analyticsvidhya.com/blog/2021/06/pre-processing-of-text-data-in-nlp/
# functions and comments are our original work, source code was modified to fit our workspace

# lemmatization

# create an instance of NLTK's word net lemmatizer class
wml = WordNetLemmatizer()

# function to lemmatize a given tokenized text input
def lemmatization(text):
    # create an empty list for new output
    lemma_words = []

    # for each word in the given input
    for word in text:
        # lemmatize the word
        token = wml.lemmatize(word)
        # and add it into our new output list
        lemma_words.append(token)
    
    return lemma_words

# function to call lemmatization function on the rows of an input dataframe
def lemmatize(df):
    # iterate through the rows of the input dataframe and apply the lemmatization function to each row
    df["document"] = df["document"].apply(lemmatization)
    print("document column lemmatized")

    df["summary"] = df["summary"].apply(lemmatization)
    print("summary column lemmatized")

In [9]:
# create data pre-processing pipeline

def pre_proc(df):
    # lowercase
    lower(df)
    # remove punctuation and symbols
    remove_punctuation(df)
    # tokenize and remove stopwords
    tokenize_nostop(df)
    # lemmatize
    lemmatize(df)
    print("pre-processed successfully")

In [10]:
# call the data pre-processing pipeline for each of the dataset splits

pre_proc(df_train)
print("train df completed")
pre_proc(df_test)
print("test df completed")
pre_proc(df_val)
print("validation df completed")

# display new format of data using training set
df_train.head(10)

Our dataset splits are now pre-processed and ready for use with models.

In [2]:
# import dataframes from saved csv files

df_train = pd.read_csv("train.csv")
print("train set imported")

df_test = pd.read_csv("test.csv")
print("test set imported")

df_val = pd.read_csv("val.csv")
print("validation set imported")

train set imported
test set imported
validation set imported


In [7]:
df_train

Unnamed: 0,document,summary
0,"['australia', 'current', 'account', 'deficit',...","['australian', 'current', 'account', 'deficit'..."
1,"['least', 'two', 'people', 'killed', 'suspecte...","['least', 'two', 'dead', 'southern', 'philippi..."
2,"['australian', 'share', 'closed', 'percent', '...","['australian', 'stock', 'close', 'percent']"
3,"['south', 'korea', 'nuclear', 'envoy', 'kim', ...","['envoy', 'urge', 'north', 'korea', 'restart',..."
4,"['south', 'korea', 'monday', 'announced', 'swe...","['skorea', 'announces', 'tax', 'cut', 'stimula..."
...,...,...
3803952,"['state', 'duma', 'lower', 'house', 'russian',...","['duma', 'urge', 'yeltsin', 'reconsider', 'tro..."
3803953,"['u', 'justice', 'department', 'today', 'rejec...","['u', 'justice', 'department', 'reject', 'spec..."
3803954,"['united', 'nation', 'calling', 'million', 'do...","['un', 'seek', 'fund', 'program', 'former', 'y..."
3803955,"['president', 'jacques', 'chirac', 'today', 'p...","['chirac', 'get', 'birthday', 'gift', 'th', 'c..."


# Performance Metrics - BERTScore and ROGUE

https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499
https://towardsdatascience.com/bertscore-evaluating-text-generation-with-bert-beb7b3431300


We have chosen to use BERTScore and ROGUE as our two metrics for evaluating the performance of our model. We chose to use ROGUE metrics because it is purpose-built for evaluating text summarization, which is the task our model will be completing. We will be generating scores for ROGUE-N (as ROGUE-1 and ROGUE-2), ROUGE-L, and ROUGE-S for the most comprehensive possible model evaluation. ROGUE-1 and ROGUE-2 will observe the number of unigrams and bigrams (respectively) shared between the model output and the "correct" output. ROGUE-L measures the longest common subsequence of words shared between the model's output and the true output. ROGUE-S observes shared skipgrams between the model's output and the desired one, this can identify sequences of consecutive words that may be correct in the model's output but are separated by a word or sequence of words. These metrics will provide a method by which to assign accuracy, precision, recall, and F1 scores when comparing the model's produced summaries with the original human-generated ones.

We chose to use BERTScore as our second metric because it is another metric that is designed to evaluate how a model's text output compares with a true output. We thought it would be interesting to pair a BERTScore evaluation with our ROGUE evaluations because BERTScore, unlike ROGUE or BLEU, focuses on a semantic comparison of the model's output and the original output, rather than a purely syntactical one. This means, rather than computing a pure accuracy score in terms of how many exact words are matched between the true and model outputs, BERTScore takes into account the meaning of individual words in each output when making evaluations. This can make for an analysis that may be more in line with human intuition.

# Baseline Model

explanation of the model we chose and why its appropriate for the task

In [39]:
# USE THIS TO STRIP THE "STRING" LISTS DOWN TO BARE STRINGS


practice = df_train.iloc[0]["document"]

no_quote = practice[1:-1].replace("'", "")

no_comm = no_quote.replace(",", "")

print(no_comm)

australia current account deficit shrunk record billion dollar lrb billion u rrb june quarter due soaring commodity price figure released monday showed
