# Group Project - Advanced Model

Divam Arora, Connor Moore, Hemanth Velan

DSBA 6165

### Sources:
* https://huggingface.co/datasets/gigaword
* https://huggingface.co/docs/datasets/v1.11.0/splits.html
* https://huggingface.co/docs/datasets/process#export
* https://aparnamishra144.medium.com/how-to-change-string-data-or-text-data-of-a-column-to-lowercase-in-pandas-248a8ce4ae01
* https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
* https://www.geeksforgeeks.org/string-punctuation-in-python/
* https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
* https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
* https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
* https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
* https://www.analyticsvidhya.com/blog/2021/06/pre-processing-of-text-data-in-nlp/

First we need to re-run the code from our EDA/pre-processing notebook that loads and prepares our dataset for implementation.

In [78]:
# import needed packages
import nltk
import random
import keras
import time
import torch
import string
import evaluate
import pandas as pd
import datasets as ds
from evaluate import load
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from transformers import BartForConditionalGeneration, BartTokenizer

# download stop word package from nltk library
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\cmoor197\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Dataset
Because our dataset is pulled directly from Huggingface's datasets library, there is no need for a local copy of the data. Running the cell below creates an instance of the specified dataset in your workspace environment.

In [8]:
# https://huggingface.co/datasets/gigaword
# https://huggingface.co/docs/datasets/v1.11.0/splits.html

# download gigaword dataset from Hugging Face dataset library
train, test, validation = ds.load_dataset("gigaword", split=["train", "test", "validation"])

In [9]:
# display the dataset splits
print(train)
print(test)
print(validation)

Dataset({
    features: ['document', 'summary'],
    num_rows: 3803957
})
Dataset({
    features: ['document', 'summary'],
    num_rows: 1951
})
Dataset({
    features: ['document', 'summary'],
    num_rows: 189651
})


In [10]:
# https://huggingface.co/docs/datasets/process#export

# export the training dataset to a pandas dataframe and display
df_train = train.to_pandas()
print("Train df exported.")

# export the test dataset to a pandas dataframe
df_test = test.to_pandas()
print("Test df exported.")

# export the validation dataset to a pandas dataframe
df_val = validation.to_pandas()
print("Validation df exported.")

Train df exported.
Test df exported.
Validation df exported.


### Balancing the train-test split
The standard provided division between train, test, and validation is extremely unbalanced towards train (95%), and the dataset overall is far too large to run through our model in a reasonable timespan. We decided to shrink the train set to 70,000 entries, and concat the provided test and validation sets. From that combined test-val set we will extract a 25,000-entry test set and a 5,000 entry validation set.

In [11]:
# select 70,000 rows randomly from the train dataframe

df_train_short = df_train.sample(n = 70000, random_state=2, ignore_index=True)

df_train_short

Unnamed: 0,document,summary
644708,a british soldier was killed saturday by an ex...,british soldier killed in afghanistan blast
1506983,ukraine insists on building two new nuclear re...,ukraine insists on linking chernobyl closure t...
3429980,portuguese president mario soares will pay an ...,portugal 's president to visit angola next month
2028209,aol stepped up its transformation from interne...,aol introduces new advertising network plans t...
1392922,marine experts from wwf flew to the northern k...,suspected toxic algae bloom leaves thousands o...
...,...,...
3124694,hong kong 's benchmark hang seng index ended h...,hong kong stocks edged up after four straight ...
1237703,former brazil coach carlos alberto parreira sa...,parreira says he 's close to an agreement to c...
671101,around ## youths on thursday protested outside...,latvian youths protest ban of UNK symbols
1601285,ohio 's method of putting prisoners to death i...,ohio judge says state s lethal injection proce...


In [12]:
# combine provided test and val sets and reseparate randomly into smaller subsets

# concat test and validation sets
test_val = [df_test, df_val]
df_testval_bulk = pd.concat(test_val)

# take a random sample of 30000 rows from the test and validation bulk set
df_testval_short = df_testval_bulk.sample(n = 30000, random_state=3, ignore_index=True)

# take a random 5000 row sample from the test-val subset
df_val_short = df_testval_short.sample(n = 5000, random_state=4, ignore_index=True)

# drop all rows taken for the validation sample from the test-val subset to create the test set
df_test_short = df_testval_short.drop(df_val_short.index, axis=0)

In [13]:
# the methods required to perform this function were found in this article -
# https://aparnamishra144.medium.com/how-to-change-string-data-or-text-data-of-a-column-to-lowercase-in-pandas-248a8ce4ae01
# the function and comments are our original work

# set all words in all rows to lower case

def lower(df):
    # vectorize strings in each row in summary column and set to lower case
    df["summary"] = df["summary"].str.lower()
    print("summary column lowercased")
    # vectorize strings in each row in document column and set to lower case
    df["document"] = df["document"].str.lower()
    print("document column lowercased")

In [14]:
# geeks for geeks and pandas doc pages were used as template source code and informed about parameter options
# stackoverflow posts helped with debugging issues
# https://stackoverflow.com/questions/42135409/removing-a-character-from-entire-data-frame
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
# https://www.geeksforgeeks.org/string-punctuation-in-python/
# https://stackoverflow.com/questions/41425945/python-pandas-error-missing-unterminated-subpattern-at-position-2
# https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe
# comments and function are our original work, source code was modifed to fit our workspace

# remove all symbols and punctuation

# create instance of all punctuation symbols
punctuation = string.punctuation

# since we learned there are lots of apostrophe s in the dataset during EDA, we will add this to our remove list
punct_list = ["'s"]

# add all punctuation from the premade variable to our new list
for symbol in punctuation:
    punct_list.append(symbol)

# display the symbols included in our list
print(punct_list)

def remove_punctuation(df):
    # for each symbol in our punctuation list
    for symbol in punct_list:
        # iterate through the dataframe and replace every instance of the symbol with an empty string
        df["document"] = df["document"].str.replace(symbol, "", regex=False)
        df["summary"] = df["summary"].str.replace(symbol, "", regex=False)
    print("symbols removed")

["'s", '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [15]:
# source code and ideas for this process were gathered from the following geeks for geeks page and article -
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
# comments and functions are original work, source code was modified to fit our workspace

# tokenization and removal of stopwords

# create an instance of all stopwords
stop_words = set(stopwords.words('english'))

# function for removing stopwords from a given input
def remove_stopwords(text):
    # tokenize the input string
    tokens = word_tokenize(text)
    # create an empty list for the new output
    filtered_tokens = []
    # for each word in the tokenized text
    for word in tokens:
        # if the word is not a stop word
        if word not in stop_words:
            # add the token to the new output list
            filtered_tokens.append(word)

    return filtered_tokens

# function to apply the stopword removal/tokenization function to input dataframes
def tokenize_nostop(df):
    # iterate through the dataframe and tokenize/remove stop words for each row
    df["document"] = df["document"].apply(remove_stopwords)
    print("stopwords removed from document column")
    
    df["summary"] = df["summary"].apply(remove_stopwords)
    print("stopwords removed from summary column")

In [18]:
# inspiration and source code for NLTK's word net lemmatizer came from the following article -
# https://www.analyticsvidhya.com/blog/2021/06/pre-processing-of-text-data-in-nlp/
# functions and comments are our original work, source code was modified to fit our workspace

# lemmatization

# create an instance of NLTK's word net lemmatizer class
wml = WordNetLemmatizer()

# function to lemmatize a given tokenized text input
def lemmatization(text):
    # create an empty list for new output
    lemma_words = []

    # for each word in the given input
    for word in text:
        # lemmatize the word
        token = wml.lemmatize(word)
        # and add it into our new output list
        lemma_words.append(token)
    
    return lemma_words

# function to call lemmatization function on the rows of an input dataframe
def lemmatize(df):
    # iterate through the rows of the input dataframe and apply the lemmatization function to each row
    df["document"] = df["document"].apply(lemmatization)
    print("document column lemmatized")

    df["summary"] = df["summary"].apply(lemmatization)
    print("summary column lemmatized")

In [19]:
# create data pre-processing pipeline

def pre_proc(df):
    # lowercase
    lower(df)
    # remove punctuation and symbols
    remove_punctuation(df)
    # tokenize and remove stopwords
    tokenize_nostop(df)
    # lemmatize
    lemmatize(df)
    print("pre-processed successfully")

In [20]:
# call the data pre-processing pipeline for each of the dataset splits

pre_proc(df_train_short)
print("train df completed")
pre_proc(df_val_short)
print("test df completed")
pre_proc(df_test_short)
print("validation df completed")

# display new format of data using training set
df_train_short.head(10)

summary column lowercased
document column lowercased
symbols removed
stopwords removed from document column
stopwords removed from summary column
document column lemmatized
summary column lemmatized
pre-processed successfully
train df completed
summary column lowercased
document column lowercased
symbols removed
stopwords removed from document column
stopwords removed from summary column
document column lemmatized
summary column lemmatized
pre-processed successfully
test df completed
summary column lowercased
document column lowercased
symbols removed
stopwords removed from document column
stopwords removed from summary column
document column lemmatized
summary column lemmatized
pre-processed successfully
validation df completed


Unnamed: 0,document,summary
644708,"[british, soldier, killed, saturday, explosion...","[british, soldier, killed, afghanistan, blast]"
1506983,"[ukraine, insists, building, two, new, nuclear...","[ukraine, insists, linking, chernobyl, closure..."
3429980,"[portuguese, president, mario, soares, pay, of...","[portugal, president, visit, angola, next, month]"
2028209,"[aol, stepped, transformation, internet, acces...","[aol, introduces, new, advertising, network, p..."
1392922,"[marine, expert, wwf, flew, northern, kenyan, ...","[suspected, toxic, algae, bloom, leaf, thousan..."
887130,"[indian, share, price, closed, percent, higher...","[indian, share, close, pct]"
1230066,"[gustav, slammed, cuba, tobaccogrowing, wester...","[gustav, slam, cuba, massive, category, hurric..."
2933565,"[week, ago, researcher, wisconsin, japan, said...","[unk, stem, cell, venture, land, million]"
1592999,"[two, japan, biggest, soccer, star, returned, ...","[soccer, star, return, home, dropped, national..."
960389,"[united, state, britain, unleashed, massive, a...","[u, unleashes, aerial, assault, take, port, ai..."


Our dataset splits are now pre-processed and ready for use with models.

### Convert lists to string format for improved model compatibility

In [21]:
# convert list entries into single strings

def list2string(input):
    output = " ".join(input)
    return output

In [22]:
# apply function to create new stringified columns for

# train
df_train_short["docString"] = df_train_short["document"].map(list2string)
df_train_short["sumString"] = df_train_short["summary"].map(list2string)

# test
df_test_short["docString"] = df_test_short["document"].map(list2string)
df_test_short["sumString"] = df_test_short["summary"].map(list2string)

# and val
df_val_short["docString"] = df_val_short["document"].map(list2string)
df_val_short["sumString"] = df_val_short["summary"].map(list2string)

df_train_short.head(5)

Unnamed: 0,document,summary,docString,sumString
644708,"[british, soldier, killed, saturday, explosion...","[british, soldier, killed, afghanistan, blast]",british soldier killed saturday explosion sout...,british soldier killed afghanistan blast
1506983,"[ukraine, insists, building, two, new, nuclear...","[ukraine, insists, linking, chernobyl, closure...",ukraine insists building two new nuclear react...,ukraine insists linking chernobyl closure buil...
3429980,"[portuguese, president, mario, soares, pay, of...","[portugal, president, visit, angola, next, month]",portuguese president mario soares pay official...,portugal president visit angola next month
2028209,"[aol, stepped, transformation, internet, acces...","[aol, introduces, new, advertising, network, p...",aol stepped transformation internet access pro...,aol introduces new advertising network plan mo...
1392922,"[marine, expert, wwf, flew, northern, kenyan, ...","[suspected, toxic, algae, bloom, leaf, thousan...",marine expert wwf flew northern kenyan coast t...,suspected toxic algae bloom leaf thousand fish...


# adv model writeup goes here
(from canvas) "Write about how your advanced model is different from your baseline model. Why did you choose the model architecture ? What evidence from the previous model milestone did you use to drive your decision making? Write at least 100 words."

In [23]:
# code for adv model goes here and in cells below

In [24]:
# adv model construction

In [70]:
# load BART-large tokenizer and model

tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')

In [None]:
# create mini train dataset for testing
df_train_prac = df_train_short.sample(n=5, random_state=5, ignore_index=True)

df_train_prac

In [75]:
def BART_pred(doc):
    # set max length for output summary based on input doc length
    maxLen = int(len(doc) / 10) + 2

    # tokenize input and pass to model
    input = tokenizer.batch_encode_plus([doc], return_tensors='pt')
    summary_id =  model.generate(input['input_ids'], max_length=maxLen, min_length=1)

    # decode summary
    summary = tokenizer.decode(summary_id[0], skip_special_tokens=True)
    
    return summary

In [76]:
for i in range(len(df_train_prac)):
    print("original doc:", df_train_prac.iloc[i]["docString"])
    input_doc = df_train_prac.iloc[i]["docString"]
    pred = BART_pred(input_doc)
    print("predicted sum:", pred)
    print("target sum:", df_train_prac.iloc[i]["sumString"])
    print()

original doc: myanmar saw increased annual copper production ton country unk copper mine according local myanmar time monday
predicted sum: myanmar saw increased annual copper production ton country un
target sum: myanmar see higher copper production

original doc: usc coach john robinson problem think quarterback rob johnson unable win big game
predicted sum: usc coach john robinson problem think
target sum: usc robinson say qb johnson come big occasion

original doc: majority leader house fine job majority exists behind bill
predicted sum: majority leader says he
target sum: house majority leader quest consensus

original doc: heavy downpors failed dampen spirit first day london notting hill carnival sign serious street violence marred last year event
predicted sum: heavy downpors failed dampen spirit first day l
target sum: heavy rain little violence london carnival

original doc: henry kaufman president u financial consultant company henry kaufman amp co said dollar could rise high

In [None]:
# training adv model using train data

In [None]:
# using adv model to generate predictions on test set

In [None]:
# generating metrics on adv model test performance

# model performance results writeup goes here
(from canvas) "You have been able to create a training and testing set from your data (or it has already been given to you). We want to see evidence that you were able to train your advanced model and have performance metrics. How does your model perform on the metrics you have chosen from your previous submission?"