## Directions
In order to use our weak baseline, first, create a shortcut from the 'CIS 5300 Project Folder' to 'My Drive'. After, simply click on the 'Runtime' tab and hit 'Run all.' You will be prompted to mount your Drive to the file; click accept. After, you will see the output of each cell, and the final ROUGE and BERTScore produced.

In [None]:
# Install necessary huggingface packages
%%capture
!pip install transformers datasets
!pip install --upgrade accelerate
!pip install evaluate rouge_score bert_score
!pip install sentencepiece

In [None]:
# Import packages
from transformers import pipeline, Trainer, TrainingArguments, DataCollatorForLanguageModeling, AutoTokenizer, AutoModelForSeq2SeqLM, TFAutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, T5ForConditionalGeneration, T5Tokenizer, DataCollatorForSeq2Seq
import sentencepiece
from evaluate import load
from datasets import Dataset
import numpy as np
import pandas as pd
import os
import torch
import copy
from collections import defaultdict

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device_id = 0 if str(device) == 'cuda' else -1

## Data Pre-Processing

In [None]:
from google.colab import drive
drive.mount('/content/drive')
# folder = pd.read_csv("/content/drive/Shared With Me/folder_0/folder_1/file.csv")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Load data
path = "/content/drive/My Drive/CIS 5300 Final Project/5300 Project Data/exp_data/data_email_short/"
# train_source = pd.read_csv(path + 'train_source.txt', sep = '\n')
source = []
s = open(path + 'train_source.txt', 'r')
for line in s:
  source.append(line.strip())

target = []
t = open(path + 'train_target.txt', 'r')
for line in t:
  target.append(line.strip())

train = pd.DataFrame({"Source" : source, "Target" : target})
train

Unnamed: 0,Source,Target
0,"Subject: ep strategy|||Andy: I'm fried, frazzl...",Andy expresses stress about his current worklo...
1,"Subject: tibco layoffs, wireless cuts|||Steve:...",There is going to be lay offs and they are try...
2,Subject: are the knicks playing tonight?|||Dou...,Doug makes a joke about the Knicks. Fernand sa...
3,"Subject: func test failures|||Ravi: Sharon, I ...",Ravi tells Sharon there are some functional te...
4,"Subject: feedback from market|||Peter: Ron, I ...",The AvocadoIT team discuss problems with mobil...
...,...,...
1795,Subject: agenda for program mgmt meeting tbh 1...,Hideki cannot make the meeting and instead has...
1796,"Subject: cage storage|||Mark: Debbie, I need y...",Mark needs Debbie's help with items inside the...
1797,Subject: hp & exodus emobility initiative|||Ro...,Ron writes to the team regarding the HP & Exod...
1798,Subject: devices that we support today|||Rober...,Robert makes a list for Dan of the devices the...


In [None]:
# Load data
source = []
s = open(path + 'test_source.txt', 'r')
for line in s:
  source.append(line.strip())

target = []
t = open(path + 'test_target.txt', 'r')
for line in t:
  target.append(line.strip())

test = pd.DataFrame({"Source" : source, "Target" : target})
test

Unnamed: 0,Source,Target
0,"Subject: 1:30pm conference call|||Ty: Dan, Can...",Ty asks Dan if he can take a conference call w...
1,Subject: 12/20/00 need flip chart for sausalit...,Jackie needs flip chart for sausalito conf rm ...
2,Subject: environment for respond|||Sujan: Guys...,Sujan reminds everyone they need a respond env...
3,"Subject: ert user guide|||Ruth: Piyush, I know...",Ruth requests Piyush check the status of a gui...
4,Subject: 4.5 patch for japan|||Divakar: Rajeev...,Diva tells Rajeev Japan needs a patch. Rajeev ...
...,...,...
495,Subject: usps demo....tuesday demo!|||Ray: I ...,Ray communicates the demo date for USPS and as...
496,Subject: training for accenture developers in ...,Carlos asks for contact information to arrange...
497,Subject: ui postmortem minutes|||Mark: Here is...,Mark sends a summary of a recent event to the ...
498,Subject: system admin training in ep japan|||R...,There is some confusion about who will teach a...


In [None]:
# Load data
source = []
s = open(path + 'dev_source.txt', 'r')
for line in s:
  source.append(line.strip())

target = []
t = open(path + 'dev_target.txt', 'r')
for line in t:
  target.append(line.strip())

dev = pd.DataFrame({"Source" : source, "Target" : target})
dev

Unnamed: 0,Source,Target
0,Subject: bug 2664|||Sharon: Hi Ravi - this is ...,Sharon tells Ravi about an invalid link. Ravi ...
1,Subject: thanks for the hospitality and courte...,Dave thanks the group for their hospitality on...
2,Subject: testservlet update for sp4|||Howard: ...,Howard provides an action plan. Rajdeep descri...
3,Subject: andy at javaone on monday|||Andy: I'l...,Andy is at JavaOne and asks Jamie to review th...
4,Subject: avocadoit/forrester follow-up|||Ryan:...,Agenda for meeting is presented by Amanda and ...
...,...,...
244,Subject: avocadoit booth staff needed at wirel...,Debbie asks for volunteers. Laura volunteers. ...
245,Subject: marketing manager job description|||M...,Marcia writes to the team about the pending ma...
246,Subject: alaska bug|||Kavitha: This bug has be...,Kavitha notes that a bug has been fixed and as...
247,Subject: offline admin guide 4.0 - review|||De...,AvocadoIT completes an offline administration ...


In [None]:
test['Target'][0]

'Ty asks Dan if he can take a conference call with Accenture today at 1:30pm. Dan acknowledges he can and is working from home.'

In [None]:
test['Source'][0]

"Subject: 1:30pm conference call|||Ty: Dan, Can you sit in on a call with Accenture today at 1:30pm to talk about Symbol devices, barcoding into an application, and a few other questions regarding an opportunity at Visa? Thanks, Ty|||Dan: Yes but need more tech info on our symbol peripheral capabilities so may need Amitabh.|||Ty: I think you may be able to cover this initial discussion without Amitabh. We'ss dial you in or come over to Amit's office. Ty|||Dan: I'm working from home today, so you can call my mobile of give me a dial-in ph#. Dan\t\t\t\t\t\tUSERNAME@DOMAIN.COM Director of Sales Engineering PHONENUMBER m HTTP://LINK w|||Dan: are you going to call my mobile phone or do you have a dial-in ph# ? Dan\t\t\t\t\t\tUSERNAME@DOMAIN.COM Director of Sales Engineering PHONENUMBER m HTTP://LINK w"

In [None]:
from transformers import AutoTokenizer

checkpoint = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:
prefix = 'summarize: '
def add_summarize_prompt(df):
  inputs = [prefix + doc for doc in df["Source"]]
  df['Source'] = inputs
  model_inputs = tokenizer(inputs, max_length = 512, truncation = True)
  labels = tokenizer(text_target=df["Target"], max_length = 32, truncation = True)
  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

ds_train = Dataset.from_pandas(train)
tokenized_train = ds_train.map(add_summarize_prompt, batched = True)

ds_dev = Dataset.from_pandas(dev)
tokenized_dev = ds_dev.map(add_summarize_prompt, batched = True)

ds_test = Dataset.from_pandas(test)
tokenized_test = ds_test.map(add_summarize_prompt, batched = True)

Map:   0%|          | 0/1800 [00:00<?, ? examples/s]

Map:   0%|          | 0/249 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
tokenized_test['Source'][0]

"summarize: Subject: 1:30pm conference call|||Ty: Dan, Can you sit in on a call with Accenture today at 1:30pm to talk about Symbol devices, barcoding into an application, and a few other questions regarding an opportunity at Visa? Thanks, Ty|||Dan: Yes but need more tech info on our symbol peripheral capabilities so may need Amitabh.|||Ty: I think you may be able to cover this initial discussion without Amitabh. We'ss dial you in or come over to Amit's office. Ty|||Dan: I'm working from home today, so you can call my mobile of give me a dial-in ph#. Dan\t\t\t\t\t\tUSERNAME@DOMAIN.COM Director of Sales Engineering PHONENUMBER m HTTP://LINK w|||Dan: are you going to call my mobile phone or do you have a dial-in ph# ? Dan\t\t\t\t\t\tUSERNAME@DOMAIN.COM Director of Sales Engineering PHONENUMBER m HTTP://LINK w"

## Weak Baseline: Non-Fine-Tuned T5

In [None]:
tokenizer = AutoTokenizer.from_pretrained('t5-base')
nonft_model = TFAutoModelForSeq2SeqLM.from_pretrained('t5-base')
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model='t5-base')
nonft_summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [None]:
# Load the model and tokenizer
model = T5ForConditionalGeneration.from_pretrained('t5-base').to('cuda')
tokenizer = T5Tokenizer.from_pretrained('t5-base')

def batch_summarize(texts, batch_size=8):
    summaries = []
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]

        # Tokenize the texts and prepare the input tensor
        inputs = tokenizer(batch_texts, padding=True, truncation=True, max_length=512, return_tensors='pt').to('cuda')

        # Generate summaries
        with torch.no_grad():
            summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=50, early_stopping=True)

        # Decode and add to the list
        batch_summaries = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
        summaries.extend(batch_summaries)

    return summaries

# Apply batch summarization
test['Predicted'] = batch_summarize(tokenized_test['Source'])

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
print(test['Predicted'][0])
print(test['Target'][0])
print(test['Source'][0])

Dan USERNAME@DOMAIN.COM: i'm working from home today, so you can call my mobile or dial-in ph# .
Ty asks Dan if he can take a conference call with Accenture today at 1:30pm. Dan acknowledges he can and is working from home.
Subject: 1:30pm conference call|||Ty: Dan, Can you sit in on a call with Accenture today at 1:30pm to talk about Symbol devices, barcoding into an application, and a few other questions regarding an opportunity at Visa? Thanks, Ty|||Dan: Yes but need more tech info on our symbol peripheral capabilities so may need Amitabh.|||Ty: I think you may be able to cover this initial discussion without Amitabh. We'ss dial you in or come over to Amit's office. Ty|||Dan: I'm working from home today, so you can call my mobile of give me a dial-in ph#. Dan						USERNAME@DOMAIN.COM Director of Sales Engineering PHONENUMBER m HTTP://LINK w|||Dan: are you going to call my mobile phone or do you have a dial-in ph# ? Dan						USERNAME@DOMAIN.COM Director of Sales Engineering PHONENUMB

In [None]:
print(test['Predicted'][1])
print(test['Source'][1])
print(test['Target'][1])

i asked you to let me know when the meeting was over so i could clean the refrigerator . you have sent me this message 3 times already and iam currently looking for one .
Subject: 12/20/00 need flip chart for sausalito conf rm.|||Jackie: Also, Please clean out the fridge. Thank you. Jackie|||Yolanda: are you going to need flip chart today? and yes, i will clean the refrigerator this morning, remember, i asked you to let me know when the meeting was over the other day, so i could clean the refrigerator.|||Jackie: Thanks Yoli!! Sorry, I couldn't remember if I had sent this one or not. I appreciate your efforts and hard work to help me out!!! :)|||Yolanda: you have sent me this message 3 times already and iam currently looking for one|||Jackie: Sorry!!! Thanks again.
Jackie needs flip chart for sausalito conf rm and fridge cleaned. Yolanda verifies she'll clean the fridge this morning.


In [None]:
print(test['Predicted'][2])
print(test['Source'][2])
print(test['Target'][2])

i need a "standard destop" hardware configuration which we can use as the server . as it will only be used as a server, we don't need fancy monitor (ie 15" will be ok)
Subject: environment for respond|||Sujan: Guys, We need a Respond environment. For that I need to order a "standard destop" hardware configuration which we can use as the server. As it will only be used as a server, we don't need the fancy monitor (ie 15" will be ok). Please order one for my group. Let me know the availability and if you need more specific details. Thanks, Sujan|||Sujan: Guys, We need a Respond environment. For that I need to order a "standard destop" hardware configuration which we can use as the server. As it will only be used as a server, we don't need the fancy monitor (ie 15" will be ok). Please order one for my group. Let me know the availability and if you need more specific details. Thanks, Sujan|||Marek: Sujan, I'm not sure if you already took care of that, but the external IP is very important 

## Model Evaluation & Example Summaries

In [None]:
# Evaluate metrics
rouge = load("rouge")
bertscore = load("bertscore")

# Token indices sequence length is longer than the specified maximum sequence length for this model (645 > 512). Running this sequence through the model will result in indexing errors
nonft_rouge_results = rouge.compute(predictions = test['Predicted'], references = test['Target'])
nonft_berts_results = bertscore.compute(predictions = test['Predicted'], references = test['Target'], lang = 'en')

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
nonft_rouge_results

{'rouge1': 0.17723067486351762,
 'rouge2': 0.03474975387463957,
 'rougeL': 0.13833154790080854,
 'rougeLsum': 0.13801505020545007}

In [None]:
nonft_berts_results

{'precision': [0.8242924809455872,
  0.8351583480834961,
  0.8190982937812805,
  0.8239952325820923,
  0.8377957344055176,
  0.8298234939575195,
  0.8151780366897583,
  0.8303300142288208,
  0.8110110759735107,
  0.8492404222488403,
  0.8299615383148193,
  0.8233664035797119,
  0.840356171131134,
  0.8529987931251526,
  0.824569582939148,
  0.8588295578956604,
  0.8244633674621582,
  0.819983959197998,
  0.8456223011016846,
  0.832257866859436,
  0.839884877204895,
  0.8165749311447144,
  0.817705512046814,
  0.8121595978736877,
  0.8516138792037964,
  0.8219054341316223,
  0.8371283411979675,
  0.827399730682373,
  0.8153020143508911,
  0.792497992515564,
  0.8474760055541992,
  0.840796947479248,
  0.8394103646278381,
  0.820908784866333,
  0.8585995435714722,
  0.7975109815597534,
  0.8625326156616211,
  0.808857798576355,
  0.8262081146240234,
  0.8245513439178467,
  0.9007415175437927,
  0.8209487199783325,
  0.843186616897583,
  0.8541069030761719,
  0.8692108392715454,
  0.81272

In [None]:
temp = nonft_berts_results

In [None]:
avg_prec = np.mean(temp['precision'])
avg_recall = np.mean(temp['recall'])
avg_f1 = np.mean(temp['f1'])
print(avg_prec)
print(avg_recall)
print(avg_f1)

0.8375327200889587
0.8516641308069229
0.8444484784603119
