## Directions
In order to use our weak baseline, first, create a shortcut from the 'CIS 5300 Project Folder' to 'My Drive'. After, simply click on the 'Runtime' tab and hit 'Run all.' You will be prompted to mount your Drive to the file; click accept. After, you will see the output of each cell, and the final ROUGE and BERTScore produced.

In [22]:
# Install necessary huggingface packages
%%capture
!pip install transformers datasets
!pip install --upgrade accelerate
!pip install evaluate rouge_score bert_score
!pip install sentencepiece

In [23]:
# Import packages
from transformers import pipeline, Trainer, TrainingArguments, DataCollatorForLanguageModeling, AutoTokenizer, AutoModelForSeq2SeqLM, TFAutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, T5ForConditionalGeneration, T5Tokenizer, DataCollatorForSeq2Seq
import sentencepiece
from evaluate import load
from datasets import Dataset
import numpy as np
import pandas as pd
import os
import torch
import copy
from collections import defaultdict

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device_id = 0 if str(device) == 'cuda' else -1

## Data Pre-Processing

In [24]:
from google.colab import drive
drive.mount('/content/drive')
# folder = pd.read_csv("/content/drive/Shared With Me/folder_0/folder_1/file.csv")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [25]:
# Load data
path = "/content/drive/My Drive/CIS 5300 Final Project/5300 Project Data/exp_data/data_email_long/"
# train_source = pd.read_csv(path + 'train_source.txt', sep = '\n')
source = []
s = open(path + 'train_source.txt', 'r')
for line in s:
  source.append(line.strip())

target = []
t = open(path + 'train_target.txt', 'r')
for line in t:
  target.append(line.strip())

train = pd.DataFrame({"Source" : source, "Target" : target})
train

Unnamed: 0,Source,Target
0,"Subject: ep strategy|||Andy: I'm fried, frazzl...",Andy expresses that he is stressed out about a...
1,"Subject: tibco layoffs, wireless cuts|||Steve:...",Steve says Tibco is planning layoffs and will ...
2,Subject: are the knicks playing tonight?|||Dou...,Doug plays a joke on Stephen and Fernand askin...
3,"Subject: func test failures|||Ravi: Sharon, I ...",Ravi tells Sharon there are some functional te...
4,"Subject: feedback from market|||Peter: Ron, I ...",The AvocadoIT team discuss problems with mobil...
...,...,...
1795,Subject: agenda for program mgmt meeting tbh 1...,Hideki is trying to update the team on the key...
1796,"Subject: cage storage|||Mark: Debbie, I need y...",Mark asks Debbie if she could help him tomorro...
1797,Subject: hp & exodus emobility initiative|||Ro...,"Ron writes to Mike, Brett, and David about the..."
1798,Subject: devices that we support today|||Rober...,Robert mentions to Dan the devices that V2.5do...


In [26]:
# Load data
source = []
s = open(path + 'test_source.txt', 'r')
for line in s:
  source.append(line.strip())

target = []
t = open(path + 'test_target.txt', 'r')
for line in t:
  target.append(line.strip())

test = pd.DataFrame({"Source" : source, "Target" : target})
test

Unnamed: 0,Source,Target
0,"Subject: 1:30pm conference call|||Ty: Dan, Can...",Ty asks Dan if he is able to take a conference...
1,Subject: 12/20/00 need flip chart for sausalit...,Jackie alerts Yolanda and Betty she needs the ...
2,Subject: environment for respond|||Sujan: Guys...,Sujan reminds everyone they need a respond env...
3,"Subject: ert user guide|||Ruth: Piyush, I know...",Ruth requests Piyush check the status of a gui...
4,Subject: 4.5 patch for japan|||Divakar: Rajeev...,Diva tells Rajeev Japan needs a patch to fix s...
...,...,...
495,Subject: usps demo....tuesday demo!|||Ray: I ...,Ray corrects previously communicated date for ...
496,Subject: training for accenture developers in ...,Carlos will be travelling to Australia at the ...
497,Subject: ui postmortem minutes|||Mark: Here is...,Mark sends a post-mortem report to the team fo...
498,Subject: system admin training in ep japan|||R...,Richard informs people that Doug will be teach...


In [27]:
# Load data
source = []
s = open(path + 'dev_source.txt', 'r')
for line in s:
  source.append(line.strip())

target = []
t = open(path + 'dev_target.txt', 'r')
for line in t:
  target.append(line.strip())

dev = pd.DataFrame({"Source" : source, "Target" : target})
dev

Unnamed: 0,Source,Target
0,Subject: bug 2664|||Sharon: Hi Ravi - this is ...,Sharon sends Ravi a link that isn't working. S...
1,Subject: thanks for the hospitality and courte...,Dave expresses thanks for the hospitality he r...
2,Subject: testservlet update for sp4|||Howard: ...,Howard asks about a plan for changes to TestSe...
3,Subject: andy at javaone on monday|||Andy: I'l...,Andy is at JavaOne and asks Jamie to review th...
4,Subject: avocadoit/forrester follow-up|||Ryan:...,Ryan thought event was a conference not a work...
...,...,...
244,Subject: avocadoit booth staff needed at wirel...,Debbie sends out a request for volunteers that...
245,Subject: marketing manager job description|||M...,"Marcia writes to Giselle, Barry, and Pam about..."
246,Subject: alaska bug|||Kavitha: This bug has be...,Kavitha notes that a bug has been fixed and th...
247,Subject: offline admin guide 4.0 - review|||De...,AvocadoIT completes an offline administration ...


In [28]:
test['Target'][0]

"Ty asks Dan if he is able to take a conference call at 1:30pm with Accenture to discuss Symbol devices, barcoding into an application, and other questions regarding an opportunity at Visa. Dan acknowledges he can but he may need Amitabh because he needs more tech info on their Symbol peripheral capabilities. Ty thinks he'll be fine in the initial discussion without Amitabh. Dan is working from home, Ty will dial him in."

In [29]:
test['Source'][0]

"Subject: 1:30pm conference call|||Ty: Dan, Can you sit in on a call with Accenture today at 1:30pm to talk about Symbol devices, barcoding into an application, and a few other questions regarding an opportunity at Visa? Thanks, Ty|||Dan: Yes but need more tech info on our symbol peripheral capabilities so may need Amitabh.|||Ty: I think you may be able to cover this initial discussion without Amitabh. We'ss dial you in or come over to Amit's office. Ty|||Dan: I'm working from home today, so you can call my mobile of give me a dial-in ph#. Dan\t\t\t\t\t\tUSERNAME@DOMAIN.COM Director of Sales Engineering PHONENUMBER m HTTP://LINK w|||Dan: are you going to call my mobile phone or do you have a dial-in ph# ? Dan\t\t\t\t\t\tUSERNAME@DOMAIN.COM Director of Sales Engineering PHONENUMBER m HTTP://LINK w"

In [30]:
from transformers import AutoTokenizer

checkpoint = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [31]:
prefix = 'summarize: '
def add_summarize_prompt(df):
  inputs = [prefix + doc for doc in df["Source"]]
  df['Source'] = inputs
  model_inputs = tokenizer(inputs, max_length = 512, truncation = True)
  labels = tokenizer(text_target=df["Target"], max_length = 32, truncation = True)
  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

ds_train = Dataset.from_pandas(train)
tokenized_train = ds_train.map(add_summarize_prompt, batched = True)

ds_dev = Dataset.from_pandas(dev)
tokenized_dev = ds_dev.map(add_summarize_prompt, batched = True)

ds_test = Dataset.from_pandas(test)
tokenized_test = ds_test.map(add_summarize_prompt, batched = True)

Map:   0%|          | 0/1800 [00:00<?, ? examples/s]

Map:   0%|          | 0/249 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [32]:
tokenized_test['Source'][0]

"summarize: Subject: 1:30pm conference call|||Ty: Dan, Can you sit in on a call with Accenture today at 1:30pm to talk about Symbol devices, barcoding into an application, and a few other questions regarding an opportunity at Visa? Thanks, Ty|||Dan: Yes but need more tech info on our symbol peripheral capabilities so may need Amitabh.|||Ty: I think you may be able to cover this initial discussion without Amitabh. We'ss dial you in or come over to Amit's office. Ty|||Dan: I'm working from home today, so you can call my mobile of give me a dial-in ph#. Dan\t\t\t\t\t\tUSERNAME@DOMAIN.COM Director of Sales Engineering PHONENUMBER m HTTP://LINK w|||Dan: are you going to call my mobile phone or do you have a dial-in ph# ? Dan\t\t\t\t\t\tUSERNAME@DOMAIN.COM Director of Sales Engineering PHONENUMBER m HTTP://LINK w"

## Weak Baseline: Non-Fine-Tuned T5

In [33]:
tokenizer = AutoTokenizer.from_pretrained('t5-base')
nonft_model = TFAutoModelForSeq2SeqLM.from_pretrained('t5-base')
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model='t5-base')
nonft_summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [34]:
# Load the model and tokenizer
model = T5ForConditionalGeneration.from_pretrained('t5-base').to('cuda')
tokenizer = T5Tokenizer.from_pretrained('t5-base')

def batch_summarize(texts, batch_size=8):
    summaries = []
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]

        # Tokenize the texts and prepare the input tensor
        inputs = tokenizer(batch_texts, padding=True, truncation=True, max_length=512, return_tensors='pt').to('cuda')

        # Generate summaries
        with torch.no_grad():
            summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=50, early_stopping=True)

        # Decode and add to the list
        batch_summaries = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
        summaries.extend(batch_summaries)

    return summaries

# Apply batch summarization
test['Predicted'] = batch_summarize(tokenized_test['Source'])

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [35]:
print(test['Predicted'][0])
print(test['Target'][0])
print(test['Source'][0])

Dan USERNAME@DOMAIN.COM: i'm working from home today, so you can call my mobile or dial-in ph# .
Ty asks Dan if he is able to take a conference call at 1:30pm with Accenture to discuss Symbol devices, barcoding into an application, and other questions regarding an opportunity at Visa. Dan acknowledges he can but he may need Amitabh because he needs more tech info on their Symbol peripheral capabilities. Ty thinks he'll be fine in the initial discussion without Amitabh. Dan is working from home, Ty will dial him in.
Subject: 1:30pm conference call|||Ty: Dan, Can you sit in on a call with Accenture today at 1:30pm to talk about Symbol devices, barcoding into an application, and a few other questions regarding an opportunity at Visa? Thanks, Ty|||Dan: Yes but need more tech info on our symbol peripheral capabilities so may need Amitabh.|||Ty: I think you may be able to cover this initial discussion without Amitabh. We'ss dial you in or come over to Amit's office. Ty|||Dan: I'm working fro

In [36]:
print(test['Predicted'][1])
print(test['Source'][1])
print(test['Target'][1])

i asked you to let me know when the meeting was over so i could clean the refrigerator . you have sent me this message 3 times already and iam currently looking for one .
Subject: 12/20/00 need flip chart for sausalito conf rm.|||Jackie: Also, Please clean out the fridge. Thank you. Jackie|||Yolanda: are you going to need flip chart today? and yes, i will clean the refrigerator this morning, remember, i asked you to let me know when the meeting was over the other day, so i could clean the refrigerator.|||Jackie: Thanks Yoli!! Sorry, I couldn't remember if I had sent this one or not. I appreciate your efforts and hard work to help me out!!! :)|||Yolanda: you have sent me this message 3 times already and iam currently looking for one|||Jackie: Sorry!!! Thanks again.
Jackie alerts Yolanda and Betty she needs the flip chart for the Sausalito conf rm and requests the fridge to be cleaned. Yolanda asks Jackie if she will need the flip chart today, and assures her that she will clean the frid

In [37]:
print(test['Predicted'][2])
print(test['Source'][2])
print(test['Target'][2])

i need a "standard destop" hardware configuration which we can use as the server . as it will only be used as a server, we don't need fancy monitor (ie 15" will be ok)
Subject: environment for respond|||Sujan: Guys, We need a Respond environment. For that I need to order a "standard destop" hardware configuration which we can use as the server. As it will only be used as a server, we don't need the fancy monitor (ie 15" will be ok). Please order one for my group. Let me know the availability and if you need more specific details. Thanks, Sujan|||Sujan: Guys, We need a Respond environment. For that I need to order a "standard destop" hardware configuration which we can use as the server. As it will only be used as a server, we don't need the fancy monitor (ie 15" will be ok). Please order one for my group. Let me know the availability and if you need more specific details. Thanks, Sujan|||Marek: Sujan, I'm not sure if you already took care of that, but the external IP is very important 

## Model Evaluation & Example Summaries

In [38]:
# Evaluate metrics
rouge = load("rouge")
bertscore = load("bertscore")

# Token indices sequence length is longer than the specified maximum sequence length for this model (645 > 512). Running this sequence through the model will result in indexing errors
nonft_rouge_results = rouge.compute(predictions = test['Predicted'], references = test['Target'])
nonft_berts_results = bertscore.compute(predictions = test['Predicted'], references = test['Target'], lang = 'en')

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [39]:
nonft_rouge_results

{'rouge1': 0.19955151433038978,
 'rouge2': 0.05004325677210582,
 'rougeL': 0.14509618400406926,
 'rougeLsum': 0.14499080538004872}

In [40]:
nonft_berts_results

{'precision': [0.8205626010894775,
  0.8483543395996094,
  0.8281437158584595,
  0.8309992551803589,
  0.840266227722168,
  0.8259634971618652,
  0.8249658346176147,
  0.8393022418022156,
  0.809921383857727,
  0.8659712672233582,
  0.8375159502029419,
  0.837530255317688,
  0.8801302909851074,
  0.8680313229560852,
  0.8378739356994629,
  0.8635696172714233,
  0.8275223970413208,
  0.8235275149345398,
  0.8408095836639404,
  0.8361548185348511,
  0.840110182762146,
  0.8201842308044434,
  0.8162648677825928,
  0.8147243857383728,
  0.8472945094108582,
  0.8394254446029663,
  0.8617720603942871,
  0.8395898342132568,
  0.8119636178016663,
  0.7967687845230103,
  0.8505049347877502,
  0.8503284454345703,
  0.8435200452804565,
  0.8296914100646973,
  0.8932071924209595,
  0.800453782081604,
  0.8815784454345703,
  0.8107247948646545,
  0.8357192873954773,
  0.8258331418037415,
  0.8904428482055664,
  0.8276995420455933,
  0.8315075635910034,
  0.8713354468345642,
  0.8605276942253113,
  

In [41]:
temp = nonft_berts_results

In [42]:
avg_prec = np.mean(temp['precision'])
avg_recall = np.mean(temp['recall'])
avg_f1 = np.mean(temp['f1'])
print(avg_prec)
print(avg_recall)
print(avg_f1)

0.8475873854160308
0.8350747630596161
0.8411723804473877
