This notebook trains the model using the the [transformerbook/samsum](https://huggingface.co/transformersbook/pegasus-samsum) dataset and gets the results

# Importing

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from tqdm import tqdm

# Pipe test dataset into model and output results

In [None]:
import json
import pandas as pd
import random
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

file_path = "./drive/MyDrive/CS5246_Final_Project/data/"

# Read JSON file
with open(file_path + "train.json", "r", encoding="utf-8") as file:
    train_data = json.load(file)
with open(file_path + "val.json", "r", encoding="utf-8") as file:
    val_data = json.load(file)
with open(file_path + "test.json", "r", encoding="utf-8") as file:
    test_data = json.load(file)

# Convert to Pandas DataFrame
df_train = pd.DataFrame(train_data)
df_val = pd.DataFrame(val_data)
df_test = pd.DataFrame(test_data)

large_model = "transformersbook/pegasus-samsum"
model = AutoModelForSeq2SeqLM.from_pretrained(large_model)
tokenizer = AutoTokenizer.from_pretrained(large_model)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Function to batch process
def batch_generate_summaries(dialogues, batch_size=8):
    summaries = []
    for i in tqdm(range(0, len(dialogues), batch_size)):
        batch = dialogues[i:i+batch_size]
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
        with torch.no_grad():
            summary_ids = model.generate(**inputs, max_length=128, num_beams=5)
        batch_summaries = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)
        summaries.extend(batch_summaries)
    return summaries

# Generate summaries for df_test
df_test["summary"] = batch_generate_summaries(df_test["dialogue"].tolist())

100%|██████████| 103/103 [01:52<00:00,  1.09s/it]


In [None]:
sample = df_test.iloc[1]
sample_dialogue = sample["dialogue"]
sample_summary = sample["summary"]
print("Dialogue:\n", sample_dialogue)
print("\nGround truth Summary:\n", sample_summary)
print("\nGenerated Summary:\n", generate_summary(sample_dialogue))

Dialogue:
 Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)

Ground truth Summary:
 Eric and Rob are going to watch a stand-up on youtube.

Generated Summary:
 Eric, Rob and Rob are going to watch the Russian comedian's stand-up on YouTube. Eric likes the train part, Rob likes the machine part.


In [None]:
df_test

Unnamed: 0,id,summary,dialogue
0,13862856,Amanda can't find Betty's number. Larry called...,"Hannah: Hey, do you have Betty's number?\nAman..."
1,13729565,"Eric, Rob and Rob are going to watch the Russi...",Eric: MACHINE!\r\nRob: That's so gr8!\r\nEric:...
2,13680171,Lenny wants to buy purple trousers. Bob likes ...,"Lenny: Babe, can you help me with something?\r..."
3,13729438,Emma doesn't want to cook for dinner tonight. ...,"Will: hey babe, what do you want for dinner to..."
4,13828600,Jane is back in Warsaw. She lost her calendar....,"Ollie: Hi , are you in Warsaw\r\nJane: yes, ju..."
...,...,...,...
814,13611902-1,Benjamin was unable to attend Friday night's b...,Alex: Were you able to attend Friday night's b...
815,13820989,The audition starts at 7.30 P.M. at Antena 3. ...,Jamilla: remember that the audition starts at ...
816,13717193,Marta clicked something by accident. Agnieszka...,"Marta: <file_gif>\r\nMarta: Sorry girls, I cli..."
817,13829115,There was a meet and greet with James Charles ...,Cora: Have you heard how much fuss British med...
