# Text Summarization Project

This notebook is dedicated to the task of text summarization using a fine-tuned T5 model. The goal is to effectively summarize large bodies of text into concise, meaningful summaries.

## Description

In this project, we are exploring different approaches to text summarization:
- Summarizing each paragraph individually and then stitching these summaries together.
- Summarizing the entire text in one go.

We will compare these methods using ROUGE scores to evaluate their effectiveness.

## Steps to Run

1. **Add Data to the Runtime (IMPORTANT!)**: I'd say the first thing would be to add the data to the runtime on the side column. For us, we shoudl add the kindle_reviews.csv as well as the test.jsonl. Put both of those in the runtime, no need for folder nesting, etc. Should be sufficient as is.

2. **Install Required Libraries**: First cell will install libraries. We are using PyTorch for this.

3. **Model Fine-Tuning**: The model then creates a class that will prepare the summaries such that they are in a format to be fed in. In other words, we can't just input text, you have to show it that it's a text / summary pair so T5 knows.

4. **Text Summarization**: In this section, we apply the fine-tuned model to summarize the test data. The two approaches (paragraph-by-paragraph and full-text summarization) are implemented and compared. Results are saved to an array. The last one is printed out for idk sanity I guess.

5. **Evaluate Summaries**: The summaries generated by each method are evaluated using the ROUGE metric just imported easily. The results are displayed and compared to determine the more effective approach. Here is kinda a summary of what they can mean -

Recall essentially answers the following question - "Of all the relevent information present in the initial summary, how much did the generated summary manage to capture"

Precision answers the following question - "Of all the information presented in the generated summary, how much of it is relevant or actually appears in the reference summary?"

## Notes

One thing to be concerned about is the rapid amount of paragraphs... that's why I added the average paragraph to make sure it was spitting out something reasonable.

Pretraining & actually predicting take a long time, so just be wary of that when running. I think that it takes around 10 seconds generally speaking to make a prediction and the training output it shown but it took me 4.5 minutes for say 1000 summaries.

In [None]:
!pip install transformers datasets rouge-score torch sentencepiece accelerate



In [None]:
import pandas as pd
import csv
from transformers import T5Tokenizer, T5ForConditionalGeneration
from datasets import load_dataset, load_metric
from torch.utils.data import Dataset, DataLoader

df = pd.read_csv("kindle_reviews.csv", error_bad_lines=False, nrows=100)
df.head()



  df = pd.read_csv("kindle_reviews.csv", error_bad_lines=False, nrows=100)


Unnamed: 0.1,Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,B000F83SZQ,"[0, 0]",5,I enjoy vintage books and movies so I enjoyed ...,"05 5, 2014",A1F6404F1VG29J,Avidreader,Nice vintage story,1399248000
1,1,B000F83SZQ,"[2, 2]",4,This book is a reissue of an old one; the auth...,"01 6, 2014",AN0N05A9LIJEQ,critters,Different...,1388966400
2,2,B000F83SZQ,"[2, 2]",4,This was a fairly interesting read. It had ol...,"04 4, 2014",A795DMNCJILA6,dot,Oldie,1396569600
3,3,B000F83SZQ,"[1, 1]",5,I'd never read any of the Amy Brewster mysteri...,"02 19, 2014",A1FV0SX13TWVXQ,"Elaine H. Turley ""Montana Songbird""",I really liked it.,1392768000
4,4,B000F83SZQ,"[0, 1]",4,"If you like period pieces - clothing, lingo, y...","03 19, 2014",A3SPTOKDG7WBLN,Father Dowling Fan,Period Mystery,1395187200


In [None]:
class KindleReviewDataset(Dataset):
    def __init__(self, tokenizer, data, max_length=512):
        self.tokenizer = tokenizer
        self.data = data
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data.iloc[idx]
        text = item['reviewText']
        summary = item['summary']
        inputs = self.tokenizer.encode_plus(
            text, max_length=self.max_length, truncation=True, padding='max_length', return_tensors='pt')
        targets = self.tokenizer.encode_plus(
            summary, max_length=self.max_length, truncation=True, padding='max_length', return_tensors='pt')
        return {
            'input_ids': inputs['input_ids'].flatten(),
            'attention_mask': inputs['attention_mask'].flatten(),
            'labels': targets['input_ids'].flatten()
        }

tokenizer = T5Tokenizer.from_pretrained('t5-small')
dataset = KindleReviewDataset(tokenizer, df)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
from transformers import Trainer, TrainingArguments
import torch

print(df.columns)
print(len(df))

# important to use google collab torch instead (if we want to pay, >>)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model = T5ForConditionalGeneration.from_pretrained('t5-small').to(device)

# fine tunes & trains... not sure if we want to mess with this stuff
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

trainer.train()


Index(['Unnamed: 0', 'asin', 'helpful', 'overall', 'reviewText', 'reviewTime',
       'reviewerID', 'reviewerName', 'summary', 'unixReviewTime'],
      dtype='object')
100
Using device: cuda


Step,Training Loss
10,24.449
20,24.1695
30,24.2845
40,22.1785
50,22.9565
60,21.0093
70,20.6222


TrainOutput(global_step=75, training_loss=22.60388264973958, metrics={'train_runtime': 30.2981, 'train_samples_per_second': 9.902, 'train_steps_per_second': 2.475, 'total_flos': 40602540441600.0, 'train_loss': 22.60388264973958, 'epoch': 3.0})

In [None]:
import json

# summarizing text function
def summarize_text(text, tokenizer, model):
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
    inputs = inputs.to(model.device)
    outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# laod first x from dataset (im gonna use 100 for now, we can change it later so it doesn't blow up)
test_data = []
with open('test.jsonl', 'r') as file:
    for i, line in enumerate(file):
        if i < 300:
            test_data.append(json.loads(line))
        else:
            break

# summarize each paragraph vs entire text
paragraph_summaries = []
full_text_summaries = []
for article in test_data:
    # kinda does a lot of paragraph breaks just by nature of the way articles are written
    paragraphs = article['text'].split('\n\n')
    stitched_summary = ' '.join([summarize_text(para, tokenizer, model) for para in paragraphs if para.strip()])
    full_summary = summarize_text(article['text'], tokenizer, model)
    paragraph_summaries.append(stitched_summary)
    full_text_summaries.append(full_summary)

# just printing an example summary here
print(paragraph_summaries[-1])
print(full_text_summaries[-1])
print(test_data[-1])

is living the PRISON HIGH LIFE... at least compared to the... at least compared to the... at least compared to the PRISON HIGH LIFE. -- and TMZ has the menu to prove it. TMZ has the menu to prove it. click here for a complete list of TMZ products. the TMZ. first meal............................................................ Lauryn was serving 3-month sentence for tax evasion. she was served some pulled pork with a side of carrots, peas and sweet potatoes. she was able to choose from an array of juices or milk. the first meal included scrambled eggs and grits, chop suey with green beans and bread. the first meal at Bristol County Jail included scrambled eggs and grits. hill is housed in barracks so she can have fun with other inmates. at night, she's housed in barracks so she can laugh and talk. Hernandez is being kept in a 3x5 cell for 21 hours a day, only allowed out for three hours a day. he's all alone in solitary confinement. if you're gonna break the law... try not to murder a g

In [None]:
# small code block I wrote to calculate the average number of sentances in the text...

total_paragraphs = 0

for article in test_data:
    paragraphs = article['text'].split('\n\n')
    non_empty_paragraphs = sum(1 for para in paragraphs if para.strip())
    total_paragraphs += non_empty_paragraphs

average_paragraphs_per_article = total_paragraphs / len(test_data)
print("Average paragraphs per article:", average_paragraphs_per_article)

Average paragraphs per article: 13.72


In [None]:
# just using library
from datasets import load_metric

rouge = load_metric("rouge")

def calculate_rouge(predictions, references):
    return rouge.compute(predictions=predictions, references=references)

rouge_scores_paragraph = calculate_rouge(paragraph_summaries, [article['summary'] for article in test_data])
rouge_scores_full = calculate_rouge(full_text_summaries, [article['summary'] for article in test_data])

print("ROUGE Scores for Paragraph Summaries:", rouge_scores_paragraph)
print(rouge_scores_paragraph['rouge1'])
print("ROUGE Scores for Full Text Summaries:", rouge_scores_full)
print(rouge_scores_full['rouge1'])

  rouge = load_metric("rouge")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

ROUGE Scores for Paragraph Summaries: {'rouge1': AggregateScore(low=Score(precision=0.05573424229844668, recall=0.6605562492825988, fmeasure=0.09699070931016533), mid=Score(precision=0.06294917928635888, recall=0.6851262994184149, fmeasure=0.10734230453600091), high=Score(precision=0.07085429815911162, recall=0.7072498522176931, fmeasure=0.11832494519705482)), 'rouge2': AggregateScore(low=Score(precision=0.030229113788854903, recall=0.33440294443037705, fmeasure=0.052103955842789124), mid=Score(precision=0.035526304147783024, recall=0.364822343932637, fmeasure=0.05990535850025359), high=Score(precision=0.041008066921472866, recall=0.39540302075230044, fmeasure=0.06834468016580406)), 'rougeL': AggregateScore(low=Score(precision=0.04316881976893632, recall=0.5206239387348603, fmeasure=0.07528822303972534), mid=Score(precision=0.04911473744331071, recall=0.5416808568480087, fmeasure=0.08375355275423835), high=Score(precision=0.05514836779080362, recall=0.5640639819497478, fmeasure=0.09215

In [None]:
# determining which articles have the highest precision on a paragraph / full text level

best_paragraph, para_precision, para_index = '', -float('inf'), -1
best_full, full_precision, full_index = '', -float('inf'), -1

for i, article in enumerate(test_data):
  rouge_scores_paragraph = calculate_rouge([paragraph_summaries[i]], [article['summary']])
  rouge_scores_full = calculate_rouge([full_text_summaries[i]], [article['summary']])
  if rouge_scores_paragraph['rouge1'][1][0] > para_precision:
    best_paragraph, para_precision, para_index = article['summary'], rouge_scores_paragraph['rouge1'][1][0], i
  if rouge_scores_full['rouge1'][1][0] > full_precision:
    best_full, full_precision, full_index = article['summary'], rouge_scores_full['rouge1'][1][0], i


In [None]:
# determining which articles have the highest recall on a paragraph / full text level

best_paragraph, para_recall, para_index = '', -float('inf'), -1
best_full, full_recall, full_index = '', -float('inf'), -1
perfect_recall = 0 # counting how many articles have a recall of 1.0

for i, article in enumerate(test_data):
  rouge_scores_paragraph = calculate_rouge([paragraph_summaries[i]], [article['summary']])
  rouge_scores_full = calculate_rouge([full_text_summaries[i]], [article['summary']])
  if rouge_scores_paragraph['rouge1'][1][1] > para_recall:
    best_paragraph, para_recall, para_index = article['summary'], rouge_scores_paragraph['rouge1'][1][1], i
  if rouge_scores_full['rouge1'][1][1] > full_recall:
    best_full, full_recall, full_index = article['summary'], rouge_scores_full['rouge1'][1][1], i


  if rouge_scores_paragraph['rouge1'][1][1] == 1.0 or rouge_scores_full['rouge1'][1][1] == 1.0:
    perfect_recall += 1

