<a href="https://colab.research.google.com/github/anmol-singh7/GenAI-Exploration/blob/main/Text_Summarizer_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [57]:
!pip -q install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr evaluate

In [58]:
!pip -q install --upgrade accelerate
!pip -q uninstall -y transformers accelerate
!pip -q install transformers accelerate

## **Import required packages**

In [61]:
from transformers import pipeline,set_seed
from datasets import load_dataset,load_from_disk
import matplotlib.pyplot as plt
from transformers import AutoModelForSeq2SeqLM,AutoTokenizer

from evaluate import load
import pandas as pd
import os
import nltk
from nltk.tokenize import sent_tokenize

from tqdm import tqdm
import torch

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# **Check availability of GPU**

In [60]:
device ="cuda" if torch.cuda.is_available() else "cpu"

# Load pre-trained transformer model

In [5]:
model_ckpt ="google/pegasus-cnn_dailymail"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

In [6]:
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

# Load Dataset

In [54]:
DATA_PATH = '/content/drive/MyDrive/Text--Summarizer/data/'
# Create the directory if it doesn't exist
os.makedirs(DATA_PATH, exist_ok=True)

In [7]:
# The SAMSum dataset contains about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.
# Linguists were asked to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger convesations.

# Use load_from_disk if you already have a dataset that has been saved to disk and you want to reload it.
# Use load_dataset if you want to download and load a dataset from the Hugging Face Datasets Hub.

dataset_samsum = load_dataset("samsum")

# dataset_samsum = load_from_disk(DATA_PATH+'samsum_dataset')

README.md:   0%|          | 0.00/7.04k [00:00<?, ?B/s]

samsum.py:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

The repository for samsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/samsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


corpus.7z:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

In [8]:
dataset_samsum

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [9]:
dataset_samsum['train']['dialogue'][1]

'Olivia: Who are you voting for in this election? \r\nOliver: Liberals as always.\r\nOlivia: Me too!!\r\nOliver: Great'

In [10]:
dataset_samsum['train'][1]["summary"]

'Olivia and Olivier are voting for liberals in this election. '

# Split data into train and test set

In [11]:
split_lengths = [len(dataset_samsum[split]) for split in dataset_samsum]

print(f"Split lengths: {split_lengths}")
print(f"Features: {dataset_samsum['train'].column_names}")
print(f"Dialogue:\n{dataset_samsum['test'][1]['dialogue']}")
print(f"Summary:\n{dataset_samsum['test'][1]['summary']}")

Split lengths: [14732, 819, 818]
Features: ['id', 'dialogue', 'summary']
Dialogue:
Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)
Summary:
Eric and Rob are going to watch a stand-up on youtube.


# Feature Engineering

In [12]:
def convert_example_to_features(example_batch):
  input_encoding = tokenizer(example_batch['dialogue'], max_length=1024, truncation=True)
  with tokenizer.as_target_tokenizer():
    target_encoding = tokenizer(example_batch['summary'], max_length=128, truncation=True)

  return {
      'input_ids': input_encoding['input_ids'],
      'attention_mask': input_encoding['attention_mask'],
      'labels': target_encoding['input_ids']
  }

In [None]:
# In dataset_samsum_pt, the suffix pt likely stands for PyTorch. 
# This naming convention suggests that the dataset has been preprocessed and tokenized in a format compatible with PyTorch. 
# Since transformers uses PyTorch tensors when passing inputs to the model, appending pt indicates that the dataset is now 
# prepared and ready for use with PyTorch-based models, such as AutoModelForSeq2SeqLM
dataset_samsum_pt = dataset_samsum.map(convert_example_to_features, batched=True)

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]



Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

In [14]:
dataset_samsum_pt['train']

Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 14732
})

In [None]:
dataset_samsum_pt['train']['input_ids'][1]

In [16]:
dataset_samsum_pt['train']['attention_mask'][1]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

# Model Training

In [62]:
os.environ["WANDB_MODE"] = "disabled"
# this is done to stop wandb from logging as it require a api key from wano which i wana avoide

In [33]:
# training
from transformers import DataCollatorForSeq2Seq,Trainer

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

In [None]:
from transformers import TrainingArguments, Trainer
# the model is trained on only one epochs to skip training time
# for better accuracy train on more epochs
trainer_args = TrainingArguments(
    output_dir='pegasus-samsum', num_train_epochs=1, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    evaluation_strategy='steps', eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16
)



In [35]:
trainer = Trainer(model=model_pegasus,args=trainer_args,
                  tokenizer=tokenizer,data_collator=seq2seq_data_collator,
                  train_dataset=dataset_samsum_pt['train'],
                  eval_dataset=dataset_samsum_pt['validation'])

  trainer = Trainer(model=model_pegasus,args=trainer_args,


In [36]:
trainer.train()

Step,Training Loss,Validation Loss
500,1.6597,1.4842




TrainOutput(global_step=920, training_loss=1.825385618209839, metrics={'train_runtime': 3031.7589, 'train_samples_per_second': 4.859, 'train_steps_per_second': 0.303, 'total_flos': 5528248038285312.0, 'train_loss': 1.825385618209839, 'epoch': 0.9991854466467553})

# Model Evaluation

In [None]:
# Evaluation

def generate_batch_sized_chunks(list_of_elements, batch_size):
  """split the dataset into smaller batches that we can process simultaneously
  Yield successive batch-sized chunks from list_of_elements."""
  for i in range(0, len(list_of_elements), batch_size):
    yield list_of_elements[i : i + batch_size]

def calculate_metric_on_test_ds(dataset, metric, model, tokenizer,
                               batch_size=16, device=device,
                               column_text="article",
                               column_summary="highlights"):
  article_batches = generate_batch_sized_chunks(dataset[column_text], batch_size)
  target_batches = generate_batch_sized_chunks(dataset[column_summary], batch_size)

  for article_batch, target_batch in tqdm(zip(article_batches, target_batches), total=len(article_batches)):

      inputs = tokenizer(article_batch, max_length=1024, truncation=True,
                        padding="max_length", return_tensors="pt")

      summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                         attention_mask=inputs["attention_mask"].to(device),
                         length_penalty=0.8, num_beams=8, max_length=128)
      """" parameter for length penalty ensures that the model does not generate sequences that are too long. """

      # Finally, we decode the generated texts,
      # replace the token, and the decoded text with the reference to the metric
      decoded_summaries =[tokenizer.decode(s, skip_special_tokens=True,
                                           clean_up_tokenization_spaces=True)
                         for s in summaries]

      decoded_summaries = [d.replace("", " ") for d in decoded_summaries]
      metric.add_batch(predictions=decoded_summaries, references=target_batch)

      #  Finally compute and return the ROUGE scores.
      # as in classification problem we use f2-score/precision/recall metric/evaluation technique
      # in text summarization we use ROUGE score
      score = metric.compute()
      return score



In [46]:
rouge_names =["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_metric = load('rouge')

In [63]:
score = calculate_metric_on_test_ds(
    dataset_samsum['test'], rouge_metric, trainer.model, tokenizer, batch_size = 2, column_text = 'dialogue', column_summary= 'summary'
)

rouge_dict = dict((rn, score[rn] ) for rn in rouge_names )

pd.DataFrame(rouge_dict, index = [f'pegasus'] )

  0%|          | 0/410 [00:02<?, ?it/s]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.026629,0.0,0.026629,0.026629


# Save model

In [50]:
model_pegasus.save_pretrained(DATA_PATH+"pegasus-samsum-model")

# Save tokenizer

In [51]:
tokenizer.save_pretrained(DATA_PATH+"tokenizer")

('/content/drive/MyDrive/Text--Summarizer/data/tokenizer/tokenizer_config.json',
 '/content/drive/MyDrive/Text--Summarizer/data/tokenizer/special_tokens_map.json',
 '/content/drive/MyDrive/Text--Summarizer/data/tokenizer/spiece.model',
 '/content/drive/MyDrive/Text--Summarizer/data/tokenizer/added_tokens.json',
 '/content/drive/MyDrive/Text--Summarizer/data/tokenizer/tokenizer.json')

# Test

In [52]:
#Load

tokenizer = AutoTokenizer.from_pretrained(DATA_PATH+"tokenizer")

# Prediction

In [56]:
gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}

sample_text = dataset_samsum["test"][0]["dialogue"]

reference = dataset_samsum["test"][0]["summary"]

pipe = pipeline("summarization", model=DATA_PATH+"pegasus-samsum-model",tokenizer=tokenizer)

##
print("Dialogue:")
print(sample_text)


print("\nReference Summary:")
print(reference)

print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Dialogue:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Reference Summary:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.

Model Summary:
Amanda can't find Betty's number. Larry called Betty last time they were at the park together. Hannah wants Amanda to text Larry. Amanda will text Larry.
