<a href="https://colab.research.google.com/github/aditikamble123/aditi-kamble/blob/main/summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [19]:
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import PegasusForConditionalGeneration, PegasusTokenizer #Fixed typo
from torch.utils.data import DataLoader

In [8]:
!pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [9]:
from rouge import Rouge

In [10]:
# prompt: # Load pre-trained Pegasus model and tokenizer

model_name = 'google/pegasus-xsum'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

In [12]:
# prompt: Prepare dataset

class Dataset(torch.utils.data.Dataset):
  def __init__(self, encodings):
    self.encodings = encodings
  def __getitem__(self, idx):
    return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  def __len__(self):
    return len(self.encodings.input_ids)

# Example data
train_texts = ["This is the first training text.", "This is the second training text."]
train_labels = ["First summary.", "Second summary."]

# Tokenize the data
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
train_label_encodings = tokenizer(train_labels, truncation=True, padding=True)

# Create the dataset
train_dataset = Dataset(train_encodings)


In [13]:
# prompt: Create data loaders

from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)


In [14]:
# prompt: Define custom dataset class

class CustomDataset(torch.utils.data.Dataset):
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels

  def __getitem__(self, idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    item['labels'] = torch.tensor(self.labels['input_ids'][idx])
    return item

  def __len__(self):
    return len(self.labels['input_ids'])


In [15]:
# prompt: Initialize dataset and data loader

# Create the custom dataset
train_dataset = CustomDataset(train_encodings, train_label_encodings)

# Create the dataloader
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)


In [16]:
# prompt: Train the model

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

optimizer = optim.AdamW(model.parameters(), lr=5e-5)

num_epochs = 3  # Adjust as needed

for epoch in range(num_epochs):
  model.train()
  for batch in train_loader:
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['labels'].to(device)

    optimizer.zero_grad()
    outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
    loss = outputs.loss
    loss.backward()
    optimizer.step()

  print(f"Epoch {epoch+1}/{num_epochs} - Loss: {loss.item()}")


Epoch 1/3 - Loss: 5.38703727722168
Epoch 2/3 - Loss: 5.353452682495117
Epoch 3/3 - Loss: 4.441337585449219


In [26]:
# prompt: evaluate the model  with user input

# Evaluate the model
model.eval()

# Get user input
user_input = input("Enter text to summarize: ")

# Tokenize the input
inputs = tokenizer(user_input, return_tensors="pt", truncation=True, padding=True).to(device)

# Generate summary
summary_ids = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=100)  # Adjust max_length as needed
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Print the summary
print("Generated Summary:", summary)


Enter text to summarize: Creating an NLP model for summarization involves several intricate steps and considerations. First, you need to decide on your approach - extractive summarization selects important sentences from the original text, while abstractive summarization generates new text that captures the essence of the original. Hybrid approaches combine both methods. Next, data collection and preprocessing are crucial; you'll need a large dataset of document-summary pairs, which you'll clean and tokenize. The choice of model architecture is pivotal - for extractive summarization, you might use BERT or RoBERTa, while abstractive summarization often employs transformer-based models like BART, T5, or PEGASUS. Implementation typically involves using deep learning libraries such as PyTorch or TensorFlow, or higher-level libraries like Hugging Face Transformers. Training the model requires splitting your data into training and validation sets, and often involves fine-tuning a pre-trained