### DEEP LEARNING

# **TEXT SUMMARIZATION MODEL**

**IMPORTS**

In [1]:
!pip install transformers rouge_score
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
from transformers import PegasusTokenizer, PegasusForConditionalGeneration
import ast
import re
import torch.optim as optim
from tqdm import tqdm
import copy
from rouge_score import rouge_scorer

torch.manual_seed(42)
np.random.seed(42)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cpu


**DATA LOADING AND PREPROCESSING**

In [2]:
train_df = pd.read_csv('../data/train.csv', header=None)
test_df = pd.read_csv('../data/test.csv', header=None)
val_df = pd.read_csv('../data/validation.csv', header=None)

columns = ['id', 'article', 'summary']
train_df.columns = columns
test_df.columns = columns
val_df.columns = columns

def preprocess_text(text):
    text = text.replace('\n', ' ')
    text = ' '.join(text.split())
    return text

def clean_article_heading(article):
    pattern = r'By\s*\.\s*.*?\s*\.\s*PUBLISHED:\s*\.\s*\d+:\d+\s*EST,\s*\d+\s*[A-Za-z]+\s*\d+\s*\.\s*\|\s*\.\s*UPDATED:\s*\.\s*\d+:\d+\s*EST,\s*\d+\s*[A-Za-z]+\s*\d+\s*\.'
    cleaned_text = re.sub(pattern, '', article)
    return cleaned_text.strip()

train_df['article'] = train_df['article'].apply(preprocess_text)
train_df['article'] = train_df['article'].apply(clean_article_heading)
train_df['summary'] = train_df['summary'].apply(preprocess_text)
test_df['article'] = test_df['article'].apply(preprocess_text)
test_df['article'] = test_df['article'].apply(clean_article_heading)
test_df['summary'] = test_df['summary'].apply(preprocess_text)
val_df['article'] = val_df['article'].apply(preprocess_text)
val_df['article'] = val_df['article'].apply(clean_article_heading)
val_df['summary'] = val_df['summary'].apply(preprocess_text)



print(f"Training dataframe shape: {train_df.shape}")
print(f"Test dataframe shape: {test_df.shape}")
print(f"Validation dataframe shape: {val_df.shape}")

Training dataframe shape: (287114, 3)
Test dataframe shape: (11491, 3)
Validation dataframe shape: (13369, 3)


In [3]:
# Initialize tokenizer and model
tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-cnn_dailymail')
model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-cnn_dailymail').to(device)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Amount reduction for training time optimization**

In [4]:
train_df = train_df.sample(n=20000, random_state=42)
test_df = test_df.sample(n=2000, random_state=42)
val_df = val_df.sample(n=2000, random_state=42)

**CUSTOM DATASET**

In [5]:
class NewsDataset(Dataset):
    def __init__(self, articles, summaries, tokenizer, max_length=512):
        self.articles = articles
        self.summaries = summaries
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.articles)
    
    def __getitem__(self, idx):
        article = str(self.articles[idx])
        summary = str(self.summaries[idx])
        
        article_encoding = self.tokenizer(
            article,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        summary_encoding = self.tokenizer(
            summary,
            max_length=128,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        return {
            'article_input_ids': article_encoding['input_ids'].flatten(),
            'article_attention_mask': article_encoding['attention_mask'].flatten(),
            'summary_input_ids': summary_encoding['input_ids'].flatten(),
            'summary_attention_mask': summary_encoding['attention_mask'].flatten()
        }


In [6]:
# Create datasets
train_dataset = NewsDataset(
    train_df['article'].values,
    train_df['summary'].values,
    tokenizer
)

val_dataset = NewsDataset(
    val_df['article'].values,
    val_df['summary'].values,
    tokenizer
)

test_dataset = NewsDataset(
    test_df['article'].values,
    test_df['summary'].values,
    tokenizer
)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Testing samples: {len(test_dataset)}")

Training samples: 20000
Validation samples: 2000
Testing samples: 2000


In [7]:
class PegasusForSummarization:
    def __init__(self, model_name="google/pegasus-cnn_dailymail", device='cuda'):
        self.tokenizer = PegasusTokenizer.from_pretrained(model_name)
        self.model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
        self.device = device
        self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        
    def train(self, train_loader, val_loader, epochs=3, learning_rate=5e-5, warmup_steps=500, weight_decay=0.01):
        # Set up optimizer
        optimizer = AdamW(self.model.parameters(), lr=learning_rate, weight_decay=weight_decay)
        
        # Set up scheduler
        total_steps = len(train_loader) * epochs
        scheduler = get_linear_schedule_with_warmup(
            optimizer, 
            num_warmup_steps=warmup_steps, 
            num_training_steps=total_steps
        )
        
        best_val_loss = float('inf')
        best_model = None
        
        for epoch in range(epochs):
            # Training
            self.model.train()
            train_loss = 0
            train_progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{epochs} [Training]')
            
            for batch in train_progress_bar:
                # Move batch to device
                input_ids = batch['article_input_ids'].to(self.device)
                attention_mask = batch['article_attention_mask'].to(self.device)
                labels = batch['summary_input_ids'].to(self.device)
                decoder_attention_mask = batch['summary_attention_mask'].to(self.device)
                
                # Clear gradients
                optimizer.zero_grad()
                
                # Forward pass
                outputs = self.model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels,
                    decoder_attention_mask=decoder_attention_mask
                )
                
                loss = outputs.loss
                train_loss += loss.item()
                
                # Backward pass
                loss.backward()
                
                # Clip gradients
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
                
                # Update parameters
                optimizer.step()
                scheduler.step()
                
                # Update progress bar
                train_progress_bar.set_postfix({'loss': loss.item()})
            
            avg_train_loss = train_loss / len(train_loader)
            
            # Validation
            val_loss, rouge_scores = self.evaluate(val_loader)
            
            print(f"Epoch {epoch+1}/{epochs}")
            print(f"  Train Loss: {avg_train_loss:.4f}")
            print(f"  Val Loss: {val_loss:.4f}")
            print(f"  Rouge1: {rouge_scores['rouge1']:.4f}")
            print(f"  Rouge2: {rouge_scores['rouge2']:.4f}")
            print(f"  RougeL: {rouge_scores['rougeL']:.4f}")
            
            # Save best model
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                best_model = copy.deepcopy(self.model.state_dict())
                print(f"  New best model saved with validation loss: {val_loss:.4f}")
        
        # Load best model
        if best_model is not None:
            self.model.load_state_dict(best_model)
            print(f"Loaded best model with validation loss: {best_val_loss:.4f}")
    
    def evaluate(self, data_loader, max_length=128, num_beams=4):
        self.model.eval()
        val_loss = 0
        all_preds = []
        all_targets = []
        
        with torch.no_grad():
            for batch in tqdm(data_loader, desc="Evaluating"):
                # Move batch to device
                input_ids = batch['article_input_ids'].to(self.device)
                attention_mask = batch['article_attention_mask'].to(self.device)
                labels = batch['summary_input_ids'].to(self.device)
                decoder_attention_mask = batch['summary_attention_mask'].to(self.device)
                
                # Forward pass
                outputs = self.model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels,
                    decoder_attention_mask=decoder_attention_mask
                )
                
                val_loss += outputs.loss.item()
                
                # Generate summaries
                generated_ids = self.model.generate(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    max_length=max_length,
                    num_beams=num_beams,
                    repetition_penalty=2.5,
                    length_penalty=1.0,
                    early_stopping=True
                )
                
                # Decode generated summaries and reference summaries
                preds = [self.tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
                targets = [self.tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True) for t in labels]
                
                all_preds.extend(preds)
                all_targets.extend(targets)
        
        # Calculate average validation loss
        avg_val_loss = val_loss / len(data_loader)
        
        # Calculate ROUGE scores
        rouge_scores = {'rouge1': 0, 'rouge2': 0, 'rougeL': 0}
        for pred, target in zip(all_preds, all_targets):
            scores = self.rouge_scorer.score(target, pred)
            rouge_scores['rouge1'] += scores['rouge1'].fmeasure
            rouge_scores['rouge2'] += scores['rouge2'].fmeasure
            rouge_scores['rougeL'] += scores['rougeL'].fmeasure
        
        # Calculate average ROUGE scores
        for key in rouge_scores:
            rouge_scores[key] /= len(all_preds)
        
        return avg_val_loss, rouge_scores
    
    def predict(self, article, max_length=128, num_beams=4):
        self.model.eval()
        
        # Preprocess article
        article = preprocess_text(article)
        article = clean_article_heading(article)
        
        # Tokenize article
        inputs = self.tokenizer(
            article,
            max_length=512,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        ).to(self.device)
        
        # Generate summary
        with torch.no_grad():
            generated_ids = self.model.generate(
                input_ids=inputs['input_ids'],
                attention_mask=inputs['attention_mask'],
                max_length=max_length,
                num_beams=num_beams,
                repetition_penalty=2.5,
                length_penalty=1.0,
                early_stopping=True
            )
        
        # Decode summary
        summary = self.tokenizer.decode(generated_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
        
        return summary
    
    def save_model(self, path):
        torch.save({
            'model_state_dict': self.model.state_dict(),
        }, path)
        self.tokenizer.save_pretrained(path + "_tokenizer")
        print(f"Model saved to {path}")
    
    def load_model(self, path):
        checkpoint = torch.load(path)
        self.model.load_state_dict(checkpoint['model_state_dict'])
        self.tokenizer = PegasusTokenizer.from_pretrained(path + "_tokenizer")
        print(f"Model loaded from {path}")

In [8]:
def train_epoch(model, dataloader, optimizer, device, clip=1.0):
    model.train()
    total_loss = 0
    
    for batch in tqdm(dataloader, desc="Training"):
        optimizer.zero_grad()
        
        input_ids = batch['article_input_ids'].to(device)
        attention_mask = batch['article_attention_mask'].to(device)
        labels = batch['summary_input_ids'].to(device)
        decoder_attention_mask = batch['summary_attention_mask'].to(device)
        
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels,
            decoder_attention_mask=decoder_attention_mask
        )
        
        loss = outputs.loss
        total_loss += loss.item()
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
    
    return total_loss / len(dataloader)

def evaluate(model, dataloader, device):
    model.eval()
    total_loss = 0
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
    
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            input_ids = batch['article_input_ids'].to(device)
            attention_mask = batch['article_attention_mask'].to(device)
            labels = batch['summary_input_ids'].to(device)
            decoder_attention_mask = batch['summary_attention_mask'].to(device)
            
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels,
                decoder_attention_mask=decoder_attention_mask
            )
            
            loss = outputs.loss
            total_loss += loss.item()
            
            # Generate summaries
            generated_ids = model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_length=128,
                num_beams=4,
                repetition_penalty=2.5,
                length_penalty=1.0,
                early_stopping=True
            )
            
            decoded_preds = [tokenizer.decode(g, skip_special_tokens=True) for g in generated_ids]
            decoded_labels = [tokenizer.decode(t, skip_special_tokens=True) for t in labels]
            
            # Calculate ROUGE scores
            for pred, label in zip(decoded_preds, decoded_labels):
                scores = scorer.score(label, pred)
                for metric in rouge_scores:
                    rouge_scores[metric].append(scores[metric].fmeasure)
    
    avg_loss = total_loss / len(dataloader)
    avg_rouge_scores = {k: sum(v)/len(v) for k, v in rouge_scores.items()}
    
    return avg_loss, avg_rouge_scores

In [9]:
# Training parameters
'''NUM_EPOCHS = 3
LEARNING_RATE = 3e-5
EARLY_STOPPING_PATIENCE = 2

# Initialize optimizer
optimizer = optim.AdamW(model.parameters(), lr=LEARNING_RATE)

# Training loop
best_val_loss = float('inf')
early_stop_counter = 0
training_stats = []

print("Starting training...")

for epoch in range(NUM_EPOCHS):
    print(f"\nEpoch {epoch+1}/{NUM_EPOCHS}")
    
    train_loss = train_epoch(model, train_loader, optimizer, device)
    val_loss, rouge_scores = evaluate(model, val_loader, device)
    
    print(f"Train Loss: {train_loss:.4f}")
    print(f"Val Loss: {val_loss:.4f}")
    print(f"Rouge1: {rouge_scores['rouge1']:.4f}")
    print(f"Rouge2: {rouge_scores['rouge2']:.4f}")
    print(f"RougeL: {rouge_scores['rougeL']:.4f}")
    
    # Early stopping
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        early_stop_counter = 0
        torch.save(model.state_dict(), 'best_pegasus_model.pth')
        print(f"Saved new best model with validation loss: {val_loss:.4f}")
    else:
        early_stop_counter += 1
        print(f"Early stopping counter: {early_stop_counter}/{EARLY_STOPPING_PATIENCE}")
        
    if early_stop_counter >= EARLY_STOPPING_PATIENCE:
        print("Early stopping triggered!")
        break

print("Training completed!")'''

'NUM_EPOCHS = 3\nLEARNING_RATE = 3e-5\nEARLY_STOPPING_PATIENCE = 2\n\n# Initialize optimizer\noptimizer = optim.AdamW(model.parameters(), lr=LEARNING_RATE)\n\n# Training loop\nbest_val_loss = float(\'inf\')\nearly_stop_counter = 0\ntraining_stats = []\n\nprint("Starting training...")\n\nfor epoch in range(NUM_EPOCHS):\n    print(f"\nEpoch {epoch+1}/{NUM_EPOCHS}")\n    \n    train_loss = train_epoch(model, train_loader, optimizer, device)\n    val_loss, rouge_scores = evaluate(model, val_loader, device)\n    \n    print(f"Train Loss: {train_loss:.4f}")\n    print(f"Val Loss: {val_loss:.4f}")\n    print(f"Rouge1: {rouge_scores[\'rouge1\']:.4f}")\n    print(f"Rouge2: {rouge_scores[\'rouge2\']:.4f}")\n    print(f"RougeL: {rouge_scores[\'rougeL\']:.4f}")\n    \n    # Early stopping\n    if val_loss < best_val_loss:\n        best_val_loss = val_loss\n        early_stop_counter = 0\n        torch.save(model.state_dict(), \'best_pegasus_model.pth\')\n        print(f"Saved new best model with v

In [10]:
def generate_summary(model, article, tokenizer, device, max_length=128):
    model.eval()
    
    # Preprocess
    article = preprocess_text(article)
    article = clean_article_heading(article)
    
    # Tokenize
    inputs = tokenizer(
        article,
        max_length=512,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    ).to(device)
    
    # Generate summary
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            max_length=max_length,
            num_beams=4,
            repetition_penalty=2.5,
            length_penalty=1.0,
            early_stopping=True
        )
    
    summary = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    return summary

We have had some complications with the execution code of the Pegasus model. Due to a badly performed commit, the execution of the training process was not properly uploaded to git, only an older version of the notebook, so the actual execution results per epoch are not available. 

Since all the model presented here were run on the open labs of the university, in order to be able to access a GPU to run the models, we had not way to recover those results neither the time to rerun the model itself. 

By looking at the code, it can be seen that the pegasus notebook shows an error instead of the results. We actually got that error before doing it right and fixed it by changing the batch size to 8 instead of 16, as the memory did not allow to have such a big batch size. 

At least, after the training process was completed we saved the best model (it is on the google folder), we could import it and perform a example summary generation. Performing an evaluation to obtain results was not tangible due to the amount of time it requires.
This is the summary generation with the pegasus model:


In [None]:
# Generate example summary
sample_article = test_df['article'].iloc[0]
generated_summary = generate_summary(model, sample_article, tokenizer, device)

print("\nSample Summary Generation:")
print("Original article excerpt:", sample_article[:200], "...")
print("\nGenerated Summary:", generated_summary)
print("\nActual Summary:", test_df['summary'].iloc[0])


Sample Summary Generation:
Original article excerpt: David Rylance has been jailed for stealing more than £50,000 from his dying mother, who was suffering from Alzheimer's . A trusted son has been jailed for stealing more than £50,000 from his dying mot ...

Generated Summary: David Rylance, 47, stole more than £50,000 from his dying mother Margaret . She was suffering from Alzheimer's and feared her money was going missing . But in reality the pensioner's son had been slowly siphoning it away . He spent it on gambling, a holiday and cinema tickets as well as living costs . Rylance admitted theft and fraud and was jailed for two years and three months .

Actual Summary: David Rylance, 47, stole thousands of pounds from his own dying mother . Dying pensioner Margaret Rylance was suffering from Alzheimer's disease . She noticed money was missing but concerns were put down to condition . Her son was jailed for two years and three months for stealing £52,000 .
