# PAPER SUMMARIZATION
Welcome to this notebook focused on summarizing academic papers using the arXiv Summarization Dataset and the Huggingface Transformers library.

Summarizing lengthy academic papers is a time-consuming task that requires a considerable amount of effort and expertise. However, with the help of natural language processing (NLP) techniques and machine learning algorithms, it is possible to automate this process and generate informative and concise summaries.

In this notebook, we will be using the Bidirectional and Auto-Regressive Transformer (BART) model provided by the Huggingface Transformers library to generate summaries for academic papers in the arXiv Summarization Dataset. BART is a state-of-the-art transformer-based model that has achieved impressive results in various NLP tasks, including text summarization.

By the end of this notebook, you will have a better understanding of how to use BART to generate summaries for academic papers, and how to evaluate the quality of the generated summaries. Let's get started!

## Installing required packages

In [None]:
!pip install transformers datasets evaluate rouge_score

## Collecting Data

In [None]:
from datasets import load_dataset

# Loading the arXiv Summarization Dataset from Hugging Face Datasets
arxiv_dataset = load_dataset('ccdv/arxiv-summarization')

# Loading the Billsum dataset from Hugging Face Datasets
billsum = load_dataset("billsum", split="ca_test")

## Importing required packages

In [8]:
# Importing required libraries
import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader
from transformers import (
    BartTokenizer,
    DataCollatorForSeq2Seq,
    BartForConditionalGeneration,
    AdamW,
    get_linear_schedule_with_warmup,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    pipeline
)import evaluate

## Initializing model & parameters

In [9]:
# Initializing the BART tokenizer with the checkpoint name
checkpoint = 'facebook/bart-large'
tokenizer = BartTokenizer.from_pretrained(checkpoint)

# Initializing the BART model with the checkpoint name
model = BartForConditionalGeneration.from_pretrained(checkpoint)


## Data pre-processing

In [15]:
# Define the prefix to add to the input text
prefix = "summarize: "

# Define a preprocessing function to tokenize and prepare the data for the model
def preprocess_function(examples):
    # Add the prefix to the input text and tokenize it
    inputs = [prefix + doc for doc in examples[text]]
    model_inputs = tokenizer(inputs, max_length=4800, truncation=True)

    # Tokenize the target summary text and set it as the labels for the model
    labels = tokenizer(text_target=examples[summary], max_length=1024, truncation=True)

    # Add the labels to the model inputs and return the processed examples
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


In [16]:
# Define a data collator to process batches for the model training
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)


In [None]:
# Define the names of the input and output columns for the arxiv dataset
text = "article"
summary = "abstract"

# Use the preprocess function to tokenize and prepare the arxiv dataset
tokenized_arxiv = arxiv_dataset.map(preprocess_function, batched=True)

In [19]:
# Define the names of the input and output columns for the billsum dataset
text = "text"
summary = "summary"

# Split the billsum dataset into training and testing sets
billsum = billsum.train_test_split(test_size=0.2)

# Use the preprocess function to tokenize and prepare the billsum dataset
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

## Evaluation metrics

In [20]:
# Load the ROUGE metric for evaluation
rouge = evaluate.load("rouge")

# Define a function to compute the evaluation metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    
    # Decode the predicted and label sequences
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute the ROUGE scores between the predictions and the labels
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # Compute the length of the predicted sequences
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]

    # Add the mean length of the predicted sequences to the evaluation result
    result["gen_len"] = np.mean(prediction_lens)

    # Round the evaluation results to 4 decimal places
    return {k: round(v, 4) for k, v in result.items()}


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

## Model Training

In [23]:
# Setting up the training arguments for the Seq2SeqTrainer
training_args = Seq2SeqTrainingArguments(
    output_dir="paper_Summarization_model",  # Directory to save the model checkpoints
    evaluation_strategy="epoch",  # Evaluation is done at the end of each epoch
    learning_rate=2e-5,  # Learning rate for the optimizer
    per_device_train_batch_size=2,  # Batch size for training
    per_device_eval_batch_size=2,  # Batch size for evaluation
    weight_decay=0.01,  # Weight decay parameter for the optimizer
    save_total_limit=3,  # Maximum number of checkpoints to keep
    num_train_epochs=4,  # Total number of training epochs
    predict_with_generate=True,  # Generate summary at the time of prediction
    fp16=True,  # Use mixed-precision training to save memory and speed up training
    )


In [None]:
# Initializing the Seq2Seq Trainer with the specified parameters
trainer = Seq2SeqTrainer(
    model=model, 
    args=training_args, 
    train_dataset=tokenized_arxiv["train"], # Training dataset
    eval_dataset=tokenized_arxiv["test"],  # Evaluation dataset
    tokenizer=tokenizer, 
    data_collator=data_collator, # Data collator for tokenization
    compute_metrics=compute_metrics # Metrics to evaluate the model
)

# Training the Seq2Seq model
trainer.train()

## Model Prediction

In [None]:
# Loading the trained model checkpoint, initializing the tokenizer and model
checkpoint = '/content/my_awesome_billsum_model/checkpoint-1500'
tokenizer = BartTokenizer.from_pretrained(checkpoint)
model = BartForConditionalGeneration.from_pretrained(checkpoint)


In [None]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."


In [None]:
# Creating a pipeline for text summarization using the BART model
summarizer = pipeline("summarization", model=checkpoint)

# Generating a summary for the given input text
summary = summarizer(text)


In [None]:
# Tokenizing the input text and generating the summary using the BART model
inputs = tokenizer(text, return_tensors="pt").input_ids
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)
decoded_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)


## Conclusion
In this notebook, we have demonstrated how to fine-tune the BART model for text summarization using the Hugging Face Transformers library. We have used arXiv and BillSum, to showcase the effectiveness of the approach. We have also discussed the key components of the training process, including data preprocessing, model initialization, and training using the Seq2SeqTrainer. Additionally, we have shown how to evaluate the performance of the trained model using the ROUGE metric and generate summaries using the trained model. Overall, this notebook provides a comprehensive guide for fine-tuning BART for text summarization tasks.

**Disclaimer:** The arXiv dataset can be trained on free version of colab, so I haven't trained model here in this notebook but can be effectively trained on full dataset using compute sources.