# **Problem Statement 2:**  Research Article Summarization Using Advanced NLP Techniques


# Installing Required Libraries  
We start by installing the necessary libraries for evaluating summarization models.  
- **`evaluate`**: Provides metrics like ROUGE, BLEU, and BERTScore.  
- **`rouge_score`, `sacrebleu`, `bert_score`**: Dependencies required for specific evaluation metrics.  

In [None]:
# Install the required evaluation library
!pip install evaluate
# Install required dependencies for evaluation metrics
!pip install rouge_score sacrebleu bert_score

# Importing Libraries and Initial Setup  
This cell imports key libraries:  
- **Transformers**: For model, tokenizer, and training functionality.  
- **Datasets**: For loading and preprocessing the dataset.  
- **Torch**: For utilizing GPU and managing tensors.  
- **Evaluate**: For evaluation metrics.

In [2]:
from transformers import PegasusConfig, PegasusTokenizer, PegasusForConditionalGeneration, Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import load_dataset
import evaluate
import torch
import os

# Model and Dataset Preparation  
1. **Disabling W&B logging**: Streamlining training logs by disabling unnecessary outputs.  
2. **Loading Pretrained Pegasus Model**: Using `google/pegasus-large` for summarization tasks.  
3. **Dataset Loading**: Loading the ArXiv dataset and limiting to 2000 samples for faster prototyping.  
4. **Preprocessing the Dataset**: Tokenizing both the input (`markdown`) and the target (`abstract`).  
5. **Trainer Setup**: Configuring training parameters including batch size, learning rate, and epoch count.  
6. **Training and Saving**: Training the model and saving it for future use.  

***Note***: The batch size and epoch settings are intentionally kept small for testing and stability. Adjust as needed for larger datasets.  

# 🔧 Model Setup and Dataset Preprocessing  
This cell covers:  
1. Loading the Pegasus model and tokenizer.  
2. Preparing the dataset by tokenizing the inputs (`markdown`) and targets (`abstract`).

In [3]:
import os
import torch
from transformers import PegasusTokenizer, PegasusForConditionalGeneration
from datasets import load_dataset

# Disable WANDB logs
os.environ["WANDB_DISABLED"] = "true"

# Load model and tokenizer
model_name = "google/pegasus-large"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/3.09k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

In [12]:
# Load CSV dataset (Only first 100 rows)
dataset = load_dataset("csv", data_files="llm_data.csv", split="train").select(range(300))

# Ensure column names are correctly referenced
def preprocess_function(examples):
    input_texts = []
    summaries = []

    for title, keywords, abstract, conclusion, document, paper_type, topic, ocr, summary in zip(
        examples["Paper Title"], examples["Key Words"], examples["Abstract"],
        examples["Conclusion"], examples["Document"], examples["Paper Type"],
        examples["Topic"], examples["OCR"], examples["Summary"]
    ):
        input_text = f"Title: {title}\nKeywords: {keywords}\nAbstract: {abstract}\nConclusion: {conclusion}\nDocument: {document}\nPaper Type: {paper_type}\nTopic: {topic}\nOCR: {ocr}"
        input_texts.append(input_text)
        summaries.append(summary)

    # Tokenize and ensure consistent length
    inputs = tokenizer(input_texts, max_length=512, truncation=True, padding="max_length")
    targets = tokenizer(summaries, max_length=128, truncation=True, padding="max_length")

    return {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
        "labels": targets["input_ids"]
    }

# Apply preprocessing with batch processing enabled
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset.column_names)

# Check output to verify lengths
print(tokenized_dataset[:3])


Map:   0%|          | 0/300 [00:00<?, ? examples/s]

{'labels': [[182, 974, 3702, 114, 11624, 4859, 113, 909, 1355, 121, 13049, 121, 936, 1581, 118, 1546, 121, 61930, 5906, 6520, 3884, 143, 11618, 283, 250, 168, 19390, 114, 177, 47917, 112, 33076, 952, 354, 2175, 111, 592, 142, 4859, 113, 4129, 2489, 108, 4051, 8591, 108, 111, 24089, 107, 139, 974, 163, 8846, 428, 743, 111, 533, 473, 4578, 115, 109, 764, 107, 139, 2629, 3921, 112, 319, 114, 2250, 1301, 113, 64619, 2722, 111, 4135, 7913, 8628, 277, 22529, 523, 124, 866, 633, 118, 701, 692, 107, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [139, 974, 8846, 109, 2227, 113, 1352, 5906, 6520, 3884, 640, 112, 109, 7999, 113, 335, 293, 380, 107, 11127, 4166, 108, 330, 7093, 5551, 111, 73840, 1625, 108, 130, 210, 130, 7500, 121, 936, 1739, 108, 127, 1848, 107, 139, 800, 3972, 124, 7314, 111, 27055, 121, 936, 4166, 111, 8846, 109, 24089, 263, 118, 1776, 219, 1625, 107, 1041, 109, 4426, 113, 109, 27533, 3943, 303, 219, 162

# 🏋️‍♂️ Training and Saving the Model  
1. **Training Arguments**: Configuring the `Seq2SeqTrainer` with evaluation strategy, batch size, learning rate, and epoch count.  
2. **Model Training**: Initializing the trainer and training on the tokenized dataset.  
3. **Model Saving**: Storing the trained model for evaluation and further fine-tuning.  

In [13]:
# Setting training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=1,  # Small batch size
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=False,  # Disable mixed precision for stability
    dataloader_num_workers=0,
    report_to="none"
)

# Initializing Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
    tokenizer=tokenizer
)

# Training the model
trainer.train()

# Saving the model after training
model.save_pretrained("/content/drive/MyDrive/saved_model")

  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss
1,No log,1.806602
2,2.129500,1.715372


Epoch,Training Loss,Validation Loss
1,No log,1.806602
2,2.129500,1.715372
3,2.129500,1.688462


# Evaluation Metrics Setup  
This cell initializes three critical evaluation metrics:  
- **ROUGE**: Measures the overlap of `n-grams` between generated and reference summaries.  
- **BLEU**: Measures translation accuracy through n-gram precision.  
- **BERTScore**: Compares generated and reference summaries using embeddings.  

The evaluation function computes these metrics for the model on the tokenized dataset.

In [14]:
# Load evaluation metrics
rouge = evaluate.load("rouge")
bleu = evaluate.load("sacrebleu")
bert_score = evaluate.load("bertscore")
test_dataset = load_dataset("csv", data_files="llm_data.csv", split="train").select(range(300,360))
tokenized_test_dataset = test_dataset.map(preprocess_function, batched=True, remove_columns=test_dataset.column_names)

# Define evaluation function to display scores directly
def evaluate_model(trainer, dataset):
    predictions, labels, _ = trainer.predict(dataset)
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Calculate ROUGE scores directly
    rouge_results = rouge.compute(predictions=decoded_preds, references=decoded_labels)
    print("ROUGE:", rouge_results)

    # Calculate BLEU score
    bleu_results = bleu.compute(predictions=decoded_preds, references=[[label] for label in decoded_labels])
    print("BLEU:", bleu_results["score"])

    # Calculate BERTScore
    bert_results = bert_score.compute(predictions=decoded_preds, references=decoded_labels, lang="en")
    bert_avg = {k: sum(v) / len(v) for k, v in bert_results.items() if isinstance(v, list)}
    print("Mean BERTScore:", bert_avg)

# Using the Evaluation Function
evaluate_model(trainer, tokenized_test_dataset)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Map:   0%|          | 0/60 [00:00<?, ? examples/s]

ROUGE: {'rouge1': np.float64(0.5232413286957243), 'rouge2': np.float64(0.2873038251652391), 'rougeL': np.float64(0.3706057375284625), 'rougeLsum': np.float64(0.37022591967076846)}
BLEU: 20.39393316586624


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mean BERTScore: {'precision': 0.8920402139425277, 'recall': 0.9063488880793253, 'f1': 0.8990071296691895}


**ROUGE:** {'rouge1': np.float64(0.5232413286957243), 'rouge2': np.float64(0.2873038251652391), 'rougeL': np.float64(0.3706057375284625), 'rougeLsum': np.float64(0.37022591967076846)}<br>

**BLEU:** 20.39393316586624<br>

**Mean BERTScore:** {'precision': 0.8920402139425277, 'recall': 0.9063488880793253, 'f1': 0.8990071296691895}

#  Testing on a Specific Example  


In [15]:
def preprocess_input(title, keywords, abstract, conclusion, document, paper_type, topic, ocr):
    """Formats the input text similar to training preprocessing"""
    input_text = (
        f"Title: {title}\nKeywords: {keywords}\nAbstract: {abstract}\n"
        f"Conclusion: {conclusion}\nDocument: {document}\nPaper Type: {paper_type}\n"
        f"Topic: {topic}\nOCR: {ocr}"
    )

    # Tokenize input
    inputs = tokenizer(input_text, max_length=512, truncation=True, padding="max_length", return_tensors="pt").to(device)
    return inputs


In [16]:
def generate_summary(inputs):
    """Generates a summary using the fine-tuned model"""
    with torch.no_grad():
        output = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=128,
            num_beams=5,  # Beam search for better quality
            early_stopping=True
        )

    # Decode output tokens to text
    summary = tokenizer.decode(output[0], skip_special_tokens=True)
    return summary


In [17]:
# Sample input data
sample_data = {
    "title": "Abstractive text summarization using LSTM-CNN based deep learning",
    "keywords": "Text mining, Abstractive text summarization, Relation extraction, Natural Language Processing",
    "abstract": (
        "Abstractive Text Summarization (ATS), which is the task of constructing summary "
        "sentences by merging facts from different source sentences and condensing them into a shorter "
        "representation while preserving information content and overall meaning. It is very difficult "
        "and time-consuming for human beings to manually summarize large documents of text. In this "
        "paper, we propose an LSTM-CNN based ATS framework (ATSDL) that can construct new "
        "sentences by exploring more fine-grained fragments than sentences, namely, semantic phrases. "
        "Different from existing abstraction-based approaches, ATSDL is composed of two main stages, "
        "the first of which extracts phrases from source sentences and the second generates text "
        "summaries using deep learning. Experimental results on the datasets CNN and DailyMail "
        "show that our ATSDL framework outperforms the state-of-the-art models in terms of both "
        "semantics and syntactic structure, and achieves competitive results on manual linguistic quality evaluation."
    ),
    "conclusion": (
        "In this paper, we develop a novel LSTM-CNN based ATSDL model that overcomes several "
        "key problems in the field of text summarization. The present extractive text summarization (ETS) models "
        "are concerned with syntactic structure, while present ATS models are concerned with semantics. Our model "
        "draws on the strengths of both summarization models. The new ATSDL model first uses a phrase "
        "extraction method called MOSP to extract key phrases from the original text and then learns "
        "the collocation of phrases. After training, the model will generate a phrase sequence that meets "
        "the requirement of syntactic structure. In addition, we use phrase location information to solve "
        "the rare words problem that almost all ATS models would encounter. Finally, we conduct "
        "extensive experiments on two different datasets, and the result shows that our model outperforms "
        "the state-of-the-art approaches in terms of both semantics and syntactic structure."
    ),
    "document": (
        "Abstractive text summarization using LSTM-CNN based deep learning. "
        "Text mining, Abstractive text summarization, Relation extraction, Natural Language Processing. "
        "This paper proposes an LSTM-CNN based ATS framework (ATSDL) that constructs new sentences "
        "by exploring fine-grained fragments, namely, semantic phrases. Different from existing abstraction-based "
        "approaches, ATSDL has two main stages: phrase extraction and deep learning-based text generation. "
        "Experimental results on CNN and DailyMail datasets show superior performance compared to state-of-the-art models."
    ),
    "paper_type": "Text summarization",
    "topic": "Natural Language Processing",
    "ocr": (
        "Encoder decoder, Word Morphological Co-reference segmentation, Reduction, Session, "
        "Phrase process, Phrase extraction, Text input."
    )
}


# Preprocess input
inputs = preprocess_input(**sample_data)

# Generate summary
summary = generate_summary(inputs)

# Print the output
print("Generated Summary:", summary)

Generated Summary: This paper proposes an LSTM-CNN based ATS framework (ATSDL) that constructs new sentences by exploring fine-grained fragments, namely, semantic phrases. The new ATSDL model uses a phrase extraction method called MOSP to extract key phrases from the original text and then learns the collocation of phrases. After training, the model generates a phrase sequence that meets the requirement of syntactic structure. The model also uses phrase location information to solve the rare words problem. Experimental results on CNN and DailyMail datasets show superior performance compared to state-of-the-art models.


## Model's Generated Summary:
 This paper proposes an LSTM-CNN based ATS framework (ATSDL) that constructs new sentences by exploring fine-grained fragments, namely, semantic phrases. The new ATSDL model uses a phrase extraction method called MOSP to extract key phrases from the original text and then learns the collocation of phrases. After training, the model generates a phrase sequence that meets the requirement of syntactic structure. The model also uses phrase location information to solve the rare words problem. Experimental results on CNN and DailyMail datasets show superior performance compared to state-of-the-art models.

---