In [None]:
!pip install datasets transformers evaluate sacrebleu bert_score

In [None]:
!pip install rouge_score

Loading packages and setting seed

In [None]:
import os
import random
import numpy as np
import torch
import matplotlib.pyplot as plt
from datasets import load_dataset, DatasetDict
from transformers import (
    BartForConditionalGeneration,
    BartTokenizer,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq
)
import evaluate
import sacrebleu

In [None]:
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

Loading the Billsum dataset from Hugging Face

In [None]:
dataset = load_dataset("FiscalNote/billsum")
print("Original splits:", dataset)

Print the number of samples for each split

In [None]:
for split in dataset.keys():
    print(f"Split: {split}, Number of samples: {len(dataset[split])}")

Create a validation split if not available by using a 90/10 split from the training set

In [None]:
if "validation" not in dataset.keys():
    train_valid = dataset["train"].train_test_split(test_size=0.1, seed=seed)
    dataset = DatasetDict({
        "train": train_valid["train"],
        "validation": train_valid["test"],
        "test": dataset["test"]
    })
    print("After splitting, splits:", dataset)

Load the tokenizer for facebook/bart-base

In [None]:
model_checkpoint = "facebook/bart-base"
tokenizer = BartTokenizer.from_pretrained(model_checkpoint)

Define maximum lengths

In [None]:
max_input_length = 1024
max_target_length = 256

Defining a tokenization function for the dataset then applying tokenization and removing columns no longer needed

In [None]:
def tokenize_function(examples):
    model_inputs = tokenizer(examples["text"], max_length=max_input_length, truncation=True)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text", "summary", "title"])

Saving the tokenized dataset locally to avoid reprocessing later

In [None]:
tokenized_dataset_path = "./tokenized_billsum"
if not os.path.exists(tokenized_dataset_path):
    tokenized_datasets.save_to_disk(tokenized_dataset_path)
    print(f"Tokenized dataset saved to {tokenized_dataset_path}")

Limit the training set to 1000 samples and validation set to 100 samples due to limited resources

In [None]:
min_train_samples = 1000
min_valid_samples = 100
if len(tokenized_datasets["train"]) > min_train_samples:
    tokenized_datasets["train"] = tokenized_datasets["train"].select(range(min_train_samples))
if len(tokenized_datasets["validation"]) > min_valid_samples:
    tokenized_datasets["validation"] = tokenized_datasets["validation"].select(range(min_valid_samples))

**Preprocessing Methodology**

Here, we have started with the Billsum dataset using the Hugging Face Datasets package. Then, printed the number of samples in each dataset split to get a feel for the dataset’s size and distribution.

Next, because the dataset did not include a separate validation set, we have to create one by splitting the training data into 90% for training and 10% for validation.

Then, we have prepared the data for our model by tokenizing both the documents and the corresponding summaries and used the BartTokenizer to convert the raw text into token IDs. During tokenization, we have set a maximum input length of 1024 tokens and a target (summary) length of 256 tokens so that our inputs stay within manageable limits. To reduce computational load due to the limited resources available on Colab we have to subsample the dataset to use only 1000 training examples and 100 validation examples.


Cutomer trainer that overrides prediction_step to use generate() with tuned parameters.

In [None]:
class CustomSeq2SeqTrainer(Seq2SeqTrainer):
    def prediction_step(self, model, inputs, prediction_loss_only, ignore_keys=None):
        inputs = self._prepare_inputs(inputs)
        generated_tokens = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=max_target_length,
            num_beams=8,
            length_penalty=1.5,
            no_repeat_ngram_size=3,
            min_length=50
        )
        labels = inputs.get("labels")
        loss = None
        if not prediction_loss_only:
            with torch.no_grad():
                outputs = model(**inputs)
                loss = outputs.loss
        return (loss, generated_tokens, labels)

Defining evaluation metrics using the evaluate library.

In [None]:
rouge_metric = evaluate.load("rouge")
sacrebleu_metric = evaluate.load("sacrebleu")
bertscore_metric = evaluate.load("bertscore")

Safe decoding function to convert token IDs to strings.

In [None]:
def safe_decode(batch_ids):
    decoded_batch = []
    for ids in batch_ids:
        if isinstance(ids, np.ndarray):
            ids = ids.tolist()
        tokens = tokenizer.convert_ids_to_tokens(ids, skip_special_tokens=True)
        tokens = [t if t is not None else "" for t in tokens]
        decoded_batch.append("".join(tokens))
    return decoded_batch

Compute metrics function is for calculating ROUGE, BLEU (using sacreBLEU with exponential smoothing), and BERTScore.

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = safe_decode(predictions)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = safe_decode(labels)

    rouge_result = rouge_metric.compute(predictions=decoded_preds, references=decoded_labels)
    bleu = sacrebleu.corpus_bleu(decoded_preds, [[ref] for ref in decoded_labels], smooth_method="exp")
    bleu_score = bleu.score
    bertscore_result = bertscore_metric.compute(predictions=decoded_preds, references=decoded_labels, lang="en")
    avg_bertscore = np.mean(bertscore_result["f1"])

    return {
        "rouge1": rouge_result["rouge1"],
        "rouge2": rouge_result["rouge2"],
        "rougeL": rouge_result["rougeL"],
        "bleu": bleu_score,
        "bertscore": avg_bertscore
    }

Define 3 different experiment configuration

In [None]:
experiment_configs = [
    {"name": "config1", "learning_rate": 5e-5, "train_batch_size": 2, "num_train_epochs": 3},
    {"name": "config2", "learning_rate": 3e-5, "train_batch_size": 2, "num_train_epochs": 3},
    {"name": "config3", "learning_rate": 5e-5, "train_batch_size": 4, "num_train_epochs": 3},
]

experiment_results = {}
trained_trainers = {}

Create a data collator to pad sequences dynamically.

In [None]:
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=BartForConditionalGeneration.from_pretrained(model_checkpoint),
    padding="longest"
)

Configuration 1 (Config1):
- Learning Rate: 5e-5
- Batch Size: 2
- Number of Epochs: 3

Here, We are loading the pre-trained Bart model then setting up training arguments for config1, create the custom trainer for config1. Finally, train the model and saving the model weights with full configuration. 

In [None]:
print("\nStarting Experiment: config1")
config1 = experiment_configs[0]
model_config1 = BartForConditionalGeneration.from_pretrained(model_checkpoint)
training_args_config1 = Seq2SeqTrainingArguments(
    output_dir=f"billsum_bart_base_{config1['name']}",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=config1["train_batch_size"],
    per_device_eval_batch_size=config1["train_batch_size"],
    learning_rate=config1["learning_rate"],
    num_train_epochs=config1["num_train_epochs"],
    bf16=True,
    logging_dir=f'./logs_{config1["name"]}',
    logging_steps=50,
    predict_with_generate=True,
    report_to=[],
)
trainer_config1 = CustomSeq2SeqTrainer(
    model=model_config1,
    args=training_args_config1,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)
trainer_config1.train()
trainer_config1.save_model()

Plot the training and validation loss.

In [None]:
train_losses = []
eval_losses = []
for log in trainer_config1.state.log_history:
    if "loss" in log:
        train_losses.append(log["loss"])
    if "eval_loss" in log:
        eval_losses.append(log["eval_loss"])
plt.figure(figsize=(8, 5))
plt.plot(train_losses, label="Train Loss")
plt.plot(range(len(train_losses), len(train_losses) + len(eval_losses)), eval_losses, label="Validation Loss")
plt.xlabel("Logging Steps")
plt.ylabel("Loss")
plt.title("Training and Validation Loss for config1")
plt.legend()
plt.show()

Evaluate the model on the validation set and store the results.

In [None]:
eval_results_config1 = trainer_config1.evaluate(eval_dataset=tokenized_datasets["validation"])
print(f"Validation results for config1:")
print(eval_results_config1)
experiment_results["config1"] = eval_results_config1
trained_trainers["config1"] = trainer_config1

Configuration 2 (Config2):
- Learning Rate: 3e-5
- Batch Size: 2
- Number of Epochs: 3

Here, We are loading the pre-trained Bart model then setting up training arguments for config1, create the custom trainer for config1. Finally, train the model and saving the model weights with full configuration. 

In [None]:
print("\nStarting Experiment: config2")
config2 = experiment_configs[1]
model_config2 = BartForConditionalGeneration.from_pretrained(model_checkpoint)
training_args_config2 = Seq2SeqTrainingArguments(
    output_dir=f"billsum_bart_base_{config2['name']}",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=config2["train_batch_size"],
    per_device_eval_batch_size=config2["train_batch_size"],
    learning_rate=config2["learning_rate"],
    num_train_epochs=config2["num_train_epochs"],
    bf16=True,
    logging_dir=f'./logs_{config2["name"]}',
    logging_steps=50,
    predict_with_generate=True,
    report_to=[],
)
trainer_config2 = CustomSeq2SeqTrainer(
    model=model_config2,
    args=training_args_config2,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)
trainer_config2.train()
trainer_config2.save_model()

Plot the training and validation loss.

In [None]:
train_losses = []
eval_losses = []
for log in trainer_config2.state.log_history:
    if "loss" in log:
        train_losses.append(log["loss"])
    if "eval_loss" in log:
        eval_losses.append(log["eval_loss"])
plt.figure(figsize=(8, 5))
plt.plot(train_losses, label="Train Loss")
plt.plot(range(len(train_losses), len(train_losses) + len(eval_losses)), eval_losses, label="Validation Loss")
plt.xlabel("Logging Steps")
plt.ylabel("Loss")
plt.title("Training and Validation Loss for config2")
plt.legend()
plt.show()

Evaluate the model on the validation set and store the results.

In [None]:
eval_results_config2 = trainer_config2.evaluate(eval_dataset=tokenized_datasets["validation"])
print(f"Validation results for config2:")
print(eval_results_config2)
experiment_results["config2"] = eval_results_config2
trained_trainers["config2"] = trainer_config2

In [None]:
!nvidia-smi

Configuration 3 (Config3):
- Learning Rate: 5e-5
- Batch Size: 4
- Number of Epochs: 3

Here, We are loading the pre-trained Bart model then setting up training arguments for config1, create the custom trainer for config1. Finally, train the model and saving the model weights with full configuration.

In [None]:
print("\nStarting Experiment: config3")
config3 = experiment_configs[2]
model_config3 = BartForConditionalGeneration.from_pretrained(model_checkpoint)
training_args_config3 = Seq2SeqTrainingArguments(
    output_dir=f"billsum_bart_base_{config3['name']}",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=config3["train_batch_size"],
    per_device_eval_batch_size=config3["train_batch_size"],
    learning_rate=config3["learning_rate"],
    num_train_epochs=config3["num_train_epochs"],
    bf16=True,
    logging_dir=f'./logs_{config3["name"]}',
    logging_steps=50,
    predict_with_generate=True,
    report_to=[],
)
trainer_config3 = CustomSeq2SeqTrainer(
    model=model_config3,
    args=training_args_config3,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)
trainer_config3.train()
trainer_config3.save_model()

Plot the training and validation loss.

In [None]:
train_losses = []
eval_losses = []
for log in trainer_config3.state.log_history:
    if "loss" in log:
        train_losses.append(log["loss"])
    if "eval_loss" in log:
        eval_losses.append(log["eval_loss"])
plt.figure(figsize=(8, 5))
plt.plot(train_losses, label="Train Loss")
plt.plot(range(len(train_losses), len(train_losses) + len(eval_losses)), eval_losses, label="Validation Loss")
plt.xlabel("Logging Steps")
plt.ylabel("Loss")
plt.title("Training and Validation Loss for config3")
plt.legend()
plt.show()

Evaluate the model on the validation set and store the results.

In [None]:
eval_results_config3 = trainer_config3.evaluate(eval_dataset=tokenized_datasets["validation"])
print(f"Validation results for config3:")
print(eval_results_config3)
experiment_results["config3"] = eval_results_config3
trained_trainers["config3"] = trainer_config3

**Training Methodology**

After preprocessing the dataset, we have used a pre-trained model to fine tune on our data. So, started with the pre-trained facebook/bart-base model from Hugging Face, which is known for its strong performance in text generation tasks. 

Then, we have fine-tuned the model using our tokenized Billsum training data with subsamples. For the evaluation process, we have implemented a custom training by creating a trainer class CustomSeq2SeqTrainer. In this custom trainer, we have override the prediction_step method so that the model’s generate() method is used during evaluation. This will help to generate full summaries using tuned decoding parameters.

For decoding, these are some changes:
- num_beams=8: thisn is to search over more candidate summaries
- length_penalty=1.5: this will encourage the generation of longer, more complete outputs
- no_repeat_ngram_size=3: this is to prevent repetition of phrases
- min_length=50: and this will ensure that the summaries aren’t too short.

We have tried different hyperparameters with different learning rates (5e-5 and 3e-5) and batch sizes (2 and 4) in multiple configurations and finally, we have selected the best-performing configuration based on evaluation metrics specially ROUGE, which is more important here.  For our final evaluation, we have used the best configuration’s trainer to evaluate on a test subset.

Clearing GPU Space by removing trainers other than best

In [None]:
for k, t in trained_trainers.items():
    if k != "config3":
        del t
torch.cuda.empty_cache()

Selecting config3 as best and then evaluating on test dataset using subset of 100 samples

In [None]:
best_config_name = "config3"
print("Best configuration based on ROUGE scores:", best_config_name)
best_trainer = trained_trainers[best_config_name]

test_metrics = best_trainer.evaluate(eval_dataset=tokenized_datasets["test"].select(range(100)))
print(test_metrics)

Plotting a bar chart of validation metrics using the best trainer evaluation results of test data

In [None]:
metrics_to_plot = {
    "ROUGE-1": test_metrics["eval_rouge1"] * 100,
    "ROUGE-2": test_metrics["eval_rouge2"] * 100,
    "ROUGE-L": test_metrics["eval_rougeL"] * 100,
    "BLEU": test_metrics["eval_bleu"],
    "BERTScore": test_metrics["eval_bertscore"] * 100,
}
plt.figure(figsize=(8, 5))
plt.bar(list(metrics_to_plot.keys()), list(metrics_to_plot.values()))
plt.title("Test Metrics (%) for Best Model")
plt.xlabel("Metric")
plt.ylabel("Score")
plt.show()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!cp -r billsum_bart_base_config3 /content/drive/MyDrive/

## Final Analysis and Discussion

### Evaluation Scores Analysis

**Loss:**  
-  Here, The final evaluation shows an evaluation loss of approximately **2.11**.  
- This cross-entropy loss value shows that, after 3 epochs on a subset of dataset, the model is learning effectively.  
-  Also, a loss value in this range is normal for large sequence-to-sequence models when we are dealing with complex tasks like summarization.

**ROUGE Scores:**  
1. ROUGE-1: 0.428 (42.8%)  
  - This metric is used to measure the overlap of unigrams between the generated summary and the reference. A ROUGE-1 score of 42.8% clearly a strong overlap in the basic content words.  
2. ROUGE-2: 0.240 (24.0%)  
  - ROUGE-2 will measure the bigram overlap. A score of 24.0% means that the model is able to capture some phrase-level coherence.  
3. ROUGE-L: 0.313 (31.3%)  
  - ROUGE-L focuses on the longest common subsequence between the generated and reference summaries and it reflects the overall fluency and coherence.

These ROUGE metrics shows that the generated summaries closely match the reference summaries in terms of capturing important information.

**BLEU Score:**  
- The BLEU score is  around 34.41 for config 3. We had higher BLEU in config2 but BLEU is not that useful compared to ROUGE scores as it is more common in machine translation.
- Here, we are using sacreBLEU with exponential smoothing and a BLEU score above 30 is good enough. 
- A high BLEU score here means that the generated summaries have a high degree of n-gram overlap with the reference summaries. But BLEU can be sensitive to minor changes in phrasing.

**BERTScore:**  
- The BERTScore is around 0.894 or 89.4% .  
- BERTScore use contextual embeddings to measure semantic similarity between generated and reference summaries.
- A BERTScore close to 0.90 means that the model is able to capture the meaning of the input text very well, even if the exact wording differs.

### Loss Graph Analysis

- **Training Loss Curve:**  
  The plot of training loss over epochs shows a steady decrease, which indicates that the model is learning from the training data consistently.
  
- **Validation Loss Curve:**  
  The validation loss is also decreasing and stays relatively close to the training loss. This suggests that the model is generalizing well to unseen data without significant overfitting.

- **Interpretation:**  
  - Whenever we see such a smooth convergence of both training and validation losses, it can be considered as a good sign that model is learning.  
  - Here the validation loss is not increased significantly relative to the training loss. So, we can say the training is going in good direction.

### Challenges Faced

1. **Handling Long Documents:**  
   - The input text has been shortened to 1024 tokens to save GPU memory and improve processing speed. But, this truncation can might remove some important context, which could lower the quality of the summaries generated.
   
2. **Variability in Summary Quality:**  
   - Summaries naturally vary in wording and depth. Metrics such as BLEU depends on exact phrase matches. So, they can give lower scores due to minor wording differences, even when the meaning are correct.
   - By using both ROUGE and BERTScore together, we can achieve a more balanced evaluation. This can capture not just literal phrasing but also deeper semantic similarity.

3. **GPU Memory Constraints:**  
   - When we are running multiple trainer instances it was consuming lot of GPU memory and all of my 15GB GPU was used while working with google colab.
   - Even if we have created subsample as suggested instructions still we had to explicitly delete unused trainer objects (using `del` and `torch.cuda.empty_cache()`) to free up memory for the best model evaluation as otherwise our kernel was crashed once in evalution.

4. **Evaluation Speed:**  
   - Evaluating the model on the entire test set is very slow.
   - So, to get faster feedback, we have evaluated using a smaller subset of the test set (100). This might introduce some variability in results but it significantly decrease evaluation time.

### Potential Modifications

- **Input/Output Lengths:**  
  If we had more GPU resources we can consider to increase the maximum input or target lengths to capture more context from long documents. But this wasn't possible on colab or not practical on ccr.

- **Gradient Accumulation:**  
  We could implement gradient accumulation to simulate a larger batch size without exceeding GPU memory, which might improve model stability and performance.

- **Decoding Parameter Fine-Tuning:**  
  We could further try to fine-tune the decoding parameters like adjusting `min_length`, `num_beams`, or `length_penalty` to see if BLEU and other metrics can be improved without decreasing ROUGE or BERTScore.

- **Ensemble Methods:**  
  We can explore to use ensembles of multiple models to generate summaries, which might increase overall performance and mitigate variability in BLEU scores.


## References

- FiscalNote/billsum on Hugging Face: https://huggingface.co/datasets/FiscalNote/billsum
- Facebook/bart-base on Hugging Face: https://huggingface.co/facebook/bart-base
- Transformers Documentation: https://huggingface.co/docs/transformers/ 
- Datasets Documentation: https://huggingface.co/docs/datasets/
- rouge_score: https://huggingface.co/spaces/evaluate-metric/rouge
- sacreBLEU: https://pypi.org/project/sacreBLEU/
- BLEU: https://huggingface.co/spaces/evaluate-metric/bleu  
- bert_score: https://huggingface.co/spaces/evaluate-metric/bertscore
- Python os module Documentation: https://docs.python.org/3/library/os.html
- Python random module Documentation: https://docs.python.org/3/library/random.html
- NumPy Documentation: https://numpy.org/doc/stable/user/index.html#user
- PyTorch Documentation: https://pytorch.org/docs/stable/index.html
- Matplotlib Documentation: https://matplotlib.org/stable/users/index.html
- Hugging Face Evaluate Documentation: https://huggingface.co/docs/evaluate/