<a href="https://colab.research.google.com/github/danielsaggau/deep_unsupervised_learning/blob/main/BillSUM_Bigbird_Pegasus_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Evaluate 🤗's BigBirdPegasus on Pubmed**

In this notebook, we evaluate BigBird on the long-range summarization task of **[pubmed](https://huggingface.co/datasets/scientific_papers)**. BigBird was introduced in [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by *Manzil Zaheer et al.* It has achieved outstanding performance on long document summarization using an efficient block sparse attention mechanism. Please refer to this [blog post](https://huggingface.co/blog/big-bird) for an in-detail explanation of BigBird's block sparse attention.

Let's see what GPU we got. We need at least ~12 GB GPU memory to be able to run this notebook.

In [None]:
!nvidia-smi

Let's first install `transformers`, `datasets`, `rouge_score` and `sentencepiece`.

In [1]:
%%capture
!pip3 install datasets
!pip3 install rouge_score
!pip3 install git+https://github.com/huggingface/transformers
!pip3 install sentencepiece

As mentioned above, we will evaluate **BigBirdPegasus** on the **_pubmed_** dataset using the **Rouge-2** metric. For this, let's 
import the two loading functions `load_dataset` and `load_metric`. Futher, we import the `BigBirdPegasusForConditionalGeneration` and `AutoTokenizer` tokenizer.

In [2]:
from datasets import load_dataset, load_metric
import torch
from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer

Let's define some variables which will be useful later on.

In [3]:
DATASET_NAME = "billsum"
DEVICE = "cuda"
CACHE_DIR = DATASET_NAME
MODEL_ID = f"google/bigbird-pegasus-large-bigpatent"

To begin with, let's take a look at the PubMed dataset ([click to see on 🤗Datasets Hub](https://huggingface.co/datasets/scientific_papers)).
PubMed consists of scientific papers in the field of medicine. The dataset splits each paper into the *article*, and the *abstract* whereas the article consists of the whole paper minus the abstract. Thus, the input to be summarized is defined by the article and the gold label by the abstract.

The following table summarizes the size of the *train*, *validation*, and *test* split of the dataset.

|               |Training | Validation | Test |
|---------------|---------|------------|------|
| Total samples | 119924  | 6633       | 6658 |

In this notebook, we are only interested in evaluating *BigBird*. To do so, let's download the *test* split of the `pubmed` dataset. This can take a couple of minutes **☕** .

In [4]:
test_dataset = load_dataset(DATASET_NAME, split="test", cache_dir=CACHE_DIR)
test_dataset

Downloading:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/832 [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset billsum/default (download: 64.14 MiB, generated: 259.80 MiB, post-processed: Unknown size, total: 323.94 MiB) to billsum/billsum/default/3.0.0/d1e95173aed3acb71327864be74ead49b578522e4c7206048b2f2e5351b57959...


Downloading: 0.00B [00:00, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset billsum downloaded and prepared to billsum/billsum/default/3.0.0/d1e95173aed3acb71327864be74ead49b578522e4c7206048b2f2e5351b57959. Subsequent calls will reuse this data.


Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 3269
})

The official checkpoint `google/bigbird-pegasus-large-pubmed` ([click to see on 🤗Model Hub](https://huggingface.co/google/bigbird-pegasus-large-pubmed)) has already been fine-tuned on pubmed, so we can simply load the weights are run the model in inference mode.

In [5]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = BigBirdPegasusForConditionalGeneration.from_pretrained(MODEL_ID).to(DEVICE)
rouge = load_metric("rouge")

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.35M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/775 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.15G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

`BigBirdPegasus` makes use of *block sparse attention*. Let's verify the `config`'s attention type and the `block_size`.

In [8]:
model.config.attention_type, model.config.block_size

('block_sparse', 64)

Next, we will take a look at the length distribution of the dataset. The following table shows the *median* and the 90% quantile of the article, and abstract (summary). 

|                 | Median | 90%-ile |
|-----------------|--------|---------|
| Articles Length | 2715   | 6101    |
| Summary Length  | 212    | 318     |

`BigBirdPegasus` can handle sequence up to a length of **4096** which is significantly higher than the median input length of **2715**. However, many input samples are longer than **4096**, which consequently need to be truncated. 
The summaries have a median length of **212** with 90% being shorter than **318**. Given this data, 256 seems to be a reasonable choice as the model's maximum generation length.

Now we can write the evaluation function for BigBirdPegasus.
First, we tokenize each *article* up to a maximum length of 4096 tokens.
We will make use of beam search (with `num_beams=5` & `length_penalty=0.8`) to generate the predicted *abstract* of the *article*. Finally, the predicted *abstract* tokens are decoded and the resulting predicted *abstract* string is saved in the batch.

In [13]:
def generate_answer(batch):
  inputs_dict = tokenizer(batch["text"], padding="max_length", max_length=4096, return_tensors="pt", truncation=True)
  inputs_dict = {k: inputs_dict[k].to(DEVICE) for k in inputs_dict}
  predicted_abstract_ids = model.generate(**inputs_dict, max_length=256, num_beams=5, length_penalty=0.8)
  batch["predicted_abstract"] = tokenizer.decode(predicted_abstract_ids[0], skip_special_tokens=True)
  print(batch["predicted_abstract"])
  return batch

Let's take 2 samples and verify the predictions to be sure everything works as expected 🙂.

In [10]:
dataset_small = test_dataset.select(range(2))
result_small = dataset_small.map(generate_answer)

rouge.compute(predictions=result_small["predicted_abstract"], references=result_small["summary"])

  0%|          | 0/2 [00:00<?, ?ex/s]

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


The present invention relates to a method of designing and constructing a design for a water supply system. The method includes the steps of selecting a design for the water supply system from a plurality of designs available for design, the design for the water supply system being selected from a plurality of designs based on cost and performance considerations, the design for the water supply system providing for water supply from a plurality of water sources, and the design for the water supply system providing for water supply from a plurality of water sources sharing a common boundary, the design for the water supply system providing for water supply from a plurality of water sources sharing a common boundary, the design for the water supply system providing for water supply from a plurality of water sources sharing a common boundary, the design for the water supply system providing for water supply from a plurality of water sources sharing a common boundary, and the design for th

{'rouge1': AggregateScore(low=Score(precision=0.319672131147541, recall=0.24920127795527156, fmeasure=0.2800718132854578), mid=Score(precision=0.33088869715271785, recall=0.2690450834220802, fmeasure=0.29666241266682525), high=Score(precision=0.34210526315789475, recall=0.28888888888888886, fmeasure=0.3132530120481927)),
 'rouge2': AggregateScore(low=Score(precision=0.02702702702702703, recall=0.022727272727272728, fmeasure=0.02469135802469136), mid=Score(precision=0.036147258369480594, recall=0.028991841491841492, fmeasure=0.032165498832165504), high=Score(precision=0.04526748971193416, recall=0.035256410256410256, fmeasure=0.039639639639639644)),
 'rougeL': AggregateScore(low=Score(precision=0.21052631578947367, recall=0.1757188498402556, fmeasure=0.19277108433734938), mid=Score(precision=0.2179680759275237, recall=0.17674831380901668, fmeasure=0.19512880967316304), high=Score(precision=0.22540983606557377, recall=0.17777777777777778, fmeasure=0.19748653500897667)),
 'rougeLsum': Agg

Because of the very large input size of ~ 4K tokens in this notebook, it would take over (time) to evaluate the whole filtered test dataset. For the sake of this notebook, we'll only evaluate the first 600 examples. Therefore, we cut the 6000+ samples to just 600 samples using 🤗Datasets' convenient `.select()` function.

In [7]:
test_dataset = test_dataset.select(range(600))

Alright, now let's map each sample to the predicted *abstract*. This will take *ca.* 2 hours if you have been given a fast GPU.

In [8]:
result = test_dataset.map(generate_answer)

  0%|          | 0/600 [00:00<?, ?ex/s]

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


The present invention relates to a method of designing and constructing a design for a water supply system. The method includes the steps of selecting a design for the water supply system from a plurality of designs available for design, the design for the water supply system being selected from a plurality of designs based on cost and performance considerations, the design for the water supply system providing for water supply from a plurality of water sources, and the design for the water supply system providing for water supply from a plurality of water sources sharing a common boundary, the design for the water supply system providing for water supply from a plurality of water sources sharing a common boundary, the design for the water supply system providing for water supply from a plurality of water sources sharing a common boundary, the design for the water supply system providing for water supply from a plurality of water sources sharing a common boundary, and the design for th

Finally, we can evaluate the predictions using the *rouge* metric.

In [13]:
rouge.compute(predictions=result["predicted_abstract"], references=result["summary"])

{'rouge1': AggregateScore(low=Score(precision=0.38441796937296435, recall=0.23481834046034858, fmeasure=0.24421159633556253), mid=Score(precision=0.43491877905470344, recall=0.26115545098887005, fmeasure=0.26651783991442946), high=Score(precision=0.4808193954988864, recall=0.28691421535200184, fmeasure=0.28936985965464884)),
 'rouge2': AggregateScore(low=Score(precision=0.13810657990642555, recall=0.07109150558224125, fmeasure=0.07795572507349292), mid=Score(precision=0.17419718525263328, recall=0.0870254861411826, fmeasure=0.0946720942783223), high=Score(precision=0.2147535264534155, recall=0.10396743478806365, fmeasure=0.1133076056690345)),
 'rougeL': AggregateScore(low=Score(precision=0.2758759831784442, recall=0.17351628461357538, fmeasure=0.17639677623559327), mid=Score(precision=0.3155106209469023, recall=0.19408998617901746, fmeasure=0.19189662152569054), high=Score(precision=0.3575766542235534, recall=0.21645725086052453, fmeasure=0.21009481914040087)),
 'rougeLsum': AggregateS

In [None]:
!pip install git+https://github.com/google-research/bleurt.git
bleurt= load_metric('bleurt')

For our 600 samples, we get a *Rouge-2* score of **19.6** 🔥🔥🔥.

**Note**: As stated in the [official paper](https://arxiv.org/pdf/2007.14062.pdf) *BigBirdPegasus* achieves a new state-of-the-art of **20.65** Rouge-2 score on PubMed. Evaluation in this notebook might be slightly worse since a different `length_penalty` is used for generation and data pre-processing is kept as simple as possibe (no "*newline*" removal and space removal before special tokens).

In case you want to evaluate [`google/bigbird-pegasus-large-arxiv`](https://huggingface.co/google/bigbird-pegasus-large-pubmed) on `arxiv` dataset from [`scientific_papers`](https://huggingface.co/datasets/scientific_papers), you can just change the `DATASET_NAME` to `arxiv` in the cell above.

In [10]:
billsum_rouge_result = rouge.compute(predictions=result["predicted_abstract"], references=result["summary"])
billsum_bleurt_score = bleurt.compute(predictions=result["predicted_abstract"], references=result["summary"])

In [11]:
import pandas as pd

In [12]:
dataframe = pd.DataFrame(billsum_rouge_result)
dataframe.to_csv('/content/billsum_rouge_result_beam_bigpatent.csv', index = False)
dataframe = pd.DataFrame(billsum_bleurt_score)
dataframe.to_csv('/content/billsum_bleurt_score_beam_bigpatent.csv', index = False)

In [14]:
def generate_answer(batch):
  inputs_dict = tokenizer(batch["text"], padding="max_length", max_length=4096, return_tensors="pt", truncation=True)
  inputs_dict = {k: inputs_dict[k].to(DEVICE) for k in inputs_dict}
  predicted_abstract_ids = model.generate(**inputs_dict, max_length=256, top_p=0.95,repetition_penalty=1.1, length_penalty=0.8)
  batch["predicted_abstract"] = tokenizer.decode(predicted_abstract_ids[0], skip_special_tokens=True)
  print(batch["predicted_abstract"])
  return batch

In [15]:
result = test_dataset.map(generate_answer)

  0%|          | 0/600 [00:00<?, ?ex/s]

The present invention relates to a method for designing and constructing a design system for the elimination or control of combined sewer overflows. The method includes the steps of: providing a design system for the elimination or control of combined sewer overflows; providing a design system for the design of an alternative water supply for the elimination or control of combined sewer overflows; providing a design system for the design of an alternative water supply for the elimination or control of combined sewer overflows; providing a design system for the design of an alternative water supply for the elimination or control of combined sewer overflows; providing a design system for the design of an alternative water supply for the elimination or control of combined sewer overflows; providing a design system for the design of an alternative water supply for the design of an alternative water supply for the elimination or control of combined sewer overflows; providing a design system

In [16]:
billsum_rouge_result = rouge.compute(predictions=result["predicted_abstract"], references=result["summary"])
billsum_bleurt_score = bleurt.compute(predictions=result["predicted_abstract"], references=result["summary"])
dataframe = pd.DataFrame(billsum_rouge_result)
dataframe.to_csv('/content/billsum_rouge_result_all_bigpatent.csv', index = False)
dataframe = pd.DataFrame(billsum_bleurt_score)
dataframe.to_csv('/content/billsum_bleurt_score_all_bigpatent.csv', index = False)