<a href="https://colab.research.google.com/github/danielsaggau/deep_unsupervised_learning/blob/main/BillSUM_Bigbird_Pegasus_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Evaluate ðŸ¤—'s BigBirdPegasus on Pubmed**

In this notebook, we evaluate BigBird on the long-range summarization task of **[pubmed](https://huggingface.co/datasets/scientific_papers)**. BigBird was introduced in [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by *Manzil Zaheer et al.* It has achieved outstanding performance on long document summarization using an efficient block sparse attention mechanism. Please refer to this [blog post](https://huggingface.co/blog/big-bird) for an in-detail explanation of BigBird's block sparse attention.

Let's see what GPU we got. We need at least ~12 GB GPU memory to be able to run this notebook.

In [None]:
!nvidia-smi

Let's first install `transformers`, `datasets`, `rouge_score` and `sentencepiece`.

In [None]:
%%capture
!pip3 install datasets
!pip3 install rouge_score
!pip3 install git+https://github.com/huggingface/transformers
!pip3 install sentencepiece

As mentioned above, we will evaluate **BigBirdPegasus** on the **_pubmed_** dataset using the **Rouge-2** metric. For this, let's 
import the two loading functions `load_dataset` and `load_metric`. Futher, we import the `BigBirdPegasusForConditionalGeneration` and `AutoTokenizer` tokenizer.

In [None]:
from datasets import load_dataset, load_metric
import torch
from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer

Let's define some variables which will be useful later on.

In [None]:
DATASET_NAME = "billsum"
DEVICE = "cuda"
CACHE_DIR = DATASET_NAME
MODEL_ID = f"google/bigbird-pegasus-large-bigpatent"

To begin with, let's take a look at the PubMed dataset ([click to see on ðŸ¤—Datasets Hub](https://huggingface.co/datasets/scientific_papers)).
PubMed consists of scientific papers in the field of medicine. The dataset splits each paper into the *article*, and the *abstract* whereas the article consists of the whole paper minus the abstract. Thus, the input to be summarized is defined by the article and the gold label by the abstract.

The following table summarizes the size of the *train*, *validation*, and *test* split of the dataset.

|               |Training | Validation | Test |
|---------------|---------|------------|------|
| Total samples | 119924  | 6633       | 6658 |

In this notebook, we are only interested in evaluating *BigBird*. To do so, let's download the *test* split of the `pubmed` dataset. This can take a couple of minutes **â˜•** .

In [None]:
test_dataset = load_dataset("billsum", DATASET_NAME, split="test", cache_dir=CACHE_DIR)
test_dataset

Downloading:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/832 [00:00<?, ?B/s]

Using custom data configuration billsum


Downloading and preparing dataset billsum/billsum to billsum/billsum/billsum/3.0.0/d1e95173aed3acb71327864be74ead49b578522e4c7206048b2f2e5351b57959...


Downloading: 0.00B [00:00, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset billsum downloaded and prepared to billsum/billsum/billsum/3.0.0/d1e95173aed3acb71327864be74ead49b578522e4c7206048b2f2e5351b57959. Subsequent calls will reuse this data.


Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 3269
})

The official checkpoint `google/bigbird-pegasus-large-pubmed` ([click to see on ðŸ¤—Model Hub](https://huggingface.co/google/bigbird-pegasus-large-pubmed)) has already been fine-tuned on pubmed, so we can simply load the weights are run the model in inference mode.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = BigBirdPegasusForConditionalGeneration.from_pretrained(MODEL_ID).to(DEVICE)
rouge = load_metric("rouge")

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.35M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/775 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.15G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

`BigBirdPegasus` makes use of *block sparse attention*. Let's verify the `config`'s attention type and the `block_size`.

In [None]:
model.config.attention_type, model.config.block_size

('block_sparse', 64)

Next, we will take a look at the length distribution of the dataset. The following table shows the *median* and the 90% quantile of the article, and abstract (summary). 

|                 | Median | 90%-ile |
|-----------------|--------|---------|
| Articles Length | 2715   | 6101    |
| Summary Length  | 212    | 318     |

`BigBirdPegasus` can handle sequence up to a length of **4096** which is significantly higher than the median input length of **2715**. However, many input samples are longer than **4096**, which consequently need to be truncated. 
The summaries have a median length of **212** with 90% being shorter than **318**. Given this data, 256 seems to be a reasonable choice as the model's maximum generation length.

Now we can write the evaluation function for BigBirdPegasus.
First, we tokenize each *article* up to a maximum length of 4096 tokens.
We will make use of beam search (with `num_beams=5` & `length_penalty=0.8`) to generate the predicted *abstract* of the *article*. Finally, the predicted *abstract* tokens are decoded and the resulting predicted *abstract* string is saved in the batch.

In [None]:
def generate_answer(batch):
  inputs_dict = tokenizer(batch["text"], padding="max_length", max_length=4096, return_tensors="pt", truncation=True)
  inputs_dict = {k: inputs_dict[k].to(DEVICE) for k in inputs_dict}
  predicted_abstract_ids = model.generate(**inputs_dict, max_length=256, num_beams=5, length_penalty=0.8)
  batch["predicted_abstract"] = tokenizer.decode(predicted_abstract_ids[0], skip_special_tokens=True)
  print(batch["predicted_abstract"])
  return batch

Let's take 2 samples and verify the predictions to be sure everything works as expected ðŸ™‚.

In [None]:
dataset_small = test_dataset.select(range(2))
result_small = dataset_small.map(generate_answer)

rouge.compute(predictions=result_small["predicted_abstract"], references=result_small["summary"])

  0%|          | 0/2 [00:00<?, ?ex/s]

in this brief reply to a recent letter to [ s.- a. ], we point out that there is an error in [ s.- a. ], and in [ s.- a. ], in [ s.- a. ], and in [ s.- a. ], in [ s.- a ], and in [ s.- a. ], respectively, in [ s.- a ], and in [ s.- a. ], respectively, in [ s.- a ], and in [ s.- a. ], respectively, in [ s.- a ], and in [ s.- a. ], respectively, in [ s.- a ], and in [ s.- a. ], respectively, in [ s.- a ], and in [ s
in this brief report, we attempt to answer some of the questions raised in the resolution of the following open question : 1. the answer to the question : 1. whether or not the 4th and 5th centuries are incompatible? 2. the answer to the question : 3. the answer to the question : 4. the answer to the question : 5. the answer to the question : 6. the answer to the question : 7. the answer to the question : 8. the answer to the question : 9. the answer to the question : 10. the answer to the question : 11. the answer to the question : 12. the answer to the question : 13. the an

{'rouge1': AggregateScore(low=Score(precision=0.050314465408805034, recall=0.07028753993610223, fmeasure=0.0784313725490196), mid=Score(precision=0.16439773903351643, recall=0.12403265885694001, fmeasure=0.09533813525410165), high=Score(precision=0.27848101265822783, recall=0.17777777777777778, fmeasure=0.11224489795918367)),
 'rouge2': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.006329113924050633, recall=0.022727272727272728, fmeasure=0.009900990099009901), high=Score(precision=0.012658227848101266, recall=0.045454545454545456, fmeasure=0.019801980198019802)),
 'rougeL': AggregateScore(low=Score(precision=0.0440251572327044, recall=0.05750798722044728, fmeasure=0.06862745098039215), mid=Score(precision=0.13593662924926359, recall=0.10653177138800142, fmeasure=0.08023209283713485), high=Score(precision=0.22784810126582278, recall=0.15555555555555556, fmeasure=0.09183673469387756)),
 'rougeLsum': AggregateScore(low=Score(precision=0.05031446

Because of the very large input size of ~ 4K tokens in this notebook, it would take over (time) to evaluate the whole filtered test dataset. For the sake of this notebook, we'll only evaluate the first 600 examples. Therefore, we cut the 6000+ samples to just 600 samples using ðŸ¤—Datasets' convenient `.select()` function.

In [None]:
test_dataset = test_dataset.select(range(100))

Alright, now let's map each sample to the predicted *abstract*. This will take *ca.* 2 hours if you have been given a fast GPU.

In [None]:
result = test_dataset.map(generate_answer)

  0%|          | 0/100 [00:00<?, ?ex/s]

in this brief reply to a recent letter to [ s.- a. ], we point out that there is an error in [ s.- a. ], and in [ s.- a. ], in [ s.- a. ], and in [ s.- a. ], in [ s.- a ], and in [ s.- a. ], respectively, in [ s.- a ], and in [ s.- a. ], respectively, in [ s.- a ], and in [ s.- a. ], respectively, in [ s.- a ], and in [ s.- a. ], respectively, in [ s.- a ], and in [ s.- a. ], respectively, in [ s.- a ], and in [ s
in this brief report, we attempt to answer some of the questions raised in the resolution of the following open question : 1. the answer to the question : 1. whether or not the 4th and 5th centuries are incompatible? 2. the answer to the question : 3. the answer to the question : 4. the answer to the question : 5. the answer to the question : 6. the answer to the question : 7. the answer to the question : 8. the answer to the question : 9. the answer to the question : 10. the answer to the question : 11. the answer to the question : 12. the answer to the question : 13. the an

Finally, we can evaluate the predictions using the *rouge* metric.

In [None]:
rouge.compute(predictions=result["predicted_abstract"], references=result["summary"])

{'rouge1': AggregateScore(low=Score(precision=0.19358401223508673, recall=0.1637358247547922, fmeasure=0.14659759828823976), mid=Score(precision=0.2215944078386553, recall=0.18116945112520072, fmeasure=0.1606089260734368), high=Score(precision=0.25085208810960036, recall=0.19983779439288984, fmeasure=0.1741686051795053)),
 'rouge2': AggregateScore(low=Score(precision=0.01993041704405919, recall=0.015102793823906743, fmeasure=0.014237118326248733), mid=Score(precision=0.027535967007829537, recall=0.018797558800413845, fmeasure=0.017330975339198695), high=Score(precision=0.037881973105541346, recall=0.022666397173800142, fmeasure=0.020913443751381874)),
 'rougeL': AggregateScore(low=Score(precision=0.1524727168604346, recall=0.12910489560908717, fmeasure=0.11415284055111366), mid=Score(precision=0.17038074858919028, recall=0.14420899032016238, fmeasure=0.12454701494818621), high=Score(precision=0.19086367927775097, recall=0.1590382977557767, fmeasure=0.1336304316790825)),
 'rougeLsum': A

In [None]:
!pip install git+https://github.com/google-research/bleurt.git
bleurt= load_metric('bleurt')

For our 600 samples, we get a *Rouge-2* score of **19.6** ðŸ”¥ðŸ”¥ðŸ”¥.

**Note**: As stated in the [official paper](https://arxiv.org/pdf/2007.14062.pdf) *BigBirdPegasus* achieves a new state-of-the-art of **20.65** Rouge-2 score on PubMed. Evaluation in this notebook might be slightly worse since a different `length_penalty` is used for generation and data pre-processing is kept as simple as possibe (no "*newline*" removal and space removal before special tokens).

In case you want to evaluate [`google/bigbird-pegasus-large-arxiv`](https://huggingface.co/google/bigbird-pegasus-large-pubmed) on `arxiv` dataset from [`scientific_papers`](https://huggingface.co/datasets/scientific_papers), you can just change the `DATASET_NAME` to `arxiv` in the cell above.

In [None]:
billsum_rouge_result = rouge.compute(predictions=result["predicted_abstract"], references=result["summary"])
billsum_bleurt_score = bleurt.compute(predictions=result["predicted_abstract"], references=result["summary"])

In [None]:
import pandas as pd

In [None]:
dataframe = pd.DataFrame(billsum_rouge_result)
dataframe.to_csv('/content/billsum_rouge_result_beam.csv', index = False)
dataframe = pd.DataFrame(billsum_bleurt_score)
dataframe.to_csv('/content/billsum_bleurt_score_beam.csv', index = False)