<a href="https://colab.research.google.com/github/danielsaggau/deep_unsupervised_learning/blob/main/big_patent_Bigbird_Pegasus_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Evaluate 🤗's BigBirdPegasus on Pubmed**

In this notebook, we evaluate BigBird on the long-range summarization task of **[pubmed](https://huggingface.co/datasets/scientific_papers)**. BigBird was introduced in [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by *Manzil Zaheer et al.* It has achieved outstanding performance on long document summarization using an efficient block sparse attention mechanism. Please refer to this [blog post](https://huggingface.co/blog/big-bird) for an in-detail explanation of BigBird's block sparse attention.

Let's see what GPU we got. We need at least ~12 GB GPU memory to be able to run this notebook.

In [2]:
!nvidia-smi

Tue Sep 14 11:35:23 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Let's first install `transformers`, `datasets`, `rouge_score` and `sentencepiece`.

In [1]:
%%capture
!pip3 install datasets
!pip3 install rouge_score
!pip3 install git+https://github.com/huggingface/transformers
!pip3 install sentencepiece
!pip install git+https://github.com/google-research/bleurt.git
!pip install bert_score

As mentioned above, we will evaluate **BigBirdPegasus** on the **_pubmed_** dataset using the **Rouge-2** metric. For this, let's 
import the two loading functions `load_dataset` and `load_metric`. Futher, we import the `BigBirdPegasusForConditionalGeneration` and `AutoTokenizer` tokenizer.

In [3]:
from datasets import load_dataset, load_metric
import torch
from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer

In [4]:
DATASET_NAME = "big_patent" # arxiv
DEVICE = "cuda"
CACHE_DIR = DATASET_NAME
MODEL_ID = f"google/bigbird-pegasus-large-{DATASET_NAME}"

In [18]:
test_dataset = load_dataset(DATASET_NAME,'all', split="test", cache_dir=CACHE_DIR)
test_dataset

Downloading and preparing dataset big_patent/all (download: 6.01 GiB, generated: 24.17 GiB, post-processed: Unknown size, total: 30.17 GiB) to big_patent/big_patent/all/1.0.0/efa16ff728ce0a1726ef8a0faeb0376331093f8fff41cf4cfaccc11d9cdb442d...


Downloading: 0.00B [00:00, ?B/s]

  0%|          | 0/3 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset big_patent downloaded and prepared to big_patent/big_patent/all/1.0.0/efa16ff728ce0a1726ef8a0faeb0376331093f8fff41cf4cfaccc11d9cdb442d. Subsequent calls will reuse this data.


Dataset({
    features: ['description', 'abstract'],
    num_rows: 67072
})

In [24]:
tokenizer = AutoTokenizer.from_pretrained('google/bigbird-pegasus-large-bigpatent')
model = BigBirdPegasusForConditionalGeneration.from_pretrained('google/bigbird-pegasus-large-bigpatent').to(DEVICE)

Downloading:   0%|          | 0.00/2.15G [00:00<?, ?B/s]

In [25]:
rouge = load_metric('rouge')

Downloading:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [26]:
#!pip install git+https://github.com/google-research/bleurt.git
bleurt= load_metric('bleurt')

Downloading:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

Using default BLEURT-Base checkpoint for sequence maximum length 128. You can use a bigger model for better results with e.g.: datasets.load_metric('bleurt', 'bleurt-large-512').


Downloading:   0%|          | 0.00/405M [00:00<?, ?B/s]

INFO:tensorflow:Reading checkpoint /root/.cache/huggingface/metrics/bleurt/default/downloads/extracted/887f2dc36c17f53c287f696681b8f7c947278407c1cf9f226662e16c8c0dc417/bleurt-base-128.
INFO:tensorflow:Config file found, reading.
INFO:tensorflow:Will load checkpoint bert_custom
INFO:tensorflow:Loads full paths and checks that files exists.
INFO:tensorflow:... name:bert_custom
INFO:tensorflow:... vocab_file:vocab.txt
INFO:tensorflow:... bert_config_file:bert_config.json
INFO:tensorflow:... do_lower_case:True
INFO:tensorflow:... max_seq_length:128
INFO:tensorflow:Creating BLEURT scorer.
INFO:tensorflow:Creating WordPiece tokenizer.
INFO:tensorflow:WordPiece tokenizer instantiated.
INFO:tensorflow:Creating Eager Mode predictor.
INFO:tensorflow:Loading model.
INFO:tensorflow:BLEURT initialized.


`BigBirdPegasus` makes use of *block sparse attention*. Let's verify the `config`'s attention type and the `block_size`.

In [27]:
model.config.attention_type, model.config.block_size

('block_sparse', 64)

Next, we will take a look at the length distribution of the dataset. The following table shows the *median* and the 90% quantile of the article, and abstract (summary). 

|                 | Median | 90%-ile |
|-----------------|--------|---------|
| Articles Length | 2715   | 6101    |
| Summary Length  | 212    | 318     |

`BigBirdPegasus` can handle sequence up to a length of **4096** which is significantly higher than the median input length of **2715**. However, many input samples are longer than **4096**, which consequently need to be truncated. 
The summaries have a median length of **212** with 90% being shorter than **318**. Given this data, 256 seems to be a reasonable choice as the model's maximum generation length.

Now we can write the evaluation function for BigBirdPegasus.
First, we tokenize each *article* up to a maximum length of 4096 tokens.
We will make use of beam search (with `num_beams=5` & `length_penalty=0.8`) to generate the predicted *abstract* of the *article*. Finally, the predicted *abstract* tokens are decoded and the resulting predicted *abstract* string is saved in the batch.

In [30]:
def generate_answer(batch):
  inputs_dict = tokenizer(batch["description"], padding="max_length", max_length=4096, return_tensors="pt", truncation=True)
  inputs_dict = {k: inputs_dict[k].to(DEVICE) for k in inputs_dict}
  predicted_abstract_ids = model.generate(**inputs_dict, max_length=256, top_p= 0.95, repetition_penalty=1.1,length_penalty=0.8)
  batch["predicted_abstract"] = tokenizer.decode(predicted_abstract_ids[0], skip_special_tokens=True)
  print(batch["predicted_abstract"])
  return batch

Let's take 2 samples and verify the predictions to be sure everything works as expected 🙂.

In [31]:
dataset_small = test_dataset.select(range(2))
result_small = dataset_small.map(generate_answer)

rouge.compute(predictions=result_small["predicted_abstract"], references=result_small["abstract"])

  0%|          | 0/2 [00:00<?, ?ex/s]

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


A method and system for cleaning pet appendages including feet, hooves, and limbs using a plurality of flow-through type brushes that are readily transported and stored between uses, readily adapts to specific uses, and environments proximate that treatment surface is not limited.
A method for preparing an oatmeal composition is disclosed. The method includes the steps of hydrating steel cut oats, adding oat bran to the hydrated steel cut oats, granulating the oat bran, adding rolled oats to the granulated oat bran mixture, cooking the mixture, transferring the cooked mixture to a holding reservoir, heating the mixture in the reservoir to cook the rolled oats, and transferring the cooked mixture to a container.


{'rouge1': AggregateScore(low=Score(precision=0.3953488372093023, recall=0.1559633027522936, fmeasure=0.2236842105263158), mid=Score(precision=0.5849983622666229, recall=0.20956059874456784, fmeasure=0.3082706766917293), high=Score(precision=0.7746478873239436, recall=0.2631578947368421, fmeasure=0.3928571428571428)),
 'rouge2': AggregateScore(low=Score(precision=0.19047619047619047, recall=0.07407407407407407, fmeasure=0.10666666666666667), mid=Score(precision=0.24523809523809523, recall=0.08751780626780627, fmeasure=0.1288729016786571), high=Score(precision=0.3, recall=0.10096153846153846, fmeasure=0.1510791366906475)),
 'rougeL': AggregateScore(low=Score(precision=0.32558139534883723, recall=0.12844036697247707, fmeasure=0.1842105263157895), mid=Score(precision=0.416311824434982, recall=0.15034458540011414, fmeasure=0.22067669172932333), high=Score(precision=0.5070422535211268, recall=0.1722488038277512, fmeasure=0.2571428571428572)),
 'rougeLsum': AggregateScore(low=Score(precision

In [32]:
bleurt.compute(predictions=result_small["predicted_abstract"], references=result_small["abstract"])

{'scores': [-0.6322981715202332, 0.19446244835853577]}

Package error because of costum tokens: might need to use older version and will be fine god bless

Because of the very large input size of ~ 4K tokens in this notebook, it would take over (time) to evaluate the whole filtered test dataset. For the sake of this notebook, we'll only evaluate the first 600 examples. Therefore, we cut the 6000+ samples to just 600 samples using 🤗Datasets' convenient `.select()` function.

In [34]:
test_dataset = test_dataset.select(range(600))

Alright, now let's map each sample to the predicted *abstract*. This will take *ca.* 2 hours if you have been given a fast GPU.

# generate Summaries

In [37]:
result = test_dataset.map(generate_answer)

A soil opener assembly for an agricultural implement includes a hub and a plurality of disc sectors that interfit with the hub to form a substantially continuous disc about the hub. The hub presents a radially inner shoulder surface and a radially outer shoulder surface, wherein the radially inner shoulder surface and the radially outer shoulder surface cooperate to restrict axial movement of the disc sectors relative to the hub. The hub presents a tongue - and-groove connection between the radially inner shoulder surface and the radially outer shoulder surface, wherein the tongue - and-groove connection is configured to restrict lateral movement of the disc sectors relative to the hub.
The present invention is directed to methods of treating symptoms, pathologies or diseases characterized by reduced levels of dopamine in a patients brain, including neurological or movement disorders such as restless leg syndrome, parkinson &#34; s disease and secondary parkinsonism, huntingdon &#34; s

  0%|          | 0/600 [00:00<?, ?ex/s]

A method and system for cleaning pet appendages including feet, hooves, and limbs using a plurality of flow-through type brushes that are readily transported and stored between uses, readily adapts to specific uses, and environments proximate that treatment surface is not limited.
A method for preparing an oatmeal composition is disclosed. The method includes the steps of hydrating steel cut oats, adding oat bran to the hydrated steel cut oats, granulating the oat bran, adding rolled oats to the granulated oat bran mixture, cooking the mixture, transferring the cooked mixture to a holding reservoir, heating the mixture in the reservoir to cook the rolled oats, and transferring the cooked mixture to a container.
The trunk rotation conditioning device of this invention provides the following. the user is in a weight bearing position that simulates a stance in many sports. the angle of the inclination is adjustable about a pivot to accommodate individual variation in the standing position

Finally, we can evaluate the predictions using the *rouge* metric.

# Evaluation via ROUGE

In [None]:
result

Dataset({
    features: ['article', 'abstract', 'section_names', 'predicted_abstract'],
    num_rows: 600
})

In [38]:
rouge_result = rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"])

# Evaluation via BLEURT

In [39]:
score = bleurt.compute(predictions=result["predicted_abstract"], references=result["abstract"])

In [40]:
import pandas as pd
dataframe = pd.DataFrame(rouge_result)
dataframe.to_csv('/content/bigpatent_rouge_result.csv', index = False)

In [41]:
dataframe = pd.DataFrame(score)
dataframe.to_csv('/content/bigpatent_bleurt_result.csv', index = False)

In [19]:
dataframe = pd.DataFrame(result)
dataframe.to_csv('/content/bigpatent_result.csv', index = False)

For our 600 samples, we get a *Rouge-2* score of **19.6** 🔥🔥🔥.

**Note**: As stated in the [official paper](https://arxiv.org/pdf/2007.14062.pdf) *BigBirdPegasus* achieves a new state-of-the-art of **20.65** Rouge-2 score on PubMed. Evaluation in this notebook might be slightly worse since a different `length_penalty` is used for generation and data pre-processing is kept as simple as possibe (no "*newline*" removal and space removal before special tokens).

In [44]:
def generate_answer(batch):
  inputs_dict = tokenizer(batch["description"], padding="max_length", max_length=4096, return_tensors="pt", truncation=True)
  inputs_dict = {k: inputs_dict[k].to(DEVICE) for k in inputs_dict}
  predicted_abstract_ids = model.generate(**inputs_dict, max_length=256, num_beams=5, length_penalty=0.8)
  batch["predicted_abstract"] = tokenizer.decode(predicted_abstract_ids[0], skip_special_tokens=True)
  print(batch["predicted_abstract"])
  return batch

In [45]:
result = test_dataset.map(generate_answer)

  0%|          | 0/600 [00:00<?, ?ex/s]

A method and system for cleaning pet appendages including feet, hooves, and limbs using a plurality of flow-through type brushes is disclosed.
A method for preparing an oatmeal composition is disclosed. The method includes the steps of hydrating steel cut oats, adding oat bran to the hydrated steel cut oats, adding rolled oats to the mixture of steel cut oats, oat bran, and rolled oats, and cooking the mixture.
The trunk rotation conditioning device of this invention provides the following. the user is in a weight bearing position that simulates a stance in many sports. the angle of the inclination is adjustable about a pivot to accommodate individual variation in the standing position. In the preferred embodiment of a golf exercise apparatus, the device provides resistance during an exercise emulating a golf swing of a golfer to strengthen muscles of the axial skeleton and lower extremities of the performing golfer.
The present invention provides an electrolyte gel based on a crosslin

In [48]:
bigpatent_beam_rouge_result = rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"])
bigpatent_beam_bleurt_score = bleurt.compute(predictions=result["predicted_abstract"], references=result["abstract"])

In [49]:
dataframe = pd.DataFrame(bigpatent_beam_rouge_result)
dataframe.to_csv('/content/bigpatent_beam_rouge_result.csv', index = False)
dataframe = pd.DataFrame(bigpatent_beam_bleurt_score)
dataframe.to_csv('/content/bigpatent_beam_bleurt_score.csv', index = False)

In [50]:
dataframe = pd.DataFrame(result)
dataframe.to_csv('/content/bigpatent_beam_result.csv', index = False)

In [51]:
def generate_answer(batch):
  inputs_dict = tokenizer(batch["description"], padding="max_length", max_length=4096, return_tensors="pt", truncation=True)
  inputs_dict = {k: inputs_dict[k].to(DEVICE) for k in inputs_dict}
  predicted_abstract_ids = model.generate(**inputs_dict, max_length=256, top_p=0.95, length_penalty=0.8)
  batch["predicted_abstract"] = tokenizer.decode(predicted_abstract_ids[0], skip_special_tokens=True)
  print(batch["predicted_abstract"])
  return batch

In [52]:
result = test_dataset.map(generate_answer)

  0%|          | 0/600 [00:00<?, ?ex/s]

A method and system for cleaning pet appendages including feet, hooves, and limbs using a plurality of flow-through type brushes is disclosed.
A method for preparing an oatmeal composition is disclosed. The method includes the steps of hydrating steel cut oats, adding oat bran to the hydrated steel cut oats, adding rolled oats to the mixture of steel cut oats, oat bran, and rolled oats, and cooking the mixture.
The trunk rotation conditioning device of this invention provides the following. the user is in a weight bearing position that simulates a stance in many sports. the angle of the inclination is adjustable about a pivot to accommodate individual variation in the standing position. In the preferred embodiment of a golf exercise apparatus, the device provides resistance during an exercise emulating a golf swing of a golfer to strengthen muscles of the axial skeleton and lower extremities of the performing golfer.
The present invention provides an electrolyte gel based on a crosslin

In [53]:
bigpatent_nopen_rouge_result = rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"])
bigpatent_nopen_bleurt_score = bleurt.compute(predictions=result["predicted_abstract"], references=result["abstract"])

In [54]:
dataframe = pd.DataFrame(bigpatent_nopen_rouge_result)
dataframe.to_csv('/content/bigpatent_nopen_rouge_result.csv', index = False)
dataframe = pd.DataFrame(bigpatent_nopen_bleurt_score)
dataframe.to_csv('/content/bigpatent_nopen_bleurt_score.csv', index = False)