## Finetune **Longformer Encoder-Decoder (LED)** on 8K Tokens**

In [None]:
# need this code1q snippet in each notebook to mount the google 
from google.colab import drive  
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


The *Longformer Encoder-Decoder (LED)* was recently added as an extension to [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.

In this notebook we will finetune *LED* for Summarization on **[Pubmed](https://huggingface.co/datasets/viewer/?dataset=scientific_papers). *Pubmed* is a long-range summarization dataset, which makes it a good candidate for LED**. LED will be finetuned up to an input length of 8K tokens on a single GPU.

We will leverage 🤗`Seq2SeqTrainer`, gradient checkpointing and as usual 🤗`datasets`.

First, let's try to get a GPU with at least 15GB RAM.

In [None]:
# crash colab to get more RAM
# !kill -9 -1

To check that we are having enough RAM we can run the following command.
If the randomly allocated GPU is too small, the above cells can be run 
to crash the notebook hoping to get a better GPU.

In [None]:
!nvidia-smi

Wed Jun 30 22:43:54 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    26W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Next, we install 🤗Transformers, 🤗Datasets, and `rouge_score`.



In [None]:
%%capture
!pip install datasets==1.2.1
!pip install transformers==4.2.0
!pip install rouge_score

Let's start by loading and preprocessing the dataset.



In [None]:
from datasets import load_dataset, load_metric

Next, we download the pubmed train and validation dataset ([click to see on 🤗Datasets Hub](https://huggingface.co/datasets/scientific_papers)). This can take a couple of minutes **☕** .

In [None]:
train_dataset = load_dataset("scientific_papers", "pubmed", split="train")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2069.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1231.0, style=ProgressStyle(description…


Downloading and preparing dataset scientific_papers/pubmed (download: 4.20 GiB, generated: 2.33 GiB, post-processed: Unknown size, total: 6.53 GiB) to /root/.cache/huggingface/datasets/scientific_papers/pubmed/1.1.1/043e40ed208b8a66ee9e8228c86874946c99d2fc6155a1daee685795851cfdfc...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3624420843.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=880225504.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset scientific_papers downloaded and prepared to /root/.cache/huggingface/datasets/scientific_papers/pubmed/1.1.1/043e40ed208b8a66ee9e8228c86874946c99d2fc6155a1daee685795851cfdfc. Subsequent calls will reuse this data.


In [None]:
val_dataset = load_dataset("scientific_papers", "pubmed", split="validation")

Reusing dataset scientific_papers (/root/.cache/huggingface/datasets/scientific_papers/pubmed/1.1.1/043e40ed208b8a66ee9e8228c86874946c99d2fc6155a1daee685795851cfdfc)


It's always a good idea to take a look at some data samples. Let's do that here.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=4):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(train_dataset, num_examples=1)

Unnamed: 0,abstract,article,section_names
0,"objective(s):atherosclerosis is an important risk factor for coronary heart disease . \n neuropeptide y ( npy ) and its receptors , located in peripheral tissue such as white adipose tissue , have been linked to obesity and fat storage . \n the role of npy in atherosclerosis has not yet been fully studied , so this study was conducted to further investigate the effect of biie 0246 , an npy receptor antagonist , on aortic intima - media thickness and size and number of adipocyte cells in normal and obese mice.materials and methods : tests were performed on 24 male c57bl/6 mice . \n the animals were divided into four groups as follows : control ( normal ) , obese ( high - fat diet ) , normal+npy receptor antagonist ( 1 m , 100 l / kg biie0246 intraperitoneally ) and obese+npy receptor antagonist ( n=6 each ) . \n after 14 days , the animals were sacrificed and epididymal adipose tissue and thoracic aorta were removed . \n evaluations were made for adipocyte cell number and size and for aortic intima - media thickness.results:the group on a high - fat diet showed a significantly decreased number of adipocyte cells and increased cell size ( p<0.05 ) . \n biie0246 application changed the cell number of adipocyte in normal mice ( p=0.05 ) ; however , it did not change adipocyte cell size and aortic intima - media thickness in obese and normal mice ( p>0.05).conclusion : npy receptor antagonist had no effect on adipocyte cell size and aortic intima - media thickness ; however , it decreased cell number in the normal group indicating likely involvement in the progression of obesity .","atherosclerosis is a primary risk factor for coronary heart disease that affects peripheral arteries and cerebral circulation .\natherosclerosis begins with the transmigration of oxidized ldls to the intima of the subendothelial space , and thereby causes injury to endothelial cells ( 1 ) .\nintima - media ( i m ) thickness is a noninvasive alternative marker and intermediate phenotype of atherosclerotic disease that has been used extensively since 1986 following the initial description by pignoli et al ( 2 ) .\ni m thickness of the carotid artery is an established sonographic marker for early atherosclerosis , and thickening of the i m complex reflects generalized atherosclerosis .\ndetermination of i m thickness is noninvasive , reproducible and has no side effects , so it is considered a superior method for assessment of coronary anatomy ( 3 ) .\nthere are numerous risk factors for atherosclerosis including hypertension , hyperlipidemia , smoking and obesity .\nit has been reported that obesity is the fifth leading risk factor for death globally .\nobesity is associated with metabolic dysfunction and multiple tissue system participation from conditions such as heart diseases , diabetes , dyslipidemia and many tumors ( 4 ) .\nboth , obesity and carotid i m thickness are interrelated ; in addition , they are very important risk factors for ischemic stroke ( 5 ) .\nobesity is proliferation of white adipose tissue ( wat ) mass that happens via a surge in cell size ( hypertrophy ) and/or an increase in cell number ( hyperplasia ) .\nhypothalamic neuropeptide y ( npy ) has been related to fat cell hypertrophy by exciting lipoprotein lipase activity in wat ( 6 , 7 ) .\nnpy , a 36-amino - acid neuropeptide is known as the most potent physiological appetite transducer .\nnpy , peptide yy ( pyy ) and pancreatic polypeptide are part of the npy family of peptides and affect food intake by interacting with g - protein - coupled y receptors ( 8) . centrally infused npy induce obesity in the long term , and research has determined that increasing concentrations of hypothalamic npy seem to be a major factor contributing to the onset of obesity in several obese animal models ( 9 ) .\nnpy is extensively expressed in both the brain and the peripheral nervous system . within the brain ,\nnpy is highly active in the hypothalamus , particularly in the arcuate nucleus ( 10 ) . upon stimulation\n, npy activates its y receptors to motivate circuits that increase food intake and fat storage ( 11 ) .\nnpy and its receptors are located in peripheral tissue such as wat , liver , and pancreas , but the role of npy and its receptors in the regulation of energy homeostasis has not been fully identified . in the present study ,\nthe effect of biie 0246 , an npy receptor antagonist , was investigated on adipocyte size and cell number and i m thickness in normal and diet - induced obese mice .\na total of 24 male mice c57bl/6 , weighing 20 to 30 g , and 5 weeks old were purchased from the pasteur institute of iran .\nanimals were housed in cages , four in each cage in the animal facility under the following conditions : temperature 25 c 2 and lighting cycle of 12 hr ( 6:00 am to 6:00 pm ) with access to food and water ad libitum .\nafter an adaptation period of 1 week , animals were assigned to four groups : obese , normal , obese+npy y2 receptor antagonist and normal+npy receptor antagonist ( n=6 ) .\nthe ethical committee of the isfahan university of medical sciences ( isfahan , iran ) approved the protocol for the study .\nfor induction of diet - induced obesity , the obese groups were fed with a commercial high fat diet ( hfd ; bioserv co. , cat # f3282 , usa , protein 20.5% , fat 36% , carbohydrate 35.7% ) for 16 weeks ( 12 ) .\nthe normal groups were fed with standard mouse chow ( purchased from the pasteur institute ) .\nnpy antagonist ( biie0246 ) was obtained from tocris co. ( bristol , uk ) , and to block y2 receptors in melanoma tumor , animals were treated with biie0246 at concentration 10 m and received 100 l / kg for 14 days intraperitoneal injection ( 18 ) .\nafter 14 days , animals were sacrificed and epididymal wat and thoracic aorta were removed .\nthen , they were dehydrated and embedded in paraffin . after that they were removed from paraffin and cleaned with xylene and hydrated with decreasing concentrations of ethanol .\ntissue blocks were sectioned into 5 m thickness and stained with hematoxylin and eosin ( h&e ) .\nadipocyte cell number was counted in 5 different fields through the camera of a light microscope equipped with computerized image analysis software ( advanced motic image 3.2 ) .\nthe size of adipocytes was determined by analyzing the cross - sectional area of white adipose tissue with the axiovision 4.6 ( zeiss ) software .\nrecords were taken for diameter of adipocyte cells ( 10 cells from each specimen ) .\naortic imt was measured from the endothelial surface to the adventitia in 13 different fields of samples from each animal ( 13 ) .\nimages of five fields per section from each animal were captured with 40x magnification , and the adipocyte cell surface areas ( h / e ) were measured from at least 100 cells .\ndata were expressed as means se and evaluated using analysis of variance ( anova ) with a post hoc test , lsd .\na total of 24 male mice c57bl/6 , weighing 20 to 30 g , and 5 weeks old were purchased from the pasteur institute of iran .\nanimals were housed in cages , four in each cage in the animal facility under the following conditions : temperature 25 c 2 and lighting cycle of 12 hr ( 6:00 am to 6:00 pm ) with access to food and water ad libitum .\nafter an adaptation period of 1 week , animals were assigned to four groups : obese , normal , obese+npy y2 receptor antagonist and normal+npy receptor antagonist ( n=6 ) .\nthe ethical committee of the isfahan university of medical sciences ( isfahan , iran ) approved the protocol for the study .\nfor induction of diet - induced obesity , the obese groups were fed with a commercial high fat diet ( hfd ; bioserv co. , cat # f3282 , usa , protein 20.5% , fat 36% , carbohydrate 35.7% ) for 16 weeks ( 12 ) . the normal groups were fed with standard mouse chow ( purchased from the pasteur institute ) .\nnpy antagonist ( biie0246 ) was obtained from tocris co. ( bristol , uk ) , and to block y2 receptors in melanoma tumor , animals were treated with biie0246 at concentration 10 m and received 100 l / kg for 14 days intraperitoneal injection ( 18 ) .\nafter 14 days , animals were sacrificed and epididymal wat and thoracic aorta were removed .\nthen , they were dehydrated and embedded in paraffin . after that they were removed from paraffin and cleaned with xylene and hydrated with decreasing concentrations of ethanol .\ntissue blocks were sectioned into 5 m thickness and stained with hematoxylin and eosin ( h&e ) .\nadipocyte cell number was counted in 5 different fields through the camera of a light microscope equipped with computerized image analysis software ( advanced motic image 3.2 ) .\nthe size of adipocytes was determined by analyzing the cross - sectional area of white adipose tissue with the axiovision 4.6 ( zeiss ) software .\nrecords were taken for diameter of adipocyte cells ( 10 cells from each specimen ) .\naortic imt was measured from the endothelial surface to the adventitia in 13 different fields of samples from each animal ( 13 ) .\nimages of five fields per section from each animal were captured with 40x magnification , and the adipocyte cell surface areas ( h / e ) were measured from at least 100 cells .\ndata were expressed as means se and evaluated using analysis of variance ( anova ) with a post hoc test , lsd .\nresults showed that the body weight of mice fed with the high - fat diet was significantly higher than that of mice fed with normal chow ( 341.11 g vs 25.661.41 g , respectively ; p<0.05 ) ( figure 2 ) .\nimages of normal ( a ) and obese ( b ) male mice showing increase in body size and reposition of fat in the peritoneal cavity in obese animals body weight gain in normal and obese group results showed significant difference in adipocyte cell number of the obese group and the normal groups ( 205.77 vs. 1061 number / field , respectively ; p<0.05 ) .\nadministration of npy y receptor antagonist reduced adipocyte cell number in the obese group ( 205.77 vs. 16.751.65 number / field ) ; however , results were not statistically significant ( p>0.05 ) . but cell number was reduced in normal groups ( 1061 vs. 69.08.0 number / field ; p<0.05 ) ( figures 3a and b ) .\na : normal ; b : obese ; c : normal+ npy receptor antagonist ; d : obese+ npy receptor antagonist .\nhigh - fat diet loading significantly decreased wat cell number ( b ) and increased the size of epididymal adipocyte cells ( c ) .\nnpy receptor antagonist administration changed the cell number of epididymal adipocyte in normal group but did not change adipocyte cell size in obese group ( b&c ) .\nn : normal , o : obese adipocyte cell size in white adipose tissue in obese animals was significantly higher than in the normal group ( 900.581.54 vs. 404.454.8m , respectively ; p<0.05 ) .\nneuropeptide y antagonist did not alter adipocyte cell size in obese groups ( 900.581.54 vs. 865.2146.45m ; p>0.05 ) and normal groups ( 404.454.8 vs. 491.75.1 m ; p > 0.05 ) ( figure 3a & c ) .\neffect of neuropeptide y y2 receptor antagonist on aortic i m thickness figure 4 ( a - d ) shows aortic rings from normal and obese mice stained by h&e .\naortic i m thickness was significantly greater in animals in the obese group compared to those in the normal group ( 603.5519.44 m vs. 380.852.3 m , respectively ; p<0.05 ) .\nadministration of neuropeptide y antagonist did not alter aortic intima - media thickness in the obese group ( 528.675.36 vs. 450.5634.21 m ; p>0.05 ) and normal groups ( 380.852.3 vs. 459.4312.74 m ; p>0.05 ) ( figures\neffect of high - fat diet loading and npy receptor antagonist administration on aortic i m thickness .\na. the histological sections of aorta were stained with hematoxylin & eosin ( a - d ) ; a : normal ; b : obese ; c : normal+npy receptor antagonist ; d : obese+ npy receptor antagonist .\nnpy receptor antagonist administration did not alter aortic i m thickness in obese and normal group ( b ) .\nresults showed that the body weight of mice fed with the high - fat diet was significantly higher than that of mice fed with normal chow ( 341.11 g vs 25.661.41 g , respectively ; p<0.05 ) ( figure 2 ) .\nimages of normal ( a ) and obese ( b ) male mice showing increase in body size and reposition of fat in the peritoneal cavity in obese animals body weight gain in normal and obese group\nresults showed significant difference in adipocyte cell number of the obese group and the normal groups ( 205.77 vs. 1061 number / field , respectively ; p<0.05 ) . administration of npy y receptor antagonist reduced adipocyte cell number in the obese group ( 205.77 vs. 16.751.65 number / field ) ; however , results were not statistically significant ( p>0.05 ) . but cell number was reduced in normal groups ( 1061 vs. 69.08.0 number / field ; p<0.05 ) ( figures 3a and b ) .\na : the histological sections were stained with hematoxylin & eosin . a : normal ; b : obese ; c : normal+ npy receptor antagonist ; d : obese+ npy receptor antagonist .\nhigh - fat diet loading significantly decreased wat cell number ( b ) and increased the size of epididymal adipocyte cells ( c ) .\nnpy receptor antagonist administration changed the cell number of epididymal adipocyte in normal group but did not change adipocyte cell size in obese group ( b&c ) .\nadipocyte cell size in white adipose tissue in obese animals was significantly higher than in the normal group ( 900.581.54 vs. 404.454.8m , respectively ; p<0.05 ) .\nneuropeptide y antagonist did not alter adipocyte cell size in obese groups ( 900.581.54 vs. 865.2146.45m ; p>0.05 ) and normal groups ( 404.454.8 vs. 491.75.1 m ; p > 0.05 ) ( figure 3a & c ) .\neffect of neuropeptide y y2 receptor antagonist on aortic i m thickness figure 4 ( a - d ) shows aortic rings from normal and obese mice stained by h&e .\naortic i m thickness was significantly greater in animals in the obese group compared to those in the normal group ( 603.5519.44 m vs. 380.852.3 m , respectively ; p<0.05 ) .\nadministration of neuropeptide y antagonist did not alter aortic intima - media thickness in the obese group ( 528.675.36 vs. 450.5634.21 m ; p>0.05 ) and normal groups ( 380.852.3 vs. 459.4312.74 m ; p>0.05 ) ( figures 4a & b ) .\neffect of high - fat diet loading and npy receptor antagonist administration on aortic i m thickness .\na. the histological sections of aorta were stained with hematoxylin & eosin ( a - d ) ; a : normal ; b : obese ; c : normal+npy receptor antagonist ; d : obese+ npy receptor antagonist .\nnpy receptor antagonist administration did not alter aortic i m thickness in obese and normal group ( b ) .\nthe aim of this study was to investigate the effect of npy receptor antagonist on wat characteristics including the number and size of adipocyte cells and aortic intima - media thickness in normal and diet - induced obese mice .\ntest results showed that biie0246 , an npy receptor antagonist , had no significant effect on aortic intima - media thickness and cell size in normal and obese animals .\nresearch has shown that obesity is a condition that indirectly leads to the process of atherosclerosis , and its relation with other markers of atherosclerosis is to be evaluated ( 14 ) .\ndysfunctional adipocytes contribute to the development of vascular risk factors and vascular disease ( 15 ) .\nresults of the present study showed that administration of biie0246 did not alter body weight and i m thickness of aorta in normal and obese groups .\nnpy is a polypeptide containing 36 amino acids that has proven to be one of the most important regulators of energy homeostasis , thus it seems to be a therapeutic target for the management of disorders such as obesity .\ny1 , y2 , y4 , y5 and y6 have distinctive tissue expression shape ( 16 ) .\nnpy neurons exhibited increased adipose mass , and greater muscle protein expression of phosphorylated acetyl - coa carboxylase , a key enzyme in fatty acid synthesis , demonstrating the obesogenic effect of selective blockade of y2 receptor signaling in npy neurons ( 17 ) . in this study , biie0246 was used as a selective npy2 receptor antagonist .\nit was also determined that biie0246 reduced adipocyte cell number only in normal mice without affecting adipocyte cell size in obese and normal mice .\nhyperplasia ( cell number increase ) and hypertrophy ( cell size increase ) are two possible growth mechanisms ( 18 ) .\nhypertrophy happens prior to hyperplasia to meet the need for additional fat storage capacity in the development of obesity ( 19 ) .\nthe importance of y2 receptor agonists in the reduction of food intake and obesity is controversial in that some studies reported that these peptides may not produce a continued reduction of feeding in rodents ( 20 ) or primates ( 21 ) , while other studies have supported the role of y2 receptor activation in decreased body weight and confirmed that it has anti - obesity potential .\nnaveilhan et al showed that the germline y2 receptor of the knockout mouse increased food intake , fat mass and body weight accompanied with leptin resistance that was indicated by an attenuated response to leptin in female mice ( 22 ) .\nanother study on a y2 deficient mouse model showed that female germline y2 receptor of knockout mice also had increased food intake , but with reduced body weight , whereas male y2 receptor of knockout mice had transiently reduced food intake and constantly decreased body weight associated with decreased adiposity at 16 weeks ( 23 , 24 ) .\nsubcutaneous injection of a y2 receptor agonist , a polyethylene glycol - conjugated peptide agonist and 2-mercaptonicotinic acid , reduced food intake in lean 18 hr fasted rodents , and this effect was abolished by pretreatment with the y2 antagonist biie0246 ( 25 ) .\nthis supports the therapeutic potential of peripherally administered y2 receptor agonists to reduce energy intake and treat obesity .\n, kuo , et al showed that y2 receptors were involved in promoting proliferation and differentiation of adipocytes as well as stimulating angiogenesis of capillaries in adipose tissue ( 26 ) .\nrosmaninho - s , et al showed that npy induces adipocyte proliferation and differentiation , and lipid accumulation induced by npy y2 receptor activation occurs through pka , mapk and pi3k pathways ( 27 ) . in the present study , tests were performed on diet - induced obese mice , which was very close to clinical condition , and no effect of npy receptor antagonist was observed on i m thickness and adipocyte cell size ; however , adipocyte cell number was reduced in normal mice .\nbriefly , taking into consideration that central y2 receptor induces obesity , whereas activation of peripheral y2 receptor causes emaciation , presumably biie0246 acts mainly via inhibition of peripheral receptors , and decrease adipocyte cell number in normal group and peripheral receptor effects have been dominant .\nalthough some investigations have suggested that for the interpretations of change in body weight and body composition , it is needed to consider the possibility of differential central versus peripheral effects , and/or hypothalamic or non - hypothalamic effects of the y2 receptor ( 18 ) , future research needs to look more closely into the role of the y2 receptor on obesity and atherosclerosis .",Introduction\nMaterials and Methods\nAnimals\nAnimal diet and treatment\nHistological examination\nStatistical analysis\nResults\nBody weight and fat reposition\nEffect of neuropeptide Y Y2 receptor antagonist on adipocyte cell number\nEffect of neuropeptide Y Y2 receptor antagonist on adipocyte cell size\nDiscussion\nConclusion


We can see that the input data is the `article` - a scientific report and the target data is the `abstract` - a concise summary of the report.

Cool! Having downloaded the dataset, let's tokenize it.
We'll import the convenient `AutoTokenizer` class.

In [None]:
from transformers import AutoTokenizer

 and load the tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("allenai/led-base-16384")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1092.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




Note that for the sake of this notebook, we finetune the "smaller" LED checkpoint ["allenai/led-base-16384"](https://huggingface.co/allenai/led-base-16384). Better performance can however be attained by finetuning ["allenai/led-large-16384"](https://huggingface.co/allenai/led-large-16384) at the cost of a higher required GPU RAM.

Pubmed's input data has a median token length of 2715 with the 90%-ile token length being 6101. The output data has a media token length of 171 with the 90%-ile token length being 352.${}^1$. 

Thus, we set the maximum input length to 8192 and the maximum output length to 512 to ensure that the model can attend to almost all input tokens is able to generate up to a large enough number of output tokens.

In this notebook, we are only able to train on `batch_size=2` to prevent out-of-memory errors.

---
${}^1$ The data is taken from page 11 of [Big Bird: Transformers for Longer Sequences](https://arxiv.org/pdf/2007.14062.pdf).


In [None]:
max_input_length = 8192
max_output_length = 512
batch_size = 2

Now, let's write down the input data processing function that will be used to map each data sample to the correct model format.
As explained earlier `article` represents here our input data and `abstract` is the target data. The datasamples are thus tokenized up to the respective maximum lengths of 8192 and 512.

In addition to the usual `attention_mask`, LED can make use of an additional `global_attention_mask` defining which input tokens are attended globally and which are attended only locally, just as it's the case of [Longformer](https://huggingface.co/transformers/model_doc/longformer.html). For more information on Longformer's self-attention, please take a look at the corresponding [docs](https://huggingface.co/transformers/model_doc/longformer.html#longformer-self-attention). For summarization, we follow recommendations of the [paper](https://arxiv.org/abs/2004.05150) and use global attention only for the very first token. Finally, we make sure that no loss is computed on padded tokens by setting their index to `-100`.

In [None]:
def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["article"],
        padding="max_length",
        truncation=True,
        max_length=max_input_length,
    )
    outputs = tokenizer(
        batch["abstract"],
        padding="max_length",
        truncation=True,
        max_length=max_output_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

For the sake of this notebook, we will reduce the training and validation data 
to a dummy dataset of sizes 250 and 25 respectively. For a full training run, those lines should be commented out.

In [None]:
train_dataset = train_dataset.select(range(250))
val_dataset = val_dataset.select(range(25))

Great, having defined the mapping function, let's preprocess the training data

In [None]:
train_dataset = train_dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["article", "abstract", "section_names"],
)

HBox(children=(FloatProgress(value=0.0, max=125.0), HTML(value='')))




and validation data

In [None]:
val_dataset = val_dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["article", "abstract", "section_names"],
)

HBox(children=(FloatProgress(value=0.0, max=13.0), HTML(value='')))




Finally, the datasets should be converted into the PyTorch format as follows.

In [None]:
train_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)
val_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

Alright, we're almost ready to start training. Let's load the model via the `AutoModelForSeq2SeqLM` class.

In [None]:
from transformers import AutoModelForSeq2SeqLM

We've decided to stick to the smaller model `"allenai/led-base-16384"` for the sake of this notebook. In addition, we directly enable gradient checkpointing and disable the caching mechanism to save memory.

In [None]:
led = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384", gradient_checkpointing=True, use_cache=False)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=647693783.0, style=ProgressStyle(descri…




During training, we want to evaluate the model on Rouge, the most common metric used in summarization, to make sure the model is indeed improving during training. For this, we set fitting generation parameters. We'll use beam search with a small beam of just 2 to save memory. Also, we force the model to generate at least 100 tokens, but no more than 512. In addition, some other generation parameters are set that have been found helpful for generation. For more information on those parameters, please take a look at the [docs](https://huggingface.co/transformers/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.generate).

In [None]:
# set generate hyperparameters
led.config.num_beams = 2
led.config.max_length = 512
led.config.min_length = 100
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

Next, we also have to define the function the will compute the `"rouge"` score during evalution.

Let's load the `"rouge"` metric from 🤗datasets and define the `compute_metrics(...)` function.

In [None]:
rouge = load_metric("rouge")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1955.0, style=ProgressStyle(description…




The compute metrics function expects the generation output, called `pred.predictions` as well as the gold label, called `pred.label_ids`.

Those tokens are decoded and consequently, the rouge score can be computed.

In [None]:
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

Now, we're ready to start training. Let's import the `Seq2SeqTrainer` and `Seq2SeqTrainingArguments`.

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

In contrast to the usual `Trainer`, the `Seq2SeqTrainer` makes it possible to use the `generate()` function during evaluation. This should be enabled with `predict_with_generate=True`. Because our GPU RAM is limited, we make use of gradient accumulation by setting `gradient_accumulation_steps=4` to have an effective `batch_size` of 2 * 4 = 8.

Other training arguments can be read upon in the [docs](https://huggingface.co/transformers/main_classes/trainer.html?highlight=trainingarguments#transformers.TrainingArguments).

In [None]:
# enable fp16 apex training
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True,
    output_dir="./",
    logging_steps=5,
    eval_steps=10,
    save_steps=10,
    save_total_limit=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
)

The training arguments, along with the model, tokenizer, datasets and the `compute_metrics` function can then be passed to the `Seq2SeqTrainer`

In [None]:
trainer = Seq2SeqTrainer(
    model=led,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

and we can start training. This will take about ~35min.

In [None]:
trainer.train()

This completes the fine-tuning tutorial for LED. This training script with some small changes was used to train [this](https://huggingface.co/patrickvonplaten/led-large-16384-pubmed) checkpoint, called `" patrickvonplaten/led-large-16384-pubmed"` on a single GPU for ca. 3 days. Evaluating `" patrickvonplaten/led-large-16384-pubmed"` on Pubmed's test data gives a Rouge-2 score of **19.33** which is around 1 Rouge-2 point below SOTA performance on Pubmed.

In the Appendix below, the condensed training and evaluation scripts that were used locally to finetune `" patrickvonplaten/led-large-16384-pubmed"` are attached.

https://huggingface.co/patrickvonplaten/longformer2roberta-cnn_dailymail-fp16

# **Appendix**

**ANDREAS - MODIFY TO POINT TO YOUR RESPECTIVE DIRECTORY ON GOOGLE DRIVE**

In [None]:
import os
os.chdir("/content/gdrive/My Drive/Intelligence Report Summarization")
!ls

apex			       checkpoints  reports
baseline_txt		       models	    setup.sh
bert-extractive-summarization  notebooks    transformers.png


## Evaluation

**ANDREAS - ADD ADDITIONAL SCORES, BERTSCORE, BLUE ETC...**

In [None]:
import torch

from datasets import load_dataset, load_metric
from transformers import LEDTokenizer, LEDForConditionalGeneration

# load pubmed
pubmed_test = load_dataset("scientific_papers", "pubmed", ignore_verifications=True, split="test[:1%]")

# load tokenizer
tokenizer = LEDTokenizer.from_pretrained("patrickvonplaten/led-large-16384-pubmed")
model = LEDForConditionalGeneration.from_pretrained("patrickvonplaten/led-large-16384-pubmed").to("cuda").half()


def generate_answer(batch):
  inputs_dict = tokenizer(batch["article"], padding="max_length", max_length=8192, return_tensors="pt", truncation=True)
  input_ids = inputs_dict.input_ids.to("cuda")
  attention_mask = inputs_dict.attention_mask.to("cuda")
  global_attention_mask = torch.zeros_like(attention_mask)
  # put global attention on <s> token
  global_attention_mask[:, 0] = 1

  predicted_abstract_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask)
  batch["predicted_abstract"] = tokenizer.batch_decode(predicted_abstract_ids, skip_special_tokens=True)
  return batch


result = pubmed_test.map(generate_answer, batched=True, batch_size=4)

# load rouge
rouge = load_metric("rouge")

print("Result:", rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"], rouge_types=["rouge2"])["rouge2"].mid)


Reusing dataset scientific_papers (/root/.cache/huggingface/datasets/scientific_papers/pubmed/1.1.1/043e40ed208b8a66ee9e8228c86874946c99d2fc6155a1daee685795851cfdfc)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=798293.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456356.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=772.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1264.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1357.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1839633783.0, style=ProgressStyle(descr…




To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


HBox(children=(FloatProgress(value=0.0, max=17.0), HTML(value='')))


Result: Score(precision=0.23653924633924814, recall=0.23085229928758627, fmeasure=0.22408227220415894)


In [None]:
# These are the predicted abstracts, each row corresponds to a PubMed Article.
pd.DataFrame(result)

Unnamed: 0,abstract,article,predicted_abstract,section_names
0,research on the implications of anxiety in pa...,anxiety affects quality of life in those livin...,anxiety affects quality of life in those livi...,1. Introduction\n2. Methods\n3. Results\n4. Di...
1,"small non - coding rnas include sirna , mirna...",small non - coding rnas are transcribed into m...,micrornas ( mirnas ) are small non - coding r...,Introduction\nAberrant Expression of miRNA in ...
2,objective : to evaluate the efficacy and safe...,ohss is a serious complication of ovulation in...,\n background. \n the aim of the current stud...,Introduction\nMaterials and Methods\nResults\n...
3,congenital adrenal hyperplasia is a group of ...,congenital adrenal hyperplasia ( cah ) refers ...,background : congenital adrenal hyperplasia (...,I\nM\nR\nD
4,objective(s):pentoxifylline is an immunomodul...,type 1 diabetes ( t1d ) results from the destr...,type 1 diabetes ( t1d ) results from the dest...,Introduction\nMaterials and Methods\nDrug and ...
...,...,...,...,...
62,abstractpurposethis study aimed to determine ...,nine male competitive cyclists participated in...,objectiveto determine the effects of acute he...,METHODS\nNone\nParticipants\nGeneral procedure...
63,the paper analyses the selected optical param...,"worksites in glass and metal works , foundries...",abstractthe aim of the present study was to c...,Introduction\nTest samples filters protecting...
64,objectivethis study compared the clinical and...,this retrospective observational study was con...,objectiveto examine the impact of dual- and s...,RESEARCH DESIGN AND METHODS\nData source\nSamp...
65,backgroundthe aim of this study was to determ...,endotracheal intubation is one of the most imp...,objective : to define the role of point - of ...,Background\nMaterial and Methods\nStatistical ...
