## Project submission header

## Abstractive Summarization of Scientific Papers with BART

## Module submission group
- Group member 1
    - Name: Eric Benton
    - Email: emb393@drexel.edu
- Group member 2
    - Name: Michael Wesner
    - Email: mw3344@drexel.edu
- Group member 3
    - Name: Dustin Luchmee
    - Email: dbl47@drexel.edu

In [2]:
## Code in this notebook modified from https://github.com/huggingface/notebooks/blob/master/examples/summarization.ipynb

import torch, json, wandb, nltk, random, datasets
import numpy as np
from tqdm import tqdm
from datasets import load_dataset, load_metric
import pandas as pd
from IPython.display import display, HTML
from transformers import BartForConditionalGeneration, BartTokenizerFast
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

### Load model and rouge metric

In [2]:
model_checkpoint = "facebook/bart-base"

In [3]:
metric = load_metric("rouge")

In [4]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each predictions
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_agregator: Return aggregates if this is set to True
Retu

#### Load dataset and show some examples

In [5]:
raw_dataset = load_dataset('scientific_papers', 'pubmed')

Reusing dataset scientific_papers (/home/mw/.cache/huggingface/datasets/scientific_papers/pubmed/1.1.1/043e40ed208b8a66ee9e8228c86874946c99d2fc6155a1daee685795851cfdfc)


In [6]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'abstract', 'section_names'],
        num_rows: 119924
    })
    validation: Dataset({
        features: ['article', 'abstract', 'section_names'],
        num_rows: 6633
    })
    test: Dataset({
        features: ['article', 'abstract', 'section_names'],
        num_rows: 6658
    })
})

In [16]:
def show_random_elements(dataset, num_examples=1):
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [1]:
show_random_elements(raw_dataset["train"])

### Load tokenizer and see an example

In [9]:
tokenizer = BartTokenizerFast.from_pretrained("facebook/bart-base")

In [10]:
# Tokenizer example
tokenizer("Hello, this one sentence!")

{'input_ids': [0, 31414, 6, 42, 65, 3645, 328, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

### Find average length of training articles and abstracts

In [11]:
# abstract_len = 0
# article_len = 0

# for text in raw_dataset['train']:
#     article_len += len(tokenizer.encode(text['article']))
#     abstract_len += len(tokenizer.encode(text['abstract']))

In [12]:
# articles = 119924
# print(f'Average number of tokens per article: {int(article_len/articles)} and average number of tokens per summary: {int(abstract_len/articles)}')

Average number of tokens per article: 3892 and average number of tokens per summary: 257

### Tokenize our dataset

In [11]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, padding=True, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["abstract"], max_length=max_target_length, padding=True, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [12]:
tokenized_datasets = raw_dataset.map(preprocess_function, batched=True)

Loading cached processed dataset at /home/mw/.cache/huggingface/datasets/scientific_papers/pubmed/1.1.1/043e40ed208b8a66ee9e8228c86874946c99d2fc6155a1daee685795851cfdfc/cache-1f98a046d8b2dcb8.arrow
Loading cached processed dataset at /home/mw/.cache/huggingface/datasets/scientific_papers/pubmed/1.1.1/043e40ed208b8a66ee9e8228c86874946c99d2fc6155a1daee685795851cfdfc/cache-c8f55915ad5db4fe.arrow
Loading cached processed dataset at /home/mw/.cache/huggingface/datasets/scientific_papers/pubmed/1.1.1/043e40ed208b8a66ee9e8228c86874946c99d2fc6155a1daee685795851cfdfc/cache-64b1e08e645d5439.arrow


In [13]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['abstract', 'article', 'attention_mask', 'input_ids', 'labels', 'section_names'],
        num_rows: 119924
    })
    validation: Dataset({
        features: ['abstract', 'article', 'attention_mask', 'input_ids', 'labels', 'section_names'],
        num_rows: 6633
    })
    test: Dataset({
        features: ['abstract', 'article', 'attention_mask', 'input_ids', 'labels', 'section_names'],
        num_rows: 6658
    })
})

In [82]:
len(tokenized_datasets['train']['labels'][0])

128

In [83]:
len(tokenized_datasets['train']['input_ids'][0])

1024

### Setup training arguments, load model, data collator, and metrics

In [20]:
batch_size = 3
args = Seq2SeqTrainingArguments(
    "academic-papers-abstractive-summarization",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=False
)

In [21]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [22]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [23]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

### Train model, show metrics during training

In [24]:
trainer = Seq2SeqTrainer(model,
                         args,
                         train_dataset=tokenized_datasets["train"],
                         eval_dataset=tokenized_datasets["validation"],
                         data_collator=data_collator,
                         tokenizer=tokenizer,
                         compute_metrics=compute_metrics)

In [25]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mmw1000[0m (use `wandb login --relogin` to force relogin)


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,2.0737,1.922067,14.5698,6.7183,12.4973,13.5373,20.0
2,1.937,1.81855,14.8959,6.9592,12.7443,13.8665,20.0
3,1.8819,1.787529,15.0491,7.0062,12.8696,14.0171,20.0
4,1.8096,1.766275,14.9149,7.08,12.837,13.9142,20.0
5,1.7469,1.763582,15.0034,7.124,12.8991,13.9926,20.0


TrainOutput(global_step=199875, training_loss=1.9260239605930465, metrics={'train_runtime': 46843.1812, 'train_samples_per_second': 12.801, 'train_steps_per_second': 4.267, 'total_flos': 5.77838153147351e+17, 'epoch': 5.0})

TrainOutput

global_step=199875

training_loss=1.9260239605930465

train_runtime': 46843.1812 

'train_samples_per_second': 12.801 

'train_steps_per_second': 4.267, 

'total_flos': 5.77838153147351e+17

'epoch': 5.0

In [2]:
model

### Generate predictions on test data

In [94]:
Preds = trainer.predict(test_dataset=tokenized_datasets["test"], 
                        metric_key_prefix='test', 
                        max_length=128)

In [95]:
Preds

PredictionOutput(predictions=array([[    2,     0,  1437, ...,  3625,  3059,     2],
       [    2,     0, 14926, ...,    61, 16570,     2],
       [    2,     0, 20372, ...,     1,     1,     1],
       ...,
       [    2,     0,     5, ..., 16117,   479,     2],
       [    2,     0,  3618, ...,  2156,     5,     2],
       [    2,     0,     5, ..., 50118,   601,     2]]), label_ids=array([[    0,   557,    15, ..., 50118,   333,     2],
       [    0,   650,   786, ...,     2,     1,     1],
       [    0,  4554,  4832, ..., 13280, 17624,     2],
       ...,
       [    0,    52,  6190, ...,     5, 23496,     2],
       [    0,  4554,  4832, ...,  1437, 50118,     2],
       [    0, 33484,  1283, ...,    58,    67,     2]]), metrics={'test_loss': 1.7654982805252075, 'test_rouge1': 42.2238, 'test_rouge2': 18.2209, 'test_rougeL': 27.7722, 'test_rougeLsum': 37.3796, 'test_gen_len': 121.4142, 'test_runtime': 2328.1488, 'test_samples_per_second': 2.86, 'test_steps_per_second': 0.954})

In [114]:
tokenizer.decode(Preds[1][0])

"<s> research on the implications of anxiety in parkinson's disease ( pd ) has been neglected despite its prevalence in nearly 50% of patients and its negative impact on quality of life. \n previous reports have noted that neuropsychiatric symptoms impair cognitive performance in pd patients ; however, to date, no study has directly compared pd patients with and without anxiety to examine the impact of anxiety on cognitive impairments in pd. \n this study compared cognitive performance across 50 pd participants with and without anxiety ( 17 pda+ ; 33 pda ), who underwent neurological and neuropsychological assessment. \n group</s>"

[34m[1mwandb[0m: Network error resolved after 0:00:38.989127, resuming normal operation.


In [96]:
tokenizer.decode(Preds[0][0])

"</s><s> \n objective. to examine the relationship between anxiety and cognition in parkinson's disease ( pd ) \n. methods. \n this cross - sectional study included 17 pd patients with anxiety ( n = 17 ) and thirty - three patients without anxiety, aged between 18 and 30 years, who completed the mini - mental state exam ( mmse ), the hospital anxiety and depression scale ( hads - d > 6 ), and completed a full neuropsychological assessment ( e.g., attention, memory, and executive functioning ). results. in both groups, \n anxiety was significantly associated</s>"

In [98]:
Preds[2]

{'test_loss': 1.7654982805252075,
 'test_rouge1': 42.2238,
 'test_rouge2': 18.2209,
 'test_rougeL': 27.7722,
 'test_rougeLsum': 37.3796,
 'test_gen_len': 121.4142,
 'test_runtime': 2328.1488,
 'test_samples_per_second': 2.86,
 'test_steps_per_second': 0.954}

### Generate predictions with beam search on test data

#### 3 Beams

In [99]:
Pred_3beams = trainer.predict(test_dataset=tokenized_datasets['test'],
                              metric_key_prefix='test',
                              max_length=128,
                              num_beams=3)

In [100]:
tokenizer.decode(Pred_3beams[0][0])

"</s><s> \n background. anxiety and depression are often related and coexist in parkinson's disease ( pd ). \n however, our current understanding of anxiety and its impact on cognition in pd, as well as its neural basis and best treatment practices, remains meager and lags far behind that of depression. objective. to examine the relationship between anxiety and cognition in patients with pd and to determine the independent effect of anxiety on cognition \n. methods. a cross - sectional study of 17 pd patients with anxiety and thirty - three pd without anxiety was conducted at the university of sydney.</s>"

In [101]:
Pred_3beams[2]

{'test_loss': 1.7654982805252075,
 'test_rouge1': 42.275,
 'test_rouge2': 18.2406,
 'test_rougeL': 27.8668,
 'test_rougeLsum': 37.4681,
 'test_gen_len': 120.0445,
 'test_runtime': 2154.9448,
 'test_samples_per_second': 3.09,
 'test_steps_per_second': 1.03}

#### 5 Beams

In [108]:
Pred_5beams = trainer.predict(test_dataset=tokenized_datasets['test'],
                              metric_key_prefix='test',
                              max_length=128,
                              num_beams=5)

In [109]:
tokenizer.decode(Pred_5beams[0][0])

"</s><s> \n background. anxiety and depression are often related and coexist in parkinson's disease ( pd ). \n however, our current understanding of anxiety and its impact on cognition in pd, as well as its neural basis and best treatment practices, remains meager and lags far behind that of depression. objective. to examine the relationship between anxiety and cognition in patients with pd \n. methods. a cross - sectional study of 17 pd patients with anxiety and thirty - three patients without anxiety was conducted at the brain and mind centre, university of sydney, in order to determine the independent</s>"

In [110]:
Pred_5beams[2]

{'test_loss': 1.7654982805252075,
 'test_rouge1': 42.12,
 'test_rouge2': 18.1016,
 'test_rougeL': 27.5752,
 'test_rougeLsum': 37.2314,
 'test_gen_len': 122.3542,
 'test_runtime': 2522.6649,
 'test_samples_per_second': 2.639,
 'test_steps_per_second': 0.88}

#### 7 Beams

In [102]:
Pred_7beams = trainer.predict(test_dataset=tokenized_datasets['test'],
                              metric_key_prefix='test',
                              max_length=128,
                              num_beams=7)

In [103]:
tokenizer.decode(Pred_7beams[0][0])

"</s><s> \n background. anxiety and depression are often related and coexist in parkinson's disease ( pd ). \n however, our current understanding of anxiety and its impact on cognition in pd, as well as its neural basis and best treatment practices, remains meager and lags far behind that of depression. objective. to examine the relationship between anxiety and cognition in patients with pd \n. methods. a cross - sectional study of 17 pd patients with anxiety and thirty - three patients without anxiety was conducted at the brain and mind centre, university of sydney, in order to determine the independent</s>"

In [104]:
Pred_7beams[2]

{'test_loss': 1.7654982805252075,
 'test_rouge1': 41.7882,
 'test_rouge2': 17.7992,
 'test_rougeL': 27.2814,
 'test_rougeLsum': 36.902,
 'test_gen_len': 123.0339,
 'test_runtime': 2900.4686,
 'test_samples_per_second': 2.295,
 'test_steps_per_second': 0.765}

#### 10 Beams

In [111]:
Pred_10beams = trainer.predict(test_dataset=tokenized_datasets['test'],
                               metric_key_prefix='test',
                               max_length=128,
                               num_beams=10)

In [112]:
tokenizer.decode(Pred_10beams[0][0])

"</s><s> \n background. anxiety and depression are often related and coexist in parkinson's disease ( pd ). \n however, our current understanding of anxiety and its impact on cognition in pd, as well as its neural basis and best treatment practices, remains meager and lags far behind that of depression. objective. to examine the independent effect of anxiety on cognition among pd patients with and without anxiety \n. methods. in this cross - sectional study, \n 17 patients with anxiety and thirty - three patients without anxiety were recruited from a patient database at the brain and mind centre, university of sy</s>"

In [113]:
Pred_10beams[2]

{'test_loss': 1.7654982805252075,
 'test_rouge1': 41.5606,
 'test_rouge2': 17.661,
 'test_rougeL': 27.085,
 'test_rougeLsum': 36.6851,
 'test_gen_len': 123.6877,
 'test_runtime': 3689.1254,
 'test_samples_per_second': 1.805,
 'test_steps_per_second': 0.602}