In [8]:
from transformers import (
    AutoTokenizer,
    LEDForConditionalGeneration,
)
from datasets import load_dataset, load_metric
import torch

First, we load the **Multi-news** dataset from huggingface dataset hub

In [3]:
dataset=load_dataset('multi_news')

Using custom data configuration default


Downloading and preparing dataset multi_news/default (download: 245.06 MiB, generated: 667.74 MiB, post-processed: Unknown size, total: 912.80 MiB) to /home/wenx/.cache/huggingface/datasets/multi_news/default/1.0.0/2e145a8e21361ba4ee46fef70640ab946a3e8d425002f104d2cda99a9efca376...


Downloading:   0%|          | 0.00/257M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset multi_news downloaded and prepared to /home/wenx/.cache/huggingface/datasets/multi_news/default/1.0.0/2e145a8e21361ba4ee46fef70640ab946a3e8d425002f104d2cda99a9efca376. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Then we load the fine-tuned PRIMERA model, please download [it](https://storage.googleapis.com/primer_summ/PRIMER_multinews.tar.gz) to your local computer.

In [4]:
PRIMER_path='../github/PRIMER/PRIMER_multinews_hf'
TOKENIZER = AutoTokenizer.from_pretrained(PRIMER_path)
MODEL = LEDForConditionalGeneration.from_pretrained(PRIMER_path)
MODEL.gradient_checkpointing_enable()
PAD_TOKEN_ID = TOKENIZER.pad_token_id
DOCSEP_TOKEN_ID = TOKENIZER.convert_tokens_to_ids("<doc-sep>")


We then define the functions to pre-process the data, as well as the function to generate summaries.

In [17]:
def process_document(documents):
    input_ids_all=[]
    for data in documents:
        all_docs = data.split("|||||")[:-1]
        for i, doc in enumerate(all_docs):
            doc = doc.replace("\n", " ")
            doc = " ".join(doc.split())
            all_docs[i] = doc

        #### concat with global attention on doc-sep
        input_ids = []
        for doc in all_docs:
            input_ids.extend(
                TOKENIZER.encode(
                    doc,
                    truncation=True,
                    max_length=4096 // len(all_docs),
                )[1:-1]
            )
            input_ids.append(DOCSEP_TOKEN_ID)
        input_ids = (
            [TOKENIZER.bos_token_id]
            + input_ids
            + [TOKENIZER.eos_token_id]
        )
        input_ids_all.append(torch.tensor(input_ids))
    input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids_all, batch_first=True, padding_value=PAD_TOKEN_ID
    )
    return input_ids


def batch_process(batch):
    input_ids=process_document(batch['document'])
    # get the input ids and attention masks together
    global_attention_mask = torch.zeros_like(input_ids).to(input_ids.device)
    # put global attention on <s> token

    global_attention_mask[:, 0] = 1
    global_attention_mask[input_ids == DOCSEP_TOKEN_ID] = 1
    generated_ids = MODEL.generate(
        input_ids=input_ids,
        global_attention_mask=global_attention_mask,
        use_cache=True,
        max_length=1024,
        num_beams=5,
    )
    generated_str = TOKENIZER.batch_decode(
            generated_ids.tolist(), skip_special_tokens=True
        )
    result={}
    result['generated_summaries'] = generated_str
    result['gt_summaries']=batch['summary']
    return result

Next, we simply run the model on 10 data examples (or any number of examples you want)

In [31]:
import random
data_idx = random.choices(range(len(dataset['test'])),k=10)
dataset_small = dataset['test'].select(data_idx)
result_small = dataset_small.map(batch_process, batched=True, batch_size=2)

  0%|          | 0/5 [00:00<?, ?ba/s]

After getting all the results, we load the evaluation metric. 


(Note in the original code, we didn't use the default aggregators, instead, we simply take average over all the scores.
We simply use 'mid' in this notebook)

In [19]:
rouge = load_metric("rouge")

In [32]:
result_small['generated_summaries']

['– Amazon wants to be your go-to for last-minute deals this holiday season, so it\'s offering a little something extra for brick-and-mortar stores, TechCrunch reports. The online retailer is offering users of its Price Check app up to $5 off any product they buy at a brick-and-mortar store if they use the app to scan the barcode on the item, take a picture, or type in the product\'s name, the Verge reports. Amazon usually has lower prices than brick-and-mortar stores, "so this is just the cherry on top," one analyst says. "The ability to check prices on your mobile phone when you’re in a physical retail store is changing the way people shop," an Amazon mobile director says. "Price transparency means that you can save money on the products you want and that’s a great thing for customers. Price Check in-store deals are another incentive to shop smart this holiday season."',
 '– A day care in San Antonio has been shut down after an infant was allegedly bitten 27 times by another child, K

In [33]:
score=rouge.compute(predictions=result_small["generated_summaries"], references=result_small["gt_summaries"])
print(score['rouge1'].mid)
print(score['rouge2'].mid)
print(score['rougeL'].mid)

Score(precision=0.509437078378281, recall=0.43832461548851936, fmeasure=0.4644188580686355)
Score(precision=0.17689604682544763, recall=0.14564519595131636, fmeasure=0.1581222605371442)
Score(precision=0.2362355904256852, recall=0.19669444890277293, fmeasure=0.21194685290367665)


In [27]:
import random

In [30]:
random.choices(range(5000),k=5)

[4496, 1390, 2088, 2130, 1604]

– Facebook removed a photo of two men kissing in protest of a London pub’s decision to eject a same-sex couple for kissing, reports the America Blog. “Shares that contain nudity, or any kind of graphic or sexually suggestive content, are not permitted on Facebook,” the administrators of the Dangerous Minds Facebook page said in an email. The decision to remove the photo has prompted scores of people to post their own pictures of same-sex couples kissing in protest— dozens in the last few hours alone.

– Facebook has removed a photo from a protest page for a gay pub that booted a same-sex couple for kissing, USA Today reports. The Dangerous Minds Facebook page was trying to promote a “gay kiss-in” demonstration in London to protest the pub. The page used a photo of two men kissing to promote the event. But Facebook quickly removed the photo, saying in an email, “Shares that contain nudity, or any kind of graphic or sexually suggestive content, are not permitted on Facebook.” The decision to remove the photo has prompted scores of people to post their own pictures of same-sex couples kissing in protest— dozens in the last few hours alone.