This notebook shows how to summarize ILC documents with 🤗 Datasets and 🤗Transformers, and compares the results of `led-base-ilc` with `led-base-16384` with 10 samples.

You can summarize for more samples with the same code.

Author: [Pawan Trivedi](https://twitter.com/d0r1h) <br>
Date created: 2022/05/05 <br>
Last modified: 2022/05/05 <br>
Description: Using LED for summarization task

Following are the checkpoints used in this notebook

1. [led-base-16384](https://huggingface.co/allenai/led-base-16384)
2. [led-base-ilc](https://huggingface.co/d0r1h/led-base-ilc)

In [1]:
# Installation Library

!pip install transformers datasets sentencepiece rouge -qq

In [2]:
import torch
import pandas as pd
from rouge import Rouge
from transformers import pipeline
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

In [None]:
rouge = Rouge()

dataset = load_dataset("d0r1h/ILC", split='test')

In [4]:
dataset

Dataset({
    features: ['Title', 'Summary', 'Case'],
    num_rows: 1015
})

In [5]:
def summarize(checkpoint):

  """
    Helper function to summarize the text
  """
  
  tokenizer = AutoTokenizer.from_pretrained(checkpoint)
  model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint).to(device)

  SystemSummaries = []
  for i, case in enumerate(CasesText):
      
      input_ids = tokenizer(case, return_tensors="pt").input_ids.to(device)
      global_attention_mask = torch.zeros_like(input_ids)
      global_attention_mask[:, 0] = 1
      if checkpoint == "led-base-16384":
        sequences = model.generate(input_ids, global_attention_mask=global_attention_mask).sequences
      else:
        sequences = model.generate(input_ids, global_attention_mask=global_attention_mask)
      Summary = tokenizer.batch_decode(sequences, skip_special_tokens=True)

      SystemSummaries.append(Summary)
      print(i)

  return SystemSummaries

In [6]:
CasesText = dataset['Case'][:10]
GoldSummary = dataset['Summary'][:10]

In [7]:
device = "cuda" if torch.cuda.is_available() else "cpu"

### **led-base-ilc**

In [8]:
checkpoint1 = "d0r1h/led-base-ilc"

tokenizer1 = AutoTokenizer.from_pretrained(checkpoint1)
model1 = AutoModelForSeq2SeqLM.from_pretrained(checkpoint1).to(device)

In [None]:
SystemSummary1 = summarize(checkpoint1)

In [10]:
SystemSummaryFinal1 = []

for i in SystemSummary1:
  SystemSummaryFinal1.append((i[0]))

In [11]:
Summaries1 = pd.DataFrame(list(zip(GoldSummary, SystemSummaryFinal1)), columns =['GoldSummary', 'SystemSummary'])

In [12]:
score1 = rouge.get_scores(Summaries1['SystemSummary'], Summaries1['GoldSummary'], avg=True)

In [13]:
pd.DataFrame(score1).set_index([['recall','precision','f-measure']])*100

Unnamed: 0,rouge-1,rouge-2,rouge-l
recall,39.303019,21.30861,36.348632
precision,47.91047,24.723998,44.378273
f-measure,42.625963,22.451412,39.452514


### **led_base**

In [14]:
checkpoint2 = 'allenai/led-base-16384'

tokenizer2 = AutoTokenizer.from_pretrained(checkpoint2)
model2 = AutoModelForSeq2SeqLM.from_pretrained(checkpoint2, return_dict_in_generate=True).to(device)

In [None]:
SystemSummary2 = summarize(checkpoint2)

In [16]:
SystemSummaryFinal2 = []

for i in SystemSummary2:
  SystemSummaryFinal2.append((i[0]))

In [17]:
Summaries2 = pd.DataFrame(list(zip(GoldSummary, SystemSummaryFinal2)), columns =['GoldSummary', 'SystemSummary'])

In [18]:
score2 = rouge.get_scores(Summaries2['SystemSummary'], Summaries2['GoldSummary'], avg=True)

In [19]:
pd.DataFrame(score2).set_index([['recall','precision','f-measure']])*100

Unnamed: 0,rouge-1,rouge-2,rouge-l
recall,1.957264,0.628321,1.877908
precision,39.643357,21.149684,37.97669
f-measure,3.722952,1.21835,3.571858
