In [None]:
from transformers import pipeline

# Load the summarisation pipeline with the BART large CNN model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

from datasets import load_dataset

dataset = load_dataset("dennlinger/eur-lex-sum", 'english')

Device set to use mps:0


419009


In [6]:
example_text = dataset['train'][0]['reference']
print(len(example_text))


419009


In [7]:
# Length is way too long for the summariser (419009 chars), so need to try other strategies
words = example_text.split()
print(f"Total words: {len(words)}") 

Total words: 66421


In [8]:
# Split into 800-word chunks
def chunk_text(words, max_words=800):
    for i in range(0, len(words), max_words):
        yield ' '.join(words[i:i + max_words])

chunks = list(chunk_text(words))

print(f"Total chunks: {len(chunks)}")

# Summarise each chunk
summaries = []
for chunk in chunks:
    summary = summarizer(
        chunk,
        max_length=130,
        min_length=30,
        do_sample=False
    )
    summaries.append(summary[0]['summary_text'])

# Combine
final_summary = " ".join(summaries)
print("Combined summary:\n", final_summary[:1000], "...")  # print first 1000 chars


Total chunks: 84


Your max_length is set to 130, but your input_length is only 45. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)


Combined summary:
 Regulation (EU) 2017/1129 lays down requirements to be complied with when drawing up prospectuses. The content and the format of a prospectus depend on a variety of factors, such as the type of issuer, type of security and type of issuance. The prospectus should contain a working capital statement as well as a statement of capitalisation and indebtedness of the issuer of the underlying shares. Derivative securities entail particular risks for investors. A high level of investor protection should be ensured, the EU says. It adds that certain types of securities that are not covered by the Annexes to this Regulation will be offered to the public. ‘Third country market’ means a third country market which has been deemed equivalent to a regulated market in accordance with the requirements set out in third and fourth subparagraphs of Article 25(4) of Directive 2014/65/EU of the European Parliament and of the Council (3) ‘profit estimate’ is a profit forecast for a financi

In [12]:
# Testing summary against original summary in data using ROUGE
from rouge_score import rouge_scorer

dataset_summary = dataset['train'][0]['summary']
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(dataset_summary, final_summary)

print("ROUGE-1 F1:", scores['rouge1'].fmeasure)
print("ROUGE-2 F1:", scores['rouge2'].fmeasure)
print("ROUGE-L F1:", scores['rougeL'].fmeasure)


ROUGE-1 F1: 0.4100086281276963
ROUGE-2 F1: 0.13637148282409806
ROUGE-L F1: 0.16496980155306298


For a chunked summarisation and naive combination, these results are expected as we lose some coherence.
For fine tuned models on domain specific data, should aim for:
- ROUGE-1 F1 0.5 to 0.7 or higher
- ROUGE-2 F1 0.3 to 0.5
- ROUGE-L F1 usually close to ROUGE-1, around 0.5+