Skip to content


Repository files navigation

Multi-document Summarization for News Articles Highlights Extraction

This repository contains the code for our Multi-News adaptation of paper Transformer-based Highlights Extraction from scientific papers - THExt, realized for the Deep Natural Language Processing class (A.Y. 2022-2023).

The extraction of highlights is a process which consists in selecting the salient sentences within the body of a test, which well summarize the meaning of the text under examination. This paper focuses on the problem of extracting highlights from news articles using transformer-based techniques. Our task was to adapt THExt to another domain and extend it to Multi-Document summarization.

We also propose an ablation study, which aims to improve the baseline providing a generated context through the use of LED architecture. The results obtained in the multi-document and in the ablation study, achieved respectively on two distinct datasets, have demonstrated the effectiveness of the model in the field of multi-document summarization, and its adaptability to a different domain, as well as the possibility of its improvement by providing a broader context.

The pipeline we propose, and that you can reproduce, is the following:

  • Data preprocessing
  • Extractor, Abstractor Models training
  • Model evaluation


  • Python 3 (tested on python 3.6)
  • PyTorch
    • with GPU and CUDA enabled installation (though the code is not runnable on CPU)
  • TensorFlow
  • pyrouge (for evaluation)

Dataset Download

The dataset exploited are the following:

  1. Multi-News dataset
  2. AIPubSumm, CSPubSumm, and BIOPubSumm. These datasets are not publicly available, so some directory folders may be created: plese refere to this repo for further information.

Execution Guide

Two examples to run our work are available on colab at the following links:

  1. Multi-News Adaptation: Multi-News Adaptation

  2. New Context Generation: New Contexts Generation

If the running is performed locally, please install the requirements in the file:

!pip install -r requirements.txt

Pipeline Description

1. Multi-document Summarization Pipeline

Firstly, we preprocess the input news articles, from dataset [1], using techniques such as stopword removal, sentence segmentation, and separation of articles of the same cluster, which enhances the performance of the summarization model. During this stage a tokenized unique text is created merging the texts of all the articles related to the same news. Also a new context is defined, merging the first 20% of sentences of each article for the same cluster of news.

Alt text

Next, we fine-tune the chosen models, BERT and LongFormer, on the Multi-News dataset, which helps obtain the optimized weights for the models. Finally, we use a fully connected layer-based regression to produce the resulting highlights for the cluster of articles.

2. Context Generator Pipeline

The LED model, which is an extension of the Longformer, was employed to create a new context from the dataset [2]. Specifically, the study performed an abstractive summarization of different sections of the papers, namely the abstract, introduction, and results

Alt text

Please, for further information, and to see our results, refere to the relative paper.

Authors: Matteo Berta, Arcangelo Frigiola, Francesco Marigioli, Luca Varriale


Deep NLP project






No releases published


No packages published