Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

DaNewsroom - A Large Scale Summarization Dataset for Danish

A summarization dataset created entirely from archived content hosted at The goal of this dataset was to 1. create a sizeable dataset with 2. high recall. As result of this precision might be lacking and noise will be present.

1. Prerequisits

Code Dependencies

# python dependencies
pip install -r requirements.txt

# ilp solver for the ICSI system

# macOS
brew install glpk

# linux
apt-get install -y glpk-utils.

# windows isn't tested


We provide dataset freely through "thin files", which are gzip-compressed jsonl files, where each line consists of a sample's URL and measures (but not the summary and article body). The thin files are: danewsroom.thin danewsroom.test.thin

Reach out to if you'd like the dataset directly. We can not for legal reasons distribute dataset directly.

2. Build DaNewsroom (Scrape and Extract)

Use newsroom tool to scrape news articles from This should take about about 5 days due to throttling.

newsroom-scrape --thin distributed-data/danewsroom.thin --archive data/danewsroom.archive

Use newsroom tool to extract article content from the HTML dumps. This should takes a few hours with a decent amount of CPUs.

newsroom-extract --archive distributed-data/danewsroom.archive --dataset data/danewsroom.dataset

3. Run and Evaluate DaNewsroom (split, run and compute ROUGE scores)

Note that while we do provide predefined-splits, we highly encourage you to create your own splits following recent work.

# Split dataset into train/dev/test
python build/ split-dataset --dataset data/danewsroom.dataset --dev data/dev.thin --test data/test.thin

# Run models

mkdir results

# run below with one of the models [gensim-textrank, ilp, oracle, lead-3].
python models/ --model MODELNAME --dataset data/test.dataset --summaries results/danewsroom.test.MODELNAME.summaries --budget 35 [--parallelize]

# Compute ROGUE scores (here we continue with the textrank example)
# This produces two files:
#   1. *.scores with ROUGE scores for each system summary
#   2. *.rouge with ROUGE scores across the entire dataset
newsroom-score --dataset data/datanewsroom.test.dataset --summaries results/danewsroom.test.MODELNAME.summaries --scores results/danewsroom.test.MODELNAME.scores --rouge 1,2,L --unstemmed > results/danewsroom.test.MODELNAME.rouge

# Compute table with binned meausures
# This takes the *.scores-file and produces a table with ROUGE scores across binned measures (density, compression and coverage)
newsroom-tables --scores results/danewsroom.test.MODELNAME.scores --rouge 1,2,L --variants fscore --bins density,compression,coverage > results/danewsroom.test.MODELNAME.table

4. Building a similiar dataset for another language

In addition to the distribution of this dataset, we wished to enable others to create a similar dataset. Following the procedure described in this paper, constructing a dataset consists of 5 steps:

  1. Curate a list of sites and collect URLs from through their CDX server (link to documentation)
  2. Filter out assets and keep hURLs
  3. Download snapshots from URL collection and extract article content
  4. Domain specific quality control

We distribute tooling for all steps apart from domain specific quality control and describe them below. In the below example we'll consider a single domain

4.1. Collect URLs

# Collect URLs for domain. This outputs to a file called
scrapy runspider build/ -a

# Inspect URL count
# MacOS
zcat < | wc -l

# Linux
zcat | wc -l

4.2. Filter URLs

# basic filtering based on filtering out assets and restricting URLs to hURLs.
# this script assumes no duplicated URLs.
python build/ filter-urls --urls-file --out dr.dr.filtered.jsonl.gz

4.3 Download snapshots (same as section 2)

newsroom-scrape --thin dr.dr.filtered.jsonl.gz --archive

newsroom-extract --archive --dataset

4.4. Domain specific quality control on extracted articles

As described in the paper, this work filters out documents that have:

  • Empty summaries/bodies
  • Non-unique summaries/bodies
  • Compression measure less than 1.5


1. "Cannot open exception db file for reading: data/WordNet-2.0.exc.db"

Check out

Notice that the ROUGE perl script is embedded in the newsroom distribution which is in this repository (newsroom-lib/newsroom/analyze/rouge/ROUGE-1.5.5)

2. What are thin/archive/files?

In this repo. we refer to three types of files: thin, archive, and dataset-files. These are all simply compressed JSONL-files wihch can be inspected using zless or programmatically:

from newsroom import jsonl
with, gzip=True) as filehandler:
    for row in filehandler:
        # row is dictionary


    title = "{D}a{N}ewsroom: A Large-scale {D}anish Summarisation Dataset",
    author = "Varab, Daniel  and
      Schluter, Natalie",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "",
    pages = "6731--6739",
    abstract = "Dataset development for automatic summarisation systems is notoriously English-oriented. In this paper we present the first large-scale non-English language dataset specifically curated for automatic summarisation. The document-summary pairs are news articles and manually written summaries in the Danish language. There has previously been no work done to establish a Danish summarisation dataset, nor any published work on the automatic summarisation of Danish. We provide therefore the first automatic summarisation dataset for the Danish language (large-scale or otherwise). To support the comparison of future automatic summarisation systems for Danish, we include system performance on this dataset of strong well-established unsupervised baseline systems, together with an oracle extractive summariser, which is the first account of automatic summarisation system performance for Danish. Finally, we make all code for automatically acquiring the data freely available and make explicit how this technology can easily be adapted in order to acquire automatic summarisation datasets for further languages.",
    language = "English",
    ISBN = "979-10-95546-34-4",


No description, website, or topics provided.






No releases published


No packages published