Summary data for the Potsdam Commentary Corpus.
-
The original corpus is the Potsdam Commentary Corpus. For more information, please view the original corpus page.
-
We used the (manual) syntax annotations available with the corpus to split the texts into sentences.
-
We then asked annotators to choose the three most important sentences. The exact wording of the task was as follows: "Read the texts and choose 3 sentences per text, that represent the core of the text. Then, order the sentences according to their importance: label the most important sentence with a '1', the second most important with a '2' and the third with a '3'. In this context, importance refers to the informative value of the sentence: is it a core statement, that could possibly be used to summarise the text? Then the sentence is important. Sentences which include anaphoric elements (particularly all types of pronouns) can also be chosen. When judging a sentence, you should "mentally" replace all anaphora by their antecedents."
-
For a small subset of 30 texts, we got two annotations per text. We harmonised these using a scoring system. The highest ranked sentence was equivalent to 3 points, the second to 2 and the third to 1. The sentence with the most points was then deemed the highest ranked sentence in the harmonised annotation, and so on. Any tied scores were resolved by randomly selecting one of the sentences in the tie.
We have provided the corpus in JSON format. The JSON file contains the original file name and the rankings for the three sentences. We have also provided text files for each individual text with the three sentences that form the summary on a new line.
If you use these summaries please cite the following paper:
Freya Hewett and Manfred Stede. Extractive summarisation for German-language data: a text-level approach with discourse features. In Proceedings of the 29th International Conference on Computational Linguistics (COLING). 2022. To appear.
If you use anything from the original corpus please cite the following paper:
Peter Bourgonje and Manfred Stede. The Potsdam Commentary Corpus 2.2: Extending Annotations for Shallow Discourse Parsing. In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020). Marseille, France, May 2020. European Language Resources Association (ELRA).