PCC Summaries

Summary data for the Potsdam Commentary Corpus.

About the data

The original corpus is the Potsdam Commentary Corpus. For more information, please view the original corpus page.
We used the (manual) syntax annotations available with the corpus to split the texts into sentences.
We then asked annotators to choose the three most important sentences. The exact wording of the task was as follows: "Read the texts and choose 3 sentences per text, that represent the core of the text. Then, order the sentences according to their importance: label the most important sentence with a '1', the second most important with a '2' and the third with a '3'. In this context, importance refers to the informative value of the sentence: is it a core statement, that could possibly be used to summarise the text? Then the sentence is important. Sentences which include anaphoric elements (particularly all types of pronouns) can also be chosen. When judging a sentence, you should "mentally" replace all anaphora by their antecedents."
For a small subset of 30 texts, we got two annotations per text. We harmonised these using a scoring system. The highest ranked sentence was equivalent to 3 points, the second to 2 and the third to 1. The sentence with the most points was then deemed the highest ranked sentence in the harmonised annotation, and so on. Any tied scores were resolved by randomly selecting one of the sentences in the tie.

About the files

We have provided the corpus in JSON format. The JSON file contains the original file name and the rankings for the three sentences. We have also provided text files for each individual text with the three sentences that form the summary on a new line.

How to cite

If you use these summaries please cite the following paper:

Freya Hewett and Manfred Stede. Extractive summarisation for German-language data: a text-level approach with discourse features. In Proceedings of the 29th International Conference on Computational Linguistics (COLING). 2022. To appear.

If you use anything from the original corpus please cite the following paper:

Peter Bourgonje and Manfred Stede. The Potsdam Commentary Corpus 2.2: Extending Annotations for Shallow Discourse Parsing. In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020). Marseille, France, May 2020. European Language Resources Association (ELRA).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
text_files		text_files
README.md		README.md
corpus.json		corpus.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PCC Summaries

About the data

About the files

How to cite

About

Releases

Packages

fhewett/pcc-summaries

Folders and files

Latest commit

History

Repository files navigation

PCC Summaries

About the data

About the files

How to cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages