Skip to content

Data and code for summaries for the Potsdam Commentary Corpus

Notifications You must be signed in to change notification settings

fhewett/pcc-summaries

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

PCC Summaries

Summary data for the Potsdam Commentary Corpus.

About the data

  • The original corpus is the Potsdam Commentary Corpus. For more information, please view the original corpus page.

  • We used the (manual) syntax annotations available with the corpus to split the texts into sentences.

  • We then asked annotators to choose the three most important sentences. The exact wording of the task was as follows: "Read the texts and choose 3 sentences per text, that represent the core of the text. Then, order the sentences according to their importance: label the most important sentence with a '1', the second most important with a '2' and the third with a '3'. In this context, importance refers to the informative value of the sentence: is it a core statement, that could possibly be used to summarise the text? Then the sentence is important. Sentences which include anaphoric elements (particularly all types of pronouns) can also be chosen. When judging a sentence, you should "mentally" replace all anaphora by their antecedents."

  • For a small subset of 30 texts, we got two annotations per text. We harmonised these using a scoring system. The highest ranked sentence was equivalent to 3 points, the second to 2 and the third to 1. The sentence with the most points was then deemed the highest ranked sentence in the harmonised annotation, and so on. Any tied scores were resolved by randomly selecting one of the sentences in the tie.

About the files

We have provided the corpus in JSON format. The JSON file contains the original file name and the rankings for the three sentences. We have also provided text files for each individual text with the three sentences that form the summary on a new line.

How to cite

If you use these summaries please cite the following paper:

Freya Hewett and Manfred Stede. Extractive summarisation for German-language data: a text-level approach with discourse features. In Proceedings of the 29th International Conference on Computational Linguistics (COLING). 2022. To appear.

If you use anything from the original corpus please cite the following paper:

Peter Bourgonje and Manfred Stede. The Potsdam Commentary Corpus 2.2: Extending Annotations for Shallow Discourse Parsing. In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020). Marseille, France, May 2020. European Language Resources Association (ELRA).

About

Data and code for summaries for the Potsdam Commentary Corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published