Skip to content
A web and social media corpus based on the dataset of the EmpiriST 2015 shared task
XSLT
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
doc Updated lemmatization guidelines Sep 16, 2019
utils
EmpiriST_2015_README.rst Added README from EmpiriST 2015 dataset Aug 20, 2019
LICENSE.txt Added license Aug 20, 2019
README.md Updated README Sep 19, 2019
empirist.vrt

README.md

EmpiriST corpus

Introduction

The EmpiriST corpus is a manually annotated corpus consisting of German web pages and German computer-mediated communication (CMC), i.e. written discourse. Examples for CMC genres are monologic and dialogic tweets, social and professional chats, threads from Wikipedia talk pages, WhatsApp interactions and blog comments.

The dataset was originally created by Beißwenger et al. (2016) for the EmpiriST 2015 shared task and featured manual tokenization and part-of-speech tagging. Subsequently, Rehbein et al. (2018) incorporated the dataset into their harmonised testsuite for POS tagging of German social media data, manually added sentence boundaries and automatically mapped the part-of-speech tags to UD pos tags. In our own annotation efforts (Proisl et al., in preparation), we manually normalized and lemmatized the data and converted the corpus into a “vertical” format suitable for importing into the Open Corpus Workbench, CQPweb, SketchEngine, or similar corpus tools.

Annotation

TODO: Describe S and P attributes

The following subsections give a bit of additional information about the annotation process.

Tokenization and part-of-speech tagging

Beißwenger et al. (2016: 47) describe the annotation process as follows:

All data sets were manually tokenized and PoS tagged by multiple annotators, based on the official tokenization […] and tagging guidelines […]. Cases of disagreement were then adjudicated by the task organizers to produce the final gold standard.

Sentence splitting

Rehbein et al. (2018: 20) used the following rules to guide the segmentation:

  • Hashtags and URLs at the beginning or the end of the tweet that are not integrated in the sentence are separated and form their own unit […].
  • Emoticons are treated as non-verbal comments to the text and are thus integrated in the utterance.
  • Interjections (Aaahh), inflectives (*grins*), fillers (ähm) and acronyms typical for CMC (lol, OMG) are also not separated but considered as part of the message.

Normalization and lemmatization

The data were individually normalized and lemmatized by four student annotators according to the lemmatization guidelines. Unclear cases were decided in group meetings with the team leaders.

Authors

The corpus data was collected, tokenized and part-of-speech tagged by the organizers of the EmpiriST 2015 shared task: Michael Beißwenger, Sabine Bartsch, Stefan Evert and Kay-Michael Würzner.

Ines Rehbein, Josef Ruppenhofer and Victor Zimmermann added sentence boundaries and automatically mapped the STTS pos tags to UD pos tags.

Thomas Proisl, Natalie Dykes, Philipp Heinrich, Besim Kabashi and Stefan Evert added normalization and lemmatization.

References

  • Beißwenger, Michael, Sabine Bartsch, Stefan Evert, and Kai-Michael Würzner. 2016. “EmpiriST 2015: A shared task on the automatic linguistic annotation of computer-mediated communication and web corpora.” In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, 44–56, Berlin. Association for Computational Linguistics. PDF.
  • Rehbein, Ines, Josef Ruppenhofer, and Victor Zimmermann. 2018. “A harmonised testsuite for POS tagging of German social media data.” In Proceedings of the 14th Conference on Natural Language Processing (KONVENS 2018), 18–28, Wien. PDF.
You can’t perform that action at this time.