Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

GeRedE: A Corpus of German Reddit Exchanges

GeRedE is a 270 million token German CMC corpus containing approximately 380,000 submissions and 6,800,000 comments posted on Reddit between 2010 and 2018. Reddit is a popular online platform combining social news aggregation, discussion and micro-blogging. The CWB-indexed version of our final corpus is available to registered academic users via CQPweb

This repository contains the scripts we used to extract German submissions and comments from the vast amount of data Jason Baumgartner provides at It also contains the IDs of all submissions and comments included in our corpus, so that those who wish to recreate our corpus are not required to run all processing steps by themselves.

Steps for Recreating the Corpus

  1. download raw data from
    • it is recommended, though not necessary, to re-compress all files into gzip or bz2 format
    • you need both comments and submissions (from the respective subdirectories)
  2. run on the raw comments and on the thus created *-de.ldjson.gz
    • this will identify comments that are most likely German
  3. run prop_german.R on the directory containing the *-lang.tsv.gz files created in the second step
    • for each month, this will compute the proportion of German comments in each subreddit containing at least one German comment
  4. run subreddits.R on the directory containing the *-german_subreddits_prop.csv files created in the previous step
    • creates stats.csv: statistics for all subreddits and months
    • creates stats_filtered.csv: subreddit filter; retains only subreddits where the proportion of comments classified as German is above the dynamic threshold (see paper for details)
  5. run on *-de.ldjson.gz
    • this will extract all threads IDs with at least one German comment
  6. run on the thus created *-thread-ids.tsv.gz and the raw comments
    • this will extract all comments of threads that contain at least one German comment
  7. run on the thus created *-de-threads.ldjson.gz, saving the output in threads-all.ldjson.gz
    • this will sort the comments into threads
  8. run on stats_filtered.csv.gz, data/german-comment-ids.txt.gz and the above created threads-all.ldjson.gz, saving the results in threads-filtered.ldjson.gz and the scores in threads-all-lang-scores.tsv.gz
    • this will filter out German threads with our combined approach (see paper for details)
  9. run on the raw submissions and the threads-all-lang-scores.tsv.gz
    • this will filter out all submissions of German threads
  10. run on the filtered threads and submissions ( -p tokenized/ *.ldjson.gz)
    • this will extract metadata and text, and convert the Reddit-flavored Markdown to XML
    • note that this step uses Reddit's own snudown Markdown parser and only works with Python2.
  11. tokenization and sentence splitting with SoMaJo (somajo-tokenizer -x --split_sentences)
  12. tag everything with SoMeWeTa (somewe-tagger --tag german_web_social_media_2018-12-21.model -x), then do some STTS_IBK-specific postprocessing (SoMeWeTa/utils/STTS_IBK_postprocessor -x)
  13. TODO annotate all German comments and submissions
  14. TODO run


NB: the output files of the following steps can be found in the data/ sub-folder:

  • step 2 (german-comment-ids.txt.gz)
  • step 4 (stats_filtered.csv.gz)
  • step 8 (threads-all-lang-scores.tsv.gz)

Additional Files

  • data/thread-lang-annotated.tsv.gz contains a manually annotated stratified sample of threads


  • Blombach, Andreas, Natalie Dykes, Philipp Heinrich, Besim Kabashi, and Thomas Proisl. 2020. “A Corpus of German Reddit Exchanges (GeRedE).” In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 6310–6316. Marseille: European Language Resources Association. PDF.

      author =    {Blombach, Andreas and Dykes, Natalie and Heinrich,
                   Philipp and Kabashi, Besim and Proisl, Thomas},
      title =     {A Corpus of {G}erman {R}eddit Exchanges ({GeRedE})},
      year =      {2020},
      booktitle = {Proceedings of the 12th Conference on Language
                   Resources and Evaluation ({LREC} 2020)},
      pages =     {6310--6316},
      publisher = {European Language Resources Association},
      address =   {Marseille},
      url =       {},


GeRedE: A Corpus of German Reddit Exchanges







No releases published


No packages published