Documentation for Correcting Plays from TextGrid Repository
When we started this project, we took over hundreds of plays from TextGrid Repository and were happy to have texts already marked-up in TEI. TextGridRep itself took over the data from zeno.org; the TEI mark-up was done automatically, converting zeno.org's proprietary XML into TEI. Unfortunately, the result was very problematic, to say the least. It took us two years to clean up and correct the TEI, a process which we finished in May 2019. Now we have a cleaned-up and enhanced corpus of German drama (which we named GerDraCor). We will maintain and grow it in the future and hope it will be fruitful for all kinds of research purposes. For documentation, here are the different steps we undertook to get to a more reliable corpus of German-language drama:
After fiddling around with the corpus for some time and primarily using a stripped-down intermediary format for our research, we started an all-new clean-up process from scratch on January 6, 2017 (commit).
The original TEI files had a bunch of superfluous
<div>wraps around acts and scenes, which we got rid of on August 6, 2017 (commit).
On September 17, 2017, we started the more detailed part of the clean-up and enhancement process comprising several sub-steps per play:
Enhancement: Add subtitle to
<set>info from zeno.org, because this info was accidentally left out during the conversion into TEI (meaning that the TextGrid Repository does not have this info; see for example Gottsched's Cato at zeno.org and at Textgridrep)
Verse plays are mostly okay when it comes to correct line wraps (
<l>), but plays in prose are very problematic: short speeches (those shorter than about 80 or so characters) were not wrapped in
</p>, but in
</l>, and changing that is not entirely automatisable due to some anomalies (it's still easy enough, but took a lot of time).
Only around half of the stage directions are marked-up correctly; the rest is wrapped in
</hi>, which makes them undistinguishable from emphasised text parts or text parts in foreign languages, which have the exact same wrapping: this adds just another big problem, because this is not automatisable and had to be decided on a passage-per-passage basis; btw, we decided to get rid of all
<hi>and convert them to
<stage>if they are stage directions, or to
<emph>if they are emphasised text parts (this made it easier to check if all cases of
<hi>have been taken care of eventually, which now is the case)
Speaker IDs: This could be described as our unique selling point. We checked all speaker IDs and read/re-read plays or parts of plays to attach the right speaker IDs to all speeches. This was the most time-consuming part, but it also gave us some super-quality network data (check, for example, Grabbe's Napoleon). Also, after correcting all IDs, we added gender info to
<person>s listed in
<particDesc>(FEMALE, MALE, UNKNOWN). If a speaking entity consisted of at least two characters and could not be dissolved further, we marked it up as
In many cases, speaker names were not tagged as speakers. This has consequences for quantitative studies, since the words uttered by characters are not properly assigned. This commit shows some examples for this fix. The majority of plays was affected by this bug.
We fixed torn words that were spread over two lines.
We converted footnotes à la
Other minor things were corrected during the process. Every change is documented in GitHub's version history.
- Embedded figures are still referencing images on TextGridRep.
- For verse lines spread over two or more characters (i.e., if a verse is shared),
<l>could be enhanced to
- Page beginnings (
<pb/>) in zeno.org were usually shifted by one and some page numbers are missing. We are not in a position to fix this consistently at the moment, because we don't have the original scans used by zeno.org.
This overview was last updated on May 21, 2019.