LLpro – A Literary Language Processing Pipeline for German Narrative Texts

An NLP Pipeline for German literary texts implemented in Python and Spacy (v3.5.2). Work in progress.

This pipeline implements several custom pipeline components using the Spacy API. Currently the components perform

Tokenization and Sentence Splitting via SoMaJo (Proisl, Uhrig 2016). Version 2.4.
POS tagging via SoMeWeTa (Proisl 2018). Version 1.8.1.
Lemmatization and Morphological Analysis via RNNTagger (Schmid 2019). Version 1.4.1.
Dependency Parsing via ParZu (Sennrich, Schneider, Volk, Warin 2009; Sennrich, Volk, Schneider 2013; Sennrich, Kunz 2014). Commit a15ae7f.
Named Entity Recognition via FLERT (Schweter, Akbik 2021). Version 0.12.2.
Recognition of References to literary Characters (proper nouns and common nouns, i.e. “Appelative”, cf. Krug et al., 2017) via a custom fine-tuned FLERT model aehrm/droc-character-recognizer.
Tagging of German speech, thought and writing representation (STWR) via custom fine-tuned BERT embeddings, inspired by Brunner, Tu, Weimer, Jannidis (2020); models aehrm/redewiedergabe-direct, ....
Segmentation into Scenes via BERT Embeddings via a custom fine-tuned re-implementation of a model by Kurfalı and Wirén (2021); model aehrm/stss-scene-segmenter.
Coreference Resolution via BERT Embeddings (Schröder, Hatzel, Biemann 2021). Commit f34a99e.
Annotating Event Types to verbal phrases via BERT Embeddings (Vauth, Hatzel, Gius, Biemann 2021) Version 0.2, Commit 25fdf7e.

See also the section about the Output Format for a description of the tabular output format.

Usage

usage: bin/llpro_cli.py [-h] [-v] [--no-normalize-tokens] [--tokenized]
                        [--sentencized] [--paragraph-pattern PAT]
                        [--section-pattern PAT] [--stdout | --writefiles DIR]
                        --infiles FILE [FILE ...]

NLP Pipeline for literary texts written in German.

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose
  --no-normalize-tokens
                        Do not normalize tokens.
  --tokenized           Skip tokenization, and assume that tokens are
                        separated by whitespace.
  --sentencized         Skip sentence splitting, and assume that sentences are
                        separated by newline characters.
  --paragraph-pattern PAT
                        Optional paragraph separator pattern. Paragraph
                        separators are removed, and sentences always terminate
                        on paragraph boundaries. Performed before
                        tokenization/sentence splitting.
  --section-pattern PAT
                        Optional sectioning paragraph pattern. Paragraphs
                        fully matching the pattern are removed. Performed
                        before tokenization/sentence splitting.
  --stdout              Write all processed tokens to stdout.
  --writefiles DIR      For each input file, write processed tokens to a
                        separate file in DIR.
  --infiles FILE [FILE ...]
                        Input files, or directories.

Note: you can specify the resources directory (containing ParZu etc.) with the environment variable LLPRO_RESOURCES_ROOT, and the temporary workdir with the environment variable LLPRO_TEMPDIR.

Installation

The LLpro pipeline can be run either locally or as a Docker container. Running the pipeline using Docker is strongly recommended.

WINDOWS USERS: For building the Docker image, clone using

git clone https://github.com/aehrm/LLpro --config core.autocrlf=input

to preserve line endings.

Building and running the Docker image

We strongly recommend using Docker to run the pipeline. With the provided Dockerfile, all dependencies and prerequisites are downloaded automatically.

cd LLpro
docker build --tag cophiwue/llpro .
# or, if you want experimental features enabled
# docker build --build-arg LLPRO_EXPERIMENTAL=1 --tag cophiwue/llpro-experimental .

After building, the Docker image can be run like this:

mkdir -p files/in files/out
chmod a+w files/out  # make directory writeable from the Docker container
# copy files into ./files/in to be processed
docker run \
    --rm \
    -e OMP_NUM_THREADS=4 \
    --gpus all \    # alternatively, e.g., --gpus "device=0"
    --interactive \
    --tty \
    -a stdout \
    -a stderr \
    -v "$(pwd)/files:/files" \
    cophiwue/llpro -v --writefiles /files/out --infiles /files/in
# processed files are located in ./files/out

Installing locally

Verify that the following dependencies are installed:

Python (tested on version 3.7)
For RNNTagger
- CUDA (tested on version 11.4)
For Parzu:
- SWI-Prolog >= 5.6
- SFST >= 1.4

Execute poetry install and ./prepare.sh. The script downloads all remaining prerequisites. Example usage:

poetry install
./prepare.sh
# NOTICE: use the prepared poetry venv!
poetry run python ./bin/llpro_cli.py -v --writefiles files/out files/in

# if desired, run tests
poetry run pytest -vv

Developer Guide

See the separate Developer Guide about the implemented Spacy components and how to access the assigned attributes.

See also the separate document about the tabular Output Format for a description of the output format and a reference of the used tagsets.

See the folder ./contrib for scripts to reproduce the fine-tuning of the custom models.

Citing

If you use the LLpro software for academic research, please consider citing the accompanying publication:

Ehrmanntraut, Anton, Leonard Konle, and Fotis Jannidis. 2023. „LLpro: A Literary Language Processing Pipeline for German Narrative Text.“ In Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023). Ingolstadt, Germany: KONVENS 2023 Organizers. To be published.

License

In accordance with the license terms of ParZu+Zmorge (GPL v2), and of SoMeWeTa (GPL v3) the LLpro pipeline is licensed under the terms of GPL v3. See LICENSE.

NOTICE: The code of the ParZu parser located in resources/ParZu has been modified to be compatible with LLpro. See git log -p df1e91a.. -- resources/ParZu for a summary of these changes.

NOTICE: Some subsystems and resources used by the LLpro pipeline have additional license terms:

RNNTagger: see https://www.cis.uni-muenchen.de/~schmid/tools/RNNTagger/Tagger-Licence
SoMeWeTa model german_web_social_media_2020-05-28.model: derived from the TIGER corpus; see https://www.ims.uni-stuttgart.de/documents/ressourcen/korpora/tiger-corpus/license/htmlicense.html

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

References

Akbik, Alan, Duncan Blythe, and Roland Vollgraf. 2018. “Contextual String Embeddings for Sequence Labeling.” In COLING 2018, 27th International Conference on Computational Linguistics, 1638–49.

Brunner, Annelen, Ngoc Duyen Tanja Tu, Lukas Weimer, and Fotis Jannidis. 2021. “To BERT or Not to BERT – Comparing Contextual Embeddings in a Deep Learning Architecture for the Automatic Recognition of Four Types of Speech, Thought and Writing Representation.” In Proceedings of the 5th Swiss Text Analytics Conference (SwissText) & 16th Conference on Natural Language Processing (KONVENS), 2624:11. CEUR Workshop Proceedings. Zurich, Switzerland. http://ceur-ws.org/Vol-2624/paper5.pdf.

Krug, Markus, Lukas Weimer, Isabella Reger, Luisa Macharowsky, Stephan Feldhaus, Frank Puppe, and Fotis Jannidis. 2017. “Description of a Corpus of Character References in German Novels - DROC [Deutsches ROman Corpus].” https://resolver.sub.uni-goettingen.de/purl?gro-2/108301.

Kurfalı, Murathan, and Mats Wirén. 2021. “Breaking the Narrative: Scene Segmentation Through Sequential Sentence Classification.” In Proceedings of the Shared Task on Scene Segmentation, edited by Albin Zehe, Leonard Konle, Lea Dümpelmann, Evelyn Gius, Svenja Guhr, Andreas Hotho, Fotis Jannidis, et al., 3001:49–53. CEUR Workshop Proceedings. Düsseldorf, Germany. http://ceur-ws.org/Vol-3001/#paper6.

Proisl, Thomas. 2018. “SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 665–70. Miyazaki, Japan: European Language Resources Association ELRA. http://www.lrec-conf.org/proceedings/lrec2018/pdf/49.pdf.

Proisl, Thomas, and Peter Uhrig. 2016. “SoMaJo: State-of-the-Art Tokenization for German Web and Social Media Texts.” In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, 57–62. Berlin, Germany: Association for Computational Linguistics (ACL). http://aclweb.org/anthology/W16-2607.

———. 2019. “Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts.” In DATeCH, Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage, 133–37. Brussels, Belgium: Association for Computing Machinery. https://www.cis.uni-muenchen.de/~schmid/papers/Datech2019.pdf.

Schröder, Fynn, Hans Ole Hatzel, and Chris Biemann. 2021. “Neural End-to-End Coreference Resolution for German in Different Domains.” In Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), 170–81. Düsseldorf, Germany: KONVENS 2021 Organizers. https://aclanthology.org/2021.konvens-1.15.

Schweter, Stefan, and Alan Akbik. 2021. “FLERT: Document-Level Features for Named Entity Recognition.” arXiv:2011.06993 [Cs], May. http://arxiv.org/abs/2011.06993.

Sennrich, Rico, and Beat Kunz. 2014. “Zmorge: A German Morphological Lexicon Extracted from Wiktionary.” In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 1063–7. Reykjavik, Iceland: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/116_Paper.pdf.

Sennrich, Rico, G. Schneider, M. Volk, M. Warin, C. Chiarcos, Richard Eckart de Castilho, and Manfred Stede. 2009. “A New Hybrid Dependency Parser for German.” In Proceedings of the GSCL Conference. Potsdam, Germany. https://doi.org/10.5167/UZH-25506.

Sennrich, Rico, Martin Volk, and Gerold Schneider. 2013. “Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-Tagging, and Morphological Analysis.” In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, 601–9. Hissar, Bulgaria: INCOMA Ltd. Shoumen, BULGARIA. https://www.aclweb.org/anthology/R13-1079.

Vauth, Michael, Hans Ole Hatzel, Evelyn Gius, and Chris Biemann. 2021. “Automated Event Annotation in Literary Texts.” In Proceedings of the Conference on Computational Humanities Research 2021, edited by Maud Ehrmann, Folgert Karsdorp, Melvin Wevers, Tara Lee Andrews, Manuel Burghardt, Mike Kestemont, Enrique Manjavacas, Michael Piotrowski, and Joris van Zundert, 2989:333–45. CEUR Workshop Proceedings. Amsterdam, the Netherlands. https://ceur-ws.org/Vol-2989/#short_paper18.

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
bin		bin
contrib		contrib
doc		doc
llpro		llpro
resources		resources
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
build.py		build.py
poetry.lock		poetry.lock
prepare.sh		prepare.sh
pyproject.toml		pyproject.toml

License

cophi-wue/LLpro

Folders and files

Latest commit

History

Repository files navigation

LLpro – A Literary Language Processing Pipeline for German Narrative Texts

Usage

Installation

Building and running the Docker image

Installing locally

Developer Guide

Citing

License

References

About

Resources

License

Stars

Watchers

Forks

Languages