lexica-corpus

Files & script for lexica corpus for German text simplification

Updates:

March 2022: total of 3270 files (1090 at each level)

August 2021: total of 2934 files

About

The corpus consists of texts from three Wiki-based lexica in German language: MiniKlexikon, Klexikon and Wikipedia. The articles in the Wikis are created by volunteers and can be written, discussed, and improved upon collaboratively. Klexikon is aimed specifically at children aged between 6 and 12 and MiniKlexikon is designed for children who are beginner readers, and is therefore an even simpler version of the Klexikon. We make the assumption that the three different sub-corpora represent three different levels of conceptual complexity due to the target groups they are written for: younger children, children and adults. As Wikipedia articles can be extremely long, in comparison to the other two lexica, only the introduction or abstract was taken for this corpus.

This repository contains the corpora from the original study (295 texts per sub-corpus in the orig_files folder), extended versions with ca. 1000 texts (as of August 2021) per sub-corpus (the miniklexi_corpus.txt, klexi_corpus.txt, wiki_corpus.txt files in this folder) and a script to update the extended version as new articles are added to the Klexikon and MiniKlexikon.

Note on the format

The sub-corpora feature a symbol for "end of paragraph": MiniKlexikon and Klexikon <eop>and in Wikipedia just *

Statistics for the original (smaller) corpora

Sub-corpus	Avg. article length	Avg. sentence length
MiniKlexikon	134.86	9.57
Klexikon	305.45	13.29
Wikipedia	169.89	18.41

How to use the script

Run the script build.sh to update the corpus, using the default options (or if you use Conda then use build_conda.sh)

Alternatively, create your own environment using the requirements and use the following options:

to build the corpus from scratch:

python parse_lexica.py --create_new_corpus

to check the Wikipedia disambiguations individually:

python parse_lexica.py --more_info

to change the file names of the sub-corpora:

--klexi_file --miniklexi_file --wiki_file

Contributors

Freya Hewett & Christopher Richter

License

The Klexikon and MiniKlexikon files are licensed under CC BY-SA 4.0

The Wikipedia files are licensed under CC BY-SA 3.0

Citation

Hewett, F., & Stede, M. (2021). Automatically evaluating the conceptual complexity of German texts. Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), 228–234.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

orig_files

orig_files

README.md

README.md

build.sh

build.sh

build_conda.sh

build_conda.sh

environment.yml

environment.yml

klexi_corpus.txt

klexi_corpus.txt

miniklexi_corpus.txt

miniklexi_corpus.txt

parse_lexica.py

parse_lexica.py

requirements.txt

requirements.txt

wiki_corpus.txt

wiki_corpus.txt

Repository files navigation

lexica-corpus

About

Note on the format

Statistics for the original (smaller) corpora

How to use the script

Contributors

License

Citation

About

Releases 2

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
orig_files		orig_files
README.md		README.md
build.sh		build.sh
build_conda.sh		build_conda.sh
environment.yml		environment.yml
klexi_corpus.txt		klexi_corpus.txt
miniklexi_corpus.txt		miniklexi_corpus.txt
parse_lexica.py		parse_lexica.py
requirements.txt		requirements.txt
wiki_corpus.txt		wiki_corpus.txt

fhewett/lexica-corpus

Folders and files

Latest commit

History

Repository files navigation

lexica-corpus

About

Note on the format

Statistics for the original (smaller) corpora

How to use the script

Contributors

License

Citation

About

Resources

Stars

Watchers

Forks

Languages