GSOC 2019 - Development of a Greek open source Morphological dictionary and application of it to Greek spelling tools
- An SQL database containing the following data
- A morphological dictionary containing about
518.000distinct surface forms with information described according to Universal Dependencies.
- Definitions for most lemmas
- Etymologies for most lemmas
12500of which are for Greek
4300of which are for Greek
3310Normalizations of words
- A spelling dictionary with
1.047.200words, up from the
828.807of the previous dictionary used in open source programs. The dictionary also includes frequencies for all words. It will be integrated into spelling dictionaries of Firefox and Thunderbird.
Documentation can be in the directory data
Running the script
Information about running the script is found here
You can find the final report in the following gist.
During the summer a Morphological dictionary in sqlite3 format will be created. Information will be extracted automatically with a python script and using the pymediawiki library. In addition words and morphological information will be added to the spelling tool dictionaries.
Phase 1 (May 27 - Jun 28)
Creation of a parsing tool for Greek wiktionary that parses nouns, adjectives, verbs using Universal Dependencies POS tags
Phase 2 (Jun 29 - Jul 26)
Addition of remaining parts of speech to the Morphological dictionary and addition of further information tags like toponyms and terminology extracted from page categories.
Phase 3 (Jul 27 - Aug 26)
Addition of extracted surface forms to Greek spelling dictionaries including words from reliable sources like European parliament translations.
- Google summer of code participant: Konstantinos Agiannis
- Mentor: Kostas Papadimas
- Mentor: Theodoros Karounos
- Mentor: Alexios Zavras
The source code is under GPLv3.
The produced database with the morphological dictionary is under CC BY-SA 3.0