This script parses the Finnish and Hungarian Wiktionaries, and extracts bilingual word pairs from them. It also collects definitions and example sentences.
wiktionary_parser.py (wordpairs|definitions|examples|all) --lang=<wikicode> [--input=<file>] [--output=<path>]
The script supports 4 actions:
wordpairs
: extracting Finnish-Hungarian word pairs from the given Wiktionary edition.definitions
: extracting definitions for words in the language of the Wiktionary edition (e.g. if the Wiktionary edition is Finnish, the collected definitions are also in Finnish).examples
: extracting example sentences for words in the language of the Wiktionary edition.all
: getting the output of all three actions described above.
Options:
lang
: the language of the Wiktionary from which the data should be extracted. The script supports onlyhu
andfi
.input
: the Wiktionary dump which is to be used. If not given, the script first looks for the.xml
dump file in thedata/
directory, if not found, it downloads the latest Wiktionary dump for the given language.output
: if given, the output is saved to this directory. Default output path isoutput/
.
Output is saved to the path given with the --output
option. If not given,the default output path is the output/
folder.
The bilingual dictionary is saved as wordpairs_<wikicode>.tsv
and has three values separated by tabs:
FIN_WORD <tab> HUN_WORD <tab> UD_POS_TAG
The definitions and example sentences are saved similarly, as definitions_<wikicode>.tsv
and examples_<wikicode>.tsv
respectively.
UD_POS_TAG <tab> WORD <tab> SENTENCE
This work is licensed under the GNU AGPL v3.0 License.