Wiktionary Parser

Description

This script parses the Finnish and Hungarian Wiktionaries, and extracts bilingual word pairs from them. It also collects definitions and example sentences.

Usage

wiktionary_parser.py (wordpairs|definitions|examples|all) --lang=<wikicode> [--input=<file>] [--output=<path>]

The script supports 4 actions:

wordpairs: extracting Finnish-Hungarian word pairs from the given Wiktionary edition.
definitions: extracting definitions for words in the language of the Wiktionary edition (e.g. if the Wiktionary edition is Finnish, the collected definitions are also in Finnish).
examples: extracting example sentences for words in the language of the Wiktionary edition.
all: getting the output of all three actions described above.

Options:

lang: the language of the Wiktionary from which the data should be extracted. The script supports only hu and fi.
input: the Wiktionary dump which is to be used. If not given, the script first looks for the .xml dump file in the data/ directory, if not found, it downloads the latest Wiktionary dump for the given language.
output: if given, the output is saved to this directory. Default output path is output/.

Output

Output is saved to the path given with the --output option. If not given,the default output path is the output/ folder.

The bilingual dictionary is saved as wordpairs_<wikicode>.tsv and has three values separated by tabs:

FIN_WORD    <tab>   HUN_WORD    <tab>   UD_POS_TAG

The definitions and example sentences are saved similarly, as definitions_<wikicode>.tsv and examples_<wikicode>.tsv respectively.

UD_POS_TAG  <tab>  WORD <tab>   SENTENCE

License

This work is licensed under the GNU AGPL v3.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
lang_hun.tsv		lang_hun.tsv
wiktionary_parser.py		wiktionary_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

lang_hun.tsv

lang_hun.tsv