For foreign editions of Wiktionary, extract derivations on each page (if they exist).
A .zim
file that contains all pages of a given edition of Wiktionary.
- Wiktionary dumps in
.zim
format can be obtained from kiwix.
For now, we are dealing with individual .html
pages from the editions we're working on. For example gène in French edition.
Each derivation tuple consists of these fields (UNFINISHED):
edition
: the edition of Wiktionary the derivation is from. It is a 2-3 letter code used in the Wiktionary url.headword
: the word that is being derived.lang
: the language of theheadword
. It might be different from the language of the edition.derivation
: the derivation of theheadword
.pos
: the part of speech of theheadword
inlang
.
The output is a .csv
with these columns.
beautifulsoup4
: used for parsing html.requests
: used to make http calls and fetch.html
from the Intenet. Will eventually be removed as we will be dealing with locally stored files.pycountry
andiso-639
: used for conversion between language codes.
Install in a virtualenv
as appropriate.
To install all dependencies:
$ pip install -r requirements.txt
To install one by one, use pip install [PACKAGE NAME]
.
To run with a .zim
file:
$ python parser.py -z [zimfile]
- Support for using
.zim
file has only been tested forPython 3.5
. It is probably not working forPython 2
at this moment.
Detailed usage:
usage: parser.py [-h] [--zim ZIM] [--edition EDITION]
optional arguments:
-h, --help show this help message and exit
--zim ZIM, -z ZIM use zim file instead of html
--edition EDITION, -e EDITION
explicitly specify the language edition
The list of common things in different editions are listed in common.md.
- Write parsers for two or three editions using the translation parsers.
- Run parsers on zim files (entire foreign editions of Wiktionary)
- Generalize them and create a skeleton for writing other parsers.
- make it so that we need minimal changes in order to parse another edition
- Generate parsers for editions of interest.