Scripts to import the dictionary Lexique étymologique du breton moderne (Q19216625) by Victor Henry (Q1386172) from Wikisource to Wikidata's lexicographical data. This dictionary is in French about the Breton language.
- PHP 7
- Python 3
Install the dependencies. Example on a Debian-like system:
apt install php python3 python3-pip
Download the project:
git clone "https://github.com/envlh/henry.git"
Install the Python requirements. Example of the command to use at the root of the project:
pip3 install -r requirements.txt
The bot uses Pywikibot. A way to login to Wikidata is to use a bot password.
Download Pywikibot:
git clone "https://gerrit.wikimedia.org/r/pywikibot/core"
After creating your bot password, generate configuration files:
python3 pwb.py generate_user_files.py
Copy generated files user-config.py
and user-password.py
at the root of the henry
project.
Retrieves content from Wikisource, aggregates all pages in one file, and does some cleaning.
php -f crawler.php
Several files are generated:
wikitext.txt
: raw wikitext crawled from Wikisource (useful for debug)stripped.txt
: wikitext after cleaning
Parses previously created file and converts it into machine-readable format.
python3 parser.py
Several files are generated:
lexemes.json
: lexemes that will be imported in Wikidata, serialized in Wikibase JSON formatlexemes.txt
: more human-readable list of lexemes that will be importederrors.json
: rejected lexemes, with reason of errormonograms.json
andbigrams.json
: frequencies of letters in lemmas
Imports the data in Wikidata's lexicographical data.
python3 bot.py
This project, mainly by Envel Le Hir (@envlh) for the code and Nicolas Vigneron (@belett) for the Wikisource transcription, is under CC0 license (public domain dedication).