Skip to content

envlh/henry

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Description

Scripts to import the dictionary Lexique étymologique du breton moderne (Q19216625) by Victor Henry (Q1386172) from Wikisource to Wikidata's lexicographical data. This dictionary is in French about the Breton language.

Dependencies

  • PHP 7
  • Python 3

Installation

Install the dependencies. Example on a Debian-like system:

apt install php python3 python3-pip

Download the project:

git clone "https://github.com/envlh/henry.git"

Install the Python requirements. Example of the command to use at the root of the project:

pip3 install -r requirements.txt

Configuration

The bot uses Pywikibot. A way to login to Wikidata is to use a bot password.

Download Pywikibot:

git clone "https://gerrit.wikimedia.org/r/pywikibot/core"

After creating your bot password, generate configuration files:

python3 pwb.py generate_user_files.py

Copy generated files user-config.py and user-password.py at the root of the henry project.

Usage

Crawler

Retrieves content from Wikisource, aggregates all pages in one file, and does some cleaning.

php -f crawler.php

Several files are generated:

  • wikitext.txt: raw wikitext crawled from Wikisource (useful for debug)
  • stripped.txt: wikitext after cleaning

Parser

Parses previously created file and converts it into machine-readable format.

python3 parser.py

Several files are generated:

  • lexemes.json: lexemes that will be imported in Wikidata, serialized in Wikibase JSON format
  • lexemes.txt: more human-readable list of lexemes that will be imported
  • errors.json: rejected lexemes, with reason of error
  • monograms.json and bigrams.json: frequencies of letters in lemmas

Import

Imports the data in Wikidata's lexicographical data.

python3 bot.py

Copyright

This project, mainly by Envel Le Hir (@envlh) for the code and Nicolas Vigneron (@belett) for the Wikisource transcription, is under CC0 license (public domain dedication).

About

Scripts to import a Breton dictionary by Victor Henry from Wikisource to Wikidata's lexicographical data.

Topics

Resources

License

Stars

Watchers

Forks