WikiPron

WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronunciation dictionaries mined using this tool.

Command-line tool
Python API
Data
Models
Development

If you use WikiPron in your research, please cite the following:

Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, and Kyle Gorman (2020). Massively multilingual pronunciation mining with WikiPron. In In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4223-4228. [bibtex]

Command-line tool

Installation

WikiPron requires Python 3.6+. It is available through pip:

pip install wikipron

Usage

Quick Start

After installation, the terminal command wikipron will be available. As a basic example, the following command scrapes G2P data for French:

wikipron fra

Specifying the Language

The language is indicated by a three-letter ISO 639-2 or ISO 639-3 language code, e.g., fra for French. For which languages can be scraped, here is the complete list of languages on Wiktionary that have pronunciation entries.

Output

The scraped data is organized with each <word, pronunciation> pair on its own line, where the word and pronunciation are separated by a tab. Note that the pronunciation is in International Phonetic Alphabet (IPA), segmented by spaces that correctly handle the combining and modifier diacritics for modeling purposes, e.g., we have kʰ æ t with the aspirated k instead of k ʰ æ t.

For illustration, here is a snippet of French data scraped by WikiPron:

accrémentitielle    a k ʁ e m ɑ̃ t i t j ɛ l
accrescent  a k ʁ ɛ s ɑ̃
accrétion   a k ʁ e s j ɔ̃
accrétions  a k ʁ e s j ɔ̃

By default, the scraped data appears in the terminal. To save the data in a TSV file, please redirect the standard output to a filename of your choice:

wikipron fra > fra.tsv

Advanced Options

The wikipron terminal command has an array of options to configure your scraping run. For a full list of the options, please run wikipron -h.

Python API

The underlying module can also be used from Python. A standard workflow looks like:

import wikipron

config = wikipron.Config(key="fra")  # French, with default options.
for word, pron in wikipron.scrape(config):
    ...

Data

We also make available a database of 2.5 million word/pronunciation pairs mined using WikiPron.

Models

We host grapheme-to-phoneme models and modeling software in a separate repository.

Development

Repository

The source code of WikiPron is hosted on GitHub at https://github.com/kylebgorman/wikipron, where development also happens.

For the latest changes not yet released through pip or working on the codebase yourself, you may obtain the latest source code through GitHub and git:

Create a fork of the wikipron repo on your GitHub account.
Locally, make sure you are in some sort of a virtual environment (venv, virtualenv, conda, etc).

Download and install the library in the "editable" mode together with the core and dev dependencies within the virtual environment:

git clone https://github.com/<your-github-username>/wikipron.git
cd wikipron
pip install --upgrade pip setuptools
pip install -r requirements.txt
pip install --no-deps -e .

We keep track of notable changes in CHANGELOG.md.

Contribution

For questions, bug reports, and feature requests, please file an issue.

If you would like to contribute to the wikipron codebase, please see CONTRIBUTING.md.

License

WikiPron is released under an Apache 2.0 license. Please see LICENSE.txt for details.

Please note that Wiktionary data in the data/ directory has its own licensing terms.

Name		Name	Last commit message	Last commit date
Latest commit History 289 Commits
.circleci		.circleci
.github		.github
data		data
tests		tests
wikipron		wikipron
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiPron

Command-line tool

Installation

Usage

Quick Start

Specifying the Language

Output

Advanced Options

Python API

Data

Models

Development

Repository

Contribution

License

About

Releases

Packages

Languages

License

ben-fernandes/wikipron

Folders and files

Latest commit

History

Repository files navigation

WikiPron

Command-line tool

Installation

Usage

Quick Start

Specifying the Language

Output

Advanced Options

Python API

Data

Models

Development

Repository

Contribution

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages