online cognacy identification
This repository accompanies the paper Fast and unsupervised methods for multilingual cognate clustering by Rama, Wahle, Sofroniev, and Jäger. The repository contains both the data and the source code used in the paper's experiments.
The code is developed using Python 3.5 but later versions should also do as long as the dependencies are satisfied.
# clone this repository git clone https://github.com/evolaemp/online_cognacy_ident cd online_cognacy_ident # you do not need to create a virtual environment if you know what you are # doing; remember that the code is written in python3 virtualenv meta/venv source meta/venv/bin/activate # install the dependencies # it is important to use the versions specified in the requirements file pip install -r requirements.txt # check the unit tests python -m unittest discover online_cognacy_ident
If you run into difficulties, please make sure you have tried the setup in a fresh virtual environment before opening an issue.
python-igraph on Windows may not work via
pip install python-igraph. Please use the appropriate windows binaries from Christoph
# activate the virtual env if it is not already source meta/venv/bin/activate # ensure the reproducibility of the results export PYTHONHASHSEED=42 # use train.py to train models python train.py --help # use run.py to apply trained models on datasets python run.py --help # use eval.py to evaluate the algorithms' output python eval.py --help
A dataset should be in csv format. You can specify the csv dialect using the
--dialect-input option, possible values are
If this is omitted, the script will try to guess the dialect by looking at the
A dataset should have a header with at least the following columns:
transcription. Column name
detection is case-insensitive. If there are two or more words tied to a single
gloss in a given doculect, all but the first are ignored.
The datasets used in the paper's experiments can be found in the
||Austronesian||ipa||Greenhill et al, 2008|
||Sino-Tibetan||ipa||Běijīng Dàxué, 1964|
||Trans-New Guinea||asjp||McElhanon, 1967|
||Mixe-Zoque||asjp||Cysouw et al, 2006|
Please note that you should use the
--ipa flag when running the algorithms on
any IPA-transcribed dataset, including the ones found in the
If you have fish shell installed, you could invoke
runs both algorithms on all the datasets, saves the results in the
and prints the evaluation scores to stdout.
The datasets are published under a Creative Commons Attribution-ShareAlike 4.0
International License. The source code is published under the MIT License