This repository accompanies the paper Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists by Jäger, List and Sofroniev. The repository contains both the data and the source code used in the paper's experiment.
The datasets are located in the
||Austronesian||12414||Greenhill et al, 2008|
||Turkic, Indo-European||15903||Manni et al, 2016|
||Sino-Tibetan||3632||Běijīng Dàxué, 1964|
||Trans-New Guinea||1176||McElhanon, 1967|
||Mixe-Zoque||961||Cysouw et al, 2006|
Each dataset is stored in a tsv file where each row is a word and the columns are as follows:
||The word's doculect.|
||The ISO 639-3 code of the word's doculect; can be empty.|
||The word's meaning as described in the dataset.|
||The Concepticon ID of the word's gloss.|
||The dataset's ID of the word's gloss.|
||The word's transcription in either IPA or ASJP.|
||The ID of the set of cognates the word belongs to.|
||The word's phonological segments, space-separated.|
||Field for additional information; can be empty.|
The datasets are published under a Creative Commons Attribution-ShareAlike 4.0 International License and can also be found in Zenodo.
data/vectors directory contains the samples and targets (in the machine
learning sense) derived from the datasets, in csv format. With the exception of
central_asian, which is split into two because its size exceeds 100 MB, there
is a single vector file per dataset (note that the code will not split this file
for you). In these files each row comprises a pair of words from different
languages but with the same meaning. The features are described in section 4.3
of the paper.
data/inferred directory contains the SVM-inferred cognate classes for each
.svmCC.csv file per dataset. It also contains the cognacies
inferred using the LexStat algorithm, one
.lsCC.csv file per dataset.
data/params directory contains the parameters used for inferring the PMI
features of the aforementioned feature vectors. For more information, refer to
code directory contains the source code used to run the study's
experiment. It is Python 3 code and needs
pandas as direct dependencies. You
requirements.txt to install the dependencies, as the code is only
guaranteed to work with the specified versions of those.
setup and usage
# clone this repository git clone https://github.com/evolaemp/svmcc # you do not need to create a virtual environment if you know what you are # doing; remember that the code is written in python3 virtualenv path/to/my/venv source path/to/my/venv/bin/activate # install the dependencies # it is important to use the versions specified in the requirements file pip install -r requirements.txt # this ensures the reproducibility of the results export PYTHONHASHSEED=0 # use manage.py to invoke the commands python manage.py --help
python manage.py prepare <dataset> reads a dataset, generates its samples and
targets, and writes a vector file ready for svm consumption;
the default output directory.
python manage.py infer --svmcc reads a directory of vector files, runs
svm-based automatic cognate detection, and writes the inferred classes into an
output directory; the default input and output directories are
python manage.py test runs some unit tests.
The source code (but not the data) is published under the MIT Licence (see the