cognates-phonology

workspace elisp 2018

Phonological models for cognate terminology identification

The repository hosts resources for computing Levenshtein edit distance with phonological features for several alphabets - (currently the alphabest based on Latin and Cyrillic, new sets will be added in future for Arabic, Georgian, Armenian, script etc.), which explicitly represent linguistic phonological features of compared characters, so the Levenshtein metric can use information about characters’ internal structure rather than treat them as elementary atomic units of comparison. The proposed framework allows developers to test alternative configurations of phonological features, and select feature arrangements, which show greater improvements over the traditional character-based Levenshtein edit distance.

output format for the script:

sys.stdout.write('%(SW1)s, %(SW2)s, %(Lev0)d, %(Lev0Norm).4f, %(LevenshteinI2).4f, %(LevenshteinI2Norm).4f, %(LevenshteinI4).4f, %(LevenshteinI4Norm).4f, %(LevenshteinI6).4f, %(LevenshteinI6Norm).4f, %(LevenshteinI8).4f, %(LevenshteinI8Norm).4f, %(Lev1).4f, %(Lev1Norm).4f\n' % locals())

where:

SW1 = string word 1: source

SW2 = string word 2 target

Lev0= baseline 'traditional' Levenshtein distance

Lev0Norm = baseline 'traditional Levenshtein distance normalised for length of the compared words

LevenshteinI2 = phonological Levenshtein distance, with insertion cost 0.2 (instead of 1)

LevenshteinI2Norm = the same, normalised for length of the compared words

LevenshteinI4 = phonological Levenshtein distance, with insertion cost 0.4

LevenshteinI4Norm = the same, normalised for length

LevenshteinI6 = phonological Levenshtein distance, with insertion cost 0.6

LevenshteinI6Norm = the same, normalised for length

LevenshteinI8 = phonological Levenshtein distance, with insertion cost 0.8

LevenshteinI8Norm = the same, normalised for length (!!! best performance on cognate identification from comparable corpora, from two large word lists)

Lev1 = phonological Levenshtein distance, with insertion cost 1

Lev1Norm = the same, normalised for length

two lines are generated for each word pair from the input file -- for each phonological feature set, specified in argv[6] = the list of phonological feature tables

(the first one is the hierarcical (best performance); the second one is plain vector of features (over-generates on larger search spaces)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
paper		paper
src		src
.DS_Store		.DS_Store
.project		.project
.pydevproject		.pydevproject
LICENSE		LICENSE
README.md		README.md
a0gitClone.sh		a0gitClone.sh
a0gitDownload.sh		a0gitDownload.sh
a0gitUpload.sh		a0gitUpload.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paper

paper

src

src

.DS_Store

.DS_Store

.project

.project

.pydevproject

.pydevproject

LICENSE

LICENSE

README.md

README.md

a0gitClone.sh

a0gitClone.sh

a0gitDownload.sh

a0gitDownload.sh

a0gitUpload.sh

a0gitUpload.sh

Repository files navigation

cognates-phonology

About

Releases

Packages

Languages

License

bogdanbabych/cognates-phonology

Folders and files

Latest commit

History

Repository files navigation

cognates-phonology

About

Resources

License

Stars

Watchers

Forks

Languages