# Lingpy - Polynesian example

This example follows the [lingpy](https://github.com/lingpy/lingpy-tutorial/blob/master/notebook.ipynb) tutorial. Several steps in that either did not work or caused problems due to version updates, so this has neen updated to work with the current version (**lingpy 2.6.13**).

The [lingpy documentation](https://lingpy.org/docu) is the only place I could find information about some of the features of the code, for example [Lexstat](https://lingpy.org/docu/compare/lexstat.html) provides some critical information about how the internal workings operate.

## Setup

Install any required dependencies and import packages

In [1]:
#!pip install lingpy
#!pip install re
#!pip install lingpy
#!pip install pandas

In [125]:
# !pip install lingpy
from lingpy import * # We're just importing everything from lingpy for simplicity
import pandas as pd
from lingpy.sequence.sound_classes import ipa2tokens
import re
import os

In [127]:
## We'll store data in a local 'data' directory
os.makedirs('data', exist_ok=True)

Directory 'data' created successfully


### Read data

You can download the [polynesian data](https://github.com/lingpy/lingpy-tutorial/raw/refs/heads/master/polynesian.tsv) directly and look at it. There is a separate [East polynesian dataset](https://github.com/lingpy/lingpy-tutorial/raw/refs/heads/master/east-polynesian.tsv) that the example was designed for.

Here we read it from the remote directly, but you can just change the link below to where you saved your data.

The data are sorted by 'concept' which is ok. Running `print(d.sort_values(by='ID').head())` shows that the data are complete, i.e. there are no missing 'ID's which would be a problem for pyling.

In [133]:
d = pd.read_table('https://github.com/lingpy/lingpy-tutorial/raw/refs/heads/master/polynesian.tsv',skiprows=3,comment='#') # Skip the top 3 rows which are comments or empty lines
print(d.head())
d.to_csv(os.path.join('data','polynesian.tsv'),index=False,sep='\t') # Save the data as received

     ID         DOCULECT DOCULECT_IN_SOURCE GLOTTOCODE  CONCEPTICON_ID  \
0     8  Vaeakau_Taumako    Vaeakau-Taumako   pile1238            1705   
1   251        Wallisian         Uvea, East   wall1257            1705   
2   725            Maori              Maori   maor1246            1705   
3   949   Kapingamarangi     Kapingamarangi   kapi1249            1705   
4  1169         Tahitian  Tahitian (Modern)   tahi1242            1705   

  CONCEPT VALUE  FORM               SOURCE  BVD_ID  COGID  BORROWING VARIANTS  \
0   Eight  valu  valu  Hovdhaugen-375-2009   89600    663        NaN      NaN   
1   Eight  valu  valu               POLLEX   90435    663        NaN      NaN   
2   Eight  waru  waru        Biggs-85-2005   87872    663        NaN      NaN   
3   Eight  walu  walu               POLLEX   90454    663        NaN     waru   
4   Eight  va'u  va'u       Clark-173-2005  102829    663        NaN     varu   

    TOKENS  
0  v a l u  
1  v a l u  
2  w a r u  
3  w a l u  
4  

Note that this data contains all of the tokens already, but to simulate the process of tokenization and conversion to IPA we will remove all of this information, leaving only the [cldf](https://cldf.clld.org/) required information: 'doculect' which is the language identifier, 'concept' which is the english word, and 'form' which is the word as spoken.

In [134]:
dsimple=d[['ID','CONCEPT','DOCULECT','FORM']]
dsimple=dsimple.dropna()
dsimple.loc[:,'ID']=dsimple.index
dsimple = dsimple.rename(columns={
    'DOCULECT': 'doculect',
    'CONCEPT': 'concept',
    'FORM': 'form'
})
dsimple.head()

Unnamed: 0,ID,concept,doculect,form
0,0,Eight,Vaeakau_Taumako,valu
1,1,Eight,Wallisian,valu
2,2,Eight,Maori,waru
3,3,Eight,Kapingamarangi,walu
4,4,Eight,Tahitian,va'u


## Converting data to 'IPA'-like tokens that can be processed by pyling

Mant characters are invalid for linguistic analysis and must be removed. There are some examples in these data, which we can use to show how to remove invalid content.

But first, we will make a function that converts an entire column of a dataframe to IPA, which is a little complex because it otherwise creates 'tokens', i.e. keeps the result as a list of single sounds.

In [135]:
def ipa2tokens_list(d):
    ## Takes a list of dataframe column
    ## Apply transformation to words that are present, and collapse that back to a string
    ret = [''.join(ipa2tokens(x)) for x in d]
    return ret

**PRE-PROCESSING** 

The issues below were identified by a series of errors or warnings thrown up by Lingpy. You can try omitting them and seeing how they affect the results.

In [136]:
dsimple['form']=[x.replace('-','') for x in dsimple['form']] # No - symbols
dsimple['form']=[x.replace('=','') for x in dsimple['form']] # No = symbols
dsimple['form']=[re.sub(r'\d', '', x) for x in dsimple['form']] # No numbers
dsimple['form']=[re.sub(r"\s+", '.', x) for x in dsimple['form']] # No spaces - replace these with a short pause ( though this is stripped from the IPA)
dsimple['form']=[x.lower() for x in dsimple['form']] # No capital letters

This section converts to IPA and saves the data to disk.

In [137]:

# Convert to IPA
dsimple['ipa']=ipa2tokens_list(dsimple['form'])
# Remove whitespace in the 'doculect' and 'concept' columns
dsimple['doculect']=[re.sub(r"\s+", '_', x) for x in dsimple['doculect']]
dsimple['concept']=[re.sub(r"\s+", '_', x) for x in dsimple['concept']]
# Save the results to a local file. 
dsimple.to_csv(os.path.join('data','polynesian_processed.tsv'),sep='\t',index=False)
## We'll print an extract for comparison and checking; this can be helpful for debugging as well
print(dsimple.loc[120:130,:])

      ID concept         doculect  form   ipa
120  120       I          Tikopia   kau   kau
121  121       I         Sikaiana    au    au
122  122       I  North_Marquesan    'u    'u
123  123       I  North_Marquesan    au    au
124  124       I     East_Futunan    ku    ku
125  125       I     East_Futunan   kau   kau
126  126       I         Pukapuka   aku   aku
127  127       I         Pukapuka    au    au
128  128       I        Mele_Fila  aβau  aβau
129  129       I        Ra’ivavae   vau   vau
130  130       I        Tuamotuan    ku    ku


# Main section: Working with Lingpy

Lingpy can operate on datafiles or darta in its own format.

We can access the data in Lingpy format using `Wordlist`. This is a 'Dictionary' object that has been extended with additional features. It is a little unweildy.

In [138]:
d = Wordlist('data/polynesian_processed.tsv')
# count number of languages, number of rows, number of concepts
print(
    "Wordlist has {0} languages and {1} concepts across {2} rows.".format(
        d.width, d.height, len(d)))


Wordlist has 30 languages and 210 concepts across 7315 rows.


This is how we interact with the wordlist:

In [139]:

# get all indices for concept "eight", `row` refers to the concepts here, while `col` refers to languages
eight = d.get_dict(row='Eight', entry='ipa')
for taxon in ['Emae', 'Rennell_Bellona', 
              'Tuvalu', 'Sikaiana', 'Penryhn',  'Kapingamarangi']:
    print(
        '{0:20}'.format(taxon), '  \t', ', '.join(eight[taxon]))

Emae                   	 βaru
Rennell_Bellona        	 baŋgu
Tuvalu                 	 valu
Sikaiana               	 valu
Penryhn                	 varu
Kapingamarangi         	 walu


## Cognate detection

For cognate detection, we can operate on our wordlist directly:

In [141]:
lex = LexStat(d,check=True)

2025-05-20 09:41:41,883 [INFO] No obvious errors found in the data.


Or equivelently we can run on the raw data format:

In [142]:
lex = LexStat('data/polynesian_processed.tsv',check=True)

2025-05-20 09:42:02,226 [INFO] No obvious errors found in the data.


**Cognate detection**

This is very simple to run, though complex to master.

We first (and only once) have to run `lex.get_scorer()` to tell Pyling about the score function it will use. This is complicated, but involves checking the data against itself so is run **once per dataset**. We can then run cognate detection with a number of **methods**.

Pyling is very "chatty" so we are going to reduce the amount of information it prints to the screen. This is not needed. If you get rid of it, make sure you also get rid on the indentation, because this is meaningful in python.

Here we use the clustering (i.e. cognate detection) algorithm 'infomap' which requires a 'similarity threshold' to operate.

Note also that we only try to cluster sounds for identical concepts. So even if the concepts 'sisters daughter' and 'sister in-laws daughter' have the same word, they will not be identified as cognate in this process.

In [143]:
from contextlib import redirect_stderr # We're going to redirect the chat to a log file
with open("lex_lexstat_log.log", 'w') as f:
    with redirect_stderr(f):
        lex.get_scorer() # Get the scorer (only run once)
        ## Now do the clustering. Note 
        lex.cluster(method='lexstat', threshold=0.55, cluster_method='infomap', ref='cogid')

2025-05-20 09:48:09,160 [INFO] Calculating alignments for pair Anuta / Anuta.
2025-05-20 09:48:09,175 [INFO] Calculating alignments for pair Anuta / East_Futunan.
2025-05-20 09:48:09,190 [INFO] Calculating alignments for pair Anuta / Emae.
2025-05-20 09:48:09,199 [INFO] Calculating alignments for pair Anuta / Futuna_Aniwa.
2025-05-20 09:48:09,208 [INFO] Calculating alignments for pair Anuta / Hawaiian.
2025-05-20 09:48:09,217 [INFO] Calculating alignments for pair Anuta / Kapingamarangi.
2025-05-20 09:48:09,225 [INFO] Calculating alignments for pair Anuta / Luangiua.
2025-05-20 09:48:09,236 [INFO] Calculating alignments for pair Anuta / Mangareva.
2025-05-20 09:48:09,249 [INFO] Calculating alignments for pair Anuta / Maori.
2025-05-20 09:48:09,258 [INFO] Calculating alignments for pair Anuta / Mele_Fila.
2025-05-20 09:48:09,265 [INFO] Calculating alignments for pair Anuta / Niuean.
2025-05-20 09:48:09,273 [INFO] Calculating alignments for pair Anuta / North_Marquesan.
2025-05-20 09:48:

We can examine our results directly by using the dictionary returned as 'lex'. This is a very manual process and you could instead save things to disk and look in Excel.

In [145]:
current_print=0 # We will keep track of how many lines we print to screen
# Print results
for k in lex:
    if lex[k, 'cogid'] is not None: 
        print(lex[k, 'language'], lex[k, 'concept'], lex[k, 'form'], lex[k, 'cogid'])
        current_print+=1
    if current_print >= 30: # Limit the lines to 30
        break

Wallisian Eight valu 1
Maori Eight waru 1
Kapingamarangi Eight walu 1
Tahitian Eight va'u 1
Emae Eight βaru 1
Rapanui Eight va'u 1
Mangareva Eight varu 1
Luangiua Eight valu 1
Tongan Eight valu 1
Tikopia Eight varu 1
Sikaiana Eight valu 1
North_Marquesan Eight va'u 1
East_Futunan Eight valu 1
Pukapuka Eight valu 1
Mele_Fila Eight eβaru 1
Ra’ivavae Eight vagu 1
Tuamotuan Eight varu 1
Niuean Eight valu 1
Rurutuan Eight vaʔu 1
Futuna_Aniwa Eight varu 1
Hawaiian Eight walu 1
Rarotongan Eight varu 1
Penryhn Eight varu 1
Nukuria Eight varu 1
Samoan Eight valu 1
Rennell_Bellona Eight baŋgu 1
Tuvalu Eight valu 1
Anuta Eight varu 1
Vaeakau_Taumako Fifty gatoaelima 2
Maori Fifty rima.tekau 3


The number at the end is the 'cluster ID'. So all words are seen as cognate. This looks correct. Two very different looking words for Fifty are not cognate, which also looks correct.

We can run a different 'model' easily:

In [146]:
# run the dolgopolsky (turchin) analysis, which is threshold-free
with open("lex_turchin_log.log", 'w') as f:
    with redirect_stderr(f):
        lex.cluster(method='turchin', ref='turchinid')

2025-05-20 09:53:04,802 [INFO] Analyzing words for concept <Eight>.
2025-05-20 09:53:04,828 [INFO] Analyzing words for concept <Fifty>.
2025-05-20 09:53:04,832 [INFO] Analyzing words for concept <Five>.
2025-05-20 09:53:04,845 [INFO] Analyzing words for concept <Four>.
2025-05-20 09:53:04,854 [INFO] Analyzing words for concept <I>.
2025-05-20 09:53:04,871 [INFO] Analyzing words for concept <Nine>.
2025-05-20 09:53:04,879 [INFO] Analyzing words for concept <One>.
2025-05-20 09:53:04,889 [INFO] Analyzing words for concept <One_Hundred>.
2025-05-20 09:53:04,897 [INFO] Analyzing words for concept <One_Thousand>.
2025-05-20 09:53:04,909 [INFO] Analyzing words for concept <Seven>.
2025-05-20 09:53:04,917 [INFO] Analyzing words for concept <Six>.
2025-05-20 09:53:04,924 [INFO] Analyzing words for concept <Ten>.
2025-05-20 09:53:04,940 [INFO] Analyzing words for concept <Three>.
2025-05-20 09:53:04,950 [INFO] Analyzing words for concept <Twenty>.
2025-05-20 09:53:04,954 [INFO] Analyzing words 

And a final different model, for comparison:

In [147]:
with open("lex_editdist_log.log", 'w') as f:
    with redirect_stderr(f):
        lex.cluster(method="edit-dist", threshold=0.75, ref='editid')

2025-05-20 09:53:23,902 [INFO] Analyzing words for concept <Eight>.
2025-05-20 09:53:23,926 [INFO] Analyzing words for concept <Fifty>.
2025-05-20 09:53:23,931 [INFO] Analyzing words for concept <Five>.
2025-05-20 09:53:23,941 [INFO] Analyzing words for concept <Four>.
2025-05-20 09:53:23,949 [INFO] Analyzing words for concept <I>.
2025-05-20 09:53:23,963 [INFO] Analyzing words for concept <Nine>.
2025-05-20 09:53:23,969 [INFO] Analyzing words for concept <One>.
2025-05-20 09:53:23,976 [INFO] Analyzing words for concept <One_Hundred>.
2025-05-20 09:53:23,981 [INFO] Analyzing words for concept <One_Thousand>.
2025-05-20 09:53:23,990 [INFO] Analyzing words for concept <Seven>.
2025-05-20 09:53:23,995 [INFO] Analyzing words for concept <Six>.
2025-05-20 09:53:24,000 [INFO] Analyzing words for concept <Ten>.
2025-05-20 09:53:24,012 [INFO] Analyzing words for concept <Three>.
2025-05-20 09:53:24,018 [INFO] Analyzing words for concept <Twenty>.
2025-05-20 09:53:24,021 [INFO] Analyzing words 

Finally lets take a look at the cognate calculations on these data by all models:

In [148]:
current_print=0
# Print results
for k in lex:
    if lex[k, 'cogid'] is not None: 
        print(lex[k, 'language'], lex[k, 'concept'], lex[k, 'form'], lex[k, 'cogid'],
                                                 lex[k, 'turchinid'],
                                                 lex[k, 'editid'])      
        current_print+=1
    if current_print >= 40:
        break

Wallisian Eight valu 1 1 1
Maori Eight waru 1 1 1
Kapingamarangi Eight walu 1 1 1
Tahitian Eight va'u 1 12 1
Emae Eight βaru 1 3 1
Rapanui Eight va'u 1 12 1
Mangareva Eight varu 1 1 1
Luangiua Eight valu 1 1 1
Tongan Eight valu 1 1 1
Tikopia Eight varu 1 1 1
Sikaiana Eight valu 1 1 1
North_Marquesan Eight va'u 1 12 1
East_Futunan Eight valu 1 1 1
Pukapuka Eight valu 1 1 1
Mele_Fila Eight eβaru 1 10 1
Ra’ivavae Eight vagu 1 18 1
Tuamotuan Eight varu 1 1 1
Niuean Eight valu 1 1 1
Rurutuan Eight vaʔu 1 20 1
Futuna_Aniwa Eight varu 1 1 1
Hawaiian Eight walu 1 1 1
Rarotongan Eight varu 1 1 1
Penryhn Eight varu 1 1 1
Nukuria Eight varu 1 1 1
Samoan Eight valu 1 1 1
Rennell_Bellona Eight baŋgu 1 19 1
Tuvalu Eight valu 1 1 1
Anuta Eight varu 1 1 1
Vaeakau_Taumako Fifty gatoaelima 2 33 2
Maori Fifty rima.tekau 3 23 4
Mangareva Fifty rima.rongo'uru 3 23 4
Tongan Fifty nimangofulu 3 32 4
East_Futunan Fifty kaulima 2 21 2
Pukapuka Fifty tinolima 2 27 2
Pukapuka Fifty laulima 2 34 2
Pukapuka Fifty 

We can see that the very fast method 'Turchin' is not doing a great job. The other methods are in pretty close agreement.

The following is how to extract information about a particular concept. (We can make this nicer by making a function!)

In [149]:
import math
# show the cognate sets, stored in "turchinid" for the words for "Eight"
eight = lex.get_dict(row='Eight') # get a dictionary with language as key for concept "eight"
print("{0:20} \t {1} \t{2}\t {3}\t {4}".format('doculect', 
                                               'form', 
                                               'cogid',
                                               'turchinid',
                                               'editid'))
for k, v in eight.items():
    if v: # Remove languages that are missing
        idx = v[0] # index of the word, it gives us access to all data
        print("{0:20} \t {1} \t{2}\t {3}\t {4}".format(lex[idx, 'doculect'], 
                                                 lex[idx, 'form'], 
                                                 lex[idx, 'cogid'],
                                                 lex[idx, 'turchinid'],
                                                 lex[idx, 'editid']))



doculect             	 form 	cogid	 turchinid	 editid
Wallisian            	 valu 	1	 1	 1
Maori                	 waru 	1	 1	 1
Kapingamarangi       	 walu 	1	 1	 1
Tahitian             	 va'u 	1	 12	 1
Emae                 	 βaru 	1	 3	 1
Rapanui              	 va'u 	1	 12	 1
Mangareva            	 varu 	1	 1	 1
Luangiua             	 valu 	1	 1	 1
Tongan               	 valu 	1	 1	 1
Tikopia              	 varu 	1	 1	 1
Sikaiana             	 valu 	1	 1	 1
North_Marquesan      	 va'u 	1	 12	 1
East_Futunan         	 valu 	1	 1	 1
Pukapuka             	 valu 	1	 1	 1
Mele_Fila            	 eβaru 	1	 10	 1
Ra’ivavae            	 vagu 	1	 18	 1
Tuamotuan            	 varu 	1	 1	 1
Niuean               	 valu 	1	 1	 1
Rurutuan             	 vaʔu 	1	 20	 1
Futuna_Aniwa         	 varu 	1	 1	 1
Hawaiian             	 walu 	1	 1	 1
Rarotongan           	 varu 	1	 1	 1
Penryhn              	 varu 	1	 1	 1
Nukuria              	 varu 	1	 1	 1
Samoan               	 valu 	1	 1	 1
Rennell_Bellon

In [150]:
lex.export(os.path.join('data','polynesian_cognate_detected.tsv'))

AssertionError: 

In [111]:
od = pd.read_table('OCSEAN_initial_joineddata.tsv')
od=od[['ID','concept','form','doculect']]
od=od.dropna()
od['form']=[x.replace('-','') for x in od['form']]
od['form']=[x.replace('=','') for x in od['form']]
od['form']=[re.sub(r'[\d]+', '', x) for x in od['form']]
od['form']=[re.sub(r'[\s]+', '.', x) for x in od['form']]
od = od[od['form'] != '']
od = od[od['form'] != '.']
##od.loc[:,'ID']=od.index+1
print(od.head(15))
print(od.shape)

    ID    concept              form       doculect
0    0        sun             wariy  Abui_Bunggeta
1    1       moon              'uya  Abui_Bunggeta
2    2       star             furiy  Abui_Bunggeta
3    3        sky            'adiiy  Abui_Bunggeta
4    4      Earth              buku  Abui_Bunggeta
5    5      Earth             bukuw  Abui_Bunggeta
6    6      cloud            taboqi  Abui_Bunggeta
7    7       wind            simooi  Abui_Bunggeta
8    8       wind             smooi  Abui_Bunggeta
9    9       rain             anuui  Abui_Bunggeta
10  10    drizzle  anuui.wobiyaanra  Abui_Bunggeta
11  11    drizzle   anuui.wobiyaana  Abui_Bunggeta
12  12    drizzle      anuui.paawal  Abui_Bunggeta
13  13        dew               moo  Abui_Bunggeta
14  14  mist; fog            taboqi  Abui_Bunggeta
(70457, 4)


In [112]:
## Hunting for bad sequences that cannot be converted to IPA
## Print out the "batch number" of a block of 100 that break the algorithm
for i in range(math.floor(len(od)/100)):
    try:
        tmp=convert2ipa(od['form'][(100*i):(100*i+100)])
    except:
        print(i)

In [113]:
## How we investigate a "bad" batch.
i=255
for j in range(2100*i,100*i+100):
    print(od['form'][j]) 
print(list(od['form'][(100*i):(100*i+100)]))
tmp=convert2ipa(od['form'][(100*i):(100*i+100)])

['bekipuur', 'hokku', 'hutan.rimba', 'ku', 'ije', 'ijenuwik', 'dih', 'langit', 'bakkaha', 'kanua', 'kokmim', 'kokmim', 'kila', 'daoh', 'kakie', 'kakder', 'pelangi', 'kanik', 'kanik', "ka'pepe", 'teduh', 'bayangan', 'embun', 'udara', 'ki', 'kikaknakbe', 'awan', 'awankakeeh', 'kabut', 'be', 'bekobari', 'salju', 'es', 'be.es', "e'iem", 'lidah.api', 'percikan', 'kano', 'abu', 'jelaga', 'bara.api', 'uap', 'duduk.pi', 'ku.eiem', 'arang', 'kak', 'man', 'hiur', 'pa', 'kaho', 'pahiur', 'pa', 'pakimebobor', 'pakimekape', 'pamo', 'dar', 'hun', 'pahun', 'pahun', 'am', 'na', 'kaktuo', 'keturunan', 'saudara.man', 'saudara.hiur', "a'man", 'eahman', "aa'hiur", 'eah.hiur', 'eah', "a'", 'kembar', 'datuk', 'nenek', 'anak.cucu', "ku'i", 'kaho', 'wak', 'wak', 'wak', 'wak', 'keponakan', 'sepupu', 'sepupu.dari.pasangan', 'leluhur', 'kamanipa', 'bapak.mertua', 'ibu.mertua', 'anak.mantu.lakilaki', 'anak.mantu.perempuan', 'ipar.lakilaki', 'ipar.perempuan', "da'rah", 'anak.angkat', 'janda', 'u', 'ki', 'ik', 'a',

In [114]:
# Convert to IPA
od['ipa']=convert2ipa(od['form'])
od['doculect']=[re.sub(r"\s+", '_', x) for x in od['doculect']]
od['concept']=[re.sub(r"\s+", '_', x) for x in od['concept']]
od.to_csv('OCSEAN_processed_joineddata.tsv',sep='\t')

In [115]:
ocseanwl = Wordlist('OCSEAN_processed_joineddata.tsv')

In [116]:
ocseanlex = LexStat('OCSEAN_processed_joineddata.tsv',check=True)

2025-05-19 21:39:27,205 [INFO] Data has been written to file <errors.log>.


There were errors in the input data - exclude them? [y/N]  


In [None]:

# Turn into LexStat and run cognate detection
lex = LexStat(wl)
lex.get_scorer()
lex.cluster(method='lexstat', threshold=0.55, cluster_method='infomap', ref='cogid')

# Print results
for k in lex:
    print(lex[k, 'language'], lex[k, 'concept'], lex[k, 'form'], lex[k, 'cogid'])