# Practice session with the `lingpy` module
By *Gede Primahadi W. Rajeg*

Created on: 6 August 2024

This is my personal note to learn [`lingpy`](https://github.com/lingpy/lingpy). Use `myenv` (Python 3.9.6) inside my `cldf_project` directory as the Kernel/Python environment that already has cldf-related suit of modules installed.

## Practice 1

### Overview

The steps below is adapted from the tutorials in the [`lingpy`](https://github.com/lingpy/lingpy) documentation page (cf. [here for the basic](https://lingpy.org/examples.html) and [here for handling wordlist](https://lingpy.org/tutorial/lingpy.basic.wordlist.html)). I combine them with workflow involving `pandas` to turn the matrix output into more general data frame format that I can later feed into an R workflow.

In [19]:
# load the lingpy module
from lingpy import *

# load panda to handle data frame
import pandas as pd

### Read in the word list data

We use pandas module for this purpose.

In [20]:
# read the test data, namely the Harry Potter data
df = pd.read_csv('data/harry_potter.csv', sep = '\t')

# print the first six rows of the data frame
df.head()

Unnamed: 0,ID,CONCEPT,COUNTERPART,IPA,DOCULECT,COGID
0,1,hand,Hand,hant,German,1
1,2,hand,hand,hænd,English,1
2,3,hand,рука,ruka,Russian,2
3,4,hand,рука,ruka,Ukrainan,2
4,5,leg,Bein,bain,German,3


Note that the harry_potter.csv data shown in the tutorial [here](https://lingpy.org/tutorial/lingpy.basic.wordlist.html) is no longer available from the [source code repository](https://github.com/lingpy/lingpy/tree/master) (not sure why). So, I recreated [this data](https://github.com/complexico/lingpy-practice/blob/main/data/harry_potter.csv) manually.

In [21]:
# Filter the data frame to illustrate the alignment analysis method

hand_df = df[df["CONCEPT"] == "hand"]
hand_df


Unnamed: 0,ID,CONCEPT,COUNTERPART,IPA,DOCULECT,COGID
0,1,hand,Hand,hant,German,1
1,2,hand,hand,hænd,English,1
2,3,hand,рука,ruka,Russian,2
3,4,hand,рука,ruka,Ukrainan,2


We now need to select the column and turn it into a list/sequence needed for the alignment analysis.

In [22]:
# Get the IPA column from the `hand_df` as input for alignment analysis
hand_seqs = hand_df["IPA"].tolist()
hand_seqs

['hant', 'hænd', 'ruka', 'ruka']

### Run alignment analysis

The reference is [https://lingpy.org/examples.html](https://lingpy.org/examples.html).

In [23]:

## First, create an instance of `Multiple` class
## The input data is the list/sequence of IPA forms for the concept HAND
hand_msa = Multiple(hand_seqs)
hand_msa

<lingpy.align.multiple.Multiple at 0x10ff0a520>

In [24]:
## Second, run the alignment analysis
### Using the progressive alignment (source: https://lingpy.org/examples.html)
hand_msa.prog_align()

### print the output
print(hand_msa)

h	a	n	t	-
h	æ	n	d	-
r	u	k	-	a
r	u	k	-	a


In [33]:
### Using the library alignment
hand_msa_lib = hand_msa
hand_msa_lib.lib_align()
print(hand_msa_lib)

h	a	n	t	-
h	æ	n	d	-
r	u	k	-	a
r	u	k	-	a


Note that the output of `print(hand_msa)` (or `print(hand_msa_lib)`) above is derived from a Python matrix inside the `hand_msa` object.

The following code shows how to get the attributes inside a Python object like `hand_msa`.

In [27]:
## get the number of attribute in `hand_msa` object
attr_len = len(dir(hand_msa))
attr_len

## get the list of attribute in `hand_msa` object
hand_msa_attr = dir(hand_msa)

There are 87 attributes in the `hand_msa` object. The alignment output is in the `alm_matrix` attribute while the tokenised results are in the `tokens` attribute.

The following code shows how to retrieve these attributes and their contents.

In [29]:
## retrieve the contents of the `tokens`, which is in the form of a Python matrix
getattr(hand_msa, "tokens")

[['h', 'a', 'n', 't'],
 ['h', 'æ', 'n', 'd'],
 ['r', 'u', 'k', 'a'],
 ['r', 'u', 'k', 'a']]

In [30]:
## retrieve the contents of the `alm_matrix`, which is also in the form of a Python matrix
getattr(hand_msa, "alm_matrix")

[['h', 'a', 'n', 't', '-'],
 ['h', 'æ', 'n', 'd', '-'],
 ['r', 'u', 'k', '-', 'a'],
 ['r', 'u', 'k', '-', 'a']]

### Save the alignment matrix into a data frame

We can use `pandas` to turn the matrix of alignment into data frame for ease of processing. See the following code.

In [31]:
## turn alignment matrix into pandas data frame
hand_alm_mtx = getattr(hand_msa, "alm_matrix")
hand_alm_df = pd.DataFrame(hand_alm_mtx)
hand_alm_df

Unnamed: 0,0,1,2,3,4
0,h,a,n,t,-
1,h,æ,n,d,-
2,r,u,k,-,a
3,r,u,k,-,a


In [32]:
## save the data frame into a tab-separated .csv file
hand_alm_df.to_csv("data/hand_alm_df.tsv", index = False, encoding= "utf-8", sep = "\t")

#### Next:

Alignment analysis (e.g., in for loop) for each cognate ID

## Practice 2

### Overview

For this practice, I will try to follow the tutorial presented [here](https://github.com/shh-dlce/qmss-2017)

In [34]:
# Load the lingpy package
from lingpy import *

In [47]:
# Read the word list with the Wordlist function
wl = Wordlist("../cldf_project/qmss-2017/LingPy/polynesian.tsv")
wl

In [52]:
# Or as data frame
wl_df = pd.read_csv("../cldf_project/qmss-2017/LingPy/polynesian.tsv", sep = "\t")
wl_df.head()

Unnamed: 0,ID,DOCULECT,CONCEPT,GLOTTOCODE,CONCEPTICON_ID,VALUE,FORM,TOKENS,VARIANTS,SOURCE,COGID,LOAN,COGNACY
0,6,NorthMarquesan_38,Eight,nort2845,1705,va'u,va'u,v a ʔ u,,POLLEX,618,False,3
1,301,Hawaiian_52,Eight,hawa1245,1705,walu,walu,w a l u,,71458,618,False,3
2,537,Rarotongan_58,Eight,raro1241,1705,varu,varu,v a r u,,POLLEX,618,False,3
3,843,Maori_85,Eight,maor1246,1705,waru,waru,w a r u,,Biggs-85-2005,618,False,3
4,1071,Samoan_118,Eight,samo1305,1705,valu,valu,v a l u,,Blust-118-2005,618,False,3


In [65]:
# Select few columns
wl_df.iloc[:, range(0,5)]

Unnamed: 0,ID,DOCULECT,CONCEPT,GLOTTOCODE,CONCEPTICON_ID
0,6,NorthMarquesan_38,Eight,nort2845,1705
1,301,Hawaiian_52,Eight,hawa1245,1705
2,537,Rarotongan_58,Eight,raro1241,1705
3,843,Maori_85,Eight,maor1246,1705
4,1071,Samoan_118,Eight,samo1305,1705
...,...,...,...,...,...
7546,7029,Mele-Fila_1163,you,mele1250,1213
7547,7223,Nukuria_1212,you,nuku1259,1213
7548,7224,Nukuria_1212,you,nuku1259,1213
7549,7479,Austral_1213,you,aust1304,1213


In [43]:
# count number of languages, number of rows, number of concepts
print("Wordlist has {0} languages and {1} concepts across {2} rows.".format(wl.width, wl.height, len(wl)))

Wordlist has 31 languages and 210 concepts across 7551 rows.


### Segmenting IPA/Phonetic entries

We use the function `ipa2tokens()`.

In [83]:
# Prepare the data

## creating IPA-transcribed word strings
seq1, seq2, seq3, seq4, seq5 = "th o x t a", "thoxta", "apfəl", "tʰoxtɐ", "dɔːtər"
seq1

'th o x t a'

Note that the first string in `seq1` (i.e., "th o x t a") is already in the tokenised form with certain orthography (e.g., the `th` is joined/not separated by whitespace [it throws error in join(ipa2.tokens())]). Meanwhile, the other strings are (i) NOT YET tokenised, even though (ii) they are already in IPA. 

Take home message: it is important to have an IPA-transcribed list and being tokenised so that the tokenisation is correct (e.g., which strings should be joined like `"th"` in `"th o x t a"`).

In [84]:
## Use example of the tokenisation by `ipa2tokens()` for the string `"apfəl"`
print(seq1, "\t->\t", '\t'.join(ipa2tokens(seq1)))
print(seq2, "  \t->\t", '\t'.join(ipa2tokens(seq2)))


ValueError: Input must not contain spaces

In [85]:
## Use example of the tokenisation by `ipa2tokens()` for the string `"apfəl"`
print(seq3, "  \t->\t", '\t'.join(ipa2tokens(seq3, semi_diacritics="f")))

apfəl   	->	 a	pf	ə	l


In [86]:
word = "θiɣatɛra"
segs = ipa2tokens(word)

# iterate over sound class models and write them in converted version 
for m in ['dolgo', 'sca', 'asjp', 'art']:
    print(word, ' -> ', ''.join(tokens2class(segs, m)), '({0})'.format(m))

θiɣatɛra  ->  TVKVTVRV (dolgo)
θiɣatɛra  ->  DIGATERA (sca)
θiɣatɛra  ->  8ixatEra (asjp)
θiɣatɛra  ->  37371757 (art)


In [93]:
eight = wl.get_dict(row='Eight', entry='value')
eight

{'NorthMarquesan_38': ["va'u"],
 'Hawaiian_52': ['walu'],
 'Rarotongan_58': ['varu'],
 'Maori_85': ['waru'],
 'Samoan_118': ['valu'],
 'Austral_128': ['vaʔu'],
 'TongaTongaIslands_136': ['valu'],
 'Pukapuka_152': ['valu'],
 'Tikopia_155': ['varu'],
 'FutunaAniwa_156': ['varu'],
 'Tahitian_173': ["va'u"],
 'RennellBellona_206': ['baŋgu'],
 'EastFutuna_210': ['valu'],
 'Kapingamarangi_217': ['walu'],
 'Penrhyn_235': ['varu'],
 'Luangiua_238': ['valu'],
 'Mangareva_239': ['varu'],
 'Sikaiana_243': ['valu'],
 'Tuamotuan_246': ['varu'],
 'Niuean_247': ['valu'],
 'Anuta_253': ['varu'],
 'Wallisian_258': ['valu'],
 'Rapanui_264': ["va'u"],
 'VaeakauTaumako_375': ['valu'],
 'Tuvalu_753': ['valu'],
 'Emae_1030': ['βaru'],
 'Mele-Fila_1163': ['eβaru'],
 'Nukuria_1212': ['varu'],
 'Austral_1213': ['vaGu'],
 'Polynesian_658': [],
 'RakahangaManihiki_589': []}

In [92]:
for taxon in ['Emae_1030', 'RennellBellona_206', 'Tuvalu_753', 'Sikaiana_243', 'Penrhyn_235',  'Kapingamarangi_217']:
    print('{0:20}'.format(taxon), '  \t', ', '.join(eight[taxon]))

Emae_1030              	 βaru
RennellBellona_206     	 baŋgu
Tuvalu_753             	 valu
Sikaiana_243           	 valu
Penrhyn_235            	 varu
Kapingamarangi_217     	 walu
