# Practice session with the `lingpy` module
By *Gede Primahadi W. Rajeg*

Created on: 6 August 2024

### Overview

This is my personal note to learn [`lingpy`](https://github.com/lingpy/lingpy). Use `myenv` (Python 3.9.6) inside my `cldf_project` directory as the Kernel/Python environment that already has cldf-related suit of modules installed.

The steps below is adapted from the tutorials in the [`lingpy`](https://github.com/lingpy/lingpy) documentation page (cf. [here for the basic](https://lingpy.org/examples.html) and [here for handling wordlist](https://lingpy.org/tutorial/lingpy.basic.wordlist.html)). I combine them with workflow involving `pandas` to turn the matrix output into more general data frame format that I can later feed into an R workflow.

In [19]:
# load the lingpy module
from lingpy import *

# load panda to handle data frame
import pandas as pd

In [20]:
# read the test data, namely the Harry Potter data
df = pd.read_csv('data/harry_potter.csv', sep = '\t')

# print the first six rows of the data frame
df.head()

Unnamed: 0,ID,CONCEPT,COUNTERPART,IPA,DOCULECT,COGID
0,1,hand,Hand,hant,German,1
1,2,hand,hand,hænd,English,1
2,3,hand,рука,ruka,Russian,2
3,4,hand,рука,ruka,Ukrainan,2
4,5,leg,Bein,bain,German,3


Note that the harry_potter.csv data shown in the tutorial [here](https://lingpy.org/tutorial/lingpy.basic.wordlist.html) is no longer available from the [source code repository](https://github.com/lingpy/lingpy/tree/master) (not sure why). So, I recreated [this data](https://github.com/complexico/lingpy-practice/blob/main/data/harry_potter.csv) manually.

In [21]:
# Filter the data frame to illustrate the alignment analysis method

hand_df = df[df["CONCEPT"] == "hand"]
hand_df


Unnamed: 0,ID,CONCEPT,COUNTERPART,IPA,DOCULECT,COGID
0,1,hand,Hand,hant,German,1
1,2,hand,hand,hænd,English,1
2,3,hand,рука,ruka,Russian,2
3,4,hand,рука,ruka,Ukrainan,2


In [22]:
# Get the IPA column from the `hand_df` as input for alignment analysis
hand_seqs = hand_df["IPA"].tolist()
hand_seqs

['hant', 'hænd', 'ruka', 'ruka']

### Run alignment analysis

The reference is [https://lingpy.org/examples.html](https://lingpy.org/examples.html).

In [23]:

## First, create an instance of `Multiple` class
## The input data is the list/sequence of IPA forms for the concept HAND
hand_msa = Multiple(hand_seqs)
hand_msa

<lingpy.align.multiple.Multiple at 0x10ff0a520>

In [24]:
## Second, run the alignment analysis
### Using the progressive alignment (source: https://lingpy.org/examples.html)
hand_msa.prog_align()

### print the output
print(hand_msa)

h	a	n	t	-
h	æ	n	d	-
r	u	k	-	a
r	u	k	-	a


In [33]:
### Using the library alignment
hand_msa_lib = hand_msa
hand_msa_lib.lib_align()
print(hand_msa_lib)

h	a	n	t	-
h	æ	n	d	-
r	u	k	-	a
r	u	k	-	a


Note that the output of `print(hand_msa)` (or `print(hand_msa_lib)`) above is derived from a Python matrix inside the `hand_msa` object.

The following code shows how to get the attributes inside a Python object like `hand_msa`.

In [27]:
## get the number of attribute in `hand_msa` object
attr_len = len(dir(hand_msa))
attr_len

## get the list of attribute in `hand_msa` object
hand_msa_attr = dir(hand_msa)

There are 87 attributes in the `hand_msa` object. The alignment output is in the `alm_matrix` attribute while the tokenised results are in the `tokens` attribute.

The following code shows how to retrieve these attributes and their contents.

In [29]:
## retrieve the contents of the `tokens`, which is in the form of a Python matrix
getattr(hand_msa, "tokens")

[['h', 'a', 'n', 't'],
 ['h', 'æ', 'n', 'd'],
 ['r', 'u', 'k', 'a'],
 ['r', 'u', 'k', 'a']]

In [30]:
## retrieve the contents of the `alm_matrix`, which is also in the form of a Python matrix
getattr(hand_msa, "alm_matrix")

[['h', 'a', 'n', 't', '-'],
 ['h', 'æ', 'n', 'd', '-'],
 ['r', 'u', 'k', '-', 'a'],
 ['r', 'u', 'k', '-', 'a']]

### Save the alignment matrix into a data frame

We can use `pandas` to turn the matrix of alignment into data frame for ease of processing. See the following code.

In [31]:
## turn alignment matrix into pandas data frame
hand_alm_mtx = getattr(hand_msa, "alm_matrix")
hand_alm_df = pd.DataFrame(hand_alm_mtx)
hand_alm_df

Unnamed: 0,0,1,2,3,4
0,h,a,n,t,-
1,h,æ,n,d,-
2,r,u,k,-,a
3,r,u,k,-,a


In [32]:
## save the data frame into a tab-separated .csv file
hand_alm_df.to_csv("data/hand_alm_df.tsv", index = False, encoding= "utf-8", sep = "\t")

# Next:

Alignment analysis (e.g., in for loop) for each cognate ID