# Cross-linguistic colexifications

## 1 Introduction





## 2 The Database of Cross-Linguistic Colexifications




## 3 How to Compute Colexifications in Python

In the following, we will briefly look how colexifications in a given dataset can be handled with LingPy and with Python in general. 

Let us start by loading the module from LingPy along with a wordlist of Bai dialects and compute the most frequent colexifications in this sample.

In [31]:
from lingpy import *
from lingpy.meaning.colexification import *

wl1 = Wordlist('../data/S10-BAI.tsv')
G = colexification_network(wl1, output=None, concept='concept', ipa='ipa')
for nA, nB, data in sorted(G.edges(data=True), key=lambda x: x[2]['wordWeight'], reverse=True):
    if data['wordWeight'] > 3:
        print('{0:10}\t{1:10}\t{2}'.format(nA, nB, data['wordWeight']))

year      	blood     	9
dry       	liver     	8
heart     	new       	7
sun       	warm      	5
lie       	sleep     	4
wind      	salt      	4
one       	not       	4
stand     	tree      	4


What we can see from this example, is that none of these colexifications seems to reflect the types of semantic associations we find in the CLICS database or other accounts on colexification studies. This is obviously due to the fact that our sample is very small. What is also interesting, however, is that a couple of colexifications reflect what we could interpret as "genetic markers" for the group of Bai dialects. The colexification of "year" and "blood", for example, occurs in all dialects in our sample. It is a clear example of homophony due to phonological merger. 

At least two of those mergers can also be found in a couple of Chinese dialects, especially "dry" and "liver", "heart" and "new" ("year" and "blood" are similar, compare *suì* 岁 and *xiě* 血, but not identical in Mandarin). But this does not prove that Bai dialects are closely related to Chinese, as other scholars would claim that these cases of colexification are due to recent borrowings in the Bai dialects from Chinese (see [Lee and Sagart 2008](:bib:Lee2008)).

So far, the usefulness of colexifications that point to homophones rather than polysemies to infer deep or shallow genetic relationships has not yet sufficiently been investigated. Let us just make a small experiment and check what we find in Polynesian languages, given that we have the data already in our folder:


In [24]:
wl2 = Wordlist('../data/S08_east-polynesian.tsv')
G = colexification_network(wl2, output=None, concept='concept', entry='value')
for nA, nB, data in sorted(G.edges(data=True), key=lambda x: x[2]['wordWeight'], reverse=True):
    if data['wordWeight'] > 3:
        print('{0:10}\t{1:10}\t{2}'.format(nA, nB, data['wordWeight']))

husband   	man/male  	6
wife      	woman/female	6
Five      	hand      	6
dirty     	earth/soil	5
One Hundred	leaf      	4
to see    	to know, be knowledgeable	4
to kill   	to hit    	4


Obviously, we find more instances of polysemies here, although "One Hundred" vs. "Leaf" are likely to be candidates for homophones. Note that we changed the keyword "entry" in this example, since the normal value in unsegmented form, which the algorithm compares for colexifications language-internally, is labelled differently in both datasets ("ipa" vs. "value").

Let us check the Tujia languages in our data-folder to see what we can find there (but we have to lower the threshold, otherwise we don't find any example):

In [25]:
wl3 = Wordlist('../data/S09-data.tsv')
G = colexification_network(wl3, output=None, concept='concept', entry='ipa')
for nA, nB, data in sorted(G.edges(data=True), key=lambda x: x[2]['wordWeight'], reverse=True):
    if data['wordWeight'] > 1:
        print('{0:10}\t{1:10}\t{2}'.format(nA, nB, data['wordWeight']))

near      	short     	2
tooth     	louse     	2
eat       	bite      	2
eat       	long      	2
far       	long      	2


Let us make a small experiment by simply using the colexifications to calculate a distance matrix between the languages. We then calculate the resulting tree retrieved from this distance matrix with the tree retrieved from the comparison of the cognates. For the former we use the `upgma` function of LingPy and a method that computes a distance matrix from colexification data, and for the latter, we just use the `Wordlist.calculate` function that can be applied for all wordlists where there is cognate data. 

In [28]:
from lingpy import upgma
matrix = compare_colexifications(wl2, entry='value')
wl2.calculate('tree', ref='cogid')

col_tree = upgma(matrix, taxa=wl2.cols)
print(wl2.tree.asciiArt())
print('---')
print(Tree(col_tree).asciiArt())



                    /-Sikaiana
          /edge.5--|
         |         |          /-Maori
         |          \edge.4--|
         |                   |          /-Rapanui
         |                    \edge.3--|
         |                             |          /-Hawaiian
         |                              \edge.2--|
-root----|                                       |          /-Mangareva
         |                                        \edge.1--|
         |                                                 |          /-North_Marquesan
         |                                                  \edge.0--|
         |                                                            \-Tuamotuan
         |
         |          /-Ra’ivavae
          \edge.7--|
                   |          /-Rurutuan
                    \edge.6--|
                              \-Tahitian
---
                    /-North_Marquesan
                   |
                   |                    /-Rurutuan
          /

From the differences in the resulting trees, we can easily see that the colexification data does not seem to be very useful for phylogenetic reconstruction.

When applying the LingPy function for computing colexifications on large datasets, you should be aware that it may take quite a long time. The reason is that LingPy compares all words against all other words. The current implementation is thus very time-consuming and should be replaced in future versions. A much better way to infer colexifications is to use a Python dictionary as data structure in which the data is consecutively hashed. This will yield a solution that is linear in time (in contrast to exponential, when comparing all words againstt all words). 

That means, we can retrieve the same information with a much faster function that only iterates once over each word form and adds them to a Python dictionary. If the word form for two concepts is the same, the Python dictionary's values increases, and we can later retrieve all these cases of colexifications:

In [45]:
from collections import defaultdict
from itertools import combinations
import networkx as nx
G = nx.Graph()
for t in wl1.cols:
    colexifications = defaultdict(set)
    for idx in wl1.get_list(col=t, flat=True):
        colexifications[wl1[idx, 'ipa']].add(wl1[idx, 'concept'])
    for key, vals in colexifications.items():
        if len(vals) > 1:

            for nA, nB in combinations(list(vals), r=2):
                try:
                    G[nA][nB]['weight'] += 1
                except:
                    G.add_edge(nA, nB, weight=1)
for nA, nB, data in sorted(G.edges(data=True), key=lambda x: x[2]['weight'], reverse=True):
    if data['weight'] > 3:
        print('{0:10}\t{1:10}\t{2}'.format(nA, nB, data['weight']))


year      	blood     	9
dry       	liver     	8
heart     	new       	7
sun       	warm      	5
lie       	sleep     	4
wind      	salt      	4
one       	not       	4
stand     	tree      	4


The CLICS database uses this improved colexification code. Since colexifications are not LingPy's primary concern, it is not clear yet, whether we will remove the current functions from the library in further versions, or otherwise make explicit use of the `pyclics` API. 