# Cross-linguistic colexifications

## 1 Introduction

Languages differ in how they label the universe and sometimes these labels clash in interesting and informative ways, such that one word form has multiple meanings. This may result from coincidence, termed *homophony*, whereby multiple meanings for one
word form arise accidentally as two word forms come to sound alike, as in French *paix* "peace" vs. *pet* "fart", which are both pronounced as
[pɛ]. In contrast to this are cases of *polysemy*, in which one word form comes to have multiple related senses, as in Russian дерево *dérevo*, which can denote both "tree" and "wood".

Cases of polysemy may be cross-linguistically frequent, in which case an explanation can likely be found in natural factors, be they linked to some aspect of human psychology or cognition, or the inherent structure of the natural environment (e.g. "rain" and "water", the above example of "tree" and "wood", or the common colexification of "moon" and "month"). On the other hand, where a polysemic pattern is relatively rare cross-linguistically, this is likely to point to a historical explanation in common inheritance or contact. For example, many Austronesian and
Papuan languages use the same term for both "fire", "firewood", and "tree" in eastern New
Guinea and northern Australia. As this pattern is rare world-wide, this hints that there might be some deep connection between these groups across the Torres strait (Schapper et al. 2016). Another case is given by Urban (2010), who notes that the word for "sun" can typically be translated as "eye of the day" in many Austroasiatic, Tai-Kadai, or Austronesian languages. In spite of the fact that a diachronic development based on a similar equation is attested in Indo-European (e.g. Old Irish *súil*  "eye", from the PIE root \*seh₂l-, thus cognate with Latin sōl "sun"; Classical Armenian արեգակն *aregakn*, a compound of *arew* "sun" and *akn* "eye"), the relative cross-linguistic rarity of this pattern and its prevalence in Southeast Asia suggests an explanation in terms of historical factors.

Deciding on a natural or historical explanation (i.e. distinguishing between homophony and polysemy) may be relatively straightforward for small groupings of languages for which detailed etymological and historical
knowledge is available, but it becomes increasingly difficult on a larger scale, and impossible where detailed
historical information is unknown.  To circumvent this problem, scholars have increasingly begun to use the agnostic
cover term *colexification*, where two senses in a given language *colexify* if the
language uses the same lexical form for both (Francois 2008). Taking a colexification approach
enables scholars to approach questions of lexical semantics from the perspective of the data: 
if a pattern of colexification of certain meanings in one language is replicated across different languages 
or linguistic areas, that is indicative (if not diagnostic) of polysemy, rather than homophony
(List 2013). However, if frequency is to be used in this way as a proxy to infer polysemy,
reliable large-scale cross-linguistic colexification resources are required. 

A key underpinning of all colexification studies, whether explicitly or implicitly, are
*networks*, which play a crucial role in the investigation of cross-linguistic
colexification patterns. First, they offer a convenient way to visualize the complexity of recurring
semantic associations along with a number of high-quality tools for the interactive exploration of
network data (Bastian et al. 2009, Smoot 2011). Second, thanks to recent advances in the empirical study
of networks (Newman 2010), many aspects of network structures are well understood, and a
multitude of methods and statistics are available (Csárdi and Nepusz 2006, Hagberg 2009), making it easy for
scholars to apply them in their research.

![example colex graph](img/example_colex_graph.png)

Examples for hypothetical cross-linguistic polysemy networks. A shows an unweighted graph. B shows a hypergraph, C represents a network with weighted edges (edge-width representing relative weight), and D shows a network with weighted edges and weighted nodes (node size representing relative node weight).

## 2 The Database of Cross-Linguistic Colexifications

The promise of a network-based approach to colexification (see List 2013, Mayer 2014) for
colexification analyses and Rosvall 2008 for general purpose studies) lead to the
publication of the *Database of Cross-Linguistic Colexifications* (CLICS, List 2014f),
which provided cross-linguistic colexification patterns for 1280 concepts across 220 language
varieties. While this version of the CLICS database was a valuable resource
it also had a number of serious shortcomings. In particular, it had little data, including only 220 languages spoken primarily in South America and Eurasia, and what data were available were hard to check, curate and extend. 

Here, as well as in a paper to published soon, we describe an updated release of CLICS based around a new framework that attempts
to solve these problems, while at the same time scaling up the available data, thus facilitating future research into 
colexifications. The most important points of improvements we see are: 

* separating data from display, 
* making exhaustive and principled use of existing *reference catalogs* like
  Concepticon (List 2016}, for concepts) and Glottolog (Hammarstroem 2017, for languages) along with
  recently proposed standardization efforts for cross-linguistic data (Forkel 2017),  
* curating data and code with help of a transparent Application Programming Interface
  (API), and 
* regularly releasing data in release cycles of at least one per year (Haspelmath 2015). 

In following these design guidelines, we have developed a new database of cross-linguistic
colexifications which supersedes the old CLICS database not only in size, both in terms of the number
of language varieties and the number of comparative concepts represented, but also with respect to the ease of data curation and the flexibility of the API. 

Our application allows users to explore the data from various additional
perspectives, including geographic maps, the inspection of the data of individual languages, or the
distribution of concepts for which we find translations in our data. In addition, each data point
can be traced back to its original source, allowing the users to rigorously check whether the
automatic findings we present can be confirmed through qualitative research. 

### 2.1 Colexifications of SAY, WORD, and LANGUAGE

Of the 327 communities containing at least two concepts in the data, the largest community consists
of 21 nodes, centering around the concept SAY, which has the largest number of links
in this subgraph (see the figure below regarding the position of the network in our big graph). This subgraph also contains the
link between LANGUAGE and WORD, ranking at position nine in the
collection of most frequently recurring links. Those concepts in the
figure which are shown in bold font show external links recurring in at least five different
language families, clearly suggesting an explanation in terms of natural factors. The concept LANGUAGE further links to the concept 
MOUTH
which is placed in a community with TOOTH as the central concept. 
Concept SAY itself
further links to three different concepts from other communities, namely PROMISE (central
concept OATH), CALL BY NAME (central concept
SHOUT), and DO OR MAKE (central concept BUILD). When hovering over the concept in the application,
a pop-up provides this information, and users can directly open the respective community to which a
given concept with external edges links. 

![Full network](img/full_network.png)
 
Concepts which could equally well be assigned to different communities are quite common in
cross-linguistic colexification data. While this may result from using an inappropriate algorithm
for community detection that partitions the network into too small sets of nodes, it also reflects
the general indeterminacy of concepts which can often be assigned to different domains. According to our
automatic analysis, for example, SAY plays a role in four different semantic domains, which
could be labelled as *neutral speech*,
CONCRETE ACTION (community around DO OR MAKE), PROMISE (community around
OATH), and *articulated speech* (community around CALL BY NAME).
Unfortunately, our data is not tagged for semantic fields or semantic domains. If it were, we could
automatically derive those concepts which are in transitional areas and not easy to assign to only one
domain. Much more work will have to be done in the future, both on existing resources such as the
Concepticon, and on datasets in CLDF format, as well as our colexification database, in order to exhaust its full potential.

![Network say](img/network_say.png)

### 2.2 Colexifications of WHEEL and FOOT

As a further example, let us consider a case of regional colexification that was already mentioned by
Mayer (2014) and can also be found in our new colexification database: the colexification of
FOOT and WHEEL in some South-American languages. In contrast to the example of SAY, WORD, and
LANGUAGE in the preceding section, this colexification does not reflect a global pattern which could
be identified when looking into the partitions based on the Infomap community detection analysis,
which places WHEEL and FOOT into distinct communities. An
additional view of the colexification data, introduced by Mayer (2014), however, allows one to
find areal patterns, provided they are frequent enough and recur across different language families.
This view (called *subgraph* by Mayer 2014) presents the subgraph derived from the
closest neighbors of a given query concept. Neighbors of the starting concept are identified by
setting a frequency threshold. In consecutive steps, more nodes (the neighbors of the neighbors) can
be added to the subgraph, depending on the size of the network, which should not exceed a certain
number of nodes to allow for convenient inspection. 
 
Thus, while the colexification between WHEEL and FOOT does not
show up in our community analysis, we find it in the subgraph view, as shown in the figure below. As we can see from the different concrete word forms reflecting the colexification,
we are not dealing with a direct borrowing that spread among the languages. Instead, the
colexification either reflects an instance of *loan transfer* (in the terminology of
Weinreich 1953) or an indirect metaphorical extension. 
What may substantiate the latter hypothesis is the fact that the
WHEEL-FOOT colexification is not restricted to Southern
America, but seems to be also reflected in some African languages located on the Eastern coast of
Africa (Gilman 1986; Heine 2017). The explanation for this particular colexification can thus be sought in historical factors, as a metaphorical extension linked to the introduction of the wheel as a widespread technology in a colonial context.

![Foot wheel](img/footwheel.png)

This again has immediate implications for ongoing debates on *linguistic paleography*. First,
the WHEEL-FOOT metaphor shows that concrete historical events
may be reflected in languages. Second, however, it also shows that we need to be very careful when
evaluating this evidence. As we can see from the subgraph above, there are
plenty of colexifications for CIRCLE and WHEEL in our data as
well (our data counts 25 concrete colexifications across 10 different language families).  Assuming
that societies usually have a way to express the concept "circle", while "wheel" may be
missing, our data suggests that the most straightforward strategy to express a new concept "wheel"
starts from the word for "circle". Since this can easily happen independently, as we can again see
from our data, these findings might be of importance for on-going debates on the origin of terms for
"wheel", especially in Indo-European (Hock 2017; Anthony 2015). 
Further studies on lexical typology, including studies on independently recurring
patterns of semantic shift as well as the frequency of loan transfer, are required before this linguistic data can be reliably used to reconstruct ancestral cultures. Our extended colexification data may serve as a starting point for these investigations.

### 2.3 Website

The new CLICS database can be explored at [http://clics.clld.org/](http://clics.clld.org/).


## 3 How to Compute Colexifications in Python

In the following, we will briefly look how colexifications in a given dataset can be handled with LingPy and with Python in general. 

Let us start by loading the module from LingPy along with a wordlist of Bai dialects and compute the most frequent colexifications in this sample.

In [1]:
from lingpy import *
from lingpy.meaning.colexification import *

wl1 = Wordlist('../data/S10-BAI.tsv')
G = colexification_network(wl1, output=None, concept='concept', ipa='ipa')
for nA, nB, data in sorted(G.edges(data=True), key=lambda x: x[2]['wordWeight'], reverse=True):
    if data['wordWeight'] > 3:
        print('{0:10}\t{1:10}\t{2}'.format(nA, nB, data['wordWeight']))

2018-06-26 10:14:26,066 [INFO] Analyzing taxon Dashi...
2018-06-26 10:14:26,069 [INFO] Analyzing taxon Ega...
2018-06-26 10:14:26,074 [INFO] Analyzing taxon Enqi...
2018-06-26 10:14:26,078 [INFO] Analyzing taxon Gongxing...
2018-06-26 10:14:26,082 [INFO] Analyzing taxon Jinman...
2018-06-26 10:14:26,086 [INFO] Analyzing taxon Jinxing...
2018-06-26 10:14:26,089 [INFO] Analyzing taxon Mazhelong...
2018-06-26 10:14:26,092 [INFO] Analyzing taxon Tuolo...
2018-06-26 10:14:26,095 [INFO] Analyzing taxon Zhoucheng...


year      	blood     	9
liver     	dry       	8
new       	heart     	7
sun       	warm      	5
stand     	tree      	4
wind      	salt      	4
sleep     	lie       	4
not       	one       	4


What we can see from this example, is that none of these colexifications seems to reflect the types of semantic associations we find in the CLICS database or other accounts on colexification studies. This is obviously due to the fact that our sample is very small. What is also interesting, however, is that a couple of colexifications reflect what we could interpret as "genetic markers" for the group of Bai dialects. The colexification of "year" and "blood", for example, occurs in all dialects in our sample. It is a clear example of homophony due to phonological merger. 

At least two of those mergers can also be found in a couple of Chinese dialects, especially "dry" and "liver", "heart" and "new" ("year" and "blood" are similar, compare *suì* 岁 and *xiě* 血, but not identical in Mandarin). But this does not prove that Bai dialects are closely related to Chinese, as other scholars would claim that these cases of colexification are due to recent borrowings in the Bai dialects from Chinese (see [Lee and Sagart 2008](:bib:Lee2008)).

So far, the usefulness of colexifications that point to homophones rather than polysemies to infer deep or shallow genetic relationships has not yet sufficiently been investigated. Let us just make a small experiment and check what we find in Polynesian languages, given that we have the data already in our folder:


In [2]:
wl2 = Wordlist('../data/S08_east-polynesian.tsv')
G = colexification_network(wl2, output=None, concept='concept', entry='value')
for nA, nB, data in sorted(G.edges(data=True), key=lambda x: x[2]['wordWeight'], reverse=True):
    if data['wordWeight'] > 3:
        print('{0:10}\t{1:10}\t{2}'.format(nA, nB, data['wordWeight']))

2018-06-26 10:14:26,224 [INFO] Analyzing taxon Hawaiian...
2018-06-26 10:14:26,236 [INFO] Analyzing taxon Mangareva...
2018-06-26 10:14:26,251 [INFO] Analyzing taxon Maori...
2018-06-26 10:14:26,257 [INFO] Analyzing taxon North_Marquesan...
2018-06-26 10:14:26,267 [INFO] Analyzing taxon Rapanui...
2018-06-26 10:14:26,275 [INFO] Analyzing taxon Ra’ivavae...
2018-06-26 10:14:26,284 [INFO] Analyzing taxon Rurutuan...
2018-06-26 10:14:26,294 [INFO] Analyzing taxon Sikaiana...
2018-06-26 10:14:26,303 [INFO] Analyzing taxon Tahitian...
2018-06-26 10:14:26,309 [INFO] Analyzing taxon Tuamotuan...


Five      	hand      	6
man/male  	husband   	6
woman/female	wife      	6
earth/soil	dirty     	5
to kill   	to hit    	4
leaf      	One Hundred	4
to see    	to know, be knowledgeable	4


Obviously, we find more instances of polysemies here, although "One Hundred" vs. "Leaf" are likely to be candidates for homophones. Note that we changed the keyword "entry" in this example, since the normal value in unsegmented form, which the algorithm compares for colexifications language-internally, is labelled differently in both datasets ("ipa" vs. "value").

Let us check the Tujia languages in our data-folder to see what we can find there (but we have to lower the threshold, otherwise we don't find any example):

In [3]:
wl3 = Wordlist('../data/S09-data.tsv')
G = colexification_network(wl3, output=None, concept='concept', entry='ipa')
for nA, nB, data in sorted(G.edges(data=True), key=lambda x: x[2]['wordWeight'], reverse=True):
    if data['wordWeight'] > 1:
        print('{0:10}\t{1:10}\t{2}'.format(nA, nB, data['wordWeight']))

2018-06-26 10:14:26,411 [INFO] Analyzing taxon Boluo_Tujia...
2018-06-26 10:14:26,415 [INFO] Analyzing taxon Dianfang_Tujia...
2018-06-26 10:14:26,417 [INFO] Analyzing taxon Duogu_Tujia...
2018-06-26 10:14:26,420 [INFO] Analyzing taxon Tanxi_Tujia...
2018-06-26 10:14:26,423 [INFO] Analyzing taxon Tasha_Tujia...


louse     	tooth     	2
eat       	long      	2
eat       	bite      	2
long      	far       	2
short     	near      	2


Let us make a small experiment by simply using the colexifications to calculate a distance matrix between the languages. We then calculate the resulting tree retrieved from this distance matrix with the tree retrieved from the comparison of the cognates. For the former we use the `upgma` function of LingPy and a method that computes a distance matrix from colexification data, and for the latter, we just use the `Wordlist.calculate` function that can be applied for all wordlists where there is cognate data. 

In [4]:
from lingpy import upgma
matrix = compare_colexifications(wl2, entry='value')
wl2.calculate('tree', ref='cogid')

col_tree = upgma(matrix, taxa=wl2.cols)
print(wl2.tree.asciiArt())
print('---')
print(Tree(col_tree).asciiArt())

2018-06-26 10:14:26,456 [INFO] Analyzing taxon Hawaiian...
2018-06-26 10:14:26,465 [INFO] Analyzing taxon Mangareva...
2018-06-26 10:14:26,478 [INFO] Analyzing taxon Maori...
2018-06-26 10:14:26,483 [INFO] Analyzing taxon North_Marquesan...
2018-06-26 10:14:26,491 [INFO] Analyzing taxon Rapanui...
2018-06-26 10:14:26,507 [INFO] Analyzing taxon Ra’ivavae...
2018-06-26 10:14:26,512 [INFO] Analyzing taxon Rurutuan...
2018-06-26 10:14:26,518 [INFO] Analyzing taxon Sikaiana...
2018-06-26 10:14:26,523 [INFO] Analyzing taxon Tahitian...
2018-06-26 10:14:26,530 [INFO] Analyzing taxon Tuamotuan...
2018-06-26 10:14:26,688 [INFO] Successfully calculated tree.


                    /-Sikaiana
          /edge.5--|
         |         |          /-Maori
         |          \edge.4--|
         |                   |          /-Rapanui
         |                    \edge.3--|
         |                             |          /-Hawaiian
         |                              \edge.2--|
-root----|                                       |          /-Mangareva
         |                                        \edge.1--|
         |                                                 |          /-North_Marquesan
         |                                                  \edge.0--|
         |                                                            \-Tuamotuan
         |
         |          /-Ra’ivavae
          \edge.7--|
                   |          /-Rurutuan
                    \edge.6--|
                              \-Tahitian
---
                    /-North_Marquesan
                   |
                   |                    /-Rurutuan
          /

From the differences in the resulting trees, we can easily see that the colexification data does not seem to be very useful for phylogenetic reconstruction.

When applying the LingPy function for computing colexifications on large datasets, you should be aware that it may take quite a long time. The reason is that LingPy compares all words against all other words. The current implementation is thus very time-consuming and should be replaced in future versions. A much better way to infer colexifications is to use a Python dictionary as data structure in which the data is consecutively hashed. This will yield a solution that is linear in time (in contrast to exponential, when comparing all words againstt all words). 

That means, we can retrieve the same information with a much faster function that only iterates once over each word form and adds them to a Python dictionary. If the word form for two concepts is the same, the Python dictionary's values increases, and we can later retrieve all these cases of colexifications:

In [5]:
from collections import defaultdict
from itertools import combinations
import networkx as nx
G = nx.Graph()
for t in wl1.cols:
    colexifications = defaultdict(set)
    for idx in wl1.get_list(col=t, flat=True):
        colexifications[wl1[idx, 'ipa']].add(wl1[idx, 'concept'])
    for key, vals in colexifications.items():
        if len(vals) > 1:

            for nA, nB in combinations(list(vals), r=2):
                try:
                    G[nA][nB]['weight'] += 1
                except:
                    G.add_edge(nA, nB, weight=1)
for nA, nB, data in sorted(G.edges(data=True), key=lambda x: x[2]['weight'], reverse=True):
    if data['weight'] > 3:
        print('{0:10}\t{1:10}\t{2}'.format(nA, nB, data['weight']))


year      	blood     	9
liver     	dry       	8
new       	heart     	7
sun       	warm      	5
stand     	tree      	4
sleep     	lie       	4
salt      	wind      	4
not       	one       	4


The CLICS database uses this improved colexification code. Since colexifications are not LingPy's primary concern, it is not clear yet, whether we will remove the current functions from the library in further versions, or otherwise make explicit use of the `pyclics` API. 