# Demos for analyzing Colexification Data

This Python notebook was developed by Jai Aggarwal for the course COG260:
Data, Computation, and The Mind in Fall 2022 (instructor: Yang Xu).

Data source: https://osf.io/hjvm5/

Paper: https://psyarxiv.com/efs4p/

__Original Paper__:

Brochhagen, T., G. Boleda. 2022. When do languages use the same word for
different meanings? The Goldilocks Principle in colexification. Cognition,
Volume 226, 105179.

__Additional Readings__:

1. On cross-linguistic universals in colexification:

   a. Xu, Y., Duong, K., Malt, B.C., Jiang, S., and Srinivasan, M. (2020)
   Conceptual relations predict colexification across languages. Cognition,
   201, 104280.

   b. Youn, H., Sutton, L., Smith, E., Moore, C., Wilkins, J. F., Maddieson,
   I., ... & Bhattacharya, T. (2016). On the universal structure of human 
   lexical semantics. Proceedings of the National Academy of Sciences,
   113(7), 1766-1771.


2. On cross-linguistic variation in colexification:

   a. Regier, T., Carstensen, A., & Kemp, C. (2016). Languages support 
   efficient communication about the environment: Words for snow revisited. 
   PloS one, 11(4), e0151138.

Import relevant Python libraries.

In [1]:
import numpy as np
import pandas as pd
# the library below is useful for estimating the progress of a for loop
# the use of tqdm will be displayed in Demo 2
from tqdm import tqdm
from itertools import product

pd.options.mode.chained_assignment = None

In [2]:
df = pd.read_csv("df_all_raw.csv")
df.columns = list(map(str.lower, df.columns))
df = df.drop(columns=['dataset_id', 'form_id', 'form', 'gloss_in_source', 'iso639p3code', 'mrc_word', 'kucera_francis_frequency'])

  df = pd.read_csv("df_all_raw.csv")


# Data Fields

**clics_form**: the form of the word in the language of interest

**concepticon_id**: unique numerical identifier of underlying concept

**concepticon_gloss**: the concept underlying the word form

**ontological_category**: broad category that the concept falls into

**semantic_field**: a set of words related in meaning

**variety**: the language

**glottocode**: unique alphanumeric identifier for the language variety

**macroarea**: the part of the world the language is common in

**family**: the language family of the current variety

**latitude**: rough latitude where the language variety can be found

**longitude**: rough latitude where the language variety can be found

**age_of_acquisition**: the age at which a concept is typically learned

**concreteness**: a numerical rating of how abstract or concrete a concept is, rated from (100-700)

**familiarity**: a numerical rating of how familiar a concept is to the average person, rated from (100-700)

**imagability**: a numerical rating of how well an average person can mentally visualize a concept, rated from (100-700)

# Demo 1: Language Family Statistics

First, let's explore how many unique languages we have in our dataset.

In [3]:
num_languages = df['variety'].nunique()
print(f"Number of unique languages: {num_languages}")

Number of unique languages: 3050


Now, let's explore how many language families we have in our dataset.

In [4]:
num_language_families = df['family'].nunique()
print(f"Number of unique language families: {num_language_families}")

Number of unique language families: 201


Now, let's explore how many languages each family has. 

In [5]:
num_languages_per_family = df.groupby("family")['variety'].count()
num_languages_per_family

family
Abkhaz-Adyge     2699
Abun              158
Afro-Asiatic    39787
Aikanã            280
Ainu              911
                ...  
Yukaghir         1733
Yuracaré            7
Yámana            715
Zamucoan         1405
Zuni              981
Name: variety, Length: 201, dtype: int64

In the table above, we see that the Afro-Asiatic language family has 39787 "languages". However, we only have 3050 languages total, so we know that this isn't possible. This demonstrates the difference between the ``df.count`` aggregator and the ``df.nunique`` aggregator. Here, we want the latter.

In [6]:
num_languages_per_family = df.groupby("family")['variety'].nunique()
num_languages_per_family

family
Abkhaz-Adyge     5
Abun             4
Afro-Asiatic    72
Aikanã           1
Ainu             1
                ..
Yukaghir         2
Yuracaré         1
Yámana           1
Zamucoan         1
Zuni             1
Name: variety, Length: 201, dtype: int64

We see now that the number of languages per group is a lot more sensible. Finally, let's say we want the language family with the largest number of language varieties in it. There are two ways we can do this.

1. Find the greatest value in the Series, and then find the row that has the greatest value.
2. Sort the Series and index the last row

In [7]:
# Method 1
max_val = num_languages_per_family.max()
num_languages_per_family[num_languages_per_family == max_val]

family
Nuclear Trans New Guinea    560
Name: variety, dtype: int64

In [8]:
# Method 2
sorted_data = num_languages_per_family.sort_values()
sorted_data.tail(1)

family
Nuclear Trans New Guinea    560
Name: variety, dtype: int64

# Demo 2: Extracting Colexified Concept Counts

Below, we detail in a step-by-step fashion exactly how we can compute the number of languages where a particular pair of concepts are referred to by the same word (i.e., the number of concepts where two languages are **colexified**). In the example below, each step is spelled out explicitly for ease of processing. Feel free to try optimizing this code using ``numpy`` or ``pandas`` operations to make it run faster.

Below, we go through the main loop line by line to help you understand the operations.

**Step 1**: Let's arbitrarily work with the language *Miji Nafra*. First, we can find all rows in our dataset associated with the language variety.

In [9]:
variety = 'Miji Nafra'
curr_language = df[df['variety'] == variety]

**Step 2**: We can group by the word (``clics_form``) to see if certain words are attached to multiple concepts. For each word, calculate how many concepts it is attached to.

In the example below, we see that there are three word forms that each have three concepts attached to them. For instance, the form **lu** can be used to refer to the concept of the *month*, the *moon*, and *salt*. The colexification between the *month* and the *moon* is one we will later see to be quite common.

In [10]:
agg = curr_language.groupby("clics_form")[['concepticon_gloss', 'concepticon_id']].agg(list)
agg['num_concepts'] = agg['concepticon_gloss'].apply(lambda x: len(set(x)))
agg.sort_values(by='num_concepts')

Unnamed: 0_level_0,concepticon_gloss,concepticon_id,num_concepts
clics_form,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
abo,[FATHER],[1217],1
ne,[YOUNGER SISTER],[1761],1
neipung,[ROOF],[769],1
nesi,[BROOM],[245],1
nibiung,[NOSE],[1221],1
...,...,...,...
dzuo,"[PIG, EAR]","[1337, 1247]",2
m@ni,"[NEAR, BAD]","[1942, 1292]",2
lu,"[MONTH, MOON, SALT]","[1370, 1313, 1274]",3
m@r@n,"[RIPE, ROTTEN, FAR]","[178, 1728, 1406]",3


**Step 3**: To investigate colexification, we are only interested in the words that refer to multiple concepts in our current language of interest.  For ease of processing later, we can alphabetize the concepts as well.

In [11]:
colex = agg[agg['num_concepts']>1]
colex['concepticon_gloss'] = colex['concepticon_gloss'].apply(lambda x: sorted(list(set(x))))
colex

Unnamed: 0_level_0,concepticon_gloss,concepticon_id,num_concepts
clics_form,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ani,"[MOTHER, WE]","[1216, 1212]",2
dai,"[GO, WALK]","[695, 1443]",2
dzim@sung,"[YOU, YOU (HONORIFIC)]","[1213, 3293]",2
dzo,"[SUN, TIE]","[1343, 1917]",2
dzu,"[SING, SIT]","[1261, 1416]",2
dzuo,"[EAR, PIG]","[1337, 1247]",2
lembang,"[PATH, ROAD]","[2252, 667]",2
lu,"[MONTH, MOON, SALT]","[1370, 1313, 1274]",3
m@dzo,"[FRIEND, WASH]","[1453, 1325]",2
m@ni,"[BAD, NEAR]","[1942, 1292]",2


**Step 4:**: For our current language variety, we now need to record all pairs of concepts that are colexified. Referring back to the case of *month*, *moon*, and *salt*, we want to record that the pairs (month, moon), (moon, salt), and (month, salt) are all attested colexified pairs. We can do so by making a call to the function ``per_lang_colexification``. Details for the function can be found below, in the comments.

In [12]:
def per_lang_colexification(curr_df):
    """
    Calculate the colexification frequency of pairs of concepts present in the current language.
    """
    all_combos_dict = {}
    # We iterate through each row, which has the concepts associated with a specific word
    for i, row in curr_df.iterrows():
        # Get the current set of concepts
        a = row['concepticon_gloss']
        # Create all possible unique combinations of concepts, where each pair is alphabetically sorted
        combos = list(set(map(lambda x: tuple(sorted(x)), product(a, a))))
        # Ensure the concepts in the pair are not identical
        combos = [combo for combo in combos if combo[0] != combo[1]]
        # Add counts for a pair of combinations being colexified
        for combo in combos:
            if combo in all_combos_dict:
                all_combos_dict[combo] += 1
            else:
                all_combos_dict[combo] = 1

    # Create a DataFrame out of our dictionary and return the colexification counts for two concepts
    tmp = pd.DataFrame.from_dict(all_combos_dict, "index").reset_index()
    per_lang = pd.DataFrame(tmp['index'].tolist(), columns=['concept_1', "concept_2"])
    per_lang['colexification_count'] = tmp[0]
    return per_lang

In [13]:
recorded_pairs = per_lang_colexification(colex)
recorded_pairs

Unnamed: 0,concept_1,concept_2,colexification_count
0,MOTHER,WE,1
1,GO,WALK,1
2,YOU,YOU (HONORIFIC),1
3,SUN,TIE,1
4,SING,SIT,1
5,EAR,PIG,1
6,PATH,ROAD,1
7,MONTH,SALT,1
8,MOON,SALT,1
9,MONTH,MOON,1


**Step 5**: Repeat this analysis for each other language variety we have available, and add to the records of attested colexified concept pairs. Below, we show what this would look like for a second language, *Bugun Bichom*. Our output shows that in both languages, the concepts of *moon* and *month* are colexified.

In [14]:
variety = 'Bugun Bichom'
curr_language = df[df['variety'] == variety]
agg = curr_language.groupby("clics_form")[['concepticon_gloss', 'concepticon_id']].agg(list)
agg['num_concepts'] = agg['concepticon_gloss'].apply(lambda x: len(set(x)))
agg.sort_values(by='num_concepts')
colex = agg[agg['num_concepts']>1]
colex['concepticon_gloss'] = colex['concepticon_gloss'].apply(lambda x: sorted(list(set(x))))
recorded_pairs_2 = per_lang_colexification(colex)
pd.concat((recorded_pairs, recorded_pairs_2)).groupby(["concept_1", "concept_2"]).sum().reset_index().sort_values(by='colexification_count')

Unnamed: 0,concept_1,concept_2,colexification_count
0,BACK,SPIDER,1
27,SWEAT (SUBSTANCE),WHO,1
26,SUN,TIE,1
25,SING,SIT,1
24,RIPE,ROTTEN,1
23,PATH,ROAD,1
22,NARROW,SMALL,1
21,MOTHER,WE,1
20,MOON,SALT,1
19,MONTH,SALT,1


# Main Loop
The code below synthesizes everything we covered so far into one main loop. The code should take between 5-10 minutes to run, but this may vary based on your individual computer.

In [15]:
def main():
    all_dfs = []
    for variety in tqdm(df['variety'].unique()):
        sub = df[df['variety'] == variety]
        agg = sub.groupby("clics_form")[['concepticon_gloss', 'concepticon_id']].agg(list)
        agg['num_concepts'] = agg['concepticon_gloss'].apply(lambda x: len(set(x)))
        colex = agg[agg['num_concepts']>1]
        colex['concepticon_gloss'] = colex['concepticon_gloss'].apply(lambda x: sorted(list(set(x))))
        # We skip any language where no concepts are colexified
        if colex.shape[0] == 0:
            continue
        curr_df = per_lang_colexification(colex)
        all_dfs.append(curr_df)
    mega = pd.concat(all_dfs)
    colex_counts = mega.groupby(["concept_1", "concept_2"]).sum().reset_index()
    return colex_counts

In [16]:
colex_counts = main()

100%|██████████| 3050/3050 [04:42<00:00, 10.79it/s]


We see that concepts like (tree, wood), (foot, leg), and (month, moon) commonly colexify with each other.

In [17]:
colex_counts.sort_values(by='colexification_count')

Unnamed: 0,concept_1,concept_2,colexification_count
0,A LITTLE,FEW,1
46054,GROW,LAKE,1
46055,GROW,LAUGH,1
46056,GROW,LEAD (GUIDE),1
46057,GROW,LIE DOWN,1
...,...,...,...
74236,WIFE,WOMAN,302
58509,MONTH,MOON,328
44184,GO,WALK,336
41201,FOOT,LEG,354
