# [COGSCI 1B] Typology
---
### Professor Terry Regier

This module explores a central question in cognitive science and linguistics: how do languages vary from one another? We will explore datasets of linguistic features (WALS and Phoible) to come to tentative answers to this question in a data-driven way. Example problems include visualizing the distribution of phonemes, the relationship between geography and the development of languages, and the genetic relationships of languages.

---

### Table of Contents

0 - [The Data](#section data)

1 - [Phoneme Distributions](#phoneme dist)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1 - [Consonants](#consonants)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.2 - [Vowels](#vowels)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.3 - [Phonemes](#phonemes)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.4 - [Consonants vs Vowels](#cons vs vows)<br>

2 - [Phonemes Metadata](#metadata)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.1 - [Family](#phoneme fam)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2 - [Continent](#phoneme cont)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.3 - [Latitude and Longitude](#lat lons)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.4 - [Population Size vs Phoneme Inventory Size](#pop v foam)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.5 - [Distance from Africa](#africa distance)

3 - [Common Phonemes](#common)

4 - [Tone](#tone)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.1 - [Altitude](#altitude)

5 - [Morphological Complexity](#morph complex)

**Dependencies:**

In [1]:
!pip install -U -q folium

import folium
import numpy as np
import pandas as pd
import seaborn as sns
import geopy.distance
from collections import Counter
import matplotlib.pyplot as plt
from IPython.display import display
from scripts.cogsci_module import *
import warnings

warnings.filterwarnings('ignore')
sns.set_style('darkgrid')
%matplotlib inline

## The Data <a id='section data'></a>

We will start by familiarizing ourselves with the data, and in order to do that, we need to load them into our notebook. 

First, we'll start by loading in our data from Phoible. In the code cell below, we create a variable called `file_name` that we assign to the name of our file in quotations, which lets Python know that the data is text, or a *string*. Note that we have `data/` in front of the file name, which means that our file `phoible_elevation.csv` is in the `data` directory (folder). We turn this file into what is called a **DataFrame**, which can be thought of as a slightly more rigid Excel sheet. It allows us to easily access, manipulate, and visualize our data.

A code cell will display what is written in the last line of the cell (if it is not a variable assignment statement). So in the cell below, the last line says `phoible_data.head()`, which means that it will display our dataframe, but adding `.head()` at the end of it allows us to show only the first 5 rows.

In [2]:
file_name = 'phoible_elevation.csv'
phoible_data = pd.read_csv(file_name)
phoible_data.head()

Unnamed: 0,InventoryID,Source,LanguageCode,LanguageName,Glottocode,GlottologName,Trump,LanguageFamilyRoot,LanguageFamilyGenus,Country,Area,Population,Latitude,Longitude,Phonemes,Consonants,Tones,Vowels,elevation
0,1,SPA,kor,Korean,kore1280,Korean,1,asis,Korean,"Korea, South",Asia,42000000,37.5,128.0,40,22,0,18,330.066681
1,2,SPA,ket,Ket,kett1243,Ket,1,yeos,Yeniseian,Russian Federation,Europe,190,63.7551,87.5466,32,18,0,14,25.690624
2,3,SPA,lbe,Lak,lakk1252,Lak,1,ncau,Lak-Dargwa,Russian Federation,Europe,157000,42.1328,47.0809,69,60,0,9,2169.640625
3,4,SPA,kbd,Kabardian,kaba1278,Kabardian,1,ncau,Northwest Caucasian,Russian Federation,Europe,520000,43.5082,43.3918,56,49,0,7,943.938354
4,5,SPA,kat,Georgian,nucl1302,Nuclear Georgian,1,kart,Kartvelian,Georgia,Asia,3900000,39.3705,45.8066,35,29,0,6,2204.326416


In our dataframe, the column `Population` was stored as *strings*, not numbers, because some values in the column are words. The possible text entries for those rows are shown below.

In [3]:
sorted(list(set(phoible_data.Population)))[-5:]

['Ancient',
 'Extinct',
 'Missing E16 page',
 'No_estimate_available',
 'No_known_speakers']

In order to use the numerical values of `Population` for further analysis, we are going to drop rows where the values are words, and convert the numbers to be represented as `ints`, or the Python representation of integers, and create a new dataframe called `phoib` with this new data. Reasons like this emphasize the importance of being aware how your data is represented and how you store data.

In [4]:
# phoib contains rows where population is a number
phoib = phoible_data.copy()
phoib["Population"] = pd.to_numeric(phoib['Population'], errors='coerce')
phoib = phoib.dropna(subset=['Population'])

# stands for good phoib / use for mapping
phoib_mapping = phoib.dropna()

length_difference = len(phoible_data) - len(phoib)
print("When we remove those rows with text, we lose {} rows.".format(length_difference))

phoib.head()

When we remove those rows with text, we lose 67 rows.


Unnamed: 0,InventoryID,Source,LanguageCode,LanguageName,Glottocode,GlottologName,Trump,LanguageFamilyRoot,LanguageFamilyGenus,Country,Area,Population,Latitude,Longitude,Phonemes,Consonants,Tones,Vowels,elevation
0,1,SPA,kor,Korean,kore1280,Korean,1,asis,Korean,"Korea, South",Asia,42000000.0,37.5,128.0,40,22,0,18,330.066681
1,2,SPA,ket,Ket,kett1243,Ket,1,yeos,Yeniseian,Russian Federation,Europe,190.0,63.7551,87.5466,32,18,0,14,25.690624
2,3,SPA,lbe,Lak,lakk1252,Lak,1,ncau,Lak-Dargwa,Russian Federation,Europe,157000.0,42.1328,47.0809,69,60,0,9,2169.640625
3,4,SPA,kbd,Kabardian,kaba1278,Kabardian,1,ncau,Northwest Caucasian,Russian Federation,Europe,520000.0,43.5082,43.3918,56,49,0,7,943.938354
4,5,SPA,kat,Georgian,nucl1302,Nuclear Georgian,1,kart,Kartvelian,Georgia,Asia,3900000.0,39.3705,45.8066,35,29,0,6,2204.326416


The next thing that we'll notice is that there are multiple rows for some of the languages.

In [15]:
phoible_data['LanguageCode'].value_counts().head(10)

nyf    6
sgw    6
gwn    6
xtc    5
car    5
fub    4
bva    4
hau    4
khr    4
aka    4
Name: LanguageCode, dtype: int64

In [8]:
len(list(set(phoible_data['LanguageCode'])))

1672

In [9]:
# same language code, but different language name? What is `Trump`?
cond = np.logical_and(phoible_data['Source'] == 'GM', phoible_data['LanguageCode'] == 'sgw')
phoible_data[cond]

Unnamed: 0,InventoryID,Source,LanguageCode,LanguageName,Glottocode,GlottologName,Trump,LanguageFamilyRoot,LanguageFamilyGenus,Country,Area,Population,Latitude,Longitude,Phonemes,Consonants,Tones,Vowels,elevation
1458,1459,GM,sgw,Muher,seba1251,Sebat Bet Gurage,1,afas,Semitic,Ethiopia,Africa,2320000,8.11879,37.9891,42,34,0,8,2262.824463
1459,1460,GM,sgw,Ezha,seba1251,Sebat Bet Gurage,2,afas,Semitic,Ethiopia,Africa,2320000,8.11879,37.9891,39,33,0,6,2262.824463
1460,1461,GM,sgw,Chaha,seba1251,Sebat Bet Gurage,3,afas,Semitic,Ethiopia,Africa,2320000,8.11879,37.9891,44,36,0,8,2262.824463
1461,1462,GM,sgw,Gumer,seba1251,Sebat Bet Gurage,4,afas,Semitic,Ethiopia,Africa,2320000,8.11879,37.9891,42,35,0,7,2262.824463
1462,1463,GM,sgw,Gura,seba1251,Sebat Bet Gurage,5,afas,Semitic,Ethiopia,Africa,2320000,8.11879,37.9891,42,36,0,6,2262.824463
1463,1464,GM,sgw,Gyeto,seba1251,Sebat Bet Gurage,6,afas,Semitic,Ethiopia,Africa,2320000,8.11879,37.9891,45,39,0,6,2262.824463


In the next cell, we imoport our WALS data.

In [5]:
wals = pd.read_csv('wals_data/language.csv')
wals.head()

Unnamed: 0,wals_code,iso_code,glottocode,Name,latitude,longitude,genus,family,macroarea,countrycodes,...,137B M in Second Person Singular,136B M in First Person Singular,109B Other Roles of Applied Objects,10B Nasal Vowels in West Africa,25B Zero Marking of A and P Arguments,21B Exponence of Tense-Aspect-Mood Inflection,108B Productivity of the Antipassive Construction,130B Cultural Categories of Languages with Identity of 'Finger' and 'Hand',58B Number of Possessive Nouns,79B Suppletion in Imperatives and Hortatives
0,aab,,,Arapesh (Abu),-3.45,142.95,Kombio-Arapesh,Torricelli,,PG,...,,,,,,,,,,
1,aar,aiw,aari1239,Aari,6.0,36.583333,South Omotic,Afro-Asiatic,Africa,ET,...,,,,,,,,,,
2,aba,aau,abau1245,Abau,-4.0,141.25,Upper Sepik,Sepik,Papunesia,PG,...,,,,,,,,,,
3,abb,shu,chad1249,Arabic (Abbéché Chad),13.833333,20.833333,Semitic,Afro-Asiatic,Africa,TD,...,,,,,,,,,,
4,abd,abi,abid1235,Abidji,5.666667,-4.583333,Kwa,Niger-Congo,Africa,CI,...,,,,,,,,,,


Something that we notice about our data is that there are multiple rows for the same languages. We'll show

In [None]:
mp = folium.Map(zoom_start=12)
phoib_coords = phoib.dropna(subset=['Latitude', 'Longitude'])
for coords in list(zip(phoib_coords['Latitude'], phoib_coords['Longitude'])):
    folium.Circle(
        radius=100,
        location=coords,
        color='crimson',
        fill=False,).add_to(mp)
mp

In [None]:
combined = wals.dropna(subset=['iso_code']).merge(phoib.dropna(), left_on='iso_code', right_on='LanguageCode', how='inner')
combined.head()

## Background

One big question in cognitive science is the relationship between linguistic features (e.g. number of vowels, word order and number of tense categories) and non-linguistic features (e.g. population size, altitude and climate). In particular, a lot of attention has been paid to the relationship between population size and various linguistic features. People have looked at the relationship between population size and:
- size of the phoneme inventory
- morphological complexity
- Relationship between population size and phonemic inventory

Phonemes are individual sounds. They come from the IPA, which is the phonetic alphabet. Phonemes can either be consonants or vowels. Languages have a fixed number of phonemes. The best data source for this is PHOIBLE. 
Visualize the following univariate distributions both as a histogram and on a map:

## Phoneme Distributions <a id='phoneme dist'></a>

both geographic and numeric distributions

### Consonants <a id='consonants'></a>

In [None]:
sns.distplot(phoib['Consonants'])

In [None]:
map_with_bins('Consonants', phoib_mapping)

In [None]:
# plotting with bins set on quantiles instead
map_with_bins('Consonants', phoib_mapping, quantiles=True)

### Vowels <a id='vowels'></a>

In [None]:
sns.distplot(phoib['Vowels'])

In [None]:
map_with_bins('Vowels', phoib_mapping, quantiles=True)

### Phonemes <a id='phonemes'></a>

In [None]:
sns.distplot(phoib['Phonemes'])

In [None]:
map_with_bins('Phonemes', phoib_mapping, quantiles=True)

### Consonants vs Vowels <a id='cons vs vows'></a>

Visualize the relationship between number of consontants and number of vowels, again by all language, by continent and by genetic affiliation.

In [None]:
# note to self: check out consonant to vowel ratio as number of cons / vowels increase

In [None]:
overlay_hex(phoib["Consonants"], phoib["Vowels"])

In [None]:
pho_cont = phoib[["Area","Consonants", "Vowels"]].copy()
pho_cont = pho_cont.groupby(by="Area").mean()
pho_cont['Ratio'] = pho_cont['Consonants'] / pho_cont['Vowels']
pho_cont[['Ratio']].plot.bar(figsize = (12,8))
plt.title('Average Consonants per Vowel')

## Phoneme Metadata <a id='metadata'></a>

### By Family  <a id='phoneme fam'></a>

Visualize those same distributions grouped by continent and grouped by genetic affiliation. The continent is called `area` in PHOIBLE. The genetic affiliation is called `Family` in WALS. WALS and PHOIBLE both identify languages with ISO 639-3, so you should be able to match them up. 

In [None]:
# double click on the image to zoom in (then you can scroll left or right)
combined.groupby(by="family")[['Phonemes', 'Consonants', 'Vowels']].mean().plot.bar(figsize=(50,8))

### By Continent  <a id='phoneme cont'></a>

In [None]:
phoib.groupby(by="Area")[['Phonemes', 'Consonants', 'Vowels']].mean().plot.bar(figsize=(12,6))

### Latitude and Longitude <a id='lat lons'></a>

You could try looking at number of phonemes plotted against latitude/longitude, but I doubt anything will come of it.

In [None]:
sns.jointplot('Phonemes', 'Latitude', data=phoib, kind='hex')

In [None]:
sns.jointplot('Phonemes', 'Longitude', data=phoib, kind='hex')

### Population Size vs Phoneme Inventory Size <a id='pop v foam'></a>

Visualize the relationship between population size and phoneme inventory size, again by all language, by continent and by genetic affiliation.

In [None]:
# log population
overlay_hex(phoib["Phonemes"], np.log(phoib["Population"]))

### Distance from Africa <a id='africa distance'></a>

Someone has claimed that phoneme inventory size and distance from Africa are inversely related. You could use the latitude/longitude in PHOIBLE for this. You may have to arbitrarily choose the mid-point of Africa for this.

We start by refering back to a graph we previously created.

In [None]:
pho_pop_cont = phoib.loc[:,["Area", "Phonemes"]]
pho_pop_cont = pho_pop_cont.groupby(by = "Area").mean().sort_values('Phonemes', ascending=False)
pho_pop_cont[['Phonemes']].plot.bar(figsize = (12,8))
plt.ylabel('Average Number of Phonemes')
pho_pop_cont

In [None]:
coordinates = list(zip(phoib.dropna()['Latitude'], phoib.dropna()['Longitude']))

# chose this point b/c it comes up when
# you google search 'africa coordinates'
africa_center = (8.7832, 34.5085)

# calculate the distance to each language's listed location
distances = np.array([geopy.distance.vincenty(point, africa_center).km for point in coordinates])

overlay_hex(distances, phoib.dropna()['Phonemes'])

In [None]:
# did the same, but logged the distances this time
overlay_hex(np.log(distances), phoib.dropna()['Phonemes'])

## Common Phenomes <a id='common'></a>

What are the most common phonemes in the world? What is the distribution of frequency? That is, there are about 2,000 phonemes in PHOIBLE, but only a handful are common and there's a long tail. One problem here is that PHOIBLE is not a random sample of languages. So, you could sample languages from PHOIBLE proportional to their population size and arrive at an estimate that way.

In [None]:
phonemes = pd.read_csv('phoible_data/phoible-by-phoneme.tsv', delimiter='\t')
phonemes.head()

In [None]:
print('Out of {} rows, there are {} unique phonemes.'.format(len(phonemes), len(list(set(phonemes['Phoneme'])))))

In [None]:
phoneme_counts = pd.DataFrame.from_dict(Counter(phonemes['Phoneme']), orient='index').reset_index().sort_values(0, ascending=False)
phoneme_counts.columns = ['Phoneme', 'Count']
phoneme_counts.iloc[:200].plot.bar(figsize=(15, 5))
plt.xticks([])
plt.ylabel('Count')
plt.xlabel('Phoneme')
plt.title('Counts of 200 Most Common Phonemes')

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
sns.distplot(phoneme_counts['Count'], ax=ax)
plt.title('Distribution of Phoneme Frequency')

Phonemes can be described by a set of (mostly) binary features. PHOIBLE has this data too. Is the distribution of featue values evenly split for each feature? If not, which features are more prone to being either 0 or 1?
Are some phonemes only present in some area of genetic affiliation? (There should be. For example, 'kp' and 'gb' are likely only in Africa.)

In [None]:
# join w/ other phoible (to get area column), then pivot('phoneme', 'area')

In [None]:
len(phonemes)

In [None]:
len(phoib[['LanguageCode', 'Area']])

In [None]:
lc_to_area = dict(zip(phoib['LanguageCode'], phoib['Area']))

def convert_code(code):
    try:
        return lc_to_area[code]
    except:
        return 'undefined'
    
phonemes['Area'] = [convert_code(code) for code in phonemes['LanguageCode']]

In [None]:
pd.crosstab(phonemes['Phoneme'], phonemes['Area'])

In [None]:
# normalizing by columns means that it accounts for the fact that
# there are differing numbers of languages per country
pd.crosstab(phonemes['Phoneme'], phonemes['Area'], margins=True, normalize='columns')

## Tone <a id='tone'></a>

One of the features of phonemes is tone. If a language has a phoneme with tone, it counts as a "tone language". Are most language tone? Where are the tone langauges on the map?

In [None]:
tone_languages = phoib['Tones'] > 0
num_tone_languages = sum(tone_languages)
total_languages = len(phoib)

print('There are {} tone languages out of our dataset of {} languages.'.format(num_tone_languages, total_languages))
print("That's about {}%.".format(np.round(num_tone_languages/total_languages*100, 2)))

In [None]:
tone = phoib[tone_languages]
tone.head()

In [None]:
# two rows in tone don't have coordinates, need to filter them out
valid_coords = tone['Latitude'] == tone['Latitude']
mappable_tone = tone[valid_coords]

mp = folium.Map(zoom_start=12)
for coords in list(zip(mappable_tone['Latitude'], mappable_tone['Longitude'])):
    folium.Circle(
        radius=100,
        location=coords,
        color='crimson',
        fill=False,).add_to(mp)
mp

### Altitude <a id='altitude'></a>

Someone has claimed there is a relationship being a tone language and the altitude. You could use lat/long to call some API to get the altitude and see if there's a relationship.

In [None]:
phoib['Tone Language?'] = phoib['Tones'] > 0
have_elevation = phoib[['elevation', 'Tones', 'Tone Language?']].dropna()

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
sns.distplot(have_elevation[np.invert(have_elevation['Tone Language?'])]['elevation'], ax=ax)
sns.distplot(have_elevation[have_elevation['Tone Language?']]['elevation'], ax=ax)

In [None]:
# getting rid of the 3 SD outliers to get a better picture
no_out=have_elevation[((have_elevation['elevation'] - have_elevation['elevation'].mean()) / have_elevation['elevation'].std()).abs() < 3]

f, ax = plt.subplots(figsize=(10, 8))
sns.distplot(no_out[np.invert(no_out['Tone Language?'])]['elevation'], ax=ax)
sns.distplot(no_out[no_out['Tone Language?']]['elevation'], ax=ax)

In [None]:
# note to self: explore possibility that low elevation 
# places just tend to be further from africa, then point 
# that out as a possible confounding factor

## Relationship between population size and morphological complexity <a id='morph complex'></a>

All the data for this will be in WALS. Morphological complexity is a vague term, referring to how complicated the words in a language are. Here are some features that you should look at with respect to their relation to phoneme inventory size:

### Feature 30A: Number of Genders

In [None]:
desired_columns = ['LanguageCode', 'Area', 'Latitude', 'Longitude', 'Population', 'Phonemes']

gender_data = drop_and_subset('30A Number of Genders', combined, desired_columns)
genders_dict = {'1 None':1, '2 Two':2, '3 Three':3, '4 Four':4, '5 Five or more':5}
gender_data['Genders'] = [genders_dict[value] for value in gender_data['30A Number of Genders']]

print('Rows with Gender data: {}'.format(len(gender_data)))
gender_data.head()

In [None]:
overlay_hex(gender_data['Genders'], gender_data['Phonemes'])

In [None]:
overlay_hex(gender_data['Genders'], np.log(gender_data['Population']))

### Reduplication

In [None]:
reduplication = drop_and_subset('27A Reduplication', combined, desired_columns)
reduplication.groupby('27A Reduplication').mean()[['Phonemes']].plot.bar()
plt.xticks(rotation=40)

### Feature 20A: Fusion of Selected Inflectional Formatives

In [None]:
fusion = drop_and_subset('20A Fusion of Selected Inflectional Formatives', combined, desired_columns)
fusion.groupby('20A Fusion of Selected Inflectional Formatives').mean()[['Phonemes']].plot.bar()
plt.xticks(rotation=70)

### Feature 21A: Exponence of Selected Inflectional Formatives

In [None]:
exponence_a = drop_and_subset('21A Exponence of Selected Inflectional Formatives', combined, desired_columns)
exponence_a.groupby('21A Exponence of Selected Inflectional Formatives').mean()[['Phonemes']].plot.bar()
plt.xticks(rotation=70)

### Feature 21B: Exponence of Tense-Aspect-Mood Inflection

In [None]:
exponence_b = drop_and_subset('21B Exponence of Tense-Aspect-Mood Inflection', combined, desired_columns)
exponence_b.groupby('21B Exponence of Tense-Aspect-Mood Inflection').mean()[['Phonemes']].plot.bar()
plt.xticks(rotation=70)