# Exploration

In [1]:
import scipy
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import mode
from ipywidgets import interact
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from matplotlib import colors
from sklearn.linear_model import LinearRegression
import warnings

warnings.filterwarnings('ignore')
sns.set_style('darkgrid')
%matplotlib inline

In [43]:
file_name = 'data/wals-chapter-1.csv'
data = pd.read_csv(file_name)
data.head()

Unnamed: 0,ID,Language_ID,Language_name,Parameter_ID,Parameter_name,Value,Source,Comment
0,1A-kgi,kgi,Konyagi,1A,Consonant Inventories,Large,Santos-1977,
1,1A-cve,cve,Chuave,1A,Consonant Inventories,Small,Thurman-1970,
2,1A-nbk,nbk,Natügu,1A,Consonant Inventories,Moderately large,Wurm-1972b,
3,1A-ach,ach,Aché,1A,Consonant Inventories,Small,Susnik-1974,
4,1A-aiz,aiz,Aizi,1A,Consonant Inventories,Average,Herault-1971,


In [59]:
phoible_data = pd.read_csv('phoible_data/phoible-by-phoneme.tsv', delimiter='\t')
print(phoible_data.columns)
phoible_data.head()

Index(['LanguageCode', 'LanguageName', 'SpecificDialect', 'Phoneme',
       'Allophones', 'Source', 'Trump', 'GlyphID', 'InventoryID', 'tone',
       'stress', 'syllabic', 'short', 'long', 'consonantal', 'sonorant',
       'continuant', 'delayedRelease', 'approximant', 'tap', 'trill', 'nasal',
       'lateral', 'labial', 'round', 'labiodental', 'coronal', 'anterior',
       'distributed', 'strident', 'dorsal', 'high', 'low', 'front', 'back',
       'tense', 'retractedTongueRoot', 'advancedTongueRoot',
       'periodicGlottalSource', 'epilaryngealSource', 'spreadGlottis',
       'constrictedGlottis', 'fortis', 'raisedLarynxEjective',
       'loweredLarynxImplosive', 'click'],
      dtype='object')


Unnamed: 0,LanguageCode,LanguageName,SpecificDialect,Phoneme,Allophones,Source,Trump,GlyphID,InventoryID,tone,...,retractedTongueRoot,advancedTongueRoot,periodicGlottalSource,epilaryngealSource,spreadGlottis,constrictedGlottis,fortis,raisedLarynxEjective,loweredLarynxImplosive,click
0,kor,Korean,,a,a,spa,False,0061,1,0,...,-,-,+,-,-,-,0,-,-,0
1,kor,Korean,,aː,aː,spa,False,0061+02D0,1,0,...,-,-,+,-,-,-,0,-,-,0
2,kor,Korean,,e,e,spa,False,0065,1,0,...,-,-,+,-,-,-,0,-,-,0
3,kor,Korean,,eː,eː,spa,False,0065+02D0,1,0,...,-,-,+,-,-,-,0,-,-,0
4,kor,Korean,,h,ç h ɦ,spa,False,0068,1,0,...,0,0,-,-,+,-,-,-,-,-


Background
==========
One big question in cognitive science is the relationship between linguistic features (e.g. number of vowels, word order and number of tense categories) and non-linguistic features (e.g. population size, altitude and climate). In particular, a lot of attention has been paid to the relationship between population size and various linguistic features. People have looked at the relationship between population size and:
- size of the phoneme inventory
- morphological complexity
Relationship between population size and phonemic inventory
===========================================================
Phonemes are individual sounds. They come from the IPA, which is the phonetic alphabet. Phonemes can either be consonants or vowels. Languages have a fixed number of phonemes. The best data source for this is PHOIBLE. 
Visualize the following univariate distributions both as a histogram and on a map:
- Number of consonants
- Number of vowels
- Number of phonemes
Visualize those same distributions grouped by continent and grouped by genetic affiliation. The continent is called `area` in PHOIBLE. The genetic affiliation is called `Family` in WALS. WALS and PHOIBLE both identify languages with ISO 639-3, so you should be able to match them up. You could try looking at number of phonemes plotted against latitude/longitude, but I doubt anything will come of it.
Visualize the relationship between number of consontants and number of vowels, again by all language, by continent and by genetic affiliation.
Visualize the relationship between population size and phoneme inventory size, again by all language, by continent and by genetic affiliation.
Someone has claimed that phoneme inventory size and distance from Africa are inversely related. You could use the latitude/longitude in PHOIBLE for this. You may have to arbitrarily choose the mid-point of Africa for this.
What are the most common phonemes in the world? What is the distribution of frequency? That is, there are about 2,000 phonemes in PHOIBLE, but only a handful are common and there's a long tail. One problem here is that PHOIBLE is not a random sample of languages. So, you could sample languages from PHOIBLE proportional to their population size and arrive at an estimate that way. 
Phonemes can be described by a set of (mostly) binary features. PHOIBLE has this data too. Is the distribution of featue values evenly split for each feature? If not, which features are more prone to being either 0 or 1?
Are some phonemes only present in some area of genetic affiliation? (There should be. For example, 'kp' and 'gb' are likely only in Africa.)
One of the features of phonemes is tone. If a language has a phoneme with tone, it counts as a "tone language". Are most language tone? Where are the tone langauges on the map?
Someone has claimed there is a relationship being a tone language and the altitude. You could use lat/long to call some API to get the altitude and see if there's a relationship.
Relationship between population size and morphological complexity
=================================================================
All the data for this will be in WALS. Morphological complexity is a vague term, referring to how complicated the words in a language are. Here are some features that you should look at with respect to their relation to phoneme inventory size:
- Feature 30A: Number of Genders
- Feature 27A: Reduplication
- Feature 20A: Fusion of Selected Inflectional Formatives
- Feature 21A: Exponence of Selected Inflectional Formatives
- Feature 21B: Exponence of Tense-Aspect-Mood Inflection