# Lexicostatistics
In this lecture, we will learn about how to apply what you've learned about iteration and string manipulation to the study of cross-linguistic word analysis, or lexicostatistics.

In [None]:
# run this cell; don't worry about what it does yet.
from datascience import *
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')
%matplotlib inline
import string

Lexicostatistics is a method used in linguistics to determine the similarities between different languages by comparing words with common meanings. For example, cognate words are a common topic of study: the word for "door" in many Indo-European languages is similar: *thura* (Ancient Greek), *dvar* (Sanskrit), *dorus* (Celtic), *durn* (Armenian); but compare *porte* (French), *puerta* (Spanish), and *porta* (Italian).

There are many ways to compare words cross-linguistically (across languages). One method is called the **Levenshtein distance** (or **edit distance**). The edit distance is the number of "edits" necessary to change one word into another word. An "edit" is the insertion, deletion, or replacement of a letter.

To compute edit distance, I've provided a custom function called `edit_distance()`. The function has two arguments, `w1` and `w2` (word 1 and word 2). The output is the edit distance between `w1` and `w2` as an integer (whole number).

Some examples:
```
>>> edit_distance('dog', 'doggy')
2.0
>>> edit_distance('dog', 'dag')
1.0
>>> edit_distance('dog', 'do')
1.0
```

In [None]:
def edit_distance(w1, w2):
    '''Computes the Levenshtein distance between two words.'''
    size_x = len(w1) + 1
    size_y = len(w2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if w1[x-1] == w2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    # print (matrix) (for debugging purposes)
    return int(matrix[size_x - 1, size_y - 1])

Run the cell below to see how the `edit_distance` function works. How many "edits" do you need to go from "dog" to "doge"? How about "cat" to "scallop"?

In [None]:
print(edit_distance('dog', 'doge'))
print(edit_distance('cat', 'scallop'))

## The Indo-European lexicostatistics database
The data we are using is from [Dyen, Kruskal & Black (1992)](https://www.jstor.org/stable/1006517?seq=6#metadata_info_tab_contents). They collected a list of 200 cross-linguistically common words from over 80 Indo-European languages and dialects. The list of terms was originally compiled by [Swadesh (1952)](https://en.wikipedia.org/wiki/Swadesh_list) and is often called a "Swadesh List".

In [None]:
l = Table.read_table('wk3-ie.csv')

Let's look at the table and get a feel for what it looks like.

In [None]:
l.show(10)

Let's see what other languages besides Afghan are in the table.

In [None]:
u = np.unique(l.column('Language'))
print(u)
print("There are",len(u),"languages in the Indo-European Lexicostatistics database.")

In [None]:
l.where('Language','English ST')

This will come in handy as a way to remember the words associated with each of the 200 values in the column `Feature`. (In Wednesday's lecture, Geoff will show you how to create a `dictionary` object that allows you to quickly look up any of the features.)

The researchers who created the dataset, unfortunately, were unable to find words for all of the features for every single language. These are incorrectly loaded by the table as "nan". This stands for "not a number". Many programming languages have a value like this to fill in for missing data. Another common name is "null" or "none". Python's equivalent data type is the built-in `None` object. How many values are missing from the dataset? How many languages have missing values?

In [None]:
l.where('Term','nan').show(5)

In [None]:
missing_wd = l.where('Term','nan').num_rows
missing_lg = len(np.unique(l.where('Term','nan').column('Language')))
print("The Indo-European Lexicostatistics Database is missing",missing_wd,"values in 'Term'.")
print(missing_lg,"languages in the Indo-European Lexicostatistics Database are missing at least one value in 'Term'.")

Let's replace these "nan" values with an empty string (i.e., a string of length zero: ""). We could use a `for` loop or the method `.apply()` with a custom function. I will demonstrate both methods.

In [None]:
t1 = make_array()
for item in l.column('Term'):
    r = item.replace('nan','')
    t1 = np.append(t1,r)
l0 = l.with_columns('Term2', t1)
l0.show(5)

`.with_columns` can be used to create a new column (`Term2`) or replace the original one (`Term`).

In [None]:
l_replace = l.with_columns('Term', t1)
l_replace.show(5)

In [None]:
missing_wd = l0.where('Term2','nan').num_rows
whitespace = l0.where('Term2','').num_rows
print("The Indo-European Lexicostatistics Database is missing",missing_wd,"values in 'Term2'.")
print("However, it has",whitespace,"values in 'Term2' that are empty strings.")

You probably noticed that the `for` loop is pretty slow when you have a Table with thousands of rows. In this case, using `.apply()` is much faster in terms of processing speed. First, we create a custom function for the first argument of `.apply()`.

In [None]:
def remove_nan(s):
    '''[Replace this with a description of what this function does.]'''
    return s.replace('nan','')

In [None]:
t2 = l.apply(remove_nan,'Term')
l0 = l.with_columns('Term2',t2) # This will overwrite our previous object l0.

In [None]:
missing_wd = l0.where('Term2','nan').num_rows
whitespace = l0.where('Term2','').num_rows
print("The Indo-European Lexicostatistics Database is missing",missing_wd,"values in 'Term2'.")
print("However, it has",whitespace,"values in 'Term2' that are empty strings.")

## Features with multiple terms
Some languages use multiple words to refer to the same feature (e.g., synonyms). In this dataset, these are separated by commas. We are going to create a table called `multiple_terms` with two columns:

- language: the name of the language
- num_multiple: the number of features which contain multiple values 

We compute "num_multiple" by counting the number of features which contains commas for each language. `.apply()` and a `for` loop will come in handy here.

In [None]:
def count_comma(s):
    '''[Replace this with a description of what this function does.]'''
    return s.count(',')

Let's do it for one language first.

In [None]:
language = 'Afghan'
group = l0.where('Language',language)

In [None]:
comma_count = group.apply(count_comma,'Term2')
comma_count

We'll use a simple `Boolean` (or a "True/False" statement) to find out which items in the array had a comma. Any value in `comma_count` greater than 0 would indicate as such.

In [None]:
comma_count > 0

Then we can count up the number of `True` items in the array. (What's wrong with just using `.sum()` on `comma_count`?)

In [None]:
sum_commas = (comma_count > 0).sum()
sum_commas

In [None]:
# This will give us an incorrect answer.
comma_count.sum()

Now, let's put these lines of code within a `for` loop so that we can iterate over every language. Each time we iterate, we add another item to the array `num_multiple`, which we've created outside of the loop.

In [None]:
num_multiple = make_array()
for language in np.unique(l0.column('Language')): # be careful not to iterate over every single row in the Table!
    group = l0.where('Language',language)
    comma_count = group.apply(count_comma,'Term2')
    sum_commas = (comma_count > 0).sum()
    num_multiple = np.append(num_multiple, sum_commas)
num_multiple

Finally, use `Table().with_columns()` to turn this array into a `Table`.

In [None]:
multiple_terms = Table().with_columns('language',np.unique(l0.column('Language')),'num_multiple',num_multiple)
multiple_terms

In [None]:
plt.figure(figsize=(3,15)) # This will make our plot better proportioned
plt.barh('language','num_multiple',data=multiple_terms)
plt.title('Number of features with multiple terms per language')
plt.xlabel('Language')
plt.ylabel('Number of multiple terms')

Looks like Provencal has a lot of synonyms! Does this end up making the lexicon of Provencal a lot larger than English? To find out, let's create a second table called `unique_terms` which contains two columns:

- language: the name of the language
- num_terms: the number of unique terms

And we'll create a different custom function that deals with a value that has multiple terms, separated by a comma.

In [None]:
def comma_split(s):
    '''[Replace this with a description of what this function does.]'''
    return s.split(',')

In [None]:
test = l0.apply(comma_split,'Term2')
print(test)

Notice that the `comma_split()` function has returned an array of lists instead of an array of arrays. An array that contains `[a, b, [c, d], e]` will be considered to have 4 items, not 5. In order to accurately count up the items in the array, we need to "flatten" these lists and end up with something like `[a, b, c, d, e]`. To do this, we can use more `for` loops.

In [None]:
notflat = make_array('a', 'b', ['c', 'd'], 'e')
flat_array = make_array()
for sublist in notflat:
    for item in sublist:
        flat_array = np.append(flat_array,item)
flat_array

So, we can now use `comma_split` and the flattening `for` loop on `l0`. But we don't want to apply the function to the entire table `l0` outright; instead, we are going to use a series of `for` loops to do one language at a time.

In [None]:
num_terms = make_array()
for language in np.unique(l0.column('Language')):
    group = l0.where('Language',language)
    terms = group.apply(comma_split,'Term2')
    flat_array = make_array()
    for sublist in terms:
        for item in sublist:
            flat_array = np.append(flat_array,item)
    count = len(np.unique(flat_array))
    num_terms = np.append(num_terms, count)

In [None]:
unique_terms = Table().with_columns('language',np.unique(l0.column('Language')),'num_terms', num_terms)
unique_terms

In [None]:
plt.figure(figsize=(25,20)) # This will make our plot better proportioned
ax = plt.subplot(1,2,1) # 1 row, 2 columns, plot #1
plt.barh('language','num_multiple',data=multiple_terms)
plt.title('Number of features with multiple terms per language')
plt.xlabel('Language')
plt.ylabel('Number of multiple terms')

ax = plt.subplot(1,2,2) # 1 row, 2 columns, plot #2
plt.barh('language','num_terms',data=unique_terms)
plt.title('Number of unique terms per language')
plt.xlabel('Language')
plt.ylabel('Number of unique terms')

For a different perspective, let's create the same plots, but sort the languages in decreasing order.

In [None]:
plt.figure(figsize=(25,20)) # This will make our plot better proportioned
ax = plt.subplot(1,2,1) # 1 row, 2 columns, plot #1
plt.barh('language','num_multiple',data=multiple_terms.sort('num_multiple'))
plt.title('Number of features with multiple terms per language')
plt.xlabel('Language')
plt.ylabel('Number of multiple terms')

ax = plt.subplot(1,2,2) # 1 row, 2 columns, plot #2
plt.barh('language','num_terms',data=unique_terms.sort('num_terms'))
plt.title('Number of unique terms per language')
plt.xlabel('Language')
plt.ylabel('Number of unique terms')

Which plot, left or right, do you think is a better visualization of the differences between lexicons? Why?

## Computing the language distance
Now, back to our main task. We want to compute language distance using the function `edit_distance`. To start out, we will compute the language distance between English and Afghan (labeled "English ST" and "Afghan" in this dataset). We are going to say that the language distance is the sum of the edit distances for all of the features (so `language_dist = dist_feature1 + dist_feature2 ... dist_feature200`). First, we will create an array with the distances for each feature. Then, we will compute the language distance using this array.

- If a feature has multiple terms, take the first term.
- Treat terms with spaces as if they were a single word (e.g., treat "ta nezde" as "tanezde").
- Ignore missing values: these will create a small amount of error in our data, which is okay to ignore for now.
- Use `for` statements to loop through the features.

We will use two string methods we've just learned: `.split(',')` and `.replace(old, new)`. These are built-in methods which you can use with any string:
```
>>> 'cat,dog'.split(',')
['cat, 'dog']
>>> 'cat,dog'.replace('cat', 'fish)'
'fish,dog'
```
And we will include them in our own custom function.

In [None]:
def shorten(s):
    '''[Replace this with a description of what this function does.]'''
    return s.replace(" ","").split(",")[0]

In [None]:
Term3 = make_array()
for term in l0.column('Term2'):
    t3 = shorten(term)
    Term3 = np.append(Term3,t3)
l1 = l0.with_columns('Term3',Term3)
l1

In [None]:
e = l1.where('Language','English ST').column('Term3')
a = l1.where('Language','Afghan').column('Term3')

In [None]:
edit_distance(e[0],a[0])

In [None]:
np.arange(200)

In [None]:
array = make_array()
for i in np.arange(200):
    distance = edit_distance(e[i],a[i])
    array = np.append(array,distance)
array

In [None]:
np.sum(array)

Finally, let's put our lines of code inside of a `for` loop to compare "English ST" to every other language in `l1` (including itself).

In [None]:
matrix = make_array()
for language in np.unique(l1.column('Language')):
    lang = l1.where('Language',language).column('Term3')
    engl = l1.where('Language','English ST').column('Term3')
    array = make_array()
    for i in np.arange(200):
        distance = edit_distance(lang[i],engl[i])
        array = np.append(array,distance)
    ld = np.sum(array)
    matrix = np.append(matrix,ld)
    print("Finished appending",language)
matrix

Each `matrix` object is an array that we will use to create our final `Table`.

In [None]:
matrix_table = Table().with_columns('Language',np.unique(l.column('Language')),'Language Distance',matrix)
matrix_table

According to your computed edit distance, which language appears to be the most closely related to English? The most distantly related? 

In [None]:
matrix_table.sort('Language Distance')

Can you think of a way to make `matrix_table` into an actual matrix, with additional columns for all 87 languages in the dataset?

## Visualization
A core aspect of datascience involves visualizations. We currently don't have the tools to make a good visualization with this matrix. Even if you cannot create the figure right now, what do you think would be a good way to plot the language distances? If you can think of a way to showcase the langauge distances using what we have learned so far, do so below.

In [None]:
# write your code here