In [None]:
import pandas as pd # dataframe manipulation
import numpy as np # numerical computation
import matplotlib.pyplot as plt # plotting
from mpl_toolkits.basemap import Basemap # map plotting

Below is the installation command that you should uncomment to install basemap in colab.  

In [None]:
### Google Colab installations ###
# !pip install basemap

## 1. Plotting linguistic diversity

### 1.1. Exploring the data

In this part of the practical, we are going to plot the linguistic diversity of the world. We will use the data from [Glottolog](https://glottolog.org/), a database of the world's languages. If the link that we use to export the data doesn't work, copy it from [here](https://github.com/alexeykosh/intro-to-ling/blob/main/S1/glottolog.csv) by clicking on the `raw` button, and replace the link below.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/alexeykosh/intro-to-ling/main/S1/glottolog.csv')

Let's look at the data first.

In [None]:
df.head(10)

Let's count the unique number of ISO-639-3 codes in the data. ISO-639-3 codes are unique identifiers for languages consisting of three letters.

In [None]:
df['isocodes'].nunique()

Count the number of unique ISO-codes for entries that are labeled as language and not as dialect (see the column `level`):

In [None]:
## YOUR CODE HERE ##

Let's first make a subset of the data where we only keep the entries that have both longitude and latitude information, as well as those that are labelled as *language*.

<!-- insert image below -->

<img src="https://bam.files.bbci.co.uk/bam/live/content/z74msbk/large" alt="drawing" width="500"/>


In [None]:
df_coord = df.dropna(subset=['latitude', 'longitude']) # drop rows with missing values
df_coord = df_coord[df_coord['level'] == 'language'] # only keep languages
df_coord = df_coord[df_coord['isocodes'].notna()] # remove NA isocodes
df_coord.head(10) # show the first 10 rows

Let's check that the

In [None]:
df_coord.shape

Great, now we can plot these points to see whether the locations make sense.

In [None]:
plt.figure(figsize=(12, 6)) # set the size of the plot (in inches)
plt.axhline(0, color='black', lw=0.5) # add the equator
plt.scatter(df_coord['longitude'], df_coord['latitude'], s=1) # plot the data (size of the points = 1)
plt.xlabel('Longitude') # set the label of the x-axis
plt.ylabel('Latitude') # set the label of the y-axis
plt.show() # show the plot

What can you tell from this plot already? Do you notice that some of the regions of the world are more densely populated with languages than others? We can also visualize this by plotting a density plot of the locations of the languages.

For this, we will be using the hexbin plot. The hexbin plot is created by dividing the space into hexagons and counting the number of points in each hexagon. This is a great way to visualize the density of points in a scatter plot.

Try to create by reading the documentation for the hexbin plot [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hexbin.html)

In [None]:
## YOUR CODE HERE ##

### 1.2. Map basics

Great, however, we are still missing a map of the world. Let's add a map of the world to our plot. We can use the `basemap` library to do this. This library allows us to have great basemaps for our plots, and is compatible with `matplotlib`.

In [None]:
plt.figure(figsize=(12, 6)) # set the size of the plot (in inches)
m = Basemap(projection='cyl', # set the projection (we are using a cylindrical projection)
            lon_0=0, # set the center of the map
            resolution='c') # set the resolution (we are using 'crude', as we don't need a high resolution)
m.drawcoastlines(color='black') # draw the coastlines
plt.show() # show the plot

It would be nice to color the continents, let's update the plot to color the continents.

In [None]:
plt.figure(figsize=(12, 6)) # set the size of the plot (in inches)
m = Basemap(projection='cyl', # set the projection (we are using a cylindrical projection)
            lon_0=0, # set the center of the map
            resolution='c') # set the resolution (we are using 'crude', as we don't need a high resolution)
m.drawcoastlines(color='black') # draw the coastlines
m.fillcontinents(color='gray', # set the color of the continents
                 lake_color='white') # set the color of the water
plt.show()

Maps could have many different projections, such as the Mercator projection, the Robinson projection, or the Mollweide projection. You can see some examples of these projections below:

<img src='https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikV_a3eFRALOvMNsGkE5gcHgSdw91pZDdPu8EkR-sJP9NYzR6lcbv-RrH67xEwosiWruacYndDnnR6yRExckZaj9oo1yI-_pBD_Wekhigzw_2yoGGUTdYMUaR9srvyuoEAGgRkiUDlyQxqxtBJBs5TfnHtMBSILE4P3Y5XS14bsjO6uIr16dmVZxmw5D85/s1920/InShot_20240506_185350804.jpg' alt="drawing" width="500" >

We can also change the projection by using the `projection` parameter in the `Basemap` function. Let's change the projection to `robin` (Robinson) and see how the plot changes.

In [None]:
plt.figure(figsize=(12, 6))
m = Basemap(projection='robin', # you can change the projection here
            lat_0=0, lon_0=0,
            resolution='c')
m.drawcoastlines(color='black')
m.fillcontinents(color='gray',
                 lake_color='white')
plt.show()

Let's try another projection, the Mollweide projection. Let's change the projection to `moll` and see how the plot changes. You can also find the full list of projections in the [documentation](https://matplotlib.org/basemap/stable/users/mapsetup.html) and experiment with them below.

In [None]:
## YOUR CODE HERE ##

### 1.3. Putting it all together.

Now that we have a map of the world, let's add the scatter plot of the languages on top of it. We can do this by either using `matplotlib` or `basemap` directly. Let's use `basemap` directly to plot the languages on top of the map.

First, let's define x and y coordinates of our languages:

In [None]:
x_all = df_coord['longitude'].values
y_all = df_coord['latitude'].values

Then we can plot them on top of the map. Let's also make the map more transparent by adjusting the `alpha` parameter of the `fillcontinents` method:

In [None]:
plt.figure(figsize=(12, 6))
m = Basemap(projection='cyl',
            lat_0=0,
            lon_0=0,
            resolution='c')
m.drawcoastlines(color='black')
m.fillcontinents(color='gray',
                 lake_color='white',
                 alpha=0.5) # set the transparency of the continents to make the points more visible
# Plotting the languages
m.plot(x_all, y_all, 'ro', markersize=1)
plt.show()

Let's try another projection:

In [None]:
plt.figure(figsize=(12, 6))
m = Basemap(projection='moll', # you can change the projection here
            lat_0=0,
            lon_0=0,
            resolution='c')
m.drawcoastlines(color='black')
m.fillcontinents(color='gray',
                 lake_color='white',
                 alpha=0.5) # set the transparency of the continents to make the points more visible
m.plot(x_all, y_all, 'ro', markersize=1) # Plotting the languages
plt.show()

Ooops, something seems wrong here! We cannot see the points, because we need to adapt the longitude and latitude values to the projection. Let's do this by using the `Basemap` object.

In [None]:
plt.figure(figsize=(12, 6))
m = Basemap(projection='moll',
            lat_0=0,
            lon_0=0,
            resolution='c')
m.drawcoastlines(color='black')
m.fillcontinents(color='gray',
                 lake_color='white',
                 alpha=0.5) # set the transparency of the continents to make the points more visible
new_x, new_y = m(x_all, y_all) # convert the coordinates to the map coordinates
# Plotting the languages
m.plot(new_x, new_y, 'ro', markersize=1)
plt.show()

We can also add the hexbin plot on top of the map. Let's do this by using the `hexbin` method of the `Basemap` object.

In [None]:
plt.figure(figsize=(12, 6))
m = Basemap(projection='robin',
            lat_0=0,
            lon_0=0,
            resolution='c')
m.drawcoastlines(color='black')
m.fillcontinents(color='gray',
                 lake_color='white',
                 alpha=0.5) # set the transparency of the continents to make the points more visible
new_x, new_y = m(x_all, y_all) # convert the coordinates to the map coordinates
# Plotting the languages as a hexbin plot
m.hexbin(x=new_x,
         y=new_y,
         gridsize=100, # set the number of bins
         bins='log', # log scale for the number of languages
         cmap='hot') # set the color map
m.colorbar(label='Number of languages', # set the label of the colorbar
           location='bottom') # set the location of the colorbar
# Plot parallels and meridians
m.drawparallels(np.arange(-90., 120., 30.), labels=[1, 0, 0, 0]) # the list in the labels parameter sets the visibility of the labels (left, right, top, bottom)
m.drawmeridians(np.arange(0., 420., 60.), labels=[0, 0, 1, 0])
plt.show()

### 1.4. Exploring the linguistic diversity relative to the equator

We have plotted the linguistic diversity on a map. If you look at the map, you can see that the linguistic diversity is not evenly distributed across the world. Some regions have more languages than others, like the Sub-Saharan Africa region, or the South-East Asia region. While other regions have fewer languages, like the Arctic region or the Sahara desert.

One thing that we can look at, is the density of languages relative to the equator. We can do this by plotting the latitude of the languages against the number of languages at that latitude. First, we need to group the languages by latitude and count the number of languages at each latitude. Then we can plot the number of languages at each latitude.

**Before we start doing this, do you have any hypotheses about the distribution of languages relative to the equator? Will there be more languages above or below the equator? Why?**

In [None]:
plt.figure(figsize=(8, 4))
plt.hist(df_coord['latitude'], # latitude values
         )
plt.xlabel('Latitude') # set the label of the x-axis
plt.ylabel('Number of languages')
plt.show()

Nice, but it's not very pretty, and also not very informative. Let's make a better plot by adjusting the number of bins and the color of the plot.

In [None]:
plt.figure(figsize=(8, 4))
plt.hist(df_coord['latitude'], # latitude values
         bins=100,
         color='grey')
plt.xlabel('Latitude') # set the label of the x-axis
plt.show()

Let's also add a line to the equator, and center the x-axis at the equator.

In [None]:
plt.figure(figsize=(8, 4))
plt.hist(df_coord['latitude'], # latitude values
         bins=100,
         color='grey')
plt.xlabel('Latitude') # set the label of the x-axis
plt.axvline(x=0, color='red', lw=2) # add the equator
plt.xlim(-80, 80)
plt.show()

Let's also label the left side as below the equator and the right side as above the equator.

In [None]:
plt.figure(figsize=(8, 4))
plt.hist(df_coord['latitude'], # latitude values
         bins=100,
         color='grey')
plt.xlabel('Latitude') # set the label of the x-axis
plt.axvline(x=0, color='red', lw=2) # add the equator
plt.xlim(-80, 80)
plt.text(-70, 350, 'Below the equator', fontsize=12, color='red') # add a text
plt.text(35, 350, 'Above the equator', fontsize=12, color='red') # add a text
plt.show()

It seems that there are indeed more languages above the equator than below, let's now try to count below and above the equator. Note that the latitude above the equator is positive, and below the equator is negative. You would need to use the pandas [query](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.htm) method.

In [None]:
## YOUR CODE HERE ###

There is indeed more languages above the equator than below. This is interesting, as it shows that the linguistic diversity is not evenly distributed across the world. However, it might be due to the Earth having more landmass above the equator than below. We can also show it using a bar plot.

To do this, you would first need to add a new column `equator_relative` with values `Above` and `Below` using [np.where](https://numpy.org/doc/stable/reference/generated/numpy.where.html).

In [None]:
## YOUR CODE HERE ##

If you did it right, you would be able to plot the bar plot below:

In [None]:
plt.figure(figsize=(8, 4))
df_coord['equator_relative'].value_counts().plot(kind='bar', color='grey') # plot the counts of the values
plt.ylabel('Number of languages') # set the label of the y-axis
plt.xlabel('Position relative to the equator') # set the label of the x-axis
plt.show()

### 1.5. Plotting a specific region

Let's now focus on one specific region, and look at France, for example. We can do this by setting the limits of the map to the region we are interested in.

In [None]:
# the window is [41, 51, -5, 10], only choosing the languages in this window
df_window = df_coord[(df_coord['latitude'] >= 41) & (df_coord['latitude'] <= 51) \
                     & (df_coord['longitude'] >= -5) & (df_coord['longitude'] <= 10)]

In [None]:
# remove names where there are numbers or the word "sign"
df_window = df_window[~df_window['name'].str.contains(r'\d|sign', case=False)]

In [None]:
names = df_window['name'].values.tolist()
x_w = df_window['longitude'].values
y_w = df_window['latitude'].values

In [None]:
plt.figure(figsize=(7, 7))
# plot the map of France
m = Basemap(projection='merc',
            llcrnrlat=41,
            urcrnrlat=51,
            llcrnrlon=-5,
            urcrnrlon=10,
            lat_ts=20,
            resolution='i')
m.drawcoastlines()
m.drawcountries()
m.drawmapboundary()
m.fillcontinents(color='grey',
                 lake_color='white',
                 alpha=0.5)
x_new, y_new = m(x_w, y_w)
# m.plot(x_new, y_new, 'ro', markersize=5)
plt.scatter(x_new, y_new, s=20, c='red', alpha=1)
for i, name in enumerate(names):
    plt.text(x_new[i], y_new[i], name, fontsize=8, ha='left')
plt.show()

Now it's your turn. Choose one language on the map, and explore it's glottolog page. What can you learn from it?

## 2. Predicting linguistic diversity

In his book Linguistic Diversity (1999), Daniel Nettle hypothesizes that linguistic diversity is predicted by the fertility of the land. To test this hypothesis, he collected data from 74 countries, and measured the linguistic diversity of each country by counting the number of languages spoken in each country. He also measured the fertility of the land by counting the number of months in which crops can be grown in each country (MGS, mean growing season), and included data on population.

Let's take a look at the data first.

In [None]:
data_nettle = pd.read_csv('https://raw.githubusercontent.com/'\
                          'bodowinter/applied_statistics_book_data/'\
                          'master/nettle_1999_climate.csv')
data_nettle.head(10)

Let's count the number of countries in the data.

In [None]:
data_nettle.Country.unique().shape

Min, max and median values of the mean growing season:

In [None]:
print(f'Minimum MGS: {data_nettle.MGS.min()}')
print(f'Maximum MGS: {data_nettle.MGS.max()}')
print(f'Median MGS: {data_nettle.MGS.median()}')

Same for population size:

In [None]:
### YOUR CODE HERE ###

Why do these values seem a bit off? Because the raw population size (in number of people) was transformed into a $log_{10}$ scale. Let's transform it back to the original scale by taking the exponent of the values.

In [None]:
data_nettle['Population_exp'] = 10**data_nettle['Population']

Australia's population in the 1999s was arund 17 million people. So as you can see, the population size was initially recorder in 1000ths of people.

In [None]:
data_nettle[data_nettle.Country == 'Australia']

Let's recompute mean, min and max using the exponentiated values multiplied by 1000:

In [None]:
print(f"Min population: {(data_nettle.Population_exp.min() * 1000).round(0)}")
print(f"Max population: {(data_nettle.Population_exp.max() * 1000).round(0)}")
print(f"Median population: {(data_nettle.Population_exp.median() * 1000).round(0)}")

Ok, let's now thing scientifically. We have three variables -- what is the possilbe relations betweem population size, mean growing seazon and the number of languages?

One possible scenario is that the population size is influenced by the Mean Growing Seazon, as people might tent to migrate to more prosperous regions. Only then multiple languages might appear due to preferential conditions of the environment.

Let's draw a graph showing this hypothesis that we outlined:

In [None]:
# NB: we will ignore networkx for now as we will work on it at the TD3

import networkx as nx

graph = nx.DiGraph()
graph.add_edges_from([("MGS", "population size"), ("population size", "number of \n languages")])

nx.draw_networkx(graph,
                 arrows=True,
                 node_size=2000,
                 node_color='white')
plt.gca().axison = False
plt.show()

Let's plot the mean growing season (stored in the column `MGS`) against the log-transformed population size (stored in the column `Population`). What are your predictions? Do you think there is a correlation between the population size and the number of languages?

In [None]:
plt.figure(figsize=(8, 4))
plt.scatter(data_nettle['MGS'], data_nettle['Population'], s=10)
plt.ylabel('Population (log scale)')
plt.xlabel('Mean Growing Season')
plt.show()

In [None]:
np.corrcoef(data_nettle['Population'], data_nettle['MGS'])[0, 1]

What can you tell from this?

Now let's do the same for population and the number of languages:

In [None]:
### YOUR CODE HERE ###

Finally, we will look at MGS and number of languages:

In [None]:
### YOUR CODE HERE ###