## Problem 7.1: Hacker stats and Darwin's finches

In [1]:
import numpy as np
import pandas as pd
import scipy.special
import scipy.stats as st

import numba

import bebi103

import altair as alt
import altair_catplot as altcat

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

**a)** We start with a little tidying of the data. Think about how you will deal with duplicate measurements of the same bird and make a decision on how those data are to be treated.

To start tidying up the data, we load in the data set and look at its structure.

In [2]:
# Read in the data
df = pd.read_csv('../data/finch_beaks.csv', comment='#')

# Take a look
df.head()

Unnamed: 0,band,species,beak length (mm),beak depth (mm),year
0,20123,fortis,9.25,8.05,1973
1,20126,fortis,11.35,10.45,1973
2,20128,fortis,10.15,9.55,1973
3,20129,fortis,9.95,8.75,1973
4,20133,fortis,11.55,10.15,1973


Let's look for duplicates based on band number and year.

In [3]:
pd.concat(g for _, g in df.groupby(['band', 'year']) if len(g) > 1)

Unnamed: 0,band,species,beak length (mm),beak depth (mm),year
102,316,fortis,10.9,9.7,1975
103,316,fortis,10.9,9.85,1975
304,818,fortis,10.2,9.0,1975
305,818,fortis,10.2,9.0,1975
363,944,fortis,10.3,8.3,1975
364,944,fortis,10.3,8.3,1975
365,945,fortis,11.6,10.8,1975
366,945,fortis,11.6,10.8,1975
2057,19028,fortis,12.5,8.9,2012
2178,19028,scandens,12.5,8.9,2012


It looks like for ID 19028, the species was identified as fortis and then as scandens, but the beak measurements are the same. For IDs 818, 944, and 945, the rows are exact duplicates. For band 316, the beak depth measurement is different.

Let's first drop complete duplicates and check that these rows were deleted.

In [4]:
df_deduped = df.drop_duplicates()

In [5]:
len(df)

2304

In [6]:
len(df_deduped)

2301

Three rows were indeed deleted.

In [7]:
pd.concat(g for _, g in df_deduped.groupby(['band', 'year']) if len(g) > 1)

Unnamed: 0,band,species,beak length (mm),beak depth (mm),year
102,316,fortis,10.9,9.7,1975
103,316,fortis,10.9,9.85,1975
2057,19028,fortis,12.5,8.9,2012
2178,19028,scandens,12.5,8.9,2012


In [8]:
df.loc[df['band'] == 19028]

Unnamed: 0,band,species,beak length (mm),beak depth (mm),year
2057,19028,fortis,12.5,8.9,2012
2178,19028,scandens,12.5,8.9,2012


In [9]:
df.loc[df['band'] == 316]

Unnamed: 0,band,species,beak length (mm),beak depth (mm),year
102,316,fortis,10.9,9.7,1975
103,316,fortis,10.9,9.85,1975


It makes most sense to delete the row for ID 19028 since we have no idea what species the bird is. There are no other measurements in different years for this bird for us to check. For band 316, we could either take the mean of the beak depths or delete one of these rows or delete both rows as well.

Let's delete these rows for now.

In [10]:
# Drop rows with duplicates
df_deduped = df_deduped.drop([2057,2178,102,103])

# Reset dataframe index
df_deduped = df_deduped.reset_index(drop=True)

In [11]:
len(df_deduped)

2297

We can also check for recordings of the same bird in different years.

In [12]:
pd.concat(g for _, g in df_deduped.groupby(['band', 'species']) if len(g) > 1)

Unnamed: 0,band,species,beak length (mm),beak depth (mm),year
418,364,scandens,14.00,8.40,1975
1285,364,scandens,14.00,8.40,1987
274,720,fortis,12.20,10.50,1975
1223,720,fortis,12.20,10.50,1987
732,2639,fortis,10.30,8.95,1987
1430,2639,fortis,10.30,8.95,1991
878,2666,fortis,12.81,9.30,1987
1431,2666,fortis,12.81,9.30,1991
1202,2753,fortis,10.89,10.35,1987
1432,2753,fortis,10.89,10.35,1991


For some birds, they measured the same bird over the different years. For example, taking bird with band 364, both measurements record the same beak length and beak depth. If this is the case for all of this kind of duplicate measurements, this could mean that the bird hasn't grown and that the inital measurement was accurate. If the values did change, maybe the bird grew from a juvenile to an adult. For now, we won't do anything with this information.

**b)** Plot ECDFs of the beak depths of Geospiza scandens in 1975 and in 2012. Then, estimate the mean beak depth in for each of these years with confidence intervals.

First let's slice out the beak depth values in numpy arrays.

In [13]:
inds_1975 = ((df_deduped['species'] == 'scandens') &
        (df_deduped['year'] == 1975))
    
inds_2012 = ((df_deduped['species'] == 'scandens') &
        (df_deduped['year'] == 2012))

depths_1975 = df_deduped.loc[inds_1975, 'beak depth (mm)'].values
depths_2012 = df_deduped.loc[inds_2012, 'beak depth (mm)'].values

Now we plot ECDFs of the beak depths of Geospiza scandens in 1975 and 2012.

In [24]:
p = bebi103.viz.ecdf(depths_1975,
                     x_axis_label='Scandens beak depth (mm)',
                     color='#4e79a7',
                     legend='1975')

p = bebi103.viz.ecdf(depths_2012,
                     x_axis_label='beak depth (mm)',
                     color='#f28e2b',
                     legend='2012',
                     p=p)

p.legend.location = 'bottom_right'

bokeh.io.show(p)

From the ECDF, we can see that the mean beak depth in 2012 has increased from what it was in 1975. Now let's estimate the mean beak depth in each of these years.

In [15]:
mean_1975 = np.mean(depths_1975)
mean_2012 = np.mean(depths_2012)

print("""
Mean beak depth in 1975 (mm): {0:.2f}
Mean beak depth in 2012 (mm): {1:.2f}
""".format(mean_1975, mean_2012))


Mean beak depth in 1975 (mm): 8.96
Mean beak depth in 2012 (mm): 9.19



The mean beak depth in 2012 is greater than the mean beak depth in 1975.

We will also estimate confidence intervals. First we have some functions from Justin's tutorial, and then we draw bootstrap samples and calculate the confidence intervals.

In [16]:
@numba.jit(nopython=True)
def draw_bs_sample(data):
    """
    Draw a bootstrap sample from a 1D data set.
    """
    return np.random.choice(data, size=len(data))


def draw_bs_reps(data, stat_fun, size=1):
    """
    Draw boostrap replicates computed with stat_fun from 1D data set.
    """
    return np.array([stat_fun(draw_bs_sample(data)) for _ in range(size)])


@numba.jit(nopython=True)
def draw_bs_reps_mean(data, size=1):
    """
    Draw boostrap replicates of the mean from 1D data set.
    """
    out = np.empty(size)
    for i in range(size):
        out[i] = np.mean(draw_bs_sample(data))
    return out

In [17]:
# Draw bootstrap samples
bs_reps_mean_1975 = draw_bs_reps_mean(depths_1975, size=10000)
bs_reps_mean_2012 = draw_bs_reps_mean(depths_2012, size=10000)

In [18]:
# 95% confidence intervals
mean_1975_conf_int = np.percentile(bs_reps_mean_1975, [2.5, 97.5])
mean_2012_conf_int = np.percentile(bs_reps_mean_2012, [2.5, 97.5])

print("""
Mean beak depth 95% conf int in 1975 (mm): [{0:.2f}, {1:.2f}]
Mean beak depth 95% conf int in 2012 (mm): [{2:.2f}, {3:.2f}]
""".format(*(tuple(mean_1975_conf_int) + tuple(mean_2012_conf_int))))


Mean beak depth 95% conf int in 1975 (mm): [8.84, 9.08]
Mean beak depth 95% conf int in 2012 (mm): [9.07, 9.31]



There is slight overlap between these two confidence intervals. 

We can use the bootstrap replicates to plot the probability distribution of mean beak depths in 1975 and 2012.

In [47]:
p = bebi103.viz.ecdf(bs_reps_mean_1975,
                     x_axis_label='beak depth (mm)',
                     color='#4e79a7',
                     legend='1975')
p = bebi103.viz.ecdf(bs_reps_mean_2012, color='#f28e2b', legend='2012', p=p)

p.legend.location = 'bottom_right'

bokeh.io.show(p)

## Make some sort of comment for this plot if we include it?

**c)** Perform a hypothesis test comparing the G. scandens beak depths in 1975 and 2012. Carefully state your null hypothesis, your test statistic, and your definition of what it means to be at least as extreme as the observed test statistic. Comment on the results. It might be interesting to know that a severe drought in 1976 and 1977 resulted in the death of the plants that produce small seeds on the island.

Our null hypothesis:

Our test statistic:

"At least as extreme as the observed test statistic":

In [22]:
@numba.jit(nopython=True)
def draw_perm_sample(x, y):
    """Generate a permutation sample."""
    concat_data = np.concatenate((x, y))
    np.random.shuffle(concat_data)
    return concat_data[:len(x)], concat_data[len(x):]

@numba.jit(nopython=True)
def draw_perm_reps_diff_mean(x, y, size=1):
    """
    Generate array of permuation replicates.
    """
    out = np.empty(size)
    for i in range(size):
        x_perm, y_perm = draw_perm_sample(x, y)
        out[i] = np.mean(x_perm) - np.mean(y_perm)
    return out

In [23]:
# Compute test statistic for original data set
diff_mean = np.mean(depths_2012) - np.mean(depths_1975)

# Draw replicates
perm_reps = draw_perm_reps_diff_mean(depths_2012, depths_1975, size=100000)

# Compute p-value
p_val = np.sum(perm_reps >= diff_mean) / len(perm_reps)

print('p-value =', p_val)

p-value = 0.00486


In [31]:
# Compute test statistic for original data set
diff_mean = np.abs(np.mean(depths_2012) - np.mean(depths_1975))

# Draw replicates
perm_reps = draw_perm_reps_diff_mean(depths_2012, depths_1975, size=100000)

# Compute p-value
p_val = np.sum(perm_reps >= diff_mean) / len(perm_reps)

print('p-value =', p_val)

p-value = 0.00512


**d)** Devise a measure for the shape of a beak. That is, invent some scalar measure that combines both the length and depth of the beak. Compare this measure between species and through time. (This is very open-ended. It is up to you to define the measure, make relevant plots, compute confidence intervals, and possibly do hypothesis tests to see how shape changes over time and between the two species.)

Our measure will be the ratio of beak length to beak depth, and we will look at how this measure changes through time.

In [50]:
# Calculate the ratio of beak length to beak depth
df_deduped['beak length to depth ratio'] = df_deduped['beak length (mm)'] / df_deduped['beak depth (mm)']

# Take a look
df_deduped.head()

Unnamed: 0,band,species,beak length (mm),beak depth (mm),year,beak length to depth ratio
0,20123,fortis,9.25,8.05,1973,1.149068
1,20126,fortis,11.35,10.45,1973,1.086124
2,20128,fortis,10.15,9.55,1973,1.062827
3,20129,fortis,9.95,8.75,1973,1.137143
4,20133,fortis,11.55,10.15,1973,1.137931


We slice out the beak length to depth ratios for each species for convenience.

In [52]:
alpha_s = df_deduped.loc[df_deduped['species'] == 'scandens', ['year', 'beak length to depth ratio']]
alpha_f = df_deduped.loc[df_deduped['species'] == 'fortis', ['year', 'beak length to depth ratio']]

Now we plot the ECDF of the beak length to depth ratio per species over time. 

In this plot, we have one thick line representing the ECDF of all the beak length to depth ratios data combined for each species. Then we have thinner lines representing the beak length to depth ratio over time for each species. In this graph, as the lines become darker, the year the data was measured is later in time.

In [28]:
p = bebi103.viz.ecdf(alpha_s['beak length to depth ratio'], formal=False, size=4, fill_alpha=0, color='#4e79a7', legend='scandens')
p = bebi103.viz.ecdf(alpha_f['beak length to depth ratio'], p=p, formal=False, size=4, fill_alpha=0, color='#f28e2b', legend='fortis')

years = df_deduped['year'].unique()
marker_year = np.linspace(0.2, 0.6, len(years))

for i, year in enumerate(df_deduped['year'].unique()):
    p = bebi103.viz.ecdf(alpha_f.loc[alpha_f['year'] == year, 'beak length to depth ratio'], p=p, formal=True, line_width=2, color='#f28e2b', alpha = marker_year[i])
    p = bebi103.viz.ecdf(alpha_s.loc[alpha_s['year'] == year, 'beak length to depth ratio'], p=p, formal=True, line_width=2, color='#4e79a7', alpha = marker_year[i])

p.legend.location = 'bottom_right'
bokeh.io.show(p)

We can calculate confidence intervals for G. fortis' mean beak ratio in 1975 and 2012.

We can also calculate confidence intervals for G. scandens' mean beak ratio in 1975 and 2012.

In [29]:
print("""
G. scandens mean beak ratio 95% conf int 1975 (mm):   [{0:.2f}, {1:.2f}]
G. scandens mean beak ratio 95% conf int 2012 (mm): [{2:.2f}, {3:.2f}]
""".format(*(tuple(mean_conf_int(alpha_s.loc[alpha_s['year'] == 1975, 'beak length to depth ratio'].values, size=10000)) + 
             tuple(mean_conf_int(alpha_s.loc[alpha_s['year'] == 2012, 'beak length to depth ratio'].values, size=10000)))))


G. scandens mean beak ratio 95% conf int 1975 (mm):   [1.56, 1.60]
G. scandens mean beak ratio 95% conf int 2012 (mm): [1.45, 1.48]



In [28]:
print("""
G. f mean beak ratio 95% conf int 1975 (mm):   [{0:.2f}, {1:.2f}]
G. f mean beak ratio 95% conf int 2012 (mm): [{2:.2f}, {3:.2f}]
""".format(*(tuple(mean_conf_int(alpha_f.loc[alpha_f['year'] == 1975, 'beak length to depth ratio'].values, size=10000)) + 
             tuple(mean_conf_int(alpha_f.loc[alpha_f['year'] == 2012, 'beak length to depth ratio'].values, size=10000)))))


G. f mean beak ratio 95% conf int 1975 (mm):   [1.15, 1.16]
G. f mean beak ratio 95% conf int 2012 (mm): [1.21, 1.24]



e) Introgressive hybridization occurs when a G. scandens bird mates with a G. fortis bird, and then the offspring mates again with pure G. scandens. This brings traits from G. fortis into the G. scandens genome. As this may be a mode by which beak geometries of G. scandens change over time, it is useful to know how heritable a trait is. Heritability is defined as the ratio of the covariance between parents and offsprings to the variance of the parents alone. To be clear, the heritability is defined as follows.

Compute the average value of a trait in a pair of parents.
Compute the average value of that trait among the offspring of those parents.
Do this for each set of parents/offspring. Using this data set, compute the covariance among all average offspring and the variance among all average parents.
This is a more apt definition than, say, the Pearson correlation, because it is a direct comparison between parents and offspring.

Heritability data for beak depth for G. fortis and G. scandens can be found here and here, respectively. (Be sure to look at the files before reading them in; they do have different formats.) From these data, compute the heritability of beak depth in the two species, with confidence intervals. How do they differ, and what consequences might this have for introgressive hybridization?

First we read in the data.

In [33]:
df_hs = pd.read_csv('../data/scandens_beak_depth_heredity.csv', comment='#')
df_hs.head()

Unnamed: 0,mid_parent,mid_offspring
0,8.3318,8.419
1,8.4035,9.2468
2,8.5317,8.1532
3,8.7202,8.0089
4,8.7089,8.2215


In [36]:
df_hf = pd.read_csv('../data/fortis_beak_depth_heredity.csv', comment='#')
df_hf.head()

Unnamed: 0,Mid-offspr,Male BD,Female BD
0,10.7,10.9,9.3
1,9.78,10.7,8.4
2,9.48,10.7,8.1
3,9.6,10.7,9.8
4,10.27,9.85,10.4


We see that the two dataframes have different formats. We need to average the beak length trait from the male and female bird for the fortis dataset (from looking at the header of the scandens dataset, we know this is how they calculate mid_parent).

In [37]:
df_hf['mid_parent'] = (df_hf['Male BD'] + df_hf['Female BD']) / 2
df_hf['mid_offspring'] = df_hf['Mid-offspr']
df_hf = df_hf.drop(columns=['Male BD', 'Female BD', 'Mid-offspr'])
df_hf.head()

Unnamed: 0,mid_parent,mid_offspring
0,10.1,10.7
1,9.55,9.78
2,9.4,9.48
3,10.25,9.6
4,10.125,10.27


Now let's slice out the data as numpy arrays.

In [39]:
hs_par = df_hs['mid_parent'].values
hs_offsp = df_hs['mid_offspring'].values

In [38]:
hf_par = df_hf['mid_parent'].values
hf_offsp = df_hf['mid_offspring'].values

Draw bootstrap pairs and define a function to calculate the heritability, which is the ratio of the covariance between parents and offspring to the variance of the parents.

In [40]:
@numba.jit(nopython=True)
def draw_bs_pairs(x, y):
    """
    Draw a pairs bootstrap sample.
    """
    inds = np.arange(len(x))
    bs_inds = draw_bs_sample(inds)
    return x[bs_inds], y[bs_inds]

@numba.jit(nopython=True)
def heritability(par, offsp):
    '''Heritability is calculated as the ratio of 
    the covariance between parents and offspring 
    to the variance of the parents.
    r'''
    assert len(par) == len(offsp), 'Length of the arrays must be the same.'

    return np.sum((par - np.mean(par)) * (offsp - np.mean(offsp))) \
            / np.sum((par - np.mean(par)) * (par - np.mean(par)))

In [41]:
@numba.jit(nopython=True)
def draw_bs_pairs_reps_heritability(x, y, size=1):
    """
    Draw bootstrap pairs replicates.
    """
    out = np.empty(size)
    for i in range(size):
        out[i] = heritability(*draw_bs_pairs(x, y))
    return out

Now we draw our bootstrap pairs, calculate the heritability, and plot the ECDFs of the beak depth heritability for each bird.

In [42]:
hf_reps = draw_bs_pairs_reps_heritability(hf_par, hf_offsp, size=10000)
hs_reps = draw_bs_pairs_reps_heritability(hs_par, hs_offsp, size=10000)

In [43]:
p = bebi103.viz.ecdf(hs_reps, formal=True, line_width=2, color='#4e79a7', legend='scandens')
p = bebi103.viz.ecdf(hf_reps, p=p, formal=True, line_width=2, color='#f28e2b', legend='fortis')

p.legend.location = 'bottom_right'
bokeh.io.show(p)

We can also calculate the mean and confidence intervals for the heritability.

In [45]:
mean_hf = np.mean(hf_reps)
mean_hs = np.mean(hs_reps)

print("""
Mean G. fortis heritability for beak depth: {0:.2f}
Mean G. scandens heritablity for beak depth: {1:.2f}
""".format(mean_hf, mean_hs))


Mean G. fortis heritability for beak depth: 0.72
Mean G. scandens heritablity for beak depth: 0.55



In [44]:
conf_int_hf = np.percentile(hf_reps, [2.5, 97.5])
conf_int_hs = np.percentile(hs_reps, [2.5, 97.5])

print("""
G. fortis heritability 95% conf int (mm):   [{0:.2f}, {1:.2f}]
G. scandens heritability 95% conf int (mm): [{2:.2f}, {3:.2f}]
""".format(*(tuple(conf_int_hf) + tuple(conf_int_hs))))


G. f heritability 95% conf int (mm):   [0.65, 0.80]
G. scandens heritability 95% conf int (mm): [0.35, 0.75]



We can see that G. fortis has a higher heritability for beak depth than G. scandens.