## Problem 7.1: Hacker stats and Darwin's finches

In [55]:
import numpy as np
import pandas as pd
import scipy.special
import scipy.stats as st

import numba

import bebi103

import altair as alt
import altair_catplot as altcat

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

a) We start with a little tidying of the data. Think about how you will deal with duplicate measurements of the same bird and make a decision on how those data are to be treated.

In [3]:
# Read in the data
df = pd.read_csv('../data/finch_beaks.csv', comment='#')

# Take a look
df.head()

Unnamed: 0,band,species,beak length (mm),beak depth (mm),year
0,20123,fortis,9.25,8.05,1973
1,20126,fortis,11.35,10.45,1973
2,20128,fortis,10.15,9.55,1973
3,20129,fortis,9.95,8.75,1973
4,20133,fortis,11.55,10.15,1973


Let's look for duplicates based on band number and year.

In [10]:
pd.concat(g for _, g in df.groupby(['band', 'year']) if len(g) > 1)

Unnamed: 0,band,species,beak length (mm),beak depth (mm),year
102,316,fortis,10.9,9.7,1975
103,316,fortis,10.9,9.85,1975
304,818,fortis,10.2,9.0,1975
305,818,fortis,10.2,9.0,1975
363,944,fortis,10.3,8.3,1975
364,944,fortis,10.3,8.3,1975
365,945,fortis,11.6,10.8,1975
366,945,fortis,11.6,10.8,1975
2057,19028,fortis,12.5,8.9,2012
2178,19028,scandens,12.5,8.9,2012


It looks like for ID 19028, the species was identified as fortis and then as scandens, but the beak measurements are the same. For IDs 818, 944, and 945, the rows are exact duplicates. For band 316, the beak depth measurement is different.

Let's first drop complete duplicates.

In [20]:
df_deduped = df.drop_duplicates()

In [23]:
len(df)

2304

In [22]:
len(df_deduped)

2301

Three rows were indeed deleted.

In [24]:
pd.concat(g for _, g in df_deduped.groupby(['band', 'year']) if len(g) > 1)

Unnamed: 0,band,species,beak length (mm),beak depth (mm),year
102,316,fortis,10.9,9.7,1975
103,316,fortis,10.9,9.85,1975
2057,19028,fortis,12.5,8.9,2012
2178,19028,scandens,12.5,8.9,2012


In [17]:
df.loc[df['band'] == 19028]

Unnamed: 0,band,species,beak length (mm),beak depth (mm),year
2057,19028,fortis,12.5,8.9,2012
2178,19028,scandens,12.5,8.9,2012


In [18]:
df.loc[df['band'] == 316]

Unnamed: 0,band,species,beak length (mm),beak depth (mm),year
102,316,fortis,10.9,9.7,1975
103,316,fortis,10.9,9.85,1975


It makes most sense to delete the row for ID 19028 since we have no idea what species the bird is. For band 316, we could either take the mean of the beak depths or delete one of these rows or delete both rows as well.

Let's delete these rows for now.

In [27]:
df_deduped = df_deduped.drop([2057,2178,102,103])

In [28]:
len(df_deduped)

2297

In [77]:
pd.concat(g for _, g in df_deduped.groupby('band') if len(g) > 1)

Unnamed: 0,band,species,beak length (mm),beak depth (mm),year
423,364,scandens,14.00,8.40,1975
1290,364,scandens,14.00,8.40,1987
276,720,fortis,12.20,10.50,1975
1228,720,fortis,12.20,10.50,1987
737,2639,fortis,10.30,8.95,1987
1435,2639,fortis,10.30,8.95,1991
883,2666,fortis,12.81,9.30,1987
1436,2666,fortis,12.81,9.30,1991
1207,2753,fortis,10.89,10.35,1987
1437,2753,fortis,10.89,10.35,1991


In some years, they measured the same bird over the different years like band 364. They have the same beak length and beak depth. If this is the case for all of this kind of duplicate measurement, this could mean that the bird hasn't grown and that the inital measurement was accurate. If the values did change, maybe the bird grew from a juvenile to an adult.

**b)** Plot ECDFs of the beak depths of Geospiza scandens in 1975 and in 2012. Then, estimate the mean beak depth in for each of these years with confidence intervals.

Let's plot ECDFs of the beak depths of Geospiza scandens in 1975 and 2012.

In [70]:
inds_1975 = ((df_deduped['species'] == 'scandens') &
        (df_deduped['year'] == 1975))
    
inds_2012 = ((df_deduped['species'] == 'scandens') &
        (df_deduped['year'] == 2012))

depths_1975 = df_deduped.loc[inds_1975, 'beak depth (mm)'].values
depths_2012 = df_deduped.loc[inds_2012, 'beak depth (mm)'].values

In [71]:
p = bebi103.viz.ecdf(depths_1975,
                     x_axis_label='Scandens beak depth (mm)',
                     color='#4e79a7',
                     legend='1975')

p = bebi103.viz.ecdf(depths_2012,
                     x_axis_label='beak depth (mm)',
                     color='#f28e2b',
                     legend='2012',
                     p=p)

p.legend.location = 'bottom_right'

bokeh.io.show(p)

Now let's estimate the mean beak depth in each of these years.

In [73]:
mean_1975 = np.mean(depths_1975)
mean_2012 = np.mean(depths_2012)

print("""
Mean beak depth in 1975 (mm): {0:.2f}
Mean beak depth in 2012 (mm): {1:.2f}
""".format(mean_1975, mean_2012))


Mean beak depth in 1975 (mm): 8.96
Mean beak depth in 2012 (mm): 9.19



Now let's estimate confidence intervals.

In [56]:
@numba.jit(nopython=True)
def draw_bs_sample(data):
    """
    Draw a bootstrap sample from a 1D data set.
    """
    return np.random.choice(data, size=len(data))


def draw_bs_reps(data, stat_fun, size=1):
    """
    Draw boostrap replicates computed with stat_fun from 1D data set.
    """
    return np.array([stat_fun(draw_bs_sample(data)) for _ in range(size)])


@numba.jit(nopython=True)
def draw_bs_reps_mean(data, size=1):
    """
    Draw boostrap replicates of the mean from 1D data set.
    """
    out = np.empty(size)
    for i in range(size):
        out[i] = np.mean(draw_bs_sample(data))
    return out

In [57]:
bs_reps_mean_1975 = draw_bs_reps_mean(depths_1975, size=10000)
bs_reps_mean_2012 = draw_bs_reps_mean(depths_2012, size=10000)

In [61]:
# 95% confidence intervals
mean_1975_conf_int = np.percentile(bs_reps_mean_1975, [2.5, 97.5])
mean_2012_conf_int = np.percentile(bs_reps_mean_2012, [2.5, 97.5])

print("""
Mean beak depth 95% conf int in 1975 (mm): [{0:.2f}, {1:.2f}]
Mean beak depth 95% conf int in 2012 (mm): [{2:.2f}, {3:.2f}]
""".format(*(tuple(mean_1975_conf_int) + tuple(mean_2012_conf_int))))


Mean beak depth 95% conf int in 1975 (mm): [8.84, 9.08]
Mean beak depth 95% conf int in 2012 (mm): [9.07, 9.31]



In [62]:
p = bebi103.viz.ecdf(bs_reps_mean_1975,
                     x_axis_label='beak depth (mm)',
                     color='#4e79a7',
                     legend='1975')
p = bebi103.viz.ecdf(bs_reps_mean_2012, color='#f28e2b', legend='2012', p=p)

p.legend.location = 'bottom_right'

bokeh.io.show(p)

c) Perform a hypothesis test comparing the G. scandens beak depths in 1975 and 2012. Carefully state your null hypothesis, your test statistic, and your definition of what it means to be at least as extreme as the observed test statistic. Comment on the results. It might be interesting to know that a severe drought in 1976 and 1977 resulted in the death of the plants that produce small seeds on the island.

d) Devise a measure for the shape of a beak. That is, invent some scalar measure that combines both the length and depth of the beak. Compare this measure between species and through time. (This is very open-ended. It is up to you to define the measure, make relevant plots, compute confidence intervals, and possibly do hypothesis tests to see how shape changes over time and between the two species.)

e) Introgressive hybridization occurs when a G. scandens bird mates with a G. fortis bird, and then the offspring mates again with pure G. scandens. This brings traits from G. fortis into the G. scandens genome. As this may be a mode by which beak geometries of G. scandens change over time, it is useful to know how heritable a trait is. Heritability is defined as the ratio of the covariance between parents and offsprings to the variance of the parents alone. To be clear, the heritability is defined as follows.

Compute the average value of a trait in a pair of parents.
Compute the average value of that trait among the offspring of those parents.
Do this for each set of parents/offspring. Using this data set, compute the covariance among all average offspring and the variance among all average parents.
This is a more apt definition than, say, the Pearson correlation, because it is a direct comparison between parents and offspring.

Heritability data for beak depth for G. fortis and G. scandens can be found here and here, respectively. (Be sure to look at the files before reading them in; they do have different formats.) From these data, compute the heritability of beak depth in the two species, with confidence intervals. How do they differ, and what consequences might this have for introgressive hybridization?