# Bioindicators of Strawberry Creek
### Professors Mary Power, John Huelsenbeck & Bruce Baldwin
_Estimated Time: 50 minutes_

Welcome! In this lab you will be using data science tools to determine the significance between the ecological health of the two branches of Strawberry Creek.

**Learning Outcomes**

By the end of the notebook, students should be able to:

1. Explain the use of biological organisms as indicators of ecosystem health
2. Define dichotomous keys and use them to ID organisms
3. Interpret biological metrics: taxa richness, %EPT, Biotic Index, % Filterers, % Predators
4. Analyze simulated resampling to determine if two assemblages of organisms are different fro

## Table of Contents 

1. [Jupyter Notebooks](#1)
    - [Types of Cells](#1.1)
    - [Running Cells](#1.2)
    - [Editting, Saving and Submitting](#1.3)
    - [Debugging Tips and Jupyter Help](#1.4)
<br/><br/>
2. [Introduction to Data Analytics](#2)
    - [Null and Alternate Hypothesis](#2.1)
    - [Permutation Test](#2.2)
    - [Bootstrapping](#2.3)
<br/><br/>
3. [Introduction to Data Analytics](#3)
    - [Experiment 1](#3.1)
    - [Experiment 2](#3.2)
    - [Experiment 3](#3.3)
<br/><br/>

## Jupyter Notebooks <a id='1'></a>

This lab is currently set up in a Jupyter Notebook. A Jupyter Notebook is an online, interactive computing environment, composed of different types of __cells__. Cells are chunks of code or text that are used to break up a larger notebook into smaller, more manageable parts and to let the viewer modify and interact with the elements of the notebook.
 
### Types of cells <a id= '1.1'> </a>

There are two types of cells in Jupyter, __code__ cells and __markdown__ cells. Code cells are cells indicated with “In [ ]:” to the left of the cell. In these cells you can write you own code and run the code in the individual cell.
Markdown cells hold text a majority of the time and do not have the “In [ ]” to the left of the cell (Just an empty space). 

### Running cells <a id= '1.2'> </a>

"Running" a cell is similar to pressing 'Enter' on a calculator once you've typed in an expression; it computes all of the expressions contained within the cell.

To run a code cell, you can do one of the following:
- press __Shift + Enter__
- click __Cell -> Run Cells__ in the toolbar at the top of the screen.

You can navigate the cells by either clicking on them or by using your up and down arrow keys. Try running the cell below to see what happens. 

In [3]:
print("Hello, World!")

Hello, World!


The input of the cell consists of the text/code that is contained within the cell's enclosing box. Here, the input is an expression in Python that "prints" or repeats whatever text or number is passed in. 

The output of running a cell is shown in the line immediately after it. Notice that markdown cells have no output. 

### Editing, Saving and Sumbitting <a id='1.3'> </a>

- To __edit__ a cell simply click on the desired cell and begin typing 
- To __save__ your notebook press command + s on the keyboard 
- We will go into the specifics of how to __submit__ your work at the end of the lab, but you will essentially be converting your work into a PDF file and then submitting it to bCourses

### Debugging Tips and Jupyter Help <a id= '1.4'> </a>

...

## Introduction <a id='2'> </a>

Throughout the course of this lab you will be using Python to analyze the data that you collected from Strawberry Creek. Python allows us to use data analysis methods that simulate data sets that we may not have the resources to collect in real life. The main purpose of this lab is to determine whether or not the ecological health of the two branches of the creek have a significant different.

In [10]:
import numpy as np
import pandas as pd
import qgrid
import ipywidgets as widgets
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline

ModuleNotFoundError: No module named 'qgrid'

## Data Recording

## Introduction to Data Analytics <a id= '2'> </a>

### Null Hypothesis vs. Alternate Hypothesis <a id='2.1'> </a>

One of the first problems to work through when looking at a data set is to determine whether or not the trends in the data are significant or purely due to random chance. In this particular lab we are trying to determine whether or not the difference between the ecological healths of the two branches of the creek are significant or if it is due to chance. To do this we begin by forming a null hypothesis and an alternative hypothesis to test. 

__Null Hypothesis__: A null hypothesis claims that there is no statistical difference between two distributions and that any difference is due to experimental error or chance.

__Alternative Hypothesis__: An alternative hypothesis essentially counters the null hypothesis and claims that the difference in distribution is significant.

Example Null and Alternative Hypothesis

Say we have a data set with data on the number of boba shops on Southside and Northside. The data set shows that Southside has a higher average of boba shops than Northside, but it is unclear whether the difference in the average is due to chance or some other unknown reason. For this data set potential hypotheses would be:

__Example Null Hypothesis__
- The distribution of the average of boba shops is the same for the samples taken from Southside as the samples taken from Northside. The difference in sample distribution is due to chance. 

__Example Alternative Hypothesis__  
- The average of boba shops in Northside is lower than the average of boba shops in Northside.

**What would be a potential null hypothesis for this lab?**

_Type your answer here_

**What would be a potential alternative hypothesis?**

_Type your answer here_

After you have your null and alternative hypothesis, the next step is to simulate the distribution under the null hypothesis! Theoretically, if the difference in distribution were solely due to random chance, then the data that the distribution originally comes from would not matter. This is where permutation tests come in to play.

### Permutation Test <a id='2.2'> </a>

A permutation test essentially shuffles the given data set and creates new distributions. In this case, we are using a permutation test to shuffle the difference in ecological health of the two creeks. As was previously mentioned, permutation tests simulate the null hypothesis because it assumes that there is no significant difference between the distributions. 

To demonstrate, we will run permutation testing on example data of FBI scores collected from the North and South Fork. Run the following code below:

In [None]:
# north_fork_example = pd.DataFrame({
#     'FBI Score':[3.5, 4.0, 3.0, 3.5, 4.2],
#     'Mean FBI Score':[3.64, 3.64, 3.64, 3.64, 3.64],
#     'Score - Mean':[-0.14, 0.36, -0.64, -0.14, 0.56],
#     'Square of Difference':[0.0196, 0.1296, 0.4096, 0.0196, 0.3136]
# })

# north_fork_example

example = pd.DataFrame({
    'FBI Score':[3.5, 4.0, 3.0, 3.5, 4.2, 4.5, 5.0, 3.6, 4.9, 5.1, 3.4, 2.9],
    'Fork':np.append(np.repeat('North', 5), np.repeat('South', 7))
})
example

In our example data, we have 5 data points from the North Fork and 7 from the South Fork. Run the cell below to see the observed difference in FBI Score means between the two forks.

In [None]:
observed_difference = example[example['Fork']=='North'].mean() - example[example['Fork']=='South'].mean()
observed_difference

We call this our observed difference because this statistic is observed from data that is actually collected (although generated in our case).

In permutation testing, we will be shuffling the data points between the two forks. For one permutation, we will calculate the FBI Score means for each fork. In this case, the mean difference is no longer an observed difference but a simulated difference. Run the cell below to generate a permutation of the data and to calculate the new difference.

In [None]:
perm_example = pd.DataFrame({
    'FBI Score':example['FBI Score'].sample(len(example['FBI Score'])),
    'Fork':np.append(np.repeat('North', 5), np.repeat('South', 7))
})
perm_example

In [None]:
perm_difference = perm_example[perm_example['Fork']=='North'].mean() - perm_example[perm_example['Fork']=='South'].mean()
perm_difference

This is just for one permutation of the data. Now we perform the permutation test many more times, and with these values we can plot the distribution of differences. Using this distribution of simulated differences, we can compare it with our actual observed difference to see how likely it is to observe this difference and if our null hypotheis is true.

In [None]:
def difference_in_means(fbi_scores):
    return np.mean(fbi_scores[:5]) - np.mean(fbi_scores[5:])

n_repeats = 1000
permutation_differences = []
for i in range(n_repeats):
    permutation = example['FBI Score'].sample(len(example['FBI Score']))
    new_difference = difference_in_means(permutation)
    permutation_differences.append(new_difference)

In [None]:
plt.hist(permutation_differences)
plt.axvline(observed_difference[0], color='red', label='Observed Difference')
plt.xlabel('FBI Score Mean Difference')
plt.legend();

Using this plot, we can guess if the null hypothesis is true (the observed difference between the two branches is due to random chance) or if the alternative hypothesis is true (that it is not due to chance).

**How likely is it for the observed difference to occur, and can we reject the null hypothesis?**

_Type your answer here_

Answer: Based on the distribution, although it's not very likely to occur, we still can't rule out the possibility of getting the observed difference by random chance, because it is not far enough in the tail of the distribution to be considered as very unlikely.

### Bootstrapping <a id='2.3'> </a>

Another problem that often surfaces when analysing a data set is the accuracy of an estimated statistic. For example, if we wanted to provide an estimate of the average difference in ecological health between North and South Fork, the calculations from just the experiment would not be representative of the difference throughout the day, or even throughout the year. On the otherhand, it would also not be feasible to go around collecting samples and calculating the ecological health every single hour. This is where bootstrapping comes into play!  

Bootstrapping generates new random samples by drawing samples from the original data set. We essentially treat our data set as the population. We randomly draw from the data set __with replacement__ to create new data sets that are the same size as the original.

In [None]:
boot_example = perm_example

In [None]:
obs_difference = boot_example[boot_example['Fork']=='North'].mean() - boot_example[boot_example['Fork']=='South'].mean()
obs_difference.at["FBI Score"]

Using only the data collected from your individual experiment, the estimated difference in Family Biotic Index (FBI) is the number displayed after running the cell above. As we previously described, we cannot be sure that this is a good estimate of the difference in FBI Score from this analysis alone. One solution is to use bootstrapping. 

In [None]:
new_sample = example.sample(n = 12, replace = True)
new_sample

After performing the bootstrapping method once, we have a new average difference in FBI Score displayed by running the cell below.

In [None]:
new_sample[new_sample['Fork']=='North'].mean().at["FBI Score"] - new_sample[new_sample['Fork']=='South'].mean().at["FBI Score"]

Now we repeat the bootstrapping method many times and compile the calculated average FBI differences into one distribution!

In [None]:
FBI_averages = []
for i in np.arange(500):
    one_new_sample = example.sample(n = 12, replace = True)
    average = one_new_sample[one_new_sample['Fork']=='North'].mean() - one_new_sample[one_new_sample['Fork']=='South'].mean()
    FBI_averages.append(average)
avgs_tbl = pd.DataFrame(FBI_averages)

After resampling and calculating the average difference in FBI Scores 100 times, we get this graph that displays the distribution of the average differences.

In [None]:
avgs_tbl.hist("FBI Score")
plt.hist(obs_difference, color="red")

Theoretically the data set that is collected represents the population, so the distribution of the original sample will resemble the distribution of the population. Similarly, the resampled data sets will resemble the original data set and therefore the population, which is why resampling from the same sample works for bootstrapping!

__P-Values & Statistical Significance__

Now that we have a distribution of what the differences in FBI Scores will generally look like, the next step is to determine how likely it is under the null hypothesis for the difference to be equal to or even higher than the one observed in our experiment sample. This likelihood is more classically known as the P-value of a test. The P-value of a test is the chance that, under the null hypothesis, the test statistic will be equal to the observed statistic or lean more towards the direction that supports the alternative hypothesis

__Calculating P-values__

For the purpose of this experiment, the p-value will represent the chance that the difference in FBI Scores is greater than or equal to the observed difference. 
To calculate the p-value we would count the number of times the difference is above or equal to the observed difference in the bootstrapped distribution and divide it by the total amount of bootstrap repetitions. 


In [None]:
p_val_count = 0
for i in np.arange(500):
    in_p_val = avgs_tbl.at[i, "FBI Score"] >= obs_difference.at["FBI Score"]
    if in_p_val == True:
        p_val_count += 1
p_val_count / 500

If the P-value is small, then that implies that it is very unlikely for this statistic to occur under the null hypothesis and we say we “reject the null hypothesis”. Otherwise, if the P-value is large, then that implies that the observed test statistic has a high likelihood of occurring under the null and we say we “fail to reject the null hypothesis”. 

A conventional cut-off for P-values is 5%. If the P-value is less than or equal to 5%, then the p-value is deemed “statistically significant”.

__Using the calculated P-value above, do we reject the null hypothesis or fail to reject the null hypothesis? Why?__

_type answer here_

## Submitting the Lab

## Bibliography 

---

Notebook developed by: Joshua Asuncion, Karalyn Chong

Data Science Modules: http://data.berkeley.edu/education/modules