<a href="https://colab.research.google.com/github/shaevitz/MOL518-Intro-to-Data-Analysis/blob/main/Lecture_10/MOL518_Lecture10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# In colab run this cell first to setup the file structure!
%cd /content
!rm -rf MOL518-Intro-to-Data-Analysis

!git clone https://github.com/shaevitz/MOL518-Intro-to-Data-Analysis.git
%cd MOL518-Intro-to-Data-Analysis/Lecture_10

# Lecture 10: Simulating random numbers and distributions

In this class, we will learn how to generate random numbers from different probability distributions and plot them in Python using Jupyter Notebooks running in Google Colab.

The goals of this lecture are to:

1.	Teach you how to generate random numbers
2.	Teach you how to simulate probability distributions using random number generation
3. Allow you to calculate a confidence interval using bootstrapping


-----


## TL;DR: Biology is complicated!



- The examples in the last lecture emphasize situations where the binomial, normal, poisson, and exponential distributions explain a dataset
- Real biology is often more complicated than this!
- Examples of this include bimodal distributions, skewed distributions, power law distributions, or something more complex (mixtures of distributions, for example)
- We will go over a few more concrete examples later in this notebook


## Why do we need to generate random numbers, anyways?

1. It's helpful as a learning exercise, to teach you what a probability distribution is.

2. Sometimes we want to simulate the potential results of an experiment before actually running it, or compare your results to an existing model. For example, if you are making a bunch of measurements, you might want to see if they look normally distributed, as they ought to.

3. Bootstrapping (resampling from existing data) is a powerful way to estimate error when we don't know the distribution our dataset comes from. It is also useful if you run a pilot experiment prior to running a larger one (often needed when doing animal work).

We will discuss bootstrapping further later today.


-----

## Random number generation in Python

There are two ways to generate random numbers in Python - there is a built-in ```random``` package, as well as more sophisticated random number generators as part of the ```numpy.random``` package. We will start with the built-in package so that you can understand how the process works.



### What is the simplest way to generate a random number in Python?

The simplest approach uses the aptly-named ```random.random()``` function, which generates a random floating point number between zero and one. Technically, the output can be equal to zero but should always be at least ever so slightly less than one.



### Example 1: Flipping an unbiased coin

In this initial example I will show you how to simulate a series of unbiased coin flips.

In [None]:
# We need to import the random package
import random

# number of times to flip the coin
nflips = 100
nheads = 0

#
for i in range(nflips):
  r = random.random()
  if r < 0.5:
    print('Heads')
    nheads = nheads+1
  else:
    print('Tails')

print('Number of heads: ', nheads)

### Exercise 1

Now, write some code below to simulate 10,000 coin flips.

Note: you may not want to display the results of each coin flip any more. You can copy the code from Example 1 above as a starting point. Please do not use generative AI for this exercise.

In [None]:
# Code for Exercise 1 goes here!


### Exercise 2

Make a version of the code above that runs 100 simulations of 100 coin flips each. Plot a histogram of the number of heads observed in each simulation. Also calculate the mean and standard deviation of the distribution of the number of heads observed in each simulation.

Note: You can copy the code from Example 1 above as a starting point. Please do not use generative AI for this exercise.

In [None]:
# Write your code for Exercise 2 here




#### Excercise 2 questions

Once you have run your code, please answer the following questions:

1. Based on the histogram, which distribution did you generate with the coin flips?

**[your answer goes here]**

2. What is the mean number of heads across all simulations? Is this what you would have expected?

**[your answer goes here]**

3. Is the standard deviation what you would have expected based on the distribution?

**[your answer goes here]**

The answers you provide here will not be graded, but will be helpful feedback for developing the course

## An introduction to bootstrapping: one 'weird trick' for converting "small data" into big data

If our data follow one of the four "Great Distributions", it is easy to estimate the variance and standard deviation using the formulas I gave you in the previous lecture. However, sometimes in science we find that our data do not fit one of those distributions.

We start with a sample dataset, consisting of a set of measurements. Then, we repeatedly resample the data with replacement, generating a series of resamples from our original sample dataset. We can then calculate our desired statistic (for example, the mean) on each of the resamples, generating a distribution of means.

We then can calculate (for example) a 95% confidence interval by looking at a histogram of the distribution of means of the resamples. In this specific example, the 95% confidence interval would go from the 2.5th percentile to the 97.5th percentile.

This approach is called "case resampling", and is the simplest method for bootstrapping. There are other, more complicated methods out there that may sometimes be helpful, but we will not be covering them in this course.

**[some graphic explaining bootstrapping]**

The example given here has focused on the mean, but the same approach can be applied to other estimators (e.g. median). It's not a good idea to use bootstrapping for extreme statistics (e.g., the max or min of a dataset).

### Three key questions may emerge:

1. How much data do I need to use bootstrapping? There is no strict definition of this, but a rule of thumb is that the sample size should be *at least 30*. That said, you can use bootstrapping even with smaller samples, even as low as 10-15 samples. Just know that the performance will not be as good as with a larger sample size (the error will be larger).

2. Are there any other assumptions baked into boostrapping? Yes! it is also important that the samples are *independent and are coming from the same distribution*, as this is a key assumption of the bootstrapping approach. In statistics we often refer to this as being independent and identically distributed or i.i.d. for short.

3. How many times to I need to do the resampling? *More is better, but 1,000-10,000 resamples is usually enough to give a good estimate of the standard error.* There is a bit of a tradeoff here between computing time and accuracy, so if it takes a very long time to compute 10,000 resamples, you can reduce the number of resamples.

### The pros and cons of bootstrapping:

**Pros**
1. It's simple to calculate!
2. Bootstrapping is helpful for calculating standard error & confidence intervals
3. Works even when we don't know what the distribution is

**Cons**
1. Simple bootstrapping (as described here) assumes the variance is finite, which may not always be true if your data have a lot of outliers or a long tail.
2. Bootstrapping does not usually work well when the size of the sample dataset is small (below 30).
3. If we do know the distribution, we are probably better off using other methods for estimating error

*Note:* You might be wondering: why do we do the sampling with replacement? Sampling without replacement would actually not be helpful here as we would just be reproducing the original sample dataset exactly (or a subset of it). Also, it would break the i.i.d. assumption

### Example 2

In this example, I will show you how to sample **XX YY ZZ**

In [None]:
# We need to import the numpy package to compute the mean and the median
import numpy as np

eggarray = np.loadtxt('data/egg_measurements.csv', delimiter = ',', skiprows=1) # skip the first row since it is a header
eggsize = eggarray[:,1] # egg size is the first column of the dataset

eggsize_var = np.mean((eggsize - np.mean(eggsize))**2) # the variance is defined as the mean of the squared difference between each element in eggsize and the mean eggsize
print(eggsize_var)

### Exercise 3

Now, calculate the variance of egg size, using ```numpy.var```.

In [None]:
# Your code goes here

### Exercise 4

Calculate the standard deviation of egg size, without using the built-in ```numpy.var``` or ```numpy.std``` function. You are, however, allowed to use the ```numpy.sqrt``` function.

In [None]:
# Your code goes here

# Bonus exercises

If you have time, or would like extra practice after class, please complete the following exercises. Please note that they are a bit harder than the regular exercises (that is on purpose!).

## Bonus Exercise 1

**[TEXT]**

In [None]:
# Code for Bonus Exercise 1 goes here


## Bonux Exercise 2

**[TEXT]**

In [None]:
# Code for Bonus Exercise 2 goes here
