Example 1
=========

Overview
--------

The material in this notebook covers four topics: binomial
distributions, the central limit theorem, outliers in data and invalid
model assumptions. You will have approximately 15 minutes to work
through each part, after which we will go through the answers together.
Some questions may be challenging, feel free to skip harder questions on
a first reading if you feel they will take too much time during the
tutorial.

This notebook is available on github
[here](https://github.com/aezarebski/aas-extended-examples). If you find
errors or would like to suggest an improvement, feel free to create an
issue.

Introduction
------------

In this lab we will look at the binomial distribution, central limit
theorem, and analyse two data sets consisting of a number of coin
tosses. We will look for a bias in the results of coin flips. Some of
the questions are open-ended by design. Partial solutions will be
distributed at the end of the session.

As usual we will start by importing some useful libraries.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

Parameter estimation of the binomial distribution
-------------------------------------------------

We want to make an *estimate* the probability that a coin comes up
heads. We also want to understand the level of confidence we have in
this estimate; we use a *confidence interval* (CI) to describe the range
of values we are confident the \"true\" probability of heads lies
within.

Binomial random variables can be used to model the number of times a
coin comes up heads when flipped $n$ times. Let $X$ be a binomial random
variable (RV) representing the number of heads that are observed when a
coin is flipped $n$ times and the probability of coming up heads is $p$.
We assume that $n$ is known but $p$ is unknown.

The expected value of $X$, ie the average number of times that the coin
comes up heads, is $np$. So a simple way to estimate $p$ is to divide
the number of heads, $X$, by the number of flips, $n$. This gives the
estimate

$$
\hat{p} = X / n.
$$

This estimator is called the [the method of
moments](https://en.wikipedia.org/wiki/Method_of_moments_(statistics)).
This is also an example of a maximum likelihood estimate (MLE).

Given an estimator, such as $\hat{p}$, we usually want to quantify the
uncertainty. The *Wald method* is one way to get the $95\%$ CI. It is a
very simple method, but it is acceptable when we have lots of data. The
estimated standard error of $\hat{p}$ is $\sqrt{\hat{p}(1-\hat{p})/n}$,
so the Wald CI is given by

$$
\hat{p} \pm z \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
$$

where $z$ is the appropriate quantile of the standard normal
distribution. In the case of a $95\%$ distribution this value is $1.96$.

The details are all given on the
[Wikipedia](https://en.wikipedia.org/wiki/Binomial_distribution#Estimation_of_parameters)
but there is also a reasonably clear description in [All of
Statistics](https://link.springer.com/book/10.1007/978-0-387-21736-9)
which you can get via SOLO. You can also find reasonable treatments of
Wald CIs in both of those resources.

### Question

State the limitations on the estimator we are using for the CI.

### Question

Implement a function called `wald_estimate_and_ci` which takes two
arguments: `num_trials` which is $n$ in the description above, and
`num_success` which is $X$ above. The function should return
`(p_hat,(wald_lower,wald_upper))` where `p_hat` is $\hat{p}$ and
`wald_x` are the limits on the $95\%$ CI using the Wald method.


### Question

Simulate a binomial random variable with $n=100$ and $p=0.6$. Then use
the value and the `wald_estimate_and_ci` function to see how well you
can estimate $p$. Write a couple of sentences to explain this.

Recall that in a previous example we have looked at how to simulate
random variables using `scipy.stats`.

### Question

Repeat the process from the previous question 100000 times and see what
proportion of the CIs capture the true value of $p$. Is it what you
expect? Write a couple of sentences to explain what you found.


### Question

Are credible intervals and confidence intervals the same thing?

Central limit theorem
---------------------

The central limit theorem (CLT) tells us about the limiting distribution
of the sample mean for distribution for an independent and identically
distributed (IID) sample with a finite variance. It underpins many
results in statistics and is important for reasoning about stochastic
processes.

### Question

Write down a statement of the law of large numbers (LLN). Write down a
statement of the central limit theorem. Make sure you understand what
each of them tells you.

Example: CLT
------------

To see that the distribution of the sample mean converges to a normal
distribution we will do a simulation study.

### Question

Write down the distribution of the sample mean given an IID sample of
exponential random variables with rate $1/5$.

### Question

1.  Generate 500 sample means each based on a sample of 100 exponential
    random variables
2.  Make a visualisation of the distribution of the data (e.g., a KDE or
    histogram) and overlay the CLT approximation.


### Question

Another way to assess if the sample appear to come from a normal
distribution is to use a Q-Q plot. Generate a Q-Q plot to check if the
samples appear to be normally distributed.


Experimental results: flipping coins in series
----------------------------------------------

Each of 15 students take turns flipping a coin 30 times and recording
how many heads they got. There is a suspicion that some of the students
did not actually do this properly. Some people think they just wrote
down some garbage and went to lunch early.

Read the data in `experiement1.csv` into a `DataFrame`.

In [6]:
exp1 = pd.read_csv("experiment1.csv")

Compute the point estimate and CI using the function you wrote above.

In [7]:
head_counts = exp1.drop(columns="flip_number").groupby("name").sum()
head_counts["name"] = head_counts.index.copy()

total_heads = int(head_counts["outcome"].sum())
num_people = int(head_counts["name"].unique().size)
num_flips = int(exp1["name"].value_counts().unique())

est_and_ci = wald_estimate_and_ci(num_success=total_heads,
                                  num_trials=num_people * num_flips)

print(est_and_ci)

(0.49333333333333335, (0.44713979693549655, 0.5395268697311701))


We estimate the probability of heads as 0.49 with a $95\%$ CI of
(0.45,0.54). We are not able to reject the null hypothesis that the coin
is fair.

### Question

Generate a histogram of the number of heads from each student. As an
extension, include the binomial distribution supported by your estimate
that is most amenable to large value outcomes.


### Question

It looks like there might be a couple of strange points in this dataset
as suspected. Using the upper bound on $p$ calculate the probability of
someone getting all heads. Write a couple of sentences explaining
whether you think it is reasonable to remove those data points.


### Question

Once the questionable data has been removed, plot the distribution of
the estimated binomial distribution on top of the histogram. Write a
couple of sentences explaining what you think about the coin now.




Experimental results: flipping coins in parallel
------------------------------------------------

The royal mint has become interested and wants to study an additional 49
coins and repeat the experiment to gather more data about the
fascinating topic of coin bias. Now, each of 50 students is given a coin
each and asked to flip the coin 30 times and record the results.

### Question

Do we need to change anything about how we analyse this data? If so,
why, if not, why not? **Hint:** there are good arguments that can be
given for each answer. Once you have answered one way, try to answer the
other way.


### Question

Using the data in `experiment2.csv` explore the data set using the
methodology devised above and write a couple of sentences to explain
what you found.


### Question

Visualise the number of heads each student got and compare the variance
in this to what is predicted by theory.



### Question

Consider how you might analyse this data. Over the following weeks you
will learn a couple of approaches.
