# Run this cell first

In [None]:
# this code enables the automated feedback. If you remove this, you won't get any feedback
# so don't delete this cell!
try:
  import AutoFeedback
except (ModuleNotFoundError, ImportError):
  %pip install git+https://github.com/abrown41/AutoFeedback
  import AutoFeedback

try:
  from testsrc import test_main
except (ModuleNotFoundError, ImportError):
  %pip install "git+https://github.com/autofeedback-exercises/exercises.git#subdirectory=New-SOR3012/Expectation"
  from testsrc import test_main

def runtest(tlist):
  import unittest
  from contextlib import redirect_stderr
  from os import devnull
  with redirect_stderr(open(devnull, 'w')):
    suite = unittest.TestSuite()
    for tname in tlist:
      suite.addTest(eval(f"test_main.UnitTests.{tname}"))
    runner = unittest.TextTestRunner()
    try:
      runner.run(suite)
    except AssertionError:
      pass


# Introduction

The exercises in this notebook and the notebook for next week introduce you to some of the tools of statistics.  These are the tools that we use when we analyse a data sets. To better understand these tools and the theory that underpins them you are going to generate data sets to analyse by writing the functions for generating random variables that were introduced in the last block.  We will then use these functions to generate large samples of random variables to analyse.

Don't forget to start by executing the following cell, which imports the libraries that we need.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats

# Calculating the data range

For these next three exericses I have sampled some random variables from a distribution for you.  I loaded this data into a NumPy array called `x` by using the command:

```python
x = np.loadtxt(<some_url>)
```

The aim of these exercise is to determine something about the distribution that the data in `data.dat` was sampled from.  We do this by calculating quantities know as summary statistics.  Summary statistics are useful because they allow us to summarise information contained in our data set using fewer numbers.  For example, if we were writing a report about our results we might want to summarise the experimental data that we obtained using a sentence like the one below:

_N measurements of this quantity were obtained.  The lowest value obtained was L, while the highest value was U._

This sentence provides a one-line summary of the range of results that we obtained in the experiment.

Notice that we can get the number of elements in a NumPy array called `myarray` by using the command:

```python
number = len( myarray )
```

Furthermore, we can use python to determine the largest and smallest values in a np.array called `myarray` by using the commands:

```python
lowest = min( myarray )
highest = max( myarray )
```

__To complete this task write a program that determines the values that should be used to replace `N`, `L` and `U` in the sentence in italics above given the data in the `data.dat` file.__
To pass the test you will need to define three variables called `N`, `L` and `U` in your code.  The values of these
variables should be set so that the sentence in italics above is an accurate description of the information in `data.dat`.

In [None]:
# This loads the data that we are going to investigate
x = np.loadtxt('https://raw.githubusercontent.com/autofeedback-exercises/exercises/main/New-SOR3012/Expectation/data.dat')

# Your code will go here


In [None]:
runtest(['test_N', 'test_L', 'test_U'])

# Calculating the cumulative distribution for the data

Each of the random variables that you have learned to sample in the earlier parts of this exercise has a corresponding distribution that is sampled when we generate the data.  A question we might ask, therefore, is whether we can calculate the distribution if we are given a sample of random variables from a particular distribution.  The answer to this is no.  There are, however, a number of ways we can estimate the distribution function for a random variable.  The following 10-minute video explains two of these methods.  The method that is of particular interest for this exercise is the method for estimating the cumulative probability distribution function, $P(X<=x)$.

In [1]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/VaZTKmcxLvY?si=-AHGSjS7aJ73VrN7" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

I have started writing the code to calculate the cumulative distribution for you in the following cell.  As you can see I have loaded the data set from the file data.dat saved it in a list called `x`.  The two lines at the end of the script that read:

```python
plt.plot( x, y, 'k-' )

```

Are then going to plot our cumulative probability distribution function.  You will notice that the list called `y` that we are plotting with the plot command is not defined anywhere in the code.  One of your tasks is, therefore, to write the code that calculates this variable.

Recall from the video that calculating the cumulative distribution for a dataset involves two steps:

1. Sorting the data.  You can sort the data in the array called `x` by issuing the command `x.sort()`
2. Plotting a graph in which the x-coordinates give the sorted data values and the y-coordinates are the index of the corresponding x coordinate in the sorted list divided by the total number of points in the list. .
3. Label the x-axis of your graph 'x' and the y-axis of your graph 'cumulative distribution'

Once you have sorted `x` you just need to create the list called `y` that you are going to plot.  This list is going to contain the numbers between 1 and the number of data points in your list divided by the total number of data points.  If you had four data points `y` would thus be:

```python
y = [1/4, 2/4, 3/4, 4/4]
```

Try to write the rest of the code to plot your cumulative probability distribution now.

In [None]:
# This loads the data that we are going to investigate
x = np.loadtxt('https://raw.githubusercontent.com/autofeedback-exercises/exercises/main/New-SOR3012/Expectation/data.dat')

# Your code will go here



# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()

In [None]:
runtest(['test_plot'])

# The median and percentiles

Plotting the cumulative probability distribution is useful as if we know this function we can use it to describe the data using a sentence such as:

_z % of the data points are less than or equal to x._

You can determine the value of x if you are given z by using the function `np.percentile` as shown below:

```python
x = np.percentile( data, z )
```

The quantity, x, that is output by this function can gives us a value that z % of the data in the NumPy array `data` is less or equal to.

__To complete this exercise I would like you to use `np.percentile` to calculate:__

1. the minimum of the data set
2. the lower quartile
3. the median
4. the upper quartile
5. the maximum

for the data contained in the np array called `x`.  These quantities should be saved in variables called `dmin`, `lowq`, `median`, `highq` and `dmax`.

We can display these 5 points graphically by using a box plot.  I have included some code at the end of the program that will produce a box plot for you.


In [None]:
# This loads the data that we are going to investigate
x = np.loadtxt('https://raw.githubusercontent.com/autofeedback-exercises/exercises/main/New-SOR3012/Expectation/data.dat')

# Your code will go here




# This will produce a box plot for you automatically
plt.boxplot(x)
# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()


In [None]:
runtest(['test_dmin', 'test_lowq', 'test_median', 'test_highq', 'test_dmax'])

# How the sample mean changes with sample size

The statstics that we have calculated by ordering the data are easy to straightforward to understand. The final method we have for analysing a data set involves taking the mean using:

$$
\overline{X} = \frac{1}{n}\sum_{i=1}^n X_i
$$

where each of the $_i$ are random variables. This operation is simple to perform.  However, the theory that underpins why this is a sensible operation to perform with a data set is rather complicated.  This theory is complicated becuase all the $X_i$ values that are being added together in the sum above are random variables.  __$\overline{X}$ is thus a random variable__. The exercises in this notebook are, therefore, about determining __the distribution for $\overline{X}$.__

Before we get to that, however, we are going to look at how the value of the sample mean changes as we change the number of random variables that it is calculated from.  The cell below explains how you can write a program to determine how the mean changes as the number of random variables it is computed from changes.

In [2]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/UbVJU08Vb5o?si=gQ_YSThNPQ9q1gS8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

Try to see if you can write the code from the video in the cell below and draw a graph showing how the sample mean computed from a number of uniform random variables changes as you increase the number of random variables from which the mean is computed.  To complete the code you will need to:

1. Set the first element of the list called `indices` equal to 1, the second element of the list called `indices` to 2 and so on.
2. Set the first element of the list called  `average` equal to a sample mean calculated by generating 1 uniform random variable that lies between 0 and 1, the second element of the list `average` equal to a sample mean calculated by generating 2 uniform random variables that lie between 0 and 1, set the third element of the list called `average` equal to a sample mean calculated by generating 3 uniform random variables that lie between 0 and 1 and so on until you have computed an average by generating 200 uniform random variables.


When the code is complete it should generate a graph of the sample mean versus the number of samples they are calculated from.  The red points on this graph are your various estimates of the sample mean.  The black, dashed horizontal line, meanwhile, indicates the value of the true expectation for this distribution.  You should see that the sample mean gets progressively closer and closer to this line as the number of samples it is computed from increases.

In [None]:
ssum, indices, average = 0, np.zeros(200), np.zeros(200)
for i in range(200) :
  # Add code to setup the numpy arrays called indices and average to generate the desired
  # plot here.



# This will plot the graph for the data.  You should not need to adjust this.
plt.plot( indices, average, 'ro' )
plt.plot( [0,200], [0.5,0.5], 'k--' )
plt.xlabel('Number of random variables')
plt.ylabel('Sample mean')

# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()


In [None]:
runtest(['test_variables_1'])

# Calculating the expectation

The previous exercise demonstrated that as you increase $n$ the sample mean:

$$
\overline{X} = \frac{1}{n} \sum_{i=1}^n X_i
$$

converges to a quantity called the expectation for the random variable (if it exists).  The expectation for a discrete random variable is defined as:

$$
\mathbb{E}(X) = \sum_{i=0}^\infty x_i P(X=x_i)
$$

while the expectation for a continuous random variable is defined as:

$$
\mathbb{E}(X) = \int_{-\infty}^\infty x f(x) \textrm{d}x
$$

where $f(x)$ is the probability density function. To revise how these formulas are used you can watch the following 15-minute video.

In [3]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/4l6N4mD1n6I?si=A4A4cYlXS5vXIvdX" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

I also would like you to revise the expectations for uniform, binomial, Bernoulli, geometric, exponential, negative binomial and normal random variables.  You learned exact expressions for the expectations of all these types of random variable in SOR1020.  For example, you know that the expectation of a binomial random variable Y with parameters n and p is:

$$
\mathbb{E}(Y) = np
$$

For this expression I thus want you to write functions that return the true expectations for the following kinds of random variables:

* Bernoulli random variables
* Binomial random variables
* Geometric random variables
* Uniform discrete random variables
* Uniform continuous random variables
* Negative binomial random variables
* Exponential random variables
* Normal random variables

As you can see in the stub code on the left, each of your functions take the parameters of the random variable as input.  You then need to use the formula for the expectation within the function.  N.B. If you have forgotten the expression for the expectation of any one of these types of random variable you can easily look it up on Wikipedia

In [None]:
import numpy as np

def bernoulli(p) :
  # Insert code for calculating and returning the expectation of a Bernoulli random variable here

def binomial(n, p) :
  # Insert code for calculating and returning the expectation of a binomial random variable here

def geometric(p) :
  # Insert code for calculating and returning the expectation of a geometric random variable here

def negative_binomial(r, p) :
  # Insert code for calculating and returning the expectation of a negative binomial random variable here

def uniform_continuous(a, b) :
  # Insert code for calculating and returning the expectation of a uniform continuous random variable here

def uniform_discrete(a,b) :
  # Insert code for calculating and returning the expectation of a uniform discrete random variable here

def exponential(lam) :
  # Insert code for calculating and returning the expectation of a exponential random variable here

def normal(mu, sigma) :
  # Insert code for calculating and returning the expectation of a Normal random variable here


print('The expectation for a Bernoulli random variable with p=0.5 is', bernoulli(0.5) )
print('The expectation for a binomial random variable with n=5, p=0.5 is', binomial(5,0.5) )
print('The expectation for a geometric random variable with p=0.5 is', geometric(0.5) )
print('The expectation for a negative binomial random variable with r=3 and p=0.5 is', negative_binomial(3,0.5) )
print('The expectation for a uniform continusou random variable with a=0 and b=1 is', uniform_continuous(0,1) )
print('The expectation for a uniform discrete random variable with a=1 and b=8 is', uniform_discrete(1,8) )
print('The expectation for a exponential random variable with lambda=2 is', exponential(2) )
print('The expectation for a normal random variable with mu=4 and sigma=2 is', normal(4,2) )


In [None]:
runtest(['test_bernoulli', 'test_binomial', 'test_geometric', 'test_negative_binomial', 'test_uniform_continuous', 'test_uniform_discrete', 'test_exponential', 'test_normal'])

# Calculating the sample mean

Having established that the mean for a random variable converges on the expectation, lets return to this question of the distribution for the sample mean. The easiest way to investigate this distribution is to sample random variables from this distribution multiple times as is discussed in the following video.

In [4]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/JBbu4aW3yvI?si=yotlGeR2yfXHESJb" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

 __To complete this exercise I would thus like you to complete the function called `average` below__.  This function takes an integer called `n` as input.  Within the function, you should generate `n` uniform random variables between 0 and 1.  You should then calculate the sample mean from these `n` variables using the expression above and then `return` this quantity from your function.

When your code is complete a graph will be generated.  The red points are all uniform random variables that lie between 0 and 1.  The black points, meanwhile, are all sample means computed from sets of 100 uniform random variables.  __Before moving on take a look at this graph and consider which of the two distributions is more precisely distributed.__

In [None]:
def average(n) :
  # Your code to compute the average for a set of n uniform random variables goes here.


# You should not need to adjust the code from here onwards
xv, yv1, yv2 = np.linspace(1,100,100), np.zeros(100), np.zeros(100)
for i in range(100) :
    yv1[i] = np.random.uniform(0,1)
    yv2[i] = average(100)

plt.plot( xv, yv1, 'ro' )
plt.plot( xv, yv2, 'ko' )

# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()


In [None]:
runtest(['test_variables'])

# Sampling the mean

In the next few exercises we are going to be using the word sample **a lot**.  I thus want to make sure we are on the sample page with the distinction between the following two terms:

1. **sample mean** = the mean (i.e. a single scalar) that is obtained when you calculate the mean from a sample of random variables.
2. **sample** = a set (i.e. multiple scalars) of random variables.

This is going to get confusing because we are going to being talking about a samples of sample means.  In other words, we are going to be talking about a set of random variables all of which are sample means.

In this exercise I want to ensure that you understand what it means for us to generate a sample of sample means.  I thus want you to complete the following tasks:

1. Write a function called `sample_mean`.  This function should take in a single argument `n` and should return a sample mean that is computed by adding together `n` uniform continuous random variables that lie between 0 and 1.

2. Write a second function called `sample`.  This function should take in two arguments `m` and `n`.  It should return a NumPy array that contains `m` elements.  Each of these `m` elements should be equal to a sample mean that was computed by adding together `n` uniform continuous random variables that lie between 0 and 1.

N.B. Your function called `sample` should call your function called `sample_mean`.


In [None]:
def sample_mean(n) :
  # Code for generating the sample mean goes here


def sample(m,n) :
  # Code for generating the sample goes here


In [None]:
runtest(['test_mean', 'test_sample'])

# Confidence limit on mean

Now that we can generate multiple estimates of the sample mean we can provide information about the distribution for the sample sample mean.  In previous exercises we have seen how we can provide this information on the distribution by proposing a confidence limit.  The process that we used to do this in those previous exercises involved:

1. Generating multiple random variables.
2. Using the `np.percentile` function to find a range that 90% of these random samples falls within.

We could thus make our result by reprocible by noting that if our colleague was sampling the same distribution as us he should obtain a result that falls between the 5th and 95th percentile of our distribution of results with a probablity of 90%.

To complete this exercise I want you to apply this idea for quoting confidence limits on a sample mean.  To complete the exericse you will need to complete the `limit` function in the cell below.  This function should take in two numbers `n` and `m`.  I have called the function `sample` that you wrote in the previous exercise, whihc will generate `m` sample means and store them in a NumPy array.  Each of these sample means will have been calculated by adding together `n` uniform random variables that lie between 0 and 1.  

The function `limit` should then return three numbers.  The first of these numbers `lower` should be the 5th percentile of the distribution of sample means that were generated.   The second of these numbers `median` should be the median of the sample means that was generated.  The final number `upper` should be the 95th percentile of the distribution of sample means.


In [None]:
def limit(m, n) :
    # Your code to calculate the m sample means goes here.
    # Each of these sample means should be computed from
    # n uniform random variables between 0 and 1 goes
    # here.
    mydata = sample(m, n)

    # When completed this function should return
    # lower = the 5th percentile of the distribution for the sample mean
    # median = your estimate for the median
    # upper = the 95th percentile of the distribution for the sample mean
    return lower, median, upper

print( limit(10,100) )
print( limit(10,100) )
print( limit(10,100) )
print( limit(10,100) )


In [None]:
runtest(['test_mean_1', 'test_limit'])

# The central limit theorem

We can now introduce what is arguably the most important idea in statistics - the central limit theorem.  This theorem and the reasons it is useful is explained in the following 14-minute video

In [5]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/eRmYhuXrtdw?si=zo2I82JwQ5DXOGTu" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

The short summary of the video is that sample means (for most distributions) are normally distributed.  This is why the most common probability density for for random variables is:

$$
f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)
$$

When we use this random variable we assume that the random variable is a mean of many other random variables.  We can thus safely use the following two-parameter distribution.  

In the earlier exercises, we have discussed how to generate an estimate for $\mu$.  It is what we have called $\overline{X}$, which was defined as follows.

$$
\overline{X} = \frac{1}{n}\sum_{i=1}^n X_i
$$

We have also looked at the properties of this estimator.  We now need to learn how to compute a quantity, which we called $S$, which is an estimator for the second parameter $\sigma$.  We will then look at the properties of this random variable. 

# Calculating the sample variance

If we do not know the exact value of the expectation for the distribution we should use the following estimator for the sample variance:

$$
S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \overline{X})^2
$$

which we can easily rearrange to:

$$
S^2 = \frac{n}{n-1} \left[ \left( \sum_{i=1}^n X_i^2 \right)- \overline{X}^2 \right]
$$

The way in which this expression is used to calculate the variance is explained in this video.

In [8]:
%%HTML 
<iframe width="560" height="315" src="https://www.youtube.com/embed/eRmYhuXrtdw?si=C-41WAHRcpjyj-_K" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

__Your task in this exercise is to write a function called `variance` that calculates an estimate of this quantity.__  This function should take in a single number `n`.  Within the function you should then generate `n` uniform random variables that all lie between 0 and 1.  From these `n` random variables you should then calculate an estimate for the variance of the underlying distribution using the second of the two expressions above

In [None]:
def variance(n) :
  # Your function to calculate the variance for a set of n uniform random variables goes here



In [None]:
runtest(['test_variables_3'])

# Converging the sample variance

Let's now look at how $S^2$ depends on the number of random variables that it is computed from.  __To complete the exercise you need to generate a graph that shows how the estimate of the sample variance for a uniform random variable that lies between 0 and 1 depends on the number of random variables it is computed from.__   I have written some code to get you started with the exercise.  To complete the task you need to:

1. Set the first element of the array called `indices` equal to 2, the second element of the array called `indices` to 3 and so on.  (Notice that the sample variance with n=1 is not defined as if n=1 the n-1 in the denominator is 0 and the numerator is similarly zero).

2. Set the first element of the array called `S2` equal to a sample variance computed using the above formula with n=2, the second element of this list to a sample variance computed using the above formula with n=3 and so up until you have computed the formula above with n=201.

3. Draw a graph that has the number of random variables that were used to calculate the variance on the x-axis and the estimate of the variance on the y-axis.  The x-axis label for this graph should be __Number of random variables__ and the y-axis label should be __Sample variance__.

When your code is complete a graph showing the value of the estimate of the sample variance as a function of n will be generated.


In [None]:
myvar = np.random.uniform(0,1)
ssum, ssum2 = myvar, myvar*myvar
indices, S2 = np.zeros(200), np.zeros(200)
for i in range(200) :
  # Add code to setup the numpy arrays called indices and average to generate the desired plot here.


# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()

In [None]:
runtest(['test_variables_4'])

# Variance of mean

We are now going to investigate the variance for the sample mean.  This idea is discussed in the following video


In [7]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/GDP4VeNfUhg?si=T7BmQ2T-_VG2aIUf" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

In an earlier exercise you wrote a function called `sample_mean` that took in a single integer `n` in input and that returned a sample mean that was computed by generating `n` uniform random variables that lie between 0 and 1.

In the function called variance below you are going to write code to calculate the variance for a sample of `m` sample means. In other words, you are going to take the code for calculating variances that you wrote in the last two exercises and you are going to replace the call to `np.random.uniform` with a call to sample mean.

The function `variance` will thus take two integers in input `m` and `n`.  This function should return an estimate of the variance for a sample mean computed from `n` uniform random variables that lie between 0 and 1.  This variance should be calculated by generating `m` estimates of the sample mean.

You then need to write the internals of the loop that I have started to set the variables `xvals` and `yvals`:

1. The first element of `xvals` should be set equal to 1, the second element should be set to two and so on.
2. The first element of `yvals` should be set equal to an estimate of the variance for a sample mean computed from one random variable, the second element should be set equal to an estimate of the sample variance for a sample mean computed from two random variables, the third element should be set equal to an estimate of the sample variance for a sample mean computed from three random variables.  This process should be continued until you have an estimate of the sample variance for a sample mean computed from 50 random variables.

All your estimates for the sample variance should be computed from 10 random variables.

You should plot a graph showing your values for the sample variance on the y-axis and the number of random variables that were added together to compute the mean on the x-axis.  The label
for the x-axis should be 'Number of variables used to calculate mean' and the y-axis label should be 'Variance'

What we are looking at here is how the variance of the sample means changes with the number of random variables that was used to calculate the sample mean.

In [1]:
def variance(m,n) :
    # Your code to estimate the variance for a set of m
    # sample means, which are each computed from n
    # uniform random variables between 0 and 1 goes here

xvals, yvals = np.zeros(50), np.zeros(50)
for i in range(50) :
  # Your code to set xvals and yvals as described in the panel
  # on the right goes here


# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()



IndentationError: expected an indented block (1982292788.py, line 5)

In [None]:
runtest(['test_variables_2'])

# Using the normal distribution for error bars

We can now use the central limit theorem and what we have learned about estimating sample variances to calculate error bars. To complete the code below you will need to write a function called `mean_with_errors` that takes in a parameter called `n`.  This function should return a sample mean computed from `n` uniform random variables that lie between 0 and 1.  This quantity should be returned in the variable called `mean`.  In addition, you should also compute the 5th and 95th percentiles for the distribution that mean was sampled from.  These two quantities should be retuned as `lower` and `upper`.  When calculating `lower` and `upper` you should assume that the sample mean is a sample from a normal distribution with suitable parameters.

Within the function called `mean_with_errors` you will need to compute the sample mean and the sample variance for your sample of `n` uniform random variables.  You should then be able to calculate lower and upper using your computed values for the sample mean and sample variance and the following python function:

```python
ppp = scipy.stats.norm.ppf(0.95)
```

The call above computes the 95th percentile for a standard normal random variable.  i.e. A normal random variable with expectation 0 and variance 1.  This exercise brings together all the ideas in the video entitled evaluating confidence limits on averages using the central limit theorem that I provided in an earlier cell.


In [None]:
def mean_with_errors(n) :
  # Your code to calculate the sample mean and sample variance
  # for a set of n uniform random variables between 0 and 1 goes
  # here.


  # When complete this function should return
  # lower = the 5th percentile of the distribution that was sampled
  # mean = your estimate for the sample mean
  # upper = the 95th percentile of the distribution that was sampled
  # N.B. To compute lower and upper you should be using the central
  # limit theorem as discussed in the explanatory text.
  return lower, mean, upper

print( mean_with_errors(100) )
print( mean_with_errors(100) )
print( mean_with_errors(100) )
print( mean_with_errors(100) )


In [None]:
runtest(['test_function'])

# Taking it further

You can take the ideas in this notebook further by investigating how the estimators for the mean and variance behave when the random variables that are being added together are not simply uniform random variables.  You can look at how these estimators behave when these random variables are:

* Bernoulli random variabels
* Binomial random variables
* Geometric random variables
* Negative binomial random variables
* Normal random variables
* Exponential random variables
* Discrete uniform random variables

Or any other types of random variable that you have encountered.

A random variable whose behaviour is particularly interesting is described in the following video:

In [9]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/lNfZhE7TyTI?si=aGaK3VyQTH74DWmo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

Can you identify a function of this type of random variable that does converge?