# Run this cell first

In [None]:
# this code enables the automated feedback. If you remove this, you won't get any feedback
# so don't delete this cell!
try:
  import AutoFeedback
except (ModuleNotFoundError, ImportError):
  !pip install git+https://github.com/abrown41/AutoFeedback@notebook
  import AutoFeedback

try:
  from testsrc import test_main
except (ModuleNotFoundError, ImportError):
  !pip install "git+https://github.com/autofeedback-exercises/exercises.git@testpip#subdirectory=New-SOR3012/Expectation"
  from testsrc import test_main

def runtest(tlist):
  import unittest
  from contextlib import redirect_stderr
  from os import devnull
  with redirect_stderr(open(devnull, 'w')):
    suite = unittest.TestSuite()
    for tname in tlist:
      suite.addTest(eval(f"test_main.UnitTests.{tname}"))
    runner = unittest.TextTestRunner()
    try:
      runner.run(suite)
    except AssertionError:
      pass


# Calculating the data range

For these next three exericses I have sampled some random variables from a distribution for you.  The data I have generated is contained in the `data.dat` file.  I have also loaded the data from this file into a NumPy array called `x` by using the command:

```python
x = np.loadtxt(<some_url>)
```

The aim of these exercise is to determine something about the distribution that the data in `data.dat` was sampled from.  We do this by calculating quantities know as summary statistics.  Summary statistics are useful because they allow
us to summarise information contained in our data set using fewer numbers.  For example, if we were writing a report about our results we might want to summarise the experimental data that we obtained using a sentence like the one below:

_N measurements of this quantity were obtained.  The lowest value obtained was L, while the highest value was U._

This sentence provides a one-line summary of the range of results that we obtained in the experiment.

Notice that we can get the number of elements in a NumPy array called `myarray` by using the command:

```python
number = len( myarray )
```

Furthermore, we can use python to determine the largest and smallest values in a np.array called `myarray` by using the commands:

```python
lowest = min( myarray )
highest = max( myarray )
```

__To complete this task write a program that determines the values that should be used to replace `N`, `L` and `U` in the sentence in italics above given the data in the `data.dat` file.__
I have loaded the data in this file into an array called `x` for you.  To pass the test you will need to define three variables called `N`, `L` and `U` in your code.  The values of these
variables should be set so that the sentence in italics above is an accurate description of the information in `data.dat`.


In [None]:
import numpy as np

# This loads the data that we are going to investigate
x = np.loadtxt('https://raw.githubusercontent.com/autofeedback-exercises/exercises/testpip/New-SOR3012/Expectation/data.dat')

# Your code will go here


In [None]:
runtest(['test_N', 'test_L', 'test_U'])

# Calculating the cumulative distribution for the data

Each of the random variables that you have learned to sample in the earlier parts of this exercise has a corresponding distribution that is sampled when we generate the data.
A question we might ask, therefore, is whether we can calculate the distribution if we are given a sample of random variables from a particular distribution.  The answer to this is no.  There are, however,
a number of ways we can estimate the distribution function for a random variable.  In this exericse I am going to show you how we can estimate the cumulative probability distribution function, P(X<=x).

Watching [this video](https://www.youtube.com/watch?v=VaZTKmcxLvY) will help you to better understand this exercise.

I have started writing the code to calculate the cumulative distribution for you in the file `main.py`.  As you can see I have loaded the data set from the file data.dat saved it in a list called `x`.  The two lines at the end of the script that read:

```python
plt.plot( x, y, 'k-' )

```

Are then going to plot our cumulative probability distribution function.  You will notice that the list called `y` that we are plotting with the plot command is not defined anywhere in the code.  One of your tasks is, therefore, to write the code that calculates this variable.

Recall from the video that calculating the cumulative distribution for a dataset involves two steps:

1. Sorting the data.  You can sort the data in the array called `x` by issuing the command `x.sort()`
2. Plotting a graph in which the x-coordinates give the sorted data values and the y-coordinates are the index of the corresponding x coordinate in the sorted list divided by the total number of points in the list. .
3. Label the x-axis of your graph 'x' and the y-axis of your graph 'cumulative distribution'

Once you have sorted `x` you just need to create the list called `y` that you are going to plot.  This list is going to contain the numbers between 1 and the number of data points in your list divided by the total number of data points.  If you had four data points `y` would thus be:

```python
y = [1/4, 2/4, 3/4, 4/4]
```

Try to write the rest of the code to plot your cumulative probability distribution now.


In [None]:
import matplotlib.pyplot as plt
import numpy as np

# This loads the data that we are going to investigate
x = np.loadtxt('https://raw.githubusercontent.com/autofeedback-exercises/exercises/testpip/New-SOR3012/Expectation/data.dat')

# Your code will go here



# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()


In [None]:
runtest(['test_plot'])

# The median and percentiles

Plotting the cumulative probability distribution is useful as if we know this function we can use it to describe the data using a sentence such as:

_z % of the data points are less than or equal to x._

In this exercise I am going to show you how to determine the value of x if you are given z.
We can do this using the function `np.percentile` as shown below:

```python
x = np.percentile( data, z )
```

The quantity, x, that is output by this function can gives us a value that z % of the data in the NumPy array `data` is less or equal to.

__To complete this exercise I would like you to use `np.percentile` to calculate:__

1. the minimum of the data set
2. the lower quartile
3. the median
4. the upper quartile
5. the maximum

for the data contained in the np array called `x`.  These quantities should be saved in variables called `dmin`, `lowq`, `median`, `highq` and `dmax`.

We can display these 5 points graphically by using a box plot.  I have included some code at the end of the program that will produce a box plot for you.


In [None]:
import matplotlib.pyplot as plt
import numpy as np

# This loads the data that we are going to investigate
x = np.loadtxt('https://raw.githubusercontent.com/autofeedback-exercises/exercises/testpip/New-SOR3012/Expectation/data.dat')

# Your code will go here




# This will produce a box plot for you automatically
plt.boxplot
# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()


In [None]:
runtest(['test_dmin', 'test_lowq', 'test_median', 'test_highq', 'test_dmax'])

# Calculating the expectation

As discussed in the previous couple of exercises the sample mean that we calculate is a random variable.  In order to make our result reproducible we must provide some information about the distribution whenever we quote this quantity.  When we calculate the mean, however, we are estimating the expectation of the random variable.  The expectation of a discrete random variable can be calculated exactly by using the following expression:

![](equation-1.png)

The sum in this expression runs over all the values that the random variable can take and P(X=x) is the probablity mass function.  Critically the value of the expectation that is calculated using the sum above is __not__ random.  You can thus calculate it exactly.

In these exercises we are calculating sample means by adding together independent and identical uniform, binomial, Bernoulli, geometric, exponential, negative binomial and normal random variables.  You know exact expressions for the expectations of all these types of random variable.  For expample, you know that the expectation of a binomial random variable Y with parameters n and p is:

![](equation-2.png)

Whenever you write codes to estimate the expectation of one of these variables you should check that the true expectation lies within the confidence limit that you calculate as a sanity check on your code.

For this expression I want you to write functions that return the true expectations for the following kinds of random variables:

* Bernoulli random variables
* Binomial random variables
* Geometric random variables
* Uniform discrete random variables
* Uniform continuous random variables
* Negative binomial random variables
* Exponential random variables
* Normal random variables

As you can see in the stub code on the left, each of your functions take the parameters of the random variable as input.  You then need to use the formula for the expectation within the function.  N.B. If you have forgotten the expression for the expectation of any one of these types of random variable you can easily look it up on Wikipedia.


In [None]:
import numpy as np

def bernoulli(p) :
  # Insert code for calculating and returning the expectation of a Bernoulli random variable here

def binomial(n, p) :
  # Insert code for calculating and returning the expectation of a binomial random variable here

def geometric(p) :
  # Insert code for calculating and returning the expectation of a geometric random variable here

def negative_binomial(r, p) :
  # Insert code for calculating and returning the expectation of a negative binomial random variable here

def uniform_continuous(a, b) :
  # Insert code for calculating and returning the expectation of a uniform continuous random variable here

def uniform_discrete(a,b) :
  # Insert code for calculating and returning the expectation of a uniform discrete random variable here

def exponential(lam) :
  # Insert code for calculating and returning the expectation of a exponential random variable here

def normal(mu, sigma) :
  # Insert code for calculating and returning the expectation of a Normal random variable here


print('The expectation for a Bernoulli random variable with p=0.5 is', bernoulli(0.5) )
print('The expectation for a binomial random variable with n=5, p=0.5 is', binomial(5,0.5) )
print('The expectation for a geometric random variable with p=0.5 is', geometric(0.5) )
print('The expectation for a negative binomial random variable with r=3 and p=0.5 is', negative_binomial(3,0.5) )
print('The expectation for a uniform continusou random variable with a=0 and b=1 is', uniform_continuous(0,1) )
print('The expectation for a uniform discrete random variable with a=1 and b=8 is', uniform_discrete(1,8) )
print('The expectation for a exponential random variable with lambda=2 is', exponential(2) )
print('The expectation for a normal random variable with mu=4 and sigma=2 is', normal(4,2) )


In [None]:
runtest(['test_bernoulli', 'test_binomial', 'test_geometric', 'test_negative_binomial', 'test_uniform_continuous', 'test_uniform_discrete', 'test_exponential', 'test_normal'])

# Calculating the sample mean

We can calculate the sample mean from a set of random variables using the expression below:

![](equation.png)

It is worth considering what happens if each of the Xi in the sum on the right-hand side of the expression is a random variable from one of the distributions that we learned about last week.  Obviously, as all the Xi are random, it
stands to reason that the quantity on the left-hand side is also a random variable.  We thus would like to know something about how the statistic on the left-hand side is distributed.

The easiest way to investigate this distribution is to sample random variables from this distribution multiple times.  __To complete this exercise I would thus like you to complete the function called `average` in `main.py`__.  This function takes an integer called `n` as input.  Within the function, you should generate `n` uniform random variables between 0 and 1.  You should then calculate the sample mean from these `n` variables using the expression above and then `return` this quantity from your function.

When your code is complete a graph will be generated.  The red points are all uniform random variables that lie between 0 and 1.  The black points, meanwhile, are all sample means computed from sets of 100 uniform random variables.  __Before moving on take a look at this graph and consider which of the two distributions is more precisely distributed.__


In [None]:
import matplotlib.pyplot as plt
import numpy as np

def average(n) :
  # Your code to compute the average for a set of n uniform random variables goes here.


# You should not need to adjust the code from here onwards
xv, yv1, yv2 = np.linspace(1,100,100), np.zeros(100), np.zeros(100)
for i in range(100) :
  yv1[i] = np.random.uniform(0,1)
  yv2[i] = average(100)

plt.plot( xv, yv1, 'ro' )
plt.plot( xv, yv2, 'ko' )

# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()


In [None]:
runtest(['test_variables'])

# How the sample mean changes with sample size

In this exercise, we want to look at how the value of the sample mean changes as we change the number of random variables that it is calculated from.  Once you have finished it the code in `main.py` will draw a graph showing how the sample mean computed from a number of uniform random variables changes as you increase the number of random variables from which the mean is computed.  To complete the code you will need to:

1. Set the first element of the list called `indices` equal to 1, the second element of the list called `indices` to 2 and so on.
2. Set the first element of the list called  `average` equal to a sample mean calculated by generating 1 uniform random variable that lies between 0 and 1, the second element of the list `average` equal to a sample mean calculated by generating 2 uniform random variables that lie between 0 and 1, set the third element of the list called `average` equal to a sample mean calculated by generating 3 uniform random variables that lie between 0 and 1 and so on until you have computed an average by generating 200 uniform random variables.

Remember that the sample mean is defined as:

![](equation.png)

When the code is complete it should generate a graph of the sample mean versus the number of samples they are calculated from.  The red points on this graph are your various estimates of the sample mean.  The black, dashed horizontal line, meanwhile, shows indicate the value of the true expectation for this distribution.  You should see that the sample mean gets progressively closer and closer to this line as the number of samples it is computed from increases.


In [None]:
import matplotlib.pyplot as plt
import numpy as np

ssum, indices, average = 0, np.zeros(200), np.zeros(200)
for i in range(200) :
  # Add code to setup the numpy arrays called indices and average to generate the desired
  # plot here.



# This will plot the graph for the data.  You should not need to adjust this.
plt.plot( indices, average, 'ro' )
plt.plot( [0,200], [0.5,0.5], 'k--' )
plt.xlabel('Number of random variables')
plt.ylabel('Sample mean')

# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()


In [None]:
runtest(['test_variables_1'])

# Sampling the mean

You have learned a sample mean computed by adding together n identical random variables has a value that close to the value of the expectation of the random variable.  Furthermore, the larger the value of n the closer the sample mean will be to the true expectation.  Importantly, however, the sample mean is not equal to the expectation as the sample mean is a random variable.  When we quote means we therefore need to do what we learned in previous exercises.  We need to provide some information about the distribution that the mean was sampled from.  We cannot simply quote a single number as that number is random.  The randomness makes it impossible for colleagues to reproduce our results.  What they should be able to show, however, is that the distribution they are sampling in their experiment is the same as the distribution we are sampling in ours.

In this exercise I want to ensure that you understand what it means for us to generate a sample of sample means.  I thus want you to complete the following tasks:

1. Write a function called `sample_mean`.  This function should take in a single argument `n` and should return a sample mean that is computed by adding together `n` uniform continuous random variables that lie between 0 and 1.

2. Write a second function called `sample`.  This function should take in two arguments `m` and `n`.  It should return a NumPy array that contains `m` elements.  Each of these `m` elements should be equal to a sample mean that was computed by adding together `n` uniform continuous random variables that lie between 0 and 1.

N.B. Your function called `sample` should call your function called `sample_mean`.


In [None]:
import numpy as np


def sample_mean(n) :
  # Code for generating the sample mean goes here


def sample(m,n) :
  # Code for generating the sample goes here


In [None]:
runtest(['test_mean', 'test_sample'])

# Confidence limit on mean

As discussed previously any sample mean that we calculate is a random variable.  In order to make our result reproducible we must provide some information about the distribution whenever we quote a sample mean.  In previous exercises we have seen how we can provide this information on the distribution by proposing a confidence limit.  The process that we used to do this in those previous exercises involved:

1. Generating multiple random variables.

2. Using the `np.percentile` function to find a range that 90% of these random samples falls within.

We could thus make our result by reprocible by noting that if our colleague was sampling the same distribution as us he should obtain a result that falls between the 5th and 95th percentile of our distribution of results with a probablity of 90%.

To complete this exercise I want you to apply this idea for quoting confidence limits on a sample mean.  To complete the exericse you will need to complete the following functions in `main.py`:

1. `sample_mean` should take in a single number `n`.  This function should return a sample mean that is calculated by adding `n` uniform random variables that lie between 0 and 1 together.

2. `limit` should take in two numbers `n` and `m`.  This funciton should generate `m` sample means and store them in a NumPy array.  Each of these sample means should be calculated by adding together `n` uniform random variables that lie between 0 and 1.  This function should then return three numbers.  The first of these numbers `lower` should be the 5th percentile of the distribution of sample means that were generated.   The second of these numbers `median` should be the median of the sample means that was generated.  The final number `upper` should be the 95th percentile of the distribution of sample means.


In [None]:
import numpy as np
import scipy.stats

def sample_mean(n) :
  # Your code for calculating the sample mean for n
  # uniform random variabels between 0 and 1 goes here.

def limit(n,m) :
  # Your code to calculate the m sample means goes here.
  # Each of these sample means should be computed from
  # n uniform random variables between 0 and 1 goes
  # here.


  # When completed this function should return
  # lower = the 5th percentile of the distribution for the sample mean
  # median = your estimate for the median
  # upper = the 95th percentile of the distribution for the sample mean
  return lower, median, upper

print( limit(10,100) )
print( limit(10,100) )
print( limit(10,100) )
print( limit(10,100) )


In [None]:
runtest(['test_mean_1', 'test_limit'])

# Variance of mean

In this exercise I want you to calculate the variance of the sample mean:

![](equation.png)

In the expression above each of the X_ij values is a uniform continuous random variable that lies between 0 and 1.  The sums over j are thus simply calculations of the sample mean.  The sample variance is thus calculated from n estimates fo teh sample mean.

I would like you to investigate how the value of the variance  changes as the number of variables (i.e. m) that were used in the calculation of __the sample mean__ changes.  As in the previous exercise, you will need to complete the two functions:

1.  `sample_mean` - takes a single integer `m` in input.  It should return a sample mean that is computed by generating `m` uniform random variables that lie between 0 and 1.
2. `variance` - takes two integers in input `m` and `n`.  This function should return an estimate of the variance for a sample mean computed from `m` uniform random variables that lie between 0 and 1.  This variance should be calculated by generating `n` estimates of the sample mean.

You then need to write the internals of the loop that I have started to set the variables `xvals` and `yvals`:

1. The first element of `xvals` should be set equal to 1, the second element should be set to two and so on.
2. The first element of `yvals` should be set equal to an estimate of the variance for a sample mean computed from one random variable, the second element should be set equal to an estimate of the sample variance for a sample mean computed from two random variables, the third element should be set equal to an estimate of the sample variance for a sample mean computed from three random variables.  This process should be continued until you have an estimate of the sample variance for a sample mean computed from 50 random variables.

All your estimates for the sample variance should be computed from 10 random variables.

You should plot a graph showing your values for the sample variance on the y-axis and the number of random variables that were added together to compute the mean on the x-axis.  The label
for the x-axis should be 'Number of variables used to calculate mean' and the y-axis label should be 'Variance'


In [None]:
import matplotlib.pyplot as plt
import numpy as np

def sample_mean(m) :
  # Your code to calculate a  sample mean from m
  # uniform random variables between 0 and 1 goes here


def variance(n,m) :
  # Your code to estimate the variance for a set of n
  # sample means, which are each computed from m
  # uniform random variables between 0 and 1 goes here.

xvals, yvals = np.zeros(50), np.zeros(50)
for i in range(50) :
  # Your code to set xvals and yvals as described in the panel
  # on the right goes here


# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()



In [None]:
runtest(['test_variables_2'])

# Calculating the sample variance

If we do not know the exact value of the expectation for the distribution we should use the following estimator for the sample variance:

![](equation1.png)

which we can easily rearrange to:

![](equation2.png)

__Your task in this exercise is to write a function called `variance` that calculates an estimate of this quantity.__  This function should take in a single number `n`.  Within the function you should then generate `n` uniform
random variables that all lie between 0 and 1.  From these `n` random variables you should then calculate an estimate for the variance of the underlying distribution using the second of the two expressions above.


In [None]:
import numpy as np

def variance(n) :
  # Your function to calculate the variance for a set of n uniform random variables goes here



In [None]:
runtest(['test_variables_3'])

# Converging the sample variance

Let's look again at the sample variance computed, which is computed using:

![](equation.png)

Let's now look at how this quantity depends on the number of random variables that it is computed from.  __To complete the exercise you need to generate a graph that shows how the estimate of the sample variance for a uniform random variable that lies between 0 and 1 depends on the number of random variables it is computed from.__   I have written some code to get you started with the exercise.  To complete the task you need to:

1. Set the first element of the array called `indices` equal to 2, the second element of the array called `indices` to 3 and so on.  (Notice that the sample variance with n=1 is not defined as if n=1 the n-1 in the denominator is 0 and the numerator is similarly zero).

2. Set the first element of the array called `S2` equal to a sample variance computed using the above formula with n=2, the second element of this list to a sample variance computed using the above formula with n=3 and so up until you have computed the formula above with n=201.

3. Draw a graph that has the number of random variables that were used to calculate the variance on the x-axis and the estimate of the variance on the y-axis.  The x-axis label for this graph should be __Number of random variables__ and the y-axis label should be __Sample variance__.

When your code is complete a graph showing the value of the estimate of the sample variance as a function of n will be generated.


In [None]:
import matplotlib.pyplot as plt
import numpy as np

myvar = np.random.uniform(0,1)
ssum, ssum2 = myvar, myvar*myvar
indices, S2 = np.zeros(200), np.zeros(200)
for i in range(200) :
  # Add code to setup the numpy arrays called indices and average to generate the desired plot here.


# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()

In [None]:
runtest(['test_variables_4'])

# Using the normal distribution for error bars

It is reasonable to assume that any sample mean we calculate is a sample from normal distribution.  This assumption is useful because it makes it straightforward to calculate confidence limits.  If the mean is a sample from a normal distribution we no longer need to do any resampling as we know the distribution we are sampling from.  In this exercise, we are going to see how easy this realisation makes the process of calculating error bars.

The fact that sample means are samples from normal distribution is a consequence of a the central limit theorem.  This theorem tells us the cumulative probability distribution function for the sample mean.  __The cumulative probability distribution function for the sample mean is that of a normal distribution with an expectation equal to the sample mean and a variance equal to the sample variance divided by the number of random variables from which the sample mean was computed.__  To compute error bars around a sample mean we need simply to compute the sample variance and to then use the inverse of the cumulative probability distribution for a normal random variable to get the percentiles.  There is no longer any need to resample.

The exercise in the panel on the left will hopefully help you to understand how you can compute an error bar by using the central limit theorem.  To complete this code you will need to write a function called `mean_with_errors` that takes in a parameter called `n`.  This function should return a sample mean computed from `n` uniform random variables that lie between 0 and 1.  This quantity should be returned in the variable called `mean`.  In addition, you should also compute the 5th and 95th percentiles for the distribution that mean was sampled from.  These two quantities should be retuned as `lower` and `upper`.  When calculating `lower` and `upper` you should assume that the sample mean is a sample from a normal distribution with suitable parameters.

Within the function called `mean_with_errors` you will need to compute the sample mean and the sample variance for your sample of `n` uniform random variables.  You should then be able to calculate lower and upper using your computed values for the sample mean and sample variance and the following python function:

```python
ppp = scipy.stats.norm.ppf(0.95)
```

The call above computes the 95th percentile for a standard normal random variable.  i.e. A normal random variable with expectation 0 and variance 1.


In [None]:
import numpy as np
import scipy.stats

def mean_with_errors(n) :
  # Your code to calculate the sample mean and sample variance
  # for a set of n uniform random variables between 0 and 1 goes
  # here.


  # When complete this function should return
  # lower = the 5th percentile of the distribution that was sampled
  # mean = your estimate for the sample mean
  # upper = the 95th percentile of the distribution that was sampled
  # N.B. To compute lower and upper you should be using the central
  # limit theorem as discussed in the explanatory text.
  return lower, mean, upper

print( mean_with_errors(100) )
print( mean_with_errors(100) )
print( mean_with_errors(100) )
print( mean_with_errors(100) )


In [None]:
runtest(['test_function'])