# Run this cell first

In [None]:
# this code enables the automated feedback. If you remove this, you won't get any feedback
# so don't delete this cell!
try:
  import AutoFeedback
except (ModuleNotFoundError, ImportError):
  !pip install git+https://github.com/abrown41/AutoFeedback@notebook
  import AutoFeedback

try:
  from testsrc import test_main
except (ModuleNotFoundError, ImportError):
  !pip install "git+https://github.com/autofeedback-exercises/exercises.git@testpip#subdirectory=New-SOR3012/Histograms"
  from testsrc import test_main

def runtest(tlist):
  import unittest
  from contextlib import redirect_stderr
  from os import devnull
  with redirect_stderr(open(devnull, 'w')):
    suite = unittest.TestSuite()
    for tname in tlist:
      suite.addTest(eval(f"test_main.UnitTests.{tname}"))
    runner = unittest.TextTestRunner()
    try:
      runner.run(suite)
    except AssertionError:
      pass


# Plotting the probability mass function

In the previous exercise you learned how to estimate the probablity mass function for a random variable by repeatedly sampling that random variable and by using those samples to construct a histogram.  In the instructions for that exercise I stated that the heights of the bars in the histogram were all random variables.  In fact, the whole point of the previous exercise was to plot a confidence limit around the estimate of the distribution that is obtained by sampling.

When we first encountered this idea of providing confidence limits around estimates of quantities that were found by taking a sample mean we also discussed  the expectation.  When we calculated a sample mean we noted that we were estimating the expectation and that, furthermore, for many of the random variables we have been investigating in this exercise, this expectation can be calculated exactly.  The natural question one might, therefore, ask is: is there some exact set of quantities that can be compared to the histograms we learned to estimate in the last exercise?  In other words, we know that the sample mean converges on the expectation.  Does the histogram similarly converge and if it does what does it converge to?

You should not be suprised to learn that the histogram does indeed converge.  You should not be surprised by this fact as I told you in the very first sentence of these instructions that when we calculate a histogram we are __estimating the probablity mass function__ for the random variable.  Consequently, similarly to how we check whether the true expectation lies within our confidence limit we can also check whether the true probablity mass function lies within the confidence limit for our histogram.

__Your task in this exercise is thus to plot two bar charts.__  One of these bar charts should show the exact probablity mass function for a discrete uniform random variable that can take any integer value that is greater than or equal to 1 and less than or equal to 6.  The second should show the probablity mass function for a binomial random variable with n=6 and p=0.5.  Notice that you can calculate the exact probability mass function for a binomial random variable by using scipy as follows:

```python
import scipy.stats

# Calculate P(X=x) where X is a binomial random variable with parameters n and p
p = scipy.stats.binom.pmf( x, n, p )
```

See [this](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.binom.html) website for more details.

Notice that I have written stub code in the panel on the left to plot the two probablity mass functions side by side.  In your final figure the probablity mass function for the uniform discrete random variable will appear in blue while the probablity mass function for the binomial random variable plot will appear in red.  Examine the code for generating side by side bar charts carefully as you will need to produce side by side plots like these for the exercises in future weeks.

Functions exist for plotting the probablity mass/density functions for other types of random vairables.  You can find information here:

* [Geometric](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.geom.html)
* [Negative binomial](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.nbinom.html)
* [Exponential](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.expon.html)
* [Normal](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html)


In [None]:
import matplotlib.pyplot as plt
import scipy.stats
import numpy as np

# This sets the x-coordinates for your bar charts
xvals = np.linspace(0,6,7)

# Your code for setting the values in uniform_pmf and binom_pmf goes here




# This is the part for plotting the probablity mass functions
# side by side.  Notice that the x-coordinates
# define the position of the centers of the bars.  You
# thus get the center of the two side by side bars to appear at
# the coordinates in xvals by shifting one set of bars left
# by half the width of the bar and the other set of bars
# right by half the width of the bar.
plt.bar( xvals-0.05, uniform_pmf, width=0.1, color='blue' )
plt.bar( xvals+0.05, binom_pmf, width=0.1, color='red' )
plt.xlabel('x')
plt.ylabel('P(X=x)')


# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()

In [None]:
runtest(['test_plot'])

# Counting successes and failures

We will start this exercise by doing something that by now should be very familiar.  We are going to write a function that generates a series of Bernoulli random variables.  Now, however, I want you to calculate the number of failures in these
n trials as well as the number of successes.

To complete the exercise you will need to do the following:

1. You will need to write a function called `bernoulli` that takes a parameter `p` (the probability of success) and that returns a Bernoulli random variable.
2. You will need to modify the function called `repeated_trials`.  This function takes two parameters `n` (the number of trials to perform) and `p` (the probability of success in each trial).  It should return two numbers `nsuccess` and `nfail`, which will give the number of successes and the number of failures respectively.  Within this function you will need to write the code to generate the `n` Bernoulli variables required and to compute `nsuccess` and `nfail`.


In [None]:
import numpy as np

def bernoulli(p) :
  # Your code to generate a bernoulli random variable goes here

def repeated_trials(n,p) :
  nsuccess, nfail = 0, 0
  # Your code to generate n bernoulli trials and to count the number of
  # successes and failures goes here.

  return nsuccess, nfail

print( repeated_trials(10,0.2) )
print( repeated_trials(10,0.2) )
print( repeated_trials(10,0.2) )
print( repeated_trials(10,0.2) )


In [None]:
runtest(['test_bernoulli', 'test_trials'])

# Plotting numbers of successes and failures

In this exercise, we are going to tackle the same problem as we did in the previous one but we are going to modify the way that you wrote the code.  The objective is to write the code to complete this task in a way that can be extended so that we can deal with more complex random variables that have more than two possible outcomes.  Furthermore, I would also like to show you how we can create a visual representation for the fraction of successes and the fraction of failures.

To complete the exercise you must do the following:

1. You will need to write a function called `bernoulli` that takes a parameter `p` (the probability of success) and that returns a Bernoulli random variable.
2. You will need to modify the loop that comes immediately after the function called `bernoulli` so that within this loop 100 Bernoulli random variables with parameter `prob` are generated.  In addition, you should, within this loop, count the number of times these trials were failures in element 0 of the array called `counts` and the number of times these trials were successful in element 1 of the array called `counts`.
3. You need to modify the final loop in the code to the right so that the first element of the array called `counts` is equal to the fraction of failures and the second element of the array called `counts` is equal to the fraction of successes.

Notice that when you are counting the number of successes and failures in the elements of the list called counts you do not need to use an if statement.  Instead, you can do the following:

```python
myvar = bernoulli(p)
counts[int(myvar)] = counts[int(myvar)] + 1
```

This works because if the trial was unsuccessful the function `bernoulli` returns a 0.  Consequently, the code above will modify element 0 of the list.  By contrast, if the trial is successful `bernoulli` returns a 1 and the above code will modify element 1 of the list.
If you complete the exercise correctly an estimate of the probability mass function will be generated in the file `bernoulli_histogram.png`.

N.B. The `int` command converts the real number that is output by `bernoulli` into an integer so that it can be used to refer to a particular element of the list.


In [None]:
import matplotlib.pyplot as plt
import numpy as np

def bernoulli(p) :
  # Your code to generate a bernoulli random variable goes here


prob, counts = 0.3, np.zeros(2)
# Your code to generate n bernoulli trials and to count the number of
# successes and failures goes here.
for i in range(100) :

# Your code to ensure that the sum of the two heights is equal to one
# and that the bar chart plotted is thus an estimate for the probablity
# mass function goes here.
for i in range(2) :


# This will draw a bar chart showing the fraction of successes and
# the fraction of failures.
plt.bar( [0,1], counts, width=0.1 )
plt.xlabel('Outcome')
plt.ylabel('Probability')

# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()



In [None]:
runtest(['test_plot_1'])

# Estimating the histogram for a binomial random variable

We are now going to merge this business of computing a histogram with what you know about generating different types of random variables.  In this exercise for instance I would like you to generate a histogram for a binomial random variable.  In order to do this you will need to:

1. Write a function called `binomial` that takes parameters `n` (the number of trials to perform) and `p` (the probability of success in each of these trials) and that returns a binomial random variable.
2. Write a loop that generates multiple binomial random variables with parameters `nparam` and `prob` using the function called `binomial`.  You should use the list called `counts` to count how often each of the various outcomes in the sample space for this type of random variable appears just as you did for the random variable in the previous exercise.  In addition, you will also notice that you need to set the variable `noutcomes` equal to the number of possible values that the random variable can take.
3. You need to write a loop that converts each of the quantities in the list called `counts` from the number of times that the random variable was equal to a particular value into the fraction that the random variable took on this particular value.  In addition, you need to set the elements of the list called `sample_space` equal to the values that you would like to be plotted on the x-axis of your histogram.
4. You need to plot the estimate for the probablity mass function.  In your graph you should use __Random variable value__ as the x-axies label and __Fraction of occurances__ as the y-axis label.

The final result should resemble the probability mass function for a binomial random variable with parameters `nparam` and `prob`.


In [None]:
import matplotlib.pyplot as plt
import numpy as np

# You may wish to write some code here

def binomial(n, p) :
  # Your code to generate a binomial random variable goes here

# This variable is the number of random variables we are going to generate
nsamples=200
nparam, prob = 8, 0.3
noutcomes =
counts = np.zeros(noutcomes)
for i in range(nsamples) :
  # Your code to generate multiple binomial variables using the function
  # called binomial above and to count how often each outcome comes
  # up goes here.


sample_space = np.zeros(0)
for i in range(noutcomes) :
  # Your function to convert the count of the number of times each
  # value for the random variable comes up to the fraction of times
  # each outcome comes up goes here.  You should also set the elements
  # of the list sample_space to the various values in the sample space for this particular random variable so that the plot appears correctly.



In [None]:
runtest(['test_plot_2'])

# Histograms and Percentiles

Notice that the heights of the bars in a histogram that we generate by sampling random variables are all averages.  In other words, the heights of the bars in a histogram are all random variables.  We should thus quote error bars on these error bars
to make our results reproducible.  In this exercise I am going to show you how to do this by resampling.

I have done a lot of the work for you here you still have to do a few things; namely:

1. You have to write a function called `dice_roll` that returns the (random) outcome of a roll of a fair six-sided dice.  Remember that when we roll a fair, six-sided dice we are generating a uniform discrete random variable that can take values of 1, 2, 3, 4, 5 or 6.
2. You need to write a function called `histo_esimtate` that takes a parameter called `n`.   Within this function, you should compute a histogram by taking `n` samples using your `dice_roll` function.  The fraction of times you get each of the six possible outcomes on rolling the dice should be stored in the array called `histo`, which will be returned from your function.
3. You need to work out how to set the elements of the array `upper`.  The elements of this array should be set equal to the difference between the 95th percentile of the distribution of histogram estimates and the median for the distribution of histogram estimates.  N.B. This number should be positive.

When you have written this code and run the code a graph showing the histogram with suitable error bars on each of the bars in this is produced.  Look at the code that I have written in `main.py` and try to understand how it works.  We are only using ideas about
percentiles that you have learned about in this course.  It is a little more complicated, however, as we have to use two dimensional NumPy arrays as we are estimating multiple random variables simultaneously.

Please note that the code checks the values of the error bars that are stored in the arrays called lower and upper.  You must therefore have these arrays defined in your code in order to pass the test


In [None]:
import matplotlib.pyplot as plt
import numpy as np

def dice_roll() :
  # Insert code so that this function return  the outcome of a roll of a fair six sided dice here.

def histo_estimate(n) :
  histo = np.zeros(6)
  # Insert code to compute a histogram if you roll the dice n times here.

  return histo

# This tells us that 50 (nsamples) random variables should be used in the generation of each histogram
# This procedure of generating 50 random variables and calculating the histogram should then
# be repeated 500 (nresamples) times.
nsamples, nresamples = 50, 500
# This loop resamples your histogram
histo_samples = np.zeros([nresamples,6])
for i in range(nresamples) : histo_samples[i] = histo_estimate(nsamples)

# This computes percentiles from your histogram
lower, upper, median = np.zeros(6), np.zeros(6), np.zeros(6)
for i in range(6) :
  # We find the median
  median[i] = np.median( histo_samples[:,i] )
  # Generally we quote the error by saying that the the value is between
  # median - lower and median + upper.  When we compute percentiles we are
  # getting values for median - lower and median + upper so we have to
  # do some sums to get the values of lower and upper that we want.
  lower[i] = median[i] - np.percentile( histo_samples[:,i], 5 )
  upper[i] =

plt.bar( [1,2,3,4,5,6], median, width=0.1 )
# This plots the small bar around each of the values.
plt.errorbar( [1,2,3,4,5,6], median, yerr=[lower,upper], fmt='ko' )

# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()

In [None]:
runtest(['test_dice', 'test_histo', 'test_lower', 'test_upper', 'test_plot_3'])

# Error bars for histograms

We can also use the central limit theorem to calculate error bars for histograms.  We can thus avoid resampling our error bar in the way I showed you previously.  In this exercise, we are going to try to write a program that calculates error bars that can be shown on a histogram using the central limit theorem.  Our histogram is going to be an estimate for the probability mass function for a binomial random variable so to complete the exercise you are going to need to:

1. Write a function called `binomial` that takes in two parameters `n` (the number of trials) and `p` (the probability of success in each trial)  This function should return a binomial random variable from a distribution with parameters `n` and `p`.
2. Write a loop that calls the `binomial` function `nsamples` times with `n=5` and `p=0.5` and that accumulates a histogram.  You should use the list called `histo` to accumulate how often each of the six possible outcomes appears in your sample of `nsamples` binomial random variables.
3. Normalize the histogram in `histo` so that you obtain an estimate for the probability mass.

Once you have completed steps 1-3 above you are in a position where you can calculate the error bars.  To compute the error bars notice that the height of each bar in your normalised histogram is a sample mean computed from `nsamples` Bernoulli random variables.  In other words, the height of each bar in the histogram is an estimate of the parameter of a Bernoulli random variable, `p`.  We can thus approximate the sample variance of this Bernoulli random variable using:

![](equation.png)

where q is simply the height of the histogram bar that you have stored in the list called `histo`.  You should thus be able to calculate the confidence limits on each of your histogram bars by using the estimate of the variance that you obtain from the above formula together with what you know from the central limit theorem.  I would like you to plot an error bar that represents the 90% confidence limit on your estimate of the height of the bar.   To do this you will need to set the elements of the list called `error` equal to the
width of an error bar that represents a 90% confidence limit.  Notice that the width of an error bar for a 90% confidence limit is equal to the 95th percentile of the distribution minus the mean of the distribtution.


In [None]:
import matplotlib.pyplot as  plt
import numpy as np
import scipy.stats

# You may want to add some code here

def binomial(n,p) :
  # Your code to generate the binomial random variables goes here.

nsamples = 500
histo = np.zeros(6)
# Insert code to compute a histogram by generating nsamples binomial random variables with
# n=5 and p=0.5 here.

# Don't forget to normalise your histogram.


# Include the code to compute the error bars at the 90% confidence limit here.  The list
# called error should contain the difference between the 95th percentile for the distribution of the
# mean and the mean
error = np.zeros(6)


# This will plot the histogram and the error bars
plt.bar( [0,1,2,3,4,5], histo, width=0.1 )
# This plots the small bar around each of the values.
plt.errorbar( [0,1,2,3,4,5], histo, yerr=error, fmt='ko' )
plt.xlabel('Outcome')
plt.ylabel('Fraction of occurances')


# This code is required for the autofeedback- don't delete it!
fighand = plt.gca()


In [None]:
runtest(['test_binom', 'test_error', 'test_plot_4'])