In [1]:
## Import required Python modules
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy, scipy.stats
import io
import base64
#from IPython.core.display import display
from IPython.display import display, HTML, Image
from urllib.request import urlopen

try:
    import astropy as apy
    import astropy.table
    _apy = True
    #print('Loaded astropy')
except:
    _apy = False
    #print('Could not load astropy')

## Customising the font size of figures
plt.rcParams.update({'font.size': 14})

## Customising the look of the notebook
display(HTML("<style>.container { width:95% !important; }</style>"))
## This custom file is adapted from https://github.com/lmarti/jupyter_custom/blob/master/custom.include
HTML('custom.css')
#HTML(urlopen('https://raw.githubusercontent.com/bretonr/intro_data_science/master/custom.css').read().decode('utf-8'))

In [2]:
## Custom imports
from scipy.stats import binom, poisson, chi2, norm, uniform
from scipy.optimize import curve_fit
from math import ceil, pi
from numpy import exp
from matplotlib.collections import PatchCollection
from matplotlib.patches import Circle, Rectangle
from matplotlib.colors import makeMappingArray

In [3]:
## Adding a button to hide the Python source code
HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the Python code."></form>''')

<div class="container-fluid">
    <div class="row">
        <div class="col-md-8" align="center">
            <h1>PHYS 10791: Introduction to Data Science</h1>
            <!--<h3>2019-2020 Academic Year</h3><br>-->
        </div>
        <div class="col-md-3">
            <img align='center' style="border-width:0" src="images/UoM_logo.png"/>
        </div>
    </div>
</div>

<div class="container-fluid">
    <div class="row">
        <div class="col-md-2" align="right">
            <b>Course instructors:&nbsp;&nbsp;</b>
        </div>
        <div class="col-md-9" align="left">
            <a href="http://www.renebreton.org">Prof. Rene Breton</a> - Twitter <a href="https://twitter.com/BretonRene">@BretonRene</a><br>
            <a href="http://www.hep.manchester.ac.uk/u/gersabec">Dr. Marco Gersabeck</a> - Twitter <a href="https://twitter.com/MarcoGersabeck">@MarcoGersabeck</a>
        </div>
    </div>
</div>

# Chapter 8 - Problem Sheet

### Problem 1: Type I and II errors

Identify which statements are correct.

- Type I error is the rate of acceptance of the hypothesis in a hypothesis test.
- Type I error is the rate of rejection of the hypothesis in a hypothesis test.
- Type I error is the rate of acceptance of the alternative hypothesis in a hypothesis test.
- Type I error is the rate of rejection of the alternative hypothesis in a hypothesis test.
- Type II error is the rate of acceptance of the hypothesis in a hypothesis test.
- Type II error is the rate of rejection of the hypothesis in a hypothesis test.
- Type II error is the rate of acceptance of the alternative hypothesis in a hypothesis test.
- Type II error is the rate of rejection of the alternative hypothesis in a hypothesis test.


### Solution to Problem 1

The correct statements are
- Type I error is the rate of rejection of the hypothesis in a hypothesis test.
- Type II error is the rate of acceptance of the alternative hypothesis in a hypothesis test.

### Problem 2: The choice of significance and power

#### Problem 2.1

Describe in your words what are the relevant things to consider when choosing the acceptance point of a hypothesis test, which defines significance and power.

#### Problem 2.2

In a medical diagnostic test that aims to identify a disease the quantities discussed are often: 
- the sensitivity, i.e. the rate at which true positives are not overlooked, and
- the specificity, i.e. the rate of candidates without a disease that are correctly identified as healthy.

Relate these to Type I and Type II errors and to significance and power.

#### Problem 2.3

A medical diagnostic test has a rate of Type I errors of $20\%$ and a rate of Type II errors of $0.01\%$. The test is carried out on 100,000 candidates. It is expected that 1 in 1,000 people carry the disease. Based on these numbers calculate
- the expected number of infected candidates,
- the expected number of candidates returning a postive test,
- the number of infected candidates not identified as such, and
- the fraction of postive tests that were returned by healthy candidates.

Based on the last number, discuss the usefulness of this test and what could be done to address this.

### Solution to Problem 2

#### Solution 2.1
The acceptance point of a hypothesis test should minimise significance and maximise power, i.e. maximise the integrals over the rejection region for the hypothesis and alternative hypothesis, respectively.

Which of the two is more important depends on the expected rate of the two cases. For example, is it is expected that the hypothesis is false in the vast majority of cases then it is crucial to maximise the power to minimise the rate of false positives or Type II errors as otherwise the total number of false positives would be greater than the number of true positives.

#### Solution 2.2

The sensitivity is the acceptance rate in the case that the hypothesis is true, which is one minus the significance (rate of Type I errors).

The specificity is the rejection rate in the case that the hypothesis is false, which is the power of the test or one minus the rate of Type II errors.

#### Solution 2.3
- $1/1000 \times 100000 = 100$
- $100\times(1-0.2)+(100000-100)\times 0.01\% = 80 + 100 = 180$
- $100\times0.2 = 20$
- $100/180=56\%$

The question of the usefulness depends very much on the consequences of the test. If a positive test would have as a consequence an invasive treatment, then this should clearly not end up being applied to a majority of healthy individuals. If on the other hand, this is mostly to assess the prevalence of the disease in a population, which could even be corrected for the known rate of false positives, the test would be fine.

If the test was carried out twice on those positively identified, the picture changes dramatically:
- of the 80 true positive cases, 64 would still be identified as such a second time, while
- none of the false positive cases would be expected to return a second positive test ($100\times 0.01\% \ll 1$).

In this case one might even prefer a test with a greater rate of false positives if this would allow to reduce the number of false negatives.

### Problem 3: Hypothesis tests with Poisson and Gauss

The last lecture video discussed an example in which a Poisson distribution was approximated by a Gaussian distribution. This problem aims to illustrate this further. In a counting experiment, assume that the hypothesis is that the expected count rate is 30. Make a table for counts 0 to 50 with the following columns (if you're not using a computer and calculate the numbers one-by-one, you may start at a count of 15; note also that one of the parts of the Poisson probability formula does not depend on the count):
- The count (a running number from 0 to 50)
- The Poisson probability for this count
- The cumulative Poisson probability for counts from 0 to this value
- The signed number of standard deviations corresponding to the deviation of this count from the mean when approximating the Poisson distribution by a Gaussian normal distribution.
- The fractional integral of the Gaussian normal distribution up to the number of standard deviations calculated in the previous column
- The ratio of the cumulative sum of Poisson probabilities to the fraction integral of the normal distribution, i.e. of the values in the third and fifth column.

### Solution to Problem 3

The table below contains the number as described above. The last column shows that for counts above the mean of 30 the relative difference between the cumulative sum of Poisson probabilities and the integral over Gaussian probabilities is <10% and for counts greater than 37 the relative difference is <1%.

n_sigma is calculated as $(N-\mu)/\sigma$ with $\mu=30$ and $\sigma=\sqrt{30}$.

In [18]:
import numpy as np
import scipy, scipy.stats
from math import ceil, pi
from scipy.stats import poisson, norm, binom
np.set_printoptions(precision=5,suppress=True)

mean = 30
ks = np.array(range(0,51))

rvp = poisson(mean)       # initialise poisson distribution
p_probs = rvp.pmf(ks)      # calculate poisson probabilities for all values of x and return list
p_ints = np.cumsum(p_probs, dtype=float) # calculate cumulative sum of probabilities

g_sigmas = (ks-mean)/mean**0.5
g_cumprob = norm.cdf(g_sigmas)

ratios = p_ints / g_cumprob
print(' N  P_Poiss   Cum_P      n_sigma   Cum_Gauss Poiss/Gauss')
ratios = np.array(ratios)
for a,b,c,d,e,f in zip(ks,p_probs,p_ints,g_sigmas,g_cumprob,ratios):
    if a > 0:
        print('{:2d}  {:f}  {:f}  {: f}  {:f}  {:f}'.format(a,b,c,d,e,f))

 N  P_Poiss   Cum_P      n_sigma   Cum_Gauss Poiss/Gauss
 1  0.000000  0.000000  -5.294651  0.000000  0.000049
 2  0.000000  0.000000  -5.112077  0.000000  0.000283
 3  0.000000  0.000000  -4.929503  0.000000  0.001131
 4  0.000000  0.000000  -4.746929  0.000001  0.003510
 5  0.000000  0.000000  -4.564355  0.000003  0.009011
 6  0.000000  0.000000  -4.381780  0.000006  0.019933
 7  0.000000  0.000001  -4.199206  0.000013  0.039079
 8  0.000002  0.000002  -4.016632  0.000030  0.069316
 9  0.000005  0.000007  -3.834058  0.000063  0.113002
10  0.000015  0.000022  -3.651484  0.000130  0.171433
11  0.000042  0.000064  -3.468910  0.000261  0.244470
12  0.000104  0.000168  -3.286335  0.000508  0.330438
13  0.000240  0.000407  -3.103761  0.000955  0.426302
14  0.000513  0.000921  -2.921187  0.001744  0.528065
15  0.001027  0.001947  -2.738613  0.003085  0.631284
16  0.001925  0.003873  -2.556039  0.005294  0.731591
17  0.003397  0.007270  -2.373464  0.008811  0.825125
18  0.005662  0.012933  -

<div class="well" align="center">
    <div class="container-fluid">
        <div class="row">
            <div class="col-md-3" align="center">
                <img align="center" alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" width="60%">
            </div>
            <div class="col-md-8">
            This work is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>).
            </div>
        </div>
    </div>
    <br>
    <br>
    <i>Note: The content of this Jupyter Notebook is provided for educational purposes only.</i>
</div>