In [1]:
## Import required Python modules
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy, scipy.stats
import io
import base64
#from IPython.core.display import display
from IPython.display import display, HTML, Image
from urllib.request import urlopen

try:
    import astropy as apy
    import astropy.table
    _apy = True
    #print('Loaded astropy')
except:
    _apy = False
    #print('Could not load astropy')

## Customising the font size of figures
plt.rcParams.update({'font.size': 14})

## Customising the look of the notebook
display(HTML("<style>.container { width:95% !important; }</style>"))
## This custom file is adapted from https://github.com/lmarti/jupyter_custom/blob/master/custom.include
HTML('custom.css')
#HTML(urlopen('https://raw.githubusercontent.com/bretonr/intro_data_science/master/custom.css').read().decode('utf-8'))

In [2]:
## Adding a button to hide the Python source code
HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the Python code."></form>''')

<div class="container-fluid">
    <div class="row">
        <div class="col-md-8" align="center">
            <h1>PHYS 10792: Introduction to Data Science</h1>
            <h3>2018-2019 Academic Year</h3><br>
        </div>
        <div class="col-md-3">
            <img align='center' style="border-width:0" src="images/UoM_logo.png"/>
        </div>
    </div>
</div>

<div class="container-fluid">
    <div class="row">
        <div class="col-md-2" align="right">
            <b>Course instructors:&nbsp;&nbsp;</b>
        </div>
        <div class="col-md-9" align="left">
            <a href="http://www.renebreton.org">Rene Breton</a> - Twitter <a href="https://twitter.com/BretonRene">@BretonRene</a><br>
            <a href="http://www.hep.manchester.ac.uk/u/gersabec">Marco Gersabeck</a> - Twitter <a href="https://twitter.com/MarcoGersabeck">@MarcoGersabeck</a>
        </div>
    </div>
</div>

# Chapter 7

## Syllabus

1. Probabilities and interpretations
2. Probability distributions
3. Parameter estimation
4. Maximum likelihood + extended maximum likelihood
5. Least square, chi2, correlations
6. Monte Carlo basics
7. **Probability** 
8. Hypothesis testing
9. Confidence level
10. Goodness of fit tests
11. Limit setting
12. Introduction to multivariate analysis techniques

## Topics

**[7 Probability](#7-Probability)**
- 7.1 Axioms of probability
- 7.2 Empirical probability
- 7.3 Bayesian statistics
- 7.4 Subjective probability
- 7.5 Limitations

## 7 Probability

### 7.1 Axioms of probability

#### Recall from Week 2:

When repeating a measurement the result may change in an unforeseable manner. This is the characteristic of a random system. The degree of randomness can be quantified with the concept of probability.

Let us define the probability following Kolmogorov (1933).

We have a set of possible results $S = {E_1, E_2, ...}$. To each subset $A$ of $S$, $A \subset S$, one assigns a real number $P(A)$, called probability and satisfying the *axioms of probability*:

1. For each subset $A$ in $S$, $P(A) \geq 0$
2. For all disjoint subsets $A$ and $B$ (i.e. $A \cap B = 0$, null intersection), $P(A \cup B) = P(A) + P(B)$ (i.e. the union of the two is simply the sum of the datasets)
3. $P(S) = 1$


The following properties can be derived from these axioms, where the complement to the set of results $A$, i.e. not $A$, is denoted by $\overline{A}$:

- $P(\bar A) = 1 - P(A)$
- $P(A \cup \overline{A}) = 1$
- $0 \leq P(A) \leq 1$
- $P(\emptyset) = 0$
- If $A \subset B$ then $P(A) \leq P(B)$
- $P(A \cup B) = P(A) + P(B) - P(A \cap B)$

#### Alternative way of the same
A single possible result is often called event. 
All of the above can then be written for a single event $E_i$ as the subset $A$.
With $S$ being the set of all possible events, the third axiom becomes:

&nbsp; &nbsp; 3. $P(S)=\sum P(E_i)=1$

### 7.2 Empirical probability - The limit of a frequency

Consider an experiment that is executed $N$ times. The outcome $A$ (this could be a single event or a set of events as discussed above) occurs in $M$ of these cases. As $N\to \infty$, the ratio $M/N$ tends to a limit, which is defined as the _probability_ $P(A)$ of $A$.

The experiment may be repeated $N$ times sequentially or $N$ identical experiments may be carried out in parallel. The set of all $N$ outcomes is called _collective_ or _ensemble_.

<div class="example">Example: Repeating one experiment</div>

An example for repeating one experiment is the double-slit experiment in which the same double slit is bombarded many times with particles and a distribution builds up on the screen.
This distribution corresponds to the probability of observing a particle at a given place on the screen.

<img src="images/DoubleSlit.png" width=60% >

Source <a href="https://commons.wikimedia.org/wiki/File:Two-Slit_Experiment_Light.svg">Wikimedia/inductiveload</a>

<div class="example">Example: Many independent experiments</div>

In particle physics colliders produce millions of particle collisions per second.

Each of these can be considered as an independent experiment.

What is studied in the end is the outcome of the ensemble.

<img src="images/Higgs24mu.png" width=49% > <img src="images/Higgs24l.png" width=49% >

Sources <a href="https://cds.cern.ch/record/1459496">CERN/ATLAS</a>, <a href="http://inspirehep.net/record/1608162">CMS, JHEP 1711 (2017) 047</a>

#### Experiment: Random numbers
Over to menti.com: **82 36 74**

<div class="example">Example: Insurance statistics</div>

The following is a classic example by von Mises.

It is found by the German insurance companies that the fraction of their male clients dying when aged $40$ is $1.1\%$. However, we cannot say that a particular Herr Schmidt has a probability of $1.1\%$ of dying (or $98.9\%$ of surviving) between his $40^{\rm th}$ and $41^{\rm st}$ birthdays. The probability of $1.1\%$ refers to all German insured men. Different sample groups that Herr Schmidt may belong to (e.g. all German men, all men, all Germans, all German insured non-smoking men, all German hang-glider pilots) would give different probabilities of his passing away prematurely. Hence, the probability depends on the individual **and** on the collective to which it is considered to belong.

### 7.3 Bayesian statistics

In Bayesian statistics we defined the conditional probability (see Week 2).

The conditional probability $P(A|B)$ is the probability of an event $A$ given that $B$ is true.
This is useful in defining the collective in the previous example.

For example the probability that it is Thursday is $P({\rm Tuesday})=1/7$, but the probability of it being Tuesday given that you are attending this lecture is $P({\rm Tuesday}\,|\,{\rm DataSci})=1/2$ as it only takes place on Tuesdays and Thursdays.

Bayes' theorem states that

$$P(A|B)P(B) = P(A~{\rm and}~B) = P(B|A)P(A),$$

and hence

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}.$$

It is often helpful to express $P(B)$ in terms of whether event $A$ is true or not (with $\overline{A}$ denoting 'not $A$' as before), which gives

$$P(B) = P(B|A)P(A) + P(B|\overline{A})P(\overline{A}) = P(B|A)P(A) + P(B|\overline{A})[1-P(A)],$$

and hence by inserting in the previous equation

$$P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\overline{A})[1-P(A)]}.$$

Going back to our example, we can write

$$P({\rm Tuesday}\,|\,{\rm DataSci}) = \frac{P({\rm DataSci}\,|\,{\rm Tuesday})P({\rm Tuesday})}{P({\rm DataSci}\,|\,{\rm Tuesday})P({\rm Tuesday})+P({\rm DataSci}\,|\,{\rm not~Tuesday})\,[1\,-\,P({\rm Tuesday})]} = \frac{1\times1/7}{1\times1/7+1/6\times 6/7} = \frac{1}{2}.$$

### 7.4 Subjective probability

Bayes' theorem can be applied in a way to interpret how a given result strengthens (or weakens) the degree of belief in a given theory:

$$P({\rm theory}\,|\,{\rm result})=\frac{P({\rm result}\,|\,{\rm theory})}{P({\rm result})}P({\rm theory}).$$

The subjective part lies in the assignment of the probability of the theory being true $P({\rm theory})$.

The interpretation is as follows: if a given result is forbidden by a theory, i.e. $P({\rm result}\,|\,{\rm theory})=0$, then its observation disproves the theory, i.e. $P({\rm theory}\,|\,{\rm result})=0$.
Similarly, if the result is predicted to be unlikely by the theory, its observation reduces the degree of belief in the theory.

If, on the other hand, a result is predicted to be highly likely by the theory, it can strengthen the degree of belief in the theory.
However, there are two cases to consider, for which it is useful to consider the previously discussed replacement:

$$P({\rm theory}\,|\,{\rm result})=\frac{P({\rm result}\,|\,{\rm theory})}{P({\rm result}\,|\,{\rm theory})P({\rm theory})+P({\rm result}\,|\,{\rm not~theory})[1-P({\rm theory})]}P({\rm theory}).$$

If a result is equally likely regardless of whether or not the theory is true, i.e. $P({\rm result}\,|\,{\rm theory})=P({\rm result}\,|\,{\rm not~theory})$, there is no information gain as this results in $P({\rm theory}\,|\,{\rm result})=P({\rm theory}).$

The other extreme is that the result is much more likely to occur if the theory is true, i.e. $P({\rm result}\,|\,{\rm theory})\gg P({\rm result}\,|\,{\rm not~theory})$, which leads to the observation of the result being highly predictive as $P({\rm theory}\,|\,{\rm result})\approx 1$.

<div class="example">Example: Honest Harry</div>

You toss a coin three times and find it showing heads each time.
You repeat this as part of a bet with Honest Harry, the used car salesman, and arrive at the same result, which means you lose the bet.
The first case will likely not raise any doubts whether or not you a re using a biased, i.e. double-headed, coin.
The second case may make you significantly more suspicious. In both cases the probability of the results, given that the coin is unbiased is 

$$P({\rm 3h\,|\,!bias})=(1/2)^3=0.125.$$

We can now calculate the probability of the theory that the coin is biased, given the result of three heads in both cases, $P({\rm bias\,|\,3h})$. All that is required is the subjective belief in the theory of the coin being biased. Let us assign $P_{\rm rndm}({\rm bias})=10^{-6}$ for the first case, i.e. that we randomly picked a biased coin, and $P_{\rm Harry}({\rm bias})=0.05$, i.e. that the probability of Honest Harry having made us play with a biased coin is $5\%$. With the last equation above we now get

\begin{align}
P_{\rm rndm}({\rm bias\,|\,3h}) & = \frac{P({\rm 3h\,|\,bias})}{P({\rm 3h\,|\,bias})P_{\rm rndm}({\rm bias})+P({\rm 3h\,|\,!bias})[1-P_{\rm rndm}({\rm bias})]}P_{\rm rndm}({\rm bias})\\
& = \frac{1}{1\times 10^{-6} + 0.125 \times (1-10^{-6})}\times 10^{-6}\\
& = 8\times 10^{-6},
\end{align}

and

\begin{align}
P_{\rm Harry}({\rm bias\,|\,3h}) & = \frac{P({\rm 3h\,|\,bias})}{P({\rm 3h\,|\,bias})P_{\rm Harry}({\rm bias})+P({\rm 3h\,|\,!bias})[1-P_{\rm Harry}({\rm bias})]}P_{\rm Harry}({\rm bias})\\
& = \frac{1}{1\times 0.05 + 0.125 \times 0.95}\times 0.05\\
& = 0.296.
\end{align}

The result was able to increase the belief in the theory that Honest Harry had introduced a biased coin because the observation is rather predictive with $P({\rm 3h\,|\,bias})/P({\rm 3h\,|\,!bias})=8$.

### 7.5 Limitations

Suppose you measure the mass of the electron as $m=(520\pm10)~{\rm keV}/c^2$, i.e. you have obtained a result of $520~{\rm keV}/c^2$ with a resolution of $10~{\rm keV}/c^2$.
Assuming a Gaussian resolution function it follows for the measurement of $m$ with resolution $\sigma$ and the true electron mass $m_e$

$$P(m|m_e)\propto e^{-(m-m_e)^2/2\sigma^2},$$

which corresponds to $P({\rm result}\,|\,{\rm theory})$. 

$P(m)$, corresponding to $P({\rm result})$, is the probability of measuring a given mass, which should be a constant if the measurement apparatus is unbiased. 

If we now assume that we know nothing about $m_e$, we could say $P({\rm theory})=P(m_e)={\rm const.}$, which leads to the proportionality

$$P(m_e|m) = \frac{P(m|m_e)}{P(m)}P(m_e)\propto P(m|m_e)=e^{-(m-m_e)^2/2\sigma^2}.$$

Note that this is only a proportionality and we cannot quantify this without assuming a concrete value for $P(m)$ and $P(m_e)$.

However, we could equally well have said that we know nothing about the measure of $m_e^2$, which would alter that interpretation. Both approaches are in principle valid; once more it is vital to be aware of all assumptions and to communicate them in their entirety. This will be dealt with in a more rigorous way in Chapter 9: Confidence levels.

<div class="well" align="center">
    <div class="container-fluid">
        <div class="row">
            <div class="col-md-3" align="center">
                <img align="center" alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" width="60%">
            </div>
            <div class="col-md-8">
            This work is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>).
            </div>
        </div>
    </div>
    <br>
    <br>
    <i>Note: The content of this Jupyter Notebook is provided for educational purposes only.</i>
</div>