In [1]:
## Import required Python modules
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy, scipy.stats
import io
import base64
#from IPython.core.display import display
from IPython.display import display, HTML, Image
from urllib.request import urlopen

try:
    import astropy as apy
    import astropy.table
    _apy = True
    #print('Loaded astropy')
except:
    _apy = False
    #print('Could not load astropy')

## Customising the font size of figures
plt.rcParams.update({'font.size': 14})

## Customising the look of the notebook
display(HTML("<style>.container { width:95% !important; }</style>"))
## This custom file is adapted from https://github.com/lmarti/jupyter_custom/blob/master/custom.include
HTML('custom.css')
#HTML(urlopen('https://raw.githubusercontent.com/bretonr/intro_data_science/master/custom.css').read().decode('utf-8'))

In [2]:
## Adding a button to hide the Python source code
HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the Python code."></form>''')

<div class="container-fluid">
    <div class="row">
        <div class="col-md-8" align="center">
            <h1>PHYS 10791: Introduction to Data Science</h1>
            <!--<h3>2019-2020 Academic Year</h3><br>-->
        </div>
        <div class="col-md-3">
            <img align='center' style="border-width:0" src="images/UoM_logo.png"/>
        </div>
    </div>
</div>

<div class="container-fluid">
    <div class="row">
        <div class="col-md-2" align="right">
            <b>Course instructors:&nbsp;&nbsp;</b>
        </div>
        <div class="col-md-9" align="left">
            <a href="http://www.renebreton.org">Prof. Rene Breton</a> - Twitter <a href="https://twitter.com/BretonRene">@BretonRene</a><br>
            <a href="http://www.hep.manchester.ac.uk/u/gersabec">Dr. Marco Gersabeck</a> - Twitter <a href="https://twitter.com/MarcoGersabeck">@MarcoGersabeck</a>
        </div>
    </div>
</div>

*Note: You are not expected to understand all the computer coding presented with the solutions. You should understand the mathematical concepts and be able to recover the results. We present the computer code so you can learn coding tricks (e.g. read data, compute useful values, fit and plot data) should you be interested.*

# Chapter 2 - Problem Sheet

## Problem 1

### Case study: The maths behind screening tests

Let us investigate the maths behind medical screening tests...

*An over-the-counter test is available to diagnose some particular disease which is known to be carried by 0.5% of the population at any given time. A representation of the packaging is illustrated below. If someone returns a positive result, what is the probability that they suffer from this particular disease?*

<img src="images/supertest.png" width="35%">

Note: after reading the small prints on the instruction manual, they find the following details:

```
Our test is designed to identify a specific protein expressed by individuals suffering from the disease. Controlled laboratory studies demonstrated that it can detect 99% on individuals suffering from the disease, while 2% of the tests come out positive for individuals who are not infected.
```

## Solution 1

After reading the packaging, the naive answer to this question from a non-data scientist is likely that the person has a 99% probability of suffering from the disease. However, the correct answer is actually more subtle. We should use Bayes theorem to work through it.

We need to frame Bayes theorem in the context of the current problem. The variables to consider are *disease* vs *no disease* and testing positive ($+$) vs testing negative ($-$). Hence, the question that is being asked can be written in the following,
\begin{equation}
  P({\rm disease} \mid {\rm +}) = \frac{ P({\rm disease}) P({\rm +} \mid {\rm disease}) }{ P({\rm +}) } \,,
\end{equation}
that is, the probability of having the disease given a positive test result.

Let us work through each component of the equation. First, we need to know the fraction of the population that carries the disease. In the problem, it is stated that 0.1% of the population is infected. That is, the **prior probability** to have the disease (or not) is:
\begin{eqnarray}
  P({\rm disease}) &=& 0.005 \\
  P({\rm no\,disease}) &=& 1 - P({\rm disease}) = 0.995 \,.
\end{eqnarray}

The next step is to write the likelihood of all possible conditional outcomes. A test yields a positive result with probability 99% given that the person carries the disease (i.e. **true positive**). That is:
\begin{eqnarray}
  P({\rm +} \mid {\rm disease}) &=& 0.99 \\
  P({\rm -} \mid {\rm disease}) &=& 1 - P({\rm +} \mid {\rm disease}) = 0.01 \,.
\end{eqnarray}
The second equation above is simply the opposite outcome (i.e. **false negative**).

We also need to consider the opposite situation: someone tests positive but does not carry the disease (i.e. **false positive**) as well as the complementary outcome (**true negative**);
\begin{eqnarray}
  P({\rm +} \mid {\rm no\,disease}) &=& 0.02 \\
  P({\rm -} \mid {\rm no\,disease}) &=& 0.98 \,.
\end{eqnarray}

Putting this all together, and using the expansion of the evidence (i.e. denominator):

\begin{eqnarray}
  P({\rm disease} \mid {\rm +}) &=& \frac{P({\rm disease}) P({\rm +} \mid {\rm disease})}{P({\rm +})} \\
  P({\rm disease} \mid {\rm +}) &=& \frac{P({\rm disease}) P({\rm +} \mid {\rm disease})}{P({\rm disease}) P({\rm +} \mid {\rm disease}) + P({\rm no\,disease}) P({\rm +} \mid {\rm no\,disease})} \\
          &=& \frac{0.005 \cdot 0.99}{0.005 \cdot 0.99 + 0.995 \cdot 0.02} \\
          &=& 0.199
\end{eqnarray}

The answer is therefore 19.9%, not 99% as naively expected. It is a common mistake to not consider the prevalence (i.e. prior) of a certain disease when calculating such probabilities and not normalising properly. Both the prevalence and false positive rates have massive impact on the outcome.

## Problem 2

### Case study: Hurricane Harvey

Hurricane Harvey brought in major floods in Texas at the end of August 2017. It was qualified as a once-in-a-century type of event. It was the first Category 4 hurricane to make landfall in Texas since Carla in 1961. Does this claim about the rarity of such an event make any sense?

Answering this problem requires casting the right question. In science, this is often the most difficult part. Here, we could say:

```
What are the odds of having two or more such hurricanes in the observed time span?
```

## Solution 2

Phrased differently, this question is basically asking us to determine the probability of waiting less than 56 years to see another event after Carla took place? (i.e. $p(k \geq 1)$) In this problem we are assessing the wait time between event rather than the fact that there are a certain number of event in a given period.

Note that sometimes it is easier to calculate the complementary probability. That is $p(k \geq 1) = 1 - p(k < 1) = 1 - p(k = 0)$.

**Method 1: Using Poisson**

We can use Poisson statistics, with $\lambda = 0.56$ (i.e. 1 in 100 years implies 0.56 in 56 years): $P(k; 0.56)$. Since:

\begin{equation}
  P(0; 0.56) = e^{-0.56} \frac{0.56^0}{0!} = 0.5712
\end{equation}

then:

\begin{equation}
  p(k >= 1) = 1 - 0.5712 = 0.4288 \,.
\end{equation}

**Method 2: Using Binomial**

We can use Binomial statistics, with $p = 0.01$ (i.e. 0.01 chance in a year), and $n = 56$ trials. Since:

\begin{equation}
  P(0; 56, 0.01) = 0.01^0 (1-0.01)^{56-0} \frac{56!}{0! (56-0)!} = 0.5696
\end{equation}

then:

\begin{equation}
  p(k >= 1) = 1 - 0.5696 = 0.4304 \,.
\end{equation}

*In both cases we conclude that there is about 43% chances for two such events to happen so close in time.* This is certainly possible. However, if another event was to take place in a relatively short time span -- we could extend the above calculations to $P(k \geq 2)$ -- then it would start looking more improbable. This kind of assessment is importnat in order to separate claims that are 'bogus' and wrongly attributed to, say, global warming, to genuine changes in weather patterns.

There is a nice paper on [Assessing the present and future probability of Hurricane Harvey’s rainfall](http://www.pnas.org/content/114/48/12681) by Kerry Emanuel looking at this topic.

<div class="well" align="center">
    <div class="container-fluid">
        <div class="row">
            <div class="col-md-3" align="center">
                <img align="center" alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" width="60%">
            </div>
            <div class="col-md-8">
            This work is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>).
            </div>
        </div>
    </div>
    <br>
    <br>
    <i>Note: The content of this Jupyter Notebook is provided for educational purposes only.</i>
</div>