In [1]:
## Import required Python modules
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy, scipy.stats
import io
import base64
#from IPython.core.display import display
from IPython.display import display, HTML, Image
from urllib.request import urlopen

try:
    import astropy as apy
    import astropy.table
    _apy = True
    #print('Loaded astropy')
except:
    _apy = False
    #print('Could not load astropy')

## Customising the font size of figures
plt.rcParams.update({'font.size': 14})

## Customising the look of the notebook
display(HTML("<style>.container { width:95% !important; }</style>"))
## This custom file is adapted from https://github.com/lmarti/jupyter_custom/blob/master/custom.include
HTML('custom.css')
#HTML(urlopen('https://raw.githubusercontent.com/bretonr/intro_data_science/master/custom.css').read().decode('utf-8'))

In [2]:
HTML('''
<script>
    function toggleCodeCells() {
      var codeCells = document.querySelectorAll('.jp-CodeCell');

      codeCells.forEach(function(cell) {
        var inputArea = cell.querySelector('.jp-InputArea');
        if (inputArea) {
          var currentDisplay = inputArea.style.display || getComputedStyle(inputArea).display;
          inputArea.style.display = currentDisplay === 'none' ? '' : 'none';
        }
      });
    }
</script>

<!-- Add a button to toggle visibility of input code cells -->
<button onclick="toggleCodeCells()">Toggle Code Cells</button>
''')

<div class="container-fluid">
    <div class="row">
        <div class="col-md-8" align="center">
            <h1>PHYS 10791: Introduction to Data Science</h1>
            <!--<h3>2019-2020 Academic Year</h3><br>-->
        </div>
        <div class="col-md-3">
            <img align='center' style="border-width:0" src="images/UoM_logo.png"/>
        </div>
    </div>
</div>

<div class="container-fluid">
    <div class="row">
        <div class="col-md-2" align="right">
            <b>Course instructors:&nbsp;&nbsp;</b>
        </div>
        <div class="col-md-9" align="left">
            <a href="http://www.renebreton.org">Prof. Rene Breton</a> - Twitter <a href="https://twitter.com/BretonRene">@BretonRene</a><br>
            <a href="http://www.hep.manchester.ac.uk/u/gersabec">Dr. Marco Gersabeck</a> - Twitter <a href="https://twitter.com/MarcoGersabeck">@MarcoGersabeck</a>
        </div>
    </div>
</div>

*Note: You are not expected to understand all the computer coding presented with the solutions. You should understand the mathematical concepts and be able to recover the results. We present the computer code so you can learn coding tricks (e.g. read data, compute useful values, fit and plot data) should you be interested.*

# Chapter 2 - Problem Sheet

## Solution 1

### Rene's household universe

First we calculate a useful quantity for all problems. There is a total of 11 living beings in Rene's household universe.

#### Task 1
$P({\rm human}) = 2/11$
<BR>

#### Task 2
$P({\rm cat}) = 4/11$
<BR>

#### Task 3
$P({\rm grey\,colour}) = (2+1+1)/11 = 4/11$
<BR>

#### Task 4
From counting directly the non-grey: $P({\rm NOT\,grey\,colour}) = (2+1+4)/11 = 7/11$

Alternatively: $P({\rm NOT\,grey\,colour}) = 1 - P({\rm grey\,colour}) = 1 - 4/11 = 7/11$
<BR>

#### Task 5
$P({\rm cat} \cap {\rm human}) = 0/11$
<BR>

#### Task 6
$P({\rm cat} \cup {\rm human}) = (4+2)/11 = 6/11$
<BR>

#### Task 7
From a direct count: $P({\rm grey\,colour} \mid {\rm mammal}) = (2+1)/(2+2+1+1) = 1/2$

Alternatively, using the conditional probability formula:

$P({\rm grey\,colour} \mid {\rm mammal}) = P({\rm grey\,colour mammal}) / P({\rm mammal}) = (3/11)/(6/11) = 1/2$

## Solution 2

### The maths behind screening tests

The naive answer to this question is that they have a 99% probability of suffering from the disease. However, the correct answer is actually more subtle, **Bayes rule comes to the rescue** to help us work it out.

First, we need to know the fraction of the population that carries the disease. Let us consider for the example that 0.1% of the population is infected. That is, the prior probability to have the disease or not is:

\begin{eqnarray}
  P({\rm disease}) &=& 0.001 \\
  P({\rm no\,disease}) &=& 0.999
\end{eqnarray}

A test yields a positive result with probability 99% given that the person carries the disease:

\begin{eqnarray}
  P({\rm +} \mid {\rm disease}) &=& 0.99 \\
  P({\rm -} \mid {\rm disease}) &=& 0.01
\end{eqnarray}

and a probability of 3% of people test positive even if they do not have the disease:

\begin{eqnarray}
  P({\rm +} \mid {\rm no\,disease}) &=& 0.03 \\
  P({\rm -} \mid {\rm no\,disease}) &=& 0.97
\end{eqnarray}

The probability that someone taking the test has the disease if returning a positve result is therefore:

\begin{eqnarray}
  P({\rm disease} \mid {\rm +}) &=& \frac{P({\rm disease}) P({\rm +} \mid {\rm disease})}{P({\rm +})} \\
  P({\rm disease} \mid {\rm +}) &=& \frac{P({\rm disease}) P({\rm +} \mid {\rm disease})}{P({\rm disease}) P({\rm +} \mid {\rm disease}) + P({\rm no\,disease}) P({\rm +} \mid {\rm no\,disease})} \\
          &=& \frac{0.001 \cdot 0.99}{0.001 \cdot 0.99 + 0.999 \cdot 0.03} \\
          &=& 0.032
\end{eqnarray}

The answer is therefore 3.2%, not 99% as naively expected!

## Solution 3

### Tossing a fair coin

We need to use the binomial distribution:

\begin{equation}
  P(k;, n, p) = p^k (1-p)^{(n-k)} \frac{n!}{k!(n-k)!}
\end{equation}

This problem can be written as follows:

\begin{equation}
  P(k \le 1; n=4, p=0.5) = P(k=0) + P(k=1) = 1/16 + 1/4 = 0.3125
\end{equation}

## Solution 4

### Hurricane Harvey

Phrased differently, this question is basically asking us to determine the probability of waiting less than 56 years to see another event after Carla took place? (i.e. $p(k \geq 1)$) In this problem we are assessing the wait time between event rather than the fact that there are a certain number of event in a given period.

Note that sometimes it is easier to calculate the complementary probability. That is $p(k \geq 1) = 1 - p(k < 1) = 1 - p(k = 0)$.

**Method 1: Using Poisson**

We can use Poisson statistics, with $\lambda = 0.56$ (i.e. 1 in 100 years implies 0.56 in 56 years): $P(k; 0.56)$. Since:

\begin{equation}
  P(0; 0.56) = e^{-0.56} \frac{0.56^0}{0!} = 0.5712
\end{equation}

then:

\begin{equation}
  p(k >= 1) = 1 - 0.5712 = 0.4288 \,.
\end{equation}

**Method 2: Using Binomial**

We can use Binomial statistics, with $p = 0.01$ (i.e. 0.01 chance in a year), and $n = 56$ trials. Since:

\begin{equation}
  P(0; 56, 0.01) = 0.01^0 (1-0.01)^{56-0} \frac{56!}{0! (56-0)!} = 0.5696
\end{equation}

then:

\begin{equation}
  p(k >= 1) = 1 - 0.5696 = 0.4304 \,.
\end{equation}

*In both cases we conclude that there is about 43% chances for two such events to happen so close in time.* This is certainly possible. However, if another event was to take place in a relatively short time span -- we could extend the above calculations to $P(k \geq 2)$ -- then it would start looking more improbable. This kind of assessment is importnat in order to separate claims that are 'bogus' and wrongly attributed to, say, global warming, to genuine changes in weather patterns.

There is a nice paper on [Assessing the present and future probability of Hurricane Harvey’s rainfall](http://www.pnas.org/content/114/48/12681) by Kerry Emanuel looking at this topic.

## Solution 5

### Oreo's egg production

The interval 240 and 260 is centred around the mean, hence $\mu=250$. For a normal distribution, the 68.26\% interval corresponds to the range $-1\sigma$ to $+1\sigma$, hence $\sigma=10$. Recalling that the interval $-3\sigma$ to $+3\sigma$ contains 99.71\% of the probability, this means that the threshold above which we find 99.85\% is where $x = \mu - 3\sigma = 250 - 3\times10 = 220$ eggs.

<div class="well" align="center">
    <div class="container-fluid">
        <div class="row">
            <div class="col-md-3" align="center">
                <img align="center" alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" width="60%">
            </div>
            <div class="col-md-8">
            This work is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>).
            </div>
        </div>
    </div>
    <br>
    <br>
    <i>Note: The content of this Jupyter Notebook is provided for educational purposes only.</i>
</div>