In [2]:
## Import required Python modules
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy, scipy.stats
import io
import base64
#from IPython.core.display import display
from IPython.display import display, HTML, Image
from urllib.request import urlopen

try:
    import astropy as apy
    import astropy.table
    _apy = True
    #print('Loaded astropy')
except:
    _apy = False
    #print('Could not load astropy')

## Customising the font size of figures
plt.rcParams.update({'font.size': 14})

## Customising the look of the notebook
display(HTML("<style>.container { width:95% !important; }</style>"))
## This custom file is adapted from https://github.com/lmarti/jupyter_custom/blob/master/custom.include
HTML('custom.css')
#HTML(urlopen('https://raw.githubusercontent.com/bretonr/intro_data_science/master/custom.css').read().decode('utf-8'))

In [3]:
## Adding a button to hide the Python source code
HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the Python code."></form>''')

<div class="container-fluid">
    <div class="row">
        <div class="col-md-8" align="center">
            <h1>PHYS 10791: Introduction to Data Science</h1>
            <!--<h3>2019-2020 Academic Year</h3><br>-->
        </div>
        <div class="col-md-3">
            <img align='center' style="border-width:0" src="images/UoM_logo.png"/>
        </div>
    </div>
</div>

<div class="container-fluid">
    <div class="row">
        <div class="col-md-2" align="right">
            <b>Course instructors:&nbsp;&nbsp;</b>
        </div>
        <div class="col-md-9" align="left">
            <a href="http://www.renebreton.org">Prof. Rene Breton</a> - Twitter <a href="https://twitter.com/BretonRene">@BretonRene</a><br>
            <a href="http://www.hep.manchester.ac.uk/u/gersabec">Dr. Marco Gersabeck</a> - Twitter <a href="https://twitter.com/MarcoGersabeck">@MarcoGersabeck</a>
        </div>
    </div>
</div>

# Chapter 1

## Problem 1

Consider the data below:

| x | y  |
|---|----|
| 7 | 19 |
|14 | 30 |
| 8 | 22 |
| 7 | 18 |
|14 | 32 |
|12 | 30 |
|10 | 25 |
|13 | 26 |
| 9 | 26 |
|10 | 25 |

#### Tasks

1. Calculate the arithmetic mean for column $x$
2. Calculate the geometric mean for column $x$.
3. Calculate the harmonic mean for column $x$.
4. Calculate the RMS for column $x$.
5. Calculate the median for column $x$.
6. Calculate the uncorrected standard deviation for column $x$
7. Calculate the population standard deviation for column $x$, given that the true mean is 10.
8. Calculate the skewness for column $x$.
9. Calculate the kurtosis for column $x$.
10. Calculate the covariance matrix for this dataset.
11. Calculate the correlation coefficient between $x$ and $y$.

# Chapter 2

## Problem 1

<img align='center' style="border-width:0" src="images/euler.png" width="400"/>

#### Tasks
Calculate the following probabilities:

1. $P(\Omega)$
2. $P(A1 \mid B)$
3. $P(A2 \cap B)$
4. $P(A2 \cup A3)$
5. $P(A2 \cup A3 \mid B)$
6. $P(\bar B)$

## Problem 2

After her daily breakfast, Poutine the cat always stares through the window to plan her day. She is desperate to spend time outside, but doesn't want to do so if it rains. Poutine is a clever data science and knows all about Bayes theorem. From information gathered from reading her iPaw (a cat-friendly version of an iPhone), Poutine know that:
- 50% of all rainy days start off cloudy
- Cloudy mornings are common; 40% of days start cloudy
- November is usually a rainy month; 17 of 30 days tend to be rainy

What is the probability that Poutine would go outside on a November day given that she observed cloud in the morning?

## Problem 3

New detectors for the LHCb upgrade are being built at Manchester. Some mass-produced electronic components need to undergo quality testing before being full assembled. A particular component is quoted by the manufacturer to report a value exceeding the tolerance only once every 100 measurements.<br><br>

#### Tasks
1. How likely is the component not meeting the manufacturer standards if it fails to report an accurate measurement twice in 100 tests?
2. How many tests with no failure are required in order to be 95% certain that the component meets the manufacturer standards?

## Problem 4

Supernovae release more than 99% of their energy via neutrino emission. Models predicts that these neutrinos should be emitted over a short time interval, say 10 seconds.

#### Tasks
1. If the background neutrino detection rate by a certain detector is 400 per hour for the full sky, what would be the average detection rate in 10 seconds for an area of the sky equal to 2 steradians? (This area is chosen to represent the accuracy to which a neutrino can be localised in the sky)
2. If a neutrino is detected from a particular region of the sky how likely is it that one would wait 10 seconds before detecting another?
3. If 3 neutrinos are detected within a 10-second interval from a region of the sky coincident with a supernova shortly after the event, how likely were these neutrinos produced by the supernova?

## Problem 5

Your favourite online delivery company claims that 95.44% of their customers have their goods delivered within the allocated 30-minute delivery window. You need to plan an upcoming delivery on the day of an Introduction to Data Science lecture and absolutely don't want to miss the start to see an update on Rene's chickens and Poutine the cat. How long before the lecture time must the delivery window end in order to be 99.86% certain that you won't be distracted during the lecture?

# Chapter 3

## Problem 1

You are a data analyst working for an automobile trading company. On a particular car model, 30% of cars aged 5 years and older have developed an issue with the engine which makes the resell value drop by £1000. Instead, the company has implemented a £1000 cash-back policy for any customer purchasing such car if they encounter the faulty engine problem within 6 months of the purchase.<br><br>

##### Tasks
1. What needs to be the resell price in order to ensure that the company breaks even on average?
2. Given its actual condition a car will resell for price slightly above or lower than the expected price. If a standard deviation of £200 is observed in the actual resell price compared to the forecasted value, irrespective of the engine issue, how should the resell price from the previous question be adjusted to ensure that the company breaks even on average?
3. If 100 of these cars are sold in a year, what is the probability that would you make at least £4000 profit in total?
4. Bonus: What would you need to do in order to ensure that you have less than 2.28% probability of losing money?
5. Extra bonus: your predicted resell price is a statistical estimator of the true value of the vehicle. Which property of a 'good' estimator do you affect by implementing the modification from the previous question in order to make a profit? Why?

# Chapter 4

## Problem 1

The Pareto distribution was originally introduced by the Italian scientist of the same name to describe the allocation of wealth in society, where a large number of individuals have a low wealth and the largest proportion of the wealth belongs to a minority. The Pareto distribution can be written as $\mathcal{L}(x \mid \theta, x_0) = \theta x_0^\theta x^{-\theta-1}$.<br><br>

##### Tasks
1. Write an expression for the joint likelihood of multiple data points $x = \{x_1, x_2, \dots, x_i, \dots, x_N\}$.
2. Derive an expression for the Maximum Likelihood Estimator of $\theta$ for this problem, $\widehat{\theta}$.
3. Derive an expression for the error on $\widehat{\theta}$, $\sigma_{\widehat{\theta}}$, using the minimum variance bound principle.
4. Evaluate $\widehat{\theta}$ and $\sigma_{\widehat{\theta}}$ if the observed values are $x = \{3.52, 2.07, 2.23, 2.26, 2.90, 2.33, 2.04, 3.59, 3.29, 4.23\}$, givent that $x_0 = 2$.

# Chapter 5

## Problem 1

Two independent experiments attempt to measure a certain quantity, $a$. The table below provides the tabulated value of their posterior distribution, $P_1(a)$ and $P_2(a)$, as a function of the quantity, $a$.

| a | P1(a) | P2(a) |
|---|-------|-------|
| 0 | 1.000 | 0.043 |
| 1 | 0.716 | 0.135 |
| 2 | 0.513 | 0.324 |
| 3 | 0.367 | 0.606 |
| 4 | 0.263 | 0.882 |
| 5 | 0.188 | 1.000 |
| 6 | 0.135 | 0.882 |
| 7 | 0.096 | 0.606 |
| 8 | 0.069 | 0.324 |
| 9 | 0.049 | 0.135 |

<br><br>
##### Tasks
1. What is the joint posterior probility of the two experiments?
2. What is the maximum a posteriori value, $\widehat{a}$, from each experiment?
3. What is the maximum a posteriori value, $\widehat{a}$, from the joint results?

## Problem 2

Two independent experiments attempt to measure a certain quantity, $a$. The equations below provide their posterior probability distributions:
\begin{eqnarray}
    P_1(a) &=& \frac{1}{3} e^{-a/3} \\
    P_2(a) &=& \frac{1}{2 \sqrt{2 \pi}} e^{-\frac{\left(a-5\right)^2}{8}} \,.
\end{eqnarray}
<br><br>

##### Tasks
1. What is the joint posterior probility of the two experiments?
2. What is the maximum a posteriori value, $\widehat{a}$, from each experiment?
3. What is the maximum a posteriori value, $\widehat{a}$, from the joint results?

## Problem 3

You are provided with the following measurements from a free-fall experiment. A ball, initially at rest, falls from a height $y_0$ to the ground and a high-speed camera records the height as a function of time. The uncertainty on each height measurement is assumed to be the same and equal to 0.05 m. Times are in seconds and heights in meters.

| t   | y    |
|-----|------|
| 0.1 | 2.91 |
| 0.2 | 2.77 |
| 0.3 | 2.61 |
| 0.4 | 2.24 |
| 0.5 | 1.85 |
| 0.6 | 1.27 |
| 0.7 | 0.64 |
<br><br>

##### Tasks
1. Write the equation which describes the predicted height as a function of time, $f(t)$.
2. What are the two parameters in the model?
3. Justify why you can use the chi-squared minimisation to find the most likely value of the two parameters.
4. Write the equation of the chi-square for this problem.
5. Derive an expression for the most likely value of each parameter.
6. Evaluate the most likely value of each parameter given the data provided above. (Note: this probably gets a bit tedious without using a computer so understanding the principle is more important.)
7. Derive an expression for the uncertainty on each parameter.
8. Evaluate the uncertainty on each parameter given the data provided above. (Note: this probably gets a bit tedious without using a computer so understanding the principle is more important.)
9. Does the measurement of the gravitational constant agree with the accepted value?

# Chapter 6

## Problem 1

In an experiment about circular motion, you measure the $x$ and $y$ coordinates of an object as a function of time with respect to the centre of rotation. Each of these quantities has an uncertainty given by $\sigma_x$ and $\sigma_y$, respectively. At any point in time, the object is subjected to a purely tangential force of magnitude $F$ which has an uncertainty $\sigma_F$. You are interested in the magnitude and direction of the torque, $\tau$ and $\theta_\tau$, experienced by the object at each point in time.

_Recall that $\vec{\tau} = \vec{r} \times \vec{F}$._

Derive an expression for the magnitude of the torque and its uncertainty, $\tau$ and $\sigma_\tau$.

## Problem 2

Using a seed value $n_0 = 42$, generate the first 5 random integers using the linear congruential generator which has the following parameters:
- Multiplier $a = 7$
- Increment $c = 3$
- Modulus $m = 101$

## Problem 3

In this problem you will look into calculating the following integral: $\int_1^3 x^2 {\rm d}x$.<br><br>

##### Tasks
1. Calculate the integral analytically.
2. Calculate the integral using the trapezoid method using 5 intervals.
3. Calculate the integral using the crude Monte Carlo method using the following sequence of random integers, which were generated using a linear congruential generator having a modulus $m = 101$: $n_i = \{95, 62, 33, 32, 25\}$.
4. Calculate the integral using the rejection sampling integration. Use the previous sequence of random integers to obtain your random $x_i$. Use the following sequence of integers to generate your $u_i$; these are also drawn from an LCG with a modulus of 101: $m_i = \{77, 37, 60, 19, 35\}$.

## Problem 4

Consider the probability distribution $p(x) = A x^2$ defined in the range $x = [1, 3]$. Use the inverse transform sampling method to draw random numbers from this distribution. To do so, use the following sequence of random integers, which were generated using a linear congruential generator having a modulus $m = 101$: $n_i = \{95, 62, 33, 32, 25\}$.

<div class="well" align="center">
    <div class="container-fluid">
        <div class="row">
            <div class="col-md-3" align="center">
                <img align="center" alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" width="60%">
            </div>
            <div class="col-md-8">
            This work is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>).
            </div>
        </div>
    </div>
    <br>
    <br>
    <i>Note: The content of this Jupyter Notebook is provided for educational purposes only.</i>
</div>