In [1]:
## Import required Python modules
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy, scipy.stats
import io
import base64
#from IPython.core.display import display
from IPython.display import display, HTML, Image
from urllib.request import urlopen

try:
    import astropy as apy
    import astropy.table
    _apy = True
    #print('Loaded astropy')
except:
    _apy = False
    #print('Could not load astropy')

## Customising the font size of figures
plt.rcParams.update({'font.size': 14})

## Customising the look of the notebook
## This custom file is adapted from https://github.com/lmarti/jupyter_custom/blob/master/custom.include
HTML('custom.css')
#HTML(urlopen('https://raw.githubusercontent.com/bretonr/intro_data_science/master/custom.css').read().decode('utf-8'))

In [2]:
## Adding a button to hide the Python source code
HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the Python code."></form>''')

<div class="container-fluid">
    <div class="row">
        <div class="col-md-8" align="center">
            <h1>PHYS 10791: Introduction to Data Science</h1>
            <!--<h3>2019-2020 Academic Year</h3><br>-->
        </div>
        <div class="col-md-3">
            <img align='center' style="border-width:0" src="images/UoM_logo.png"/>
        </div>
    </div>
</div>

<div class="container-fluid">
    <div class="row">
        <div class="col-md-2" align="right">
            <b>Course instructors:&nbsp;&nbsp;</b>
        </div>
        <div class="col-md-9" align="left">
            <a href="http://www.renebreton.org">Prof. Rene Breton</a> - Twitter <a href="https://twitter.com/BretonRene">@BretonRene</a><br>
            <a href="http://www.hep.manchester.ac.uk/u/gersabec">Dr. Marco Gersabeck</a> - Twitter <a href="https://twitter.com/MarcoGersabeck">@MarcoGersabeck</a>
        </div>
    </div>
</div>

# Chapter 1 - Summary

## 1.2 Basics of presentation of data

### 1.2.1 Data presentation

There are two types of data:
- Qualitative / non-numeric
- Quantitative / numeric

Quantitative data can be divided into two subtypes:
- Discrete data
- Continuous data

### 1.2.2 Measures of central tendency

Useful to describe a dataset with one number. Can be done in multiple ways:

- Arithmetic mean (also just called mean or average):
\begin{equation}
    \langle x \rangle = \frac{1}{N} \sum_{i=1}^{N} x_i
\end{equation}

- Geometric mean:
\begin{equation}
    {\rm GM} = \left( \prod_{i=1}^{N} x_i \right)^{\frac{1}{N}} = \exp \left[ \frac{1}{N} \sum_{i=1}^{N} \ln x_i \right]
\end{equation}

- Harmonic mean:
\begin{equation}
    H = \frac{N}{\sum_{i=1}^{N} \frac{1}{x_i}}
\end{equation}

- Root mean square:
\begin{equation}
    {\rm RMS} = \sqrt{\frac{\sum_{i=1}^{N} x_i^2}{N}}
\end{equation}

- Median: Middle point with equal probability on both sides. _(Beware of odd/even number of elements for discrete data.)_
- Mode: Most likely value
    - Discrete distributions: most common element
    - Continuous distributions: maximum of the probability distribution function

##### Recall

- For binned/weighted data, we must account for the number of points in each bin (e.g. 'weight'):
\begin{equation}
  \langle x \rangle = \frac{ \sum_{j=1}^{J} n_j x_j }{ \sum_{j=1}^{J} n_j }
\end{equation}

- For discrete data this is a sum; for continuous data the sums become integrals:
\begin{equation}
  \langle x \rangle = \frac{ \int_{x_{\rm min}}^{x_{\rm max}} n(x) x {\rm d}x }{ \int_{x_{\rm min}}^{x_{\rm max}} n(x) {\rm d}x }
\end{equation}


### 1.2.3 Measures of dispersion

Single measure to describe the dispersion (or 'spread'). Can be done in multiple ways:

- Variance (and standard deviation):
\begin{equation}
    V(x) = \sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2
\end{equation}

- Mean absolute deviation:
\begin{equation}
    {\rm MAD} = \frac{1}{N} \sum_{i=1}^N \left| x_i - \langle x \rangle \right|
\end{equation}

##### Recall

You need to understand the difference between:

- Population ($\mu$) vs sample ($\langle x \rangle$) variance/standard deviation
    - Uncorrected/biased sample variance:
    \begin{equation}
        s^2_{\rm uncorr} = \frac{1}{N} \sum_{i=1}^N (x_i - \langle x \rangle)^2
    \end{equation}
    - Corrected/unbiased sample variance:
    \begin{equation}
        s^2_{\rm corr} = \frac{1}{N-1} \sum_{i=1}^N (x_i - \langle x \rangle)^2
    \end{equation}


### 1.2.4 Other indicators

Other single-value indicators are also useful. Often related to higher statistical 'moments', such as:

- Skewness (asymmetry):
  - Negative: tail to the left; positive: tail to the right
\begin{equation}
    \gamma = \frac{1}{\sigma^3} \langle \left(x_i - \langle x \rangle \right)^3 \rangle = \frac{1}{N \sigma^3} \sum_{i=1}^N (x_i - \langle x \rangle)^3
\end{equation}

- Kurtosis (tailedness):
  - Negative: 'boxy'; positive: 'peaky'
\begin{equation}
    \kappa = \frac{1}{\sigma^4} \langle \left(x_i - \langle x \rangle \right)^4 \rangle - 3 = \frac{1}{N \sigma^4} \sum_{i=1}^N (x_i - \langle x \rangle)^4 - 3
\end{equation}


### 1.2.5 Multiple variables

Measures that involve multiple quantities help highlight relationship between them.

- Covariance:
  - Also subject to the bias as are the variance/standard deviation
\begin{equation}
    \operatorname{cov}(x,y) = \frac{1}{N} \sum_{i=1}^N (x_i - \langle x \rangle)(y_i - \langle y \rangle)
\end{equation}

- Correlation:
  - Normalised number: $-1 \leq \rho \leq 1$
\begin{equation}
    \rho = \frac{\operatorname{cov}(x,y)}{\sigma_x \sigma_y}
\end{equation}

The covariance matrix is a useful way to represent the covariance for a dataset having multiple dimensions:

\begin{equation}
  \Sigma = 
  \begin{bmatrix}
    \operatorname{cov}(X_1,X_1) & \operatorname{cov}(X_1,X_2) & \operatorname{cov}(X_1,X_3) & \dots  & \operatorname{cov}(X_1,X_n) \\
    \operatorname{cov}(X_2,X_1) & \operatorname{cov}(X_2,X_2) & \operatorname{cov}(X_2,X_3) & \dots  & \operatorname{cov}(X_2,X_n) \\
    \vdots       & \vdots & \vdots & \ddots & \vdots \\
    \operatorname{cov}(X_n,X_1) & \operatorname{cov}(X_n,X_2) & \operatorname{cov}(X_n,X_3) & \dots  & \operatorname{cov}(X_n,X_n)
  \end{bmatrix}
\end{equation}

##### Recall
Correlation does not imply causation!

<div class="well" align="center">
    <div class="container-fluid">
        <div class="row">
            <div class="col-md-3" align="center">
                <img align="center" alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" width="60%">
            </div>
            <div class="col-md-8">
            This work is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>).
            </div>
        </div>
    </div>
    <br>
    <br>
    <i>Note: The content of this Jupyter Notebook is provided for educational purposes only.</i>
</div>