In [1]:
## Import required Python modules
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy, scipy.stats
import io
import base64
#from IPython.core.display import display
from IPython.display import display, HTML, Image
from urllib.request import urlopen

try:
    import astropy as apy
    import astropy.table
    _apy = True
    #print('Loaded astropy')
except:
    _apy = False
    #print('Could not load astropy')

## Customising the font size of figures
plt.rcParams.update({'font.size': 14})

## Customising the look of the notebook
## This custom file is adapted from https://github.com/lmarti/jupyter_custom/blob/master/custom.include
HTML('custom.css')
#HTML(urlopen('https://raw.githubusercontent.com/bretonr/intro_data_science/master/custom.css').read().decode('utf-8'))

In [2]:
## Adding a button to hide the Python source code
HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the Python code."></form>''')

<div class="container-fluid">
    <div class="row">
        <div class="col-md-8" align="center">
            <h1>PHYS 10791: Introduction to Data Science</h1>
            <!--<h3>2019-2020 Academic Year</h3><br>-->
        </div>
        <div class="col-md-3">
            <img align='center' style="border-width:0" src="images/UoM_logo.png"/>
        </div>
    </div>
</div>

<div class="container-fluid">
    <div class="row">
        <div class="col-md-2" align="right">
            <b>Course instructors:&nbsp;&nbsp;</b>
        </div>
        <div class="col-md-9" align="left">
            <a href="http://www.renebreton.org">Prof. Rene Breton</a> - Twitter <a href="https://twitter.com/BretonRene">@BretonRene</a><br>
            <a href="http://www.hep.manchester.ac.uk/u/gersabec">Dr. Marco Gersabeck</a> - Twitter <a href="https://twitter.com/MarcoGersabeck">@MarcoGersabeck</a>
        </div>
    </div>
</div>

*Note: You are not expected to understand all the computer coding presented with the solutions. You should understand the mathematical concepts and be able to recover the results. We present the computer code so you can learn coding tricks (e.g. read data, compute useful values, fit and plot data) should you be interested.*

# Chapter 1 - Problem Sheet

### Residential neighbourhoud traffic analysis

The problems from this sheet are based on the following dataset.

A local council has installed pressure strips to measure the speed and number of cars passing in a street of a residential neighbourhood after worries that it might be used as a cut-through from the main road and excessive speeding. The equipment was in place for a full week.

<!--Below is a summary of average number of cars in two-hour intervals:

| Time | 00-02 | 02-04 | 04-06 | 06-08 | 08-10 | 10-12 | 12-14 | 14-16 | 16-18 | 18-20 | 20-22 | 22-24 |
| ---- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Flow | 28.9  | 12.5  | 11.1  | 67.6  | 262.7 | 116.5 | 152.0 | 268.1 | 269.8 | 186.9 | 125.7 | 73.5  |-->

Below is the hourly car flow at the morning rush hour (8-9am) in various speed bins:

| Speed | 00-05 | 05-10 | 10-15 | 15-20 | 20-25 | 25-30 | 30-35 | 35-40 | 40-45 | 45-50 |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Flow  | 0.88  | 20.4  | 14.8  | 42.5  | 60.0  | 38.0  | 8.3   | 1.1   | 0.38  | 0.12  |

Below is the hourly car flow at midday (noon-1pm) in various speed bins:

| Speed | 00-05 | 05-10 | 10-15 | 15-20 | 20-25 | 25-30 | 30-35 | 35-40 | 40-45 | 45-50 |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Flow  | 0.12  | 2.9   | 7.6   | 16.9  | 24.8  | 17.1  | 5.1   | 1.4   | 0.3   | 0.12  |

## Problem 1

### Measures of central tendency and dispersion

#### Tasks

1. Calculate the mean car flow at the morning rush hour.
2. Calculate the geometric mean of the car flow at the morning rush hour.
3. Calculate the harmonic mean of the car flow at the morning rush hour.
4. Calculate the RMS of the car flow at the morning rush hour.
5. Calculate the median of the car flow at the morning rush hour.
6. Calculate the uncorrected standard deviation of the car flow at the morning rush hour.

## Problem 2

### Binned data

#### Tasks

1. Calculate the average speed at the morning rush hour
2. Calculate the mode of the speed at the morning rush hour
3. Calculate the median speed at the morning rush hour

## Problem 3

### Multiple variables

#### Tasks

1. Calculate the covariance matrix of the traffic flow of the morning rush hour and midday dataset.
2. What is the correlation coefficient between the traffic flow between the morning rush hour and midday?

<div class="opt_start">
    ⬇︎ Optional Questions ⬇︎
</div>

## Problem 4

### Case study: The importance of data visualisation

Consider the data below formed of four datasets, each comprising 11 (x,y) pairs.

In [9]:
## Reading the Anscombe Quartet data an putting them in an array

from io import StringIO
c = StringIO("""
Ix  ,     Iy,   IIx,   IIy,  IIIx,   IIIy,   IVx,   IVy
10.0,  8.04 ,  10.0,  9.14,  10.0,  7.46 ,  8.0 ,  6.58 
8.0 ,  6.95 ,  8.0 ,  8.14,  8.0 ,  6.77 ,  8.0 ,  5.76 
13.0,  7.58 ,  13.0,  8.74,  13.0,  12.74,  8.0 ,  7.71 
9.0 ,  8.81 ,  9.0 ,  8.77,  9.0 ,  7.11 ,  8.0 ,  8.84 
11.0,  8.33 ,  11.0,  9.26,  11.0,  7.81 ,  8.0 ,  8.47 
14.0,  9.96 ,  14.0,  8.10,  14.0,  8.84 ,  8.0 ,  7.04 
6.0 ,  7.24 ,  6.0 ,  6.13,  6.0 ,  6.08 ,  8.0 ,  5.25 
4.0 ,  4.26 ,  4.0 ,  3.10,  4.0 ,  5.39 ,  19.0,  12.50
12.0,  10.84,  12.0,  9.13,  12.0,  8.15 ,  8.0 ,  5.56 
7.0 ,  4.82 ,  7.0 ,  7.26,  7.0 ,  6.42 ,  8.0 ,  7.91 
5.0 ,  5.68 ,  5.0 ,  4.74,  5.0 ,  5.73 ,  8.0 ,  6.89
""")
data = np.genfromtxt(c, delimiter=',', names=True)

apy.table.Table(data).pprint()

 Ix    Iy  IIx  IIy  IIIx  IIIy IVx  IVy 
---- ----- ---- ---- ---- ----- ---- ----
10.0  8.04 10.0 9.14 10.0  7.46  8.0 6.58
 8.0  6.95  8.0 8.14  8.0  6.77  8.0 5.76
13.0  7.58 13.0 8.74 13.0 12.74  8.0 7.71
 9.0  8.81  9.0 8.77  9.0  7.11  8.0 8.84
11.0  8.33 11.0 9.26 11.0  7.81  8.0 8.47
14.0  9.96 14.0  8.1 14.0  8.84  8.0 7.04
 6.0  7.24  6.0 6.13  6.0  6.08  8.0 5.25
 4.0  4.26  4.0  3.1  4.0  5.39 19.0 12.5
12.0 10.84 12.0 9.13 12.0  8.15  8.0 5.56
 7.0  4.82  7.0 7.26  7.0  6.42  8.0 7.91
 5.0  5.68  5.0 4.74  5.0  5.73  8.0 6.89


#### Tasks

1. Calculate the arithmetic mean of each column.
2. Calculate the uncorrected sample standard deviation of each column.
3. Calculate the correlation coefficient between x and y for each dataset.

*Bonus beyond the course material*

4. Make a plot of each dataset. Bonus: try and fit a straight line (e.g. $y = mx + b$ through the data)
5. What is special about these datasets?

<div class="opt_end">
    ⬆︎ Optional Questions ⬆︎
</div>

<div class="well" align="center">
    <div class="container-fluid">
        <div class="row">
            <div class="col-md-3" align="center">
                <img align="center" alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" width="60%">
            </div>
            <div class="col-md-8">
            This work is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>).
            </div>
        </div>
    </div>
    <br>
    <br>
    <i>Note: The content of this Jupyter Notebook is provided for educational purposes only.</i>
</div>