# Data Science Fundamentals
Session 2020/2021


---


# Lab 1
## Introduction to Juypter, Python and Numpy
<div class="alert alert-block alert-success">


# Unassessed

</div>



This lab is intended to introduce you to the basic use of the Juypter+Python+Numpy environment, and show how the autograding in exercises works. When you complete this exercise, you will submit it on Moodle. You will get a mark, but this **will not count towards your final grade**. 

It is strongly recommended that you complete this exercise fully. This will take around two hours if you already know some NumPy.


## Purpose of this lab
This lab should help you:

* create simple arrays
* index and slice arrays
* stack arrays together
* compute simple statistics of arrays
* understand array arithmetic and broadcasting rules



# 1: Jupyter
For those that said they've not used Jupuyter used Jupyter before [Jupyter Quickstart](guides/JupyterGuide.ipynb) is a useful introduction.

## 2: Autograder tests

Lab exercises will (mainly) be autograded via automatic tests.

The following parts have some questions to answer, and some tests (which you cannot alter) which will be run against the code you have written. If the tests pass, you will see how many marks you got with a green tick. If they do not pass, you will see a red cross. Remember, this exercise doesn't count for anything, but do try to complete the exercises.

In [None]:
# Make sure you run this cell!

from utils.tick import reset_marks, summarise_marks, marks
from utils.checkarr import array_hash, check_hash
import numpy as np  # NumPy
from utils.matrices import print_matrix, show_boxed_tensor_latex

# Set up Matplotlib
import matplotlib as mpl   
import matplotlib.pyplot as plt
%matplotlib inline

import utils.image_audio as ia

reset_marks()
print("Everything imported OK")


Here's a free 4 marks:

In [None]:
with marks(4):
    print("Hello world")

And here's what happens when you have an error. Try setting `a` to 1, and making sure you can get this to pass. 

In [None]:
a = 2
# YOUR CODE HERE

In [None]:
with marks(3):
    assert(a==1)

----------------


# 4. Introduction to NumPy

We will be using [numpy](numpy.org) as the basis for our numerical operations. This provides a datatype called `ndarray`, that can be used to store and manipulate arrays of numbers.

<div class="alert alert-block alert-warning">
    
## NumPy worked example

If you have not used NumPy before or if you are rusty **[then this example could be useful](guides/numpy_example.ipynb)** but I'd say make a start and only come back to this if the cheatsheets below aren't enough of an introduction.

</div>

---

## References and cheat sheets

If you are stuck, the following resources are very helpful:

### Cheatsheets
* [NumPy cheatsheet](https://github.com/juliangaal/python-cheat-sheet/blob/master/NumPy/NumPy.md)
* [Python for Data Science cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PythonForDataScience.pdf)
* [Another NumPy Cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)

### API reference and user guide
* [NumPy API reference](https://docs.scipy.org/doc/numpy-1.13.0/reference/)
* [NumPy user guide](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.html)



-----
# NumPy

The foundation package for numerical operations is **NumPy** which provides an array type and accelerated operations on it. 

A very important part of using numerical libraries like numpy is **vectorising operations**: avoiding explicit loops over values in the arrays and instead using library functions to do manipulations. It is *massively* faster to have numpy add to arrays together than to iterate over the elements adding them together in Python.

In [None]:
a = np.zeros((5, 5))
b = np.ones((5, 5))
y = np.zeros((5, 5))

# YES: do things like this
# a and b are NumPy arrays
x = a + b

# NO: don't do this
# this is very inefficient
for i in range(a.shape[0]):
    for j in range(a.shape[1]):
        y[i, j] = a[i, j] + b[i, j]

## No for loops (unless specified)
In this lab **do not** use explicit loops, like `while` or `for`, unless the question explicitly asks you to. **In future labs you may need to use the occasional `for` loop**, but try and avoid them where possible.

NumPy provides the **np.array** class which is a n-dimensional array of numbers **of the same type**. Arrays can be created in several ways: from a Python list (and nD arrays from nested lists), as a "blank" matrix of zeros or ones or random data, by copying existing arrays, loaded from disk or from certain special functions.

**You can always make a copy of an array using np.array() on an existing array** (e.g. `x = np.array(y)` makes a new **copy** of y). `np.array()` will also convert any iterable object (lists, tuples) into an array if it can. Note that a few operations will *change arrays in place*, and most will *return new copies*.

In [None]:
x = np.array([1.0, 2.0, 3.0, 4.0])  # create an array from a list
print(x)  # print the array

# print the matrix, but nicer (note: the first string is a LaTeX expression)
print_matrix("x", x)

print(x.dtype)  # datatype
print(x.shape)  # shape of array

# 1: Create some arrays
Create the following arrays, with the following specifications:
(**don't** use `np.array` to do this). If you don't know how to do this,
look at the worked example, or look at the cheatsheet or API reference.

Use `np.zeros`, `np.ones`, `np.full`, `np.arange` and `np.random.normal` to solve these questions.

* `x`: a 8 x 8 matrix of all zeros
* `y`: a 2 element vector, with all elements equal to np.pi
* `z`: a 1 x 2 x 5 element array of all ones.
* `q`: an 1D array with 10 elements, from 0-18 (inclusive), stepping by 2. 
* `r`: an 400 element array, random numbers normally distributed with mean 0, std. dev. 1.0 **(don't print this one out)**

Print out your arrays using `show_tensor_latex` to see if they look right.

Check that the tests pass. 

In [None]:
# YOUR CODE HERE

In [None]:
with marks(1):
    assert(check_hash(x,((8, 8), 0.0)))

In [None]:
with marks(1):
    assert(check_hash(y, ((2,),  21.991148575128552)))

In [None]:
with marks(1):
    assert(check_hash(z, ((1, 2, 5), 59.0)))

In [None]:
with marks(1):
    assert(check_hash(q,((10,), 701.744562646538)))

In [None]:
with marks(1):
    assert(r.shape==(400,)  and np.std(r)>0.75 and np.std(r)<1.5 and np.all(np.diff(r.ravel())!=0.0))

## 2: How to keep snails alive

<img src="imgs/snails.jpg" width="50%">*([Image](https://flickr.com/photos/chodhound/6083328289 "Snail") by [ChodHound](https://flickr.com/people/chodhound) license [CC BY-SA](https://creativecommons.org/licenses/by-sa/2.0/))*

Scientists at the Zoology Department, The University of Adelaide have studied the best conditions to keep snails alive. They have recorded a dataset of observations of snail mortality under controlled conditions. This data set is in the file `data/snails.txt`.

#### An excerpt from the data set description

    
>Groups of 20 snails were held for periods of 1, 2, 3 or 4 weeks in carefully
controlled conditions of temperature and relative humidity. There were two
species of snail, 0 and 1. At the end of the exposure time the snails
were tested to see if they had survived. 

>The data are unusual in that in most cases fatalities during the experiment
were fairly small. [lucky snails!]



### The task
The data is a 2D array, and has six columns, with these definitions:
            
            0              1                 2                3                 4          5
     species(binary) | exposure(weeks) | humidity(%) | temperature(deg. C) | n_deaths | n_snails
    
Each row represents one set of observations (i.e. one group of snails). You are to compute some basic properties of this data. Use NumPy operations to do the computations.

A. **Loading arrays** 
* Load this data as a NumPy array called `snails`. Note: use NumPy functions to do this! **Do not parse the file yourself** The file is space delimited.
* Print it out. Use this format to print out the results:
    
      print("snails\n", snails)

In [None]:
# YOUR CODE HERE

In [None]:
with marks(2):
    assert(check_hash(snails, ((96, 6), 3082003.4024073719)))

B. **Basic indexing** 
Compute the following results, storing the results in the variable specified and printing them out. Use the same printing format as A.


* `temp_first` the temperature in the first entry in the table. 
* `hum_diff` the *absolute* difference in humidity from the first to the last entry in the table.
* `weeks` the whole column of "weeks exposure".

In [None]:
# YOUR CODE HERE

In [None]:

with marks(1):
    assert(check_hash(hum_diff, ((), 78.9)))

In [None]:
with marks(1):
    assert(check_hash(temp_first, ((), 50.0)))

In [None]:
with marks(1):
    assert(check_hash(weeks, ((96,), 13091.118033988751)))
    

C. **Aggregate functions** 
Compute the following results, storing the results in the variable specified and printing them out:

* `total_deaths` total number of snails that died
* `total_still_alive` total number of snails that survived the whole study
* `mean_temp` mean temperature in the whole study
* `max_humidity` highest humidity in the study

Each computation should be a single line of code

In [None]:
# YOUR CODE HERE

In [None]:
with marks(2):
    assert(check_hash(total_deaths, ((), 1375.0)))
    

In [None]:
with marks(2):
    assert(check_hash(total_still_alive, ((), 8225.0)))

In [None]:
with marks(2):
    assert(check_hash(mean_temp, ((), 75.0)))

In [None]:

with marks(2):
    assert(check_hash(max_humidity, ((), 379.0)))

D. **Boolean indexing**
Compute the following results, storing the results in the variable specified and printing them out:

* `species_0` and `species_1`: split the dataset into two arrays, one with the entries for species 0 and one with the entries for species 1.

In [None]:
# YOUR CODE HERE

In [None]:
with marks(5):
    assert(check_hash(species_0, ((48, 6), 762041.58357993606)))
    assert(check_hash(species_1, ((48, 6), 791902.31693958596)))  

E. **Arithmetic and ordering**
Compute the following results, storing the results in the variable specified and printing them out:

* `deg_f` each temperature in the study, but in degrees Fahrenheit. Use the knowledge that `0C = 32F, 100C = 212F`
* `mean_cols` the mean of all the columns, as a 1D vector
* `death_rate` the death rates, in sorted order, smallest first
* `best_temp`, `best_hum` the best temperature and humidity to keep a snail for four weeks without it dying. *Look only at the four week exposures, ignoring snails kept for less than this time.* 

In [None]:
# YOUR CODE HERE

In [None]:
with marks(1):
    assert(check_hash(deg_f, ((96,), 275523.34846922837)))

In [None]:
with marks(1):
    assert(check_hash(mean_cols, ((6,), 522.92336963727359)))

In [None]:
with marks(2):
    assert(check_hash(death_rate, ((96,), 1100.9803982902313)))

In [None]:
with marks(3):
    assert(check_hash(best_temp, ((), 50.0)))

In [None]:
with marks(3):
    assert(check_hash(best_hum, ((), 379.0)))      

## 3: Image operations
Images can be represented as numerical arrays. We will use images as an example to explore NumPy functionality.

* `img = ia.load_image_colour('filename.png')` will load an image as an array.
* `ia.show_image(img)` will show it in the notebook.
    

A)

We will:
* Load `data/parrots.png` as `img_array` 
* Print out its shape and dtype
* Show the image.

In [None]:
img_array = ia.load_image_colour("data/parrots.png")
print(img_array.shape, img_array.dtype)
ia.show_image(img_array)

B) **Slicing arrays**
* Create an array `cropped` which has the pixels from [150,100] to [350,300]. Note that these positions are in `[row, col]` format, not `[x,y]`.
* Display the cropped array using `show_image()`. 
* Remember: the image is `WxHx3`. Think about how to slice the last dimension.
* Show the cropped image so you can see it.

In [None]:

# YOUR CODE HERE

In [None]:
ia.show_image(cropped)
with marks(4):
    assert(check_hash(cropped, ((200, 200, 3), 3409234926.1084023)))

C)  **Modifying arrays**

Create an array "censored" which is the same as `img_array`, but has a black bar across the following regions to protect the parrot's privacy:
    * [200,100] -> [260, 310]
    * [140, 400]-> [200, 650]

Setting array elements to zero will make them black.

**Do not modify the original `img_array`**

In [None]:
# YOUR CODE HERE

# 4. Financial misconduct

You have been asked to verify the computation of some financial predictive models. These models produces a sequence of updates to the value of a product. The product updates are mainly of two types:
* **large deposits**, representing inflows of new cash, often up into the billions of pounds
* **small returns** from high-frequency trading activity

The simulator produces **two** model outputs from two distinct models `a` and `b` at each time step, which provide very similar estimates of the value of these updates.

You are asked to write code that will produce:

* an estimate of the total value of a product over some series
* the total difference between two different product models, both of which are very similar.

You are given the existing code below, which is supposed to compute and return:

* the sum of the `a` updates (i.e. total value of `a`)
* the accumulated difference between the `a` and `b` products.

However, the result is very inaccurate when tested. Modify this code to be more accurate. Do NOT use NumPy, or *any* other external module to improve your calculation. Use floating point, regardless of the fact that floating point is not appropriate for financial data.

The errors should be less than 0.5 for the `a` sum and less than 1e-10 for the difference in predictions.

In [None]:
class Simulator: # we use a class just to hold variables between calls
    def __init__(self):
        # initialise accumulators
        self.a_sum = 0
        self.b_sum = 0
        
    def update(self, a, b):
        # increment
        self.a_sum += a
        self.b_sum += b
        
    def results(self):
        # return a  pair of results
        # (you do not need to change this)
        return self.a_sum, self.a_sum - self.b_sum
        

In [None]:
a_error, d_error = simulate(Simulator())
# bad result!
print(f"Error in a_sum is {a_error} and {d_error} in d_sum")

Copy and paste the `Simulator` into the cell below and modify it:

In [None]:
# YOUR CODE HERE

In [None]:
    
a_error, d_error = simulate(Simulator())
print(f"Error in a_sum is {a_error} and {d_error} in d_sum")

In [None]:
with tick.marks(2):
    assert(a_error<2.0)

In [None]:
with tick.marks(2):
    assert(a_error<0.5)

In [None]:
with tick.marks(2):
    assert(d_error<1e-10)

In [None]:
with tick.marks(2):
    assert(d_error<1e-12)