# Data Analysis with Jupyter Notebooks.

# ViCEPHEC

Benjamin J. Morgan & Fiona Dickinson, University of Bath.

# Contents
This introductory Python Notebook is designed only to 

- [Getting Started with Jupyter Notebooks](#getting_started)
- [Running Code](#running_code)
- [Simple Calculations](#simple_calculations)
- [Comments and Markdown cells](#comments)
- [Mathematical Functions and Modules](#functions_and_modules)
- [Variables](#variables)
- [Data Types](#data_types)
    - [Integers and Floats](#numbers)
        - [Scientific Notation](#scientific_notation)
    - [Lists](#lists)
- [Numpy and arrays](#numpy)
- [Plotting data with matplotlib](#matplotlib)
- [Creating Tables with 2D numpy arrays](#2d_arrays)

# Getting Started with Jupyter Notebooks<a id='getting_started'></a>

A Jupyter notebook consists of a series of **cells** that contain text. These cells are arranged vertically, top-to-bottom in the document. Any cell can be edited by clicking on it. A cell in **edit mode** is indicated by a green border. 
<img style="width:700px" src='figures/target_cell.png' />
A cell with a blue border is in **command mode**. 
<img style="width:700px" src='figures/command_mode_cell.png' />
In command mode you are not able to type into a cell, but you can still edit the notebook (reordering cells, executing code, etc.)

To edit a cell in command mode, press enter or double click on the cell.

## Running Code<a id="running_code"></a> 

This course will not go into detail about how to write your own Python code. Instead, as much as possible we are going to focus on learning how to use readily available packages for data analysis. There are plenty of resources for learning Python for more traditional programming tasks, such as the tutorials at [Code Academy](https://www.codecademy.com/learn/python).

The default cell type in a Jupyter notebook is a **code** cell. If you open a new notebook it will have one, empty, code cell. And you can always create more cells by clicking in the menu on Insert > Insert Cell Above (<span style="color:green">a</span>) or Insert > Insert Cell Below (<span style="color:green">b</span>).  
<img style="width:550px" src='figures/insert_cell.png'/>
Any code typed into a code cell can be run (or "**executed**") by pressing `Shift-Enter` or pressing the <img style='display:inline; height:1.5em; vertical-align: bottom;' src='figures/run_code_button.png'/> button in the notebook toolbar.

This practical consists of an interactive tutorial (this notebook), followed by a a series of exercises. Some code cells in the tutorial will already have code in them, which you can **run** by selecting and pressing `Shift-Enter` or clicking the toolbar button:

In [None]:
2+3 # run this cell…

There will also be small exercises that ask you to write your own piece of code from scratch, or modify an example that is not finished yet; it might contain an error – often called a bug, or just not do exactly what we would like. These will be in green boxes.

## Simple calculations<a id="simple_calculations"></a>

One of the simplest forms of &ldquo;code&rdquo; that can be run in code cells is mathematical expressions:

>```python
1+2+3+4
```

>```python
4*5/2
```

>```python
2**4 - 2
```

`**` is the &ldquo;power&rdquo; operator. This code calculates $2^4 - 2$.

<div class="alert alert-success">
Perform the following calculations in the three cells below:<br/>
    $1+1+2+3+5+8$<br/>
    $1*2*3*4*5*6$<br/>
    $2^{10}$
</div>

## Explain Your Code: Comments and Markdown Cells<a id="comments"></a>
Often, reading only the code can get quite cryptic (it is called &ldquo;code&rdquo;, after all), which makes it difficult to understand what is happening. To explain what a particular piece of code does, or to explain *why* a piece of code is being used, you can include **comments**.

 
```python
# this is a comment
```

Any text that appears after a <span style="font-family:monospace; color:#438080">#</span> symbol is part of the comment, and is ignored when the code is run. 

Jupyter notebooks offer a second way to describe what you are doing: **Markdown cells**. A code cell can be converted to a Markdown cell by selecting Cell > Cell Type > Markdown from the menu.  
<img src='figures/markdown.png', width=350/>  
A Markdown cell can be used to type plain text, which is displayed when the cell is run. Markdown cells are useful for documenting a notebook, particularly when you want to write something more detailed than a short comment. Markdown cells can also contain basic text formatting, links, images, and equations (more information is [here](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html)).

<div class="alert alert-success">
The cell below should be a Markdown cell, but it is currently a code cell.<br/>
First run the cell to see what happens.
Then change it into a **Markdown** cell, before re-running it. 
</div>

In [None]:
Markdown cells allow you to type longer text to explain what your code is doing.  
Setting this cell to Markdown, and running it will format the text for clearer reading.  
Markdown also provides shorthand for including other features such as [links][cheatsheet].

[cheatsheet]:https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet

To edit a Markdown cell after it has been run, double-click on it to see the raw Markdown code.

## Mathematical functions and modules<a id="functions_and_modules"></a>

In mathematics, a **function** converts one number to another number; $y=f(x)$.

In programming, a **function** is more general than this, and converts an input into an output. For example, if we want to calculate a square root, we can use the `sqrt()` function.

First we need to import the math module:

>```python
import math
math.sqrt(4)
```

You can think of `math.sqrt()` as instructing the computer to &ldquo;use the `sqrt()` function provided inside the  `math` module&rdquo;.

The `math` module contains a [large set of common mathematical functions](https://docs.python.org/2/library/math.html), and the constants $\pi$ and $\mathrm{e}$ (natural logarithm).

>```python
from math import pi, sin, e, log
```

The `log` function calculates the natural logarithm, *i.e.* $\ln x$.

>```python
pi
sin( pi/2 )
log(e) # the natural logarithm of e
e**2
```

<div class="alert alert-success">
Calculate the following:  <br/><br/>

  
$\cos(2\pi)$  <br/><br/>
$\ln(2\mathrm{e})-\ln(2)$<br/><br/>
$\log_{10}(10)$  <br/><br/>
    
    
The <span style="font-family:monospace">math</span> function for $\log_{10}$ is <span style="font-family:monospace">log10()</span>.
</div>

#  Variables<a id="variables"></a>

Many of the code examples we have already seen produce some information. When each code cell is executed, if that code **returns** a result, this is printed directly underneath the corresponding code cell, next to <span style="color:#D64423; font-family:monospace">Out[ ]:</span>.

>```python
72/4
```

<span style="font-family:monospace"><span style="color:#046308">print</span>()</span> is another function, like `math.sqrt()` or `math.log()`. Instead of performing a mathematical calculation, <span style="font-family:monospace"><span style="color:#046308">print</span>()</span> just prints whatever is inside the brackets. In this example, this is exactly what happened. The text is printed to the screen, but the <span style="font-family:monospace"><span style="color:#046308">print</span>()</span> function does not return a value.

<span style="font-family:monospace"><span style="color:#046308">print</span>()</span> can print more than one variable if these are separated by commas:

>```python
print("72/4 =",72/4)
```

Import statements are another example of code that does not return anything.

>```python
from math import sqrt
```

If you run an code cell and nothing appears underneath, the code ran okay (and hopefully did what you expected). Any output under a cell will either be the **returned** result, or an error.

One aspect that makes programmatic data analysis useful comes from the ability to write complex procedures with many steps, that are then performed *identically* every time the code is run against new data sets. To build up more sophisticated data analysis workflows, we often want to keep a result of one step, to use in a later step. Storing results in computer memory is called **assigning** **variables**. A variable is just a name; a sequence of letters and numbers; that labels the stored result. Then, to access the value stored in the variable, we can use the label (the variable name) to refer to the original result.

>```python
# calculate 2 + 3
2 + 3
```

>```python
# calculate 2 + 3 and store the result in the variable `my_result`
my_result = 2 + 3
```

Notice there is no return value printed to <span style="color:#D64423; font-family:monospace">Out[ ]:</span>.  
Instead a variable `my_result` is created, and the value returned by the calculation is stored here.  

Variable names can be nearly anything, as long as that name is not already used for some part of Python (e.g. `print`). Two limitations are they cannot begin with a number (but can contain numbers), and they cannot contain spaces. Underscores are commonly used instead of spaces to keep the code readable.

>```python
1st_result = 3 + 4
```

>```python
this result = 5 + 6
```

<div class="alert alert-success">
Fix the two previous code cells to use <span style='font-family:monospace;'>first_result</span> and <span style='font-family:monospace;'>this_result</span> so that they will run without errors.
</div>

To check the value stored in a variable we can just type the variable name, which returns the stored value.

>```python
my_result
```

Note that we only get the **last** value returned if we have multiple lines of code.


>```python
my_result
this_result
```

We can get round this by using the `print` function to print out the value stored in one or more variables
>```python
print( my_result )
print( this_result )
```

<div class="alert alert-success">
Create three variables, $x$, $y$, and $z$, and use them to store the numbers $5,6,7$.  

Using these variables, calculate:<br/>
$5+6+7$,<br/> 
and<br/>
$(5+6)\times7$.
</div>

In [14]:
# create your variables and store the numbers 5, 6, 7

In [None]:
# use the variables to calculate 5+6+7

In [None]:
# use the variables to calculate (5+6)×7

## Data types<a id="data_types"></a>

So far we have talked about &ldquo;data&rdquo; and &ldquo;results&rdquo;, but what are the pieces of information that we want to manipulate? Typically numbers (or groups of numbers) or text (or lists of text). Different kinds of data can be used for different things: numbers can be combined in mathematical expressions, text can be printed, searched, or reorganised; numbers can be arranged by magnitude, names can be arranged by alphabetical order. In Python, these differences are represented by different **data types**.

### Numbers: *int* and *float*<a id="numbers"></a>

We will discuss two kinds of numeric types: integers and floating point numbers. Python has other built in numeric data types, including complex numbers, which are useful in specialised cases.

Whole numbers, without decimal points are integers, e.g. <span style="color:#108714; font-family:monospace">1</span>, <span style="color:#108714; font-family:monospace">6</span>, <span style="color:#108714; font-family:monospace">2331</span>.  
Numbers with decimal points are floating point numbers or &ldquo;floats&rdquo;, e.g. <span style="color:#108714; font-family:monospace">1.0</span>, <span style="color:#108714; font-family:monospace">232.141</span>, <span style="color:#108714; font-family:monospace">1.3e5</span>.  
That last example uses scientific notation and is shorthand for <span style="color:#108714; font-family:monospace">130000.0</span>.

Note that <span style="color:#108714; font-family:monospace">1</span> and <span style="color:#108714; font-family:monospace">1.0</span> are different.

>```python
type(1) # `type()` returns the data-type of something
```

>```python
1 is 1.0 # `is` tests whether two things are the same
```

Even though they both represent the number one, and have equal values (yes, this can be confusing), `1` and `1.0` are **not the same** because the first is an integer and the second is a float.  

To reassure ourselves slightly, we can test whether two things are equal using `==`

>```python
1 == 1.0
```

### Scientific Notation<a id="scientific_notation"></a>
Very large and very small numbers can be written using **scientific notation**. For example, instead of 0.0000241, we would normally write 2.41&times;10<sup>-5</sup>. In Python this would be written `2.41e-5` or `2.41e-05`.

>```python
2.41e-5 == 0.0000241
```

# numpy and arrays<a id='numpy'></a>

Although lists can be very useful for handling ordered collections of things, for data manipulation we usually deal with ordered lists of only numbers. The flexibility of lists means using them is (relatively) computationally slow. This is not an issue for small data sets, but can be prohibitive for large data sets, with perhaps millions or more entries.

An alternative data type, specifically designed for manipulating (large) numerical data sets is the **numpy array**. `numpy` is a module for numerical scientific computing with Python, and is conventionally imported via

```python
import numpy as np
```

This is similar to the <span style="color:#108714; font-family:monospace; font-weight:bold">import</span> <span style="font-family:monospace">math</span> we saw [above](#functions_and_modules), but uses the <span style="color:#108714; font-family:monospace; font-weight:bold">as</span> keyword to make `numpy` more convenient to work with.

>```python
import math as m
m.sqrt(4)
```

Having imported `numpy` (as `np`) we can store lists of numbers as `numpy` arrays.

>```python
import numpy as np
a = np.array( [ 1, 2, 3, 4 ] )
a
```

You can think of a 1-dimensional `numpy` array as a vector, and we can use very compact code to perform *vector* mathematical operations on the entire array.

>```python
a + 1
```

>```python
a**2
```

Remember that `**` is the $power$ operator. This code calculates $a^2$ for every number stored in `a`.

In both these cases, the mathematical operation (add one; square) is applied to every element in the array, and a new array with *all* the results is returned.

If the mathematical expression contains two (or more) arrays, then an **element-by-element** operation is performed:

e.g. vector addition:

>```python
b = np.array( [ 5, 6, 7, 8 ] )
a + b
```

# Plotting data with matplotlib<a id='matplotlib'</a>

To plot data we use another module: [`matplotlib`](http://matplotlib.org) This is a very powerful (and complicated) plotting library, that be used for quick analysis of experimental data, or to generate publication quality figures. It supports an enormous number of plot types. We are going to start with simple 2D $x,y$ plots.

>```python
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
```
The `import` statement loads up the part of the `matplotlib` library we will use for plotting, and lets us refer to this as `plt` for convenience later.

The `%matplotlib inline` command tells the Jupyter notebook that we want all out &ldquo;plots&rdquo; to appear &ldquo;inline&rdquo;, i.e. inside the notebook (alternatives include opening the plots in other windows, or saving them as graphics files). The `%` symbol at the start means this is a &ldquo;magic&rdquo; command for controlling the behaviour of this Jupyter notebook, and is not standard Python.

If you are using a high DPI or &ldquo;retina&rdquo; screen, you will also want to switch on high resolution figures.

>```python
%config InlineBackend.figure_format = 'retina'
```

We also import `numpy` as `np` so that we can store our data as arrays.

Creating a plot uses `plt.plot()`. Remember, we have assigned `plt` as shorthand for `matplotlib.pyplot`.

>```python
# plot the numpy arrays a and b against each other
import numpy as np
a = np.array( [ 1, 2, 3, 4 ] )
b = np.array( [ 5, 6, 7, 8 ] )
print( "a:", a )
print( "b:", b )
plt.plot( a, b )
plt.show()
```

This can be used for plotting $y$ as a function of $x$, e.g. $y=x^2$.

>```python
x = np.array( [0, 1, 2, 3, 4, 5] )
y = x**2
plt.plot( x, y )
plt.show()
```