# Contents

- [Data Analysis with Jupyter Notebooks](#intro)
    - [Why analyse data with computer code?](#why)
- [Getting Started with Jupyter Notebooks](#getting_started)
- [Running Code](#running_code)
- [Simple Calculations](#simple_calculations)
- [Comments and Markdown cells](#comments)
- [Mathematical Functions and Modules](#functions_and_modules)
- [Variables](#variables)
- [Data Types](#data_types)
    - [Integers and Floats](#numbers)
    - [Strings](#strings)
    - [Lists](#lists)
    - [Arrays and numpy](#numpy)
- [Plotting data with matplotlib](#matplotlib)
- [Creating Tables with 2D numpy arrays](#2d_arrays)
- [Introduction to pandas](#pandas)
- [Managing datasets with pandas](#pandas2)
- [Introductory data analysis with numpy](#data_analysis)
- [Further Reading](#further_reading)
- [Additional Info](#additional_info)

# Data Analysis with Jupyter Notebooks<a id="intro"></a>

This is a Jupyter notebook. In some ways each document is like a physical notebook. It can be used for describing an experiment, recording data, and commenting on it. Unlike a physical notebook, a Jupyter notebook also allows you to run and easily share computer code. This combination makes Jupyter notebooks a very useful tool for analysing data collected from experiments. Unlike spreadsheets or combinations of separate data analysis codes, a Jupyter notebook allows you to collect desciptions and notes for individual experiments, links to the raw data collected, the computer code that performs any necessary data analysis, and the final figures generated with these data, ready for use in a report or published paper.

## Why analyse data using code?<a id="id"></a>
Using computer code allows you to analyse experimental data *programmatically*. All the steps for working with your data are carried out according to a *program*; a predefined series of instructions; like a recipe for a particular meal. 

Once a particular program has been written, it will always produce the same results with the same starting data. This makes it possible to &ldquo;show your working&rdquo;. Scientists presenting new results can share their original data, alongside the code that they used for all their analysis. This has a number of benefits. Other scientists can review the code, run it against the original data set, and check that any analysis has been done correctly. 

Finished code can also be used as a starting point for looking at a similar set of data. The original scientist might repeat their experiment to confirm their results, or another group might collect data under slightly different conditions, and want to compare the two cases. Often the steps described by the code are the same for small data sets and for large data sets. Once an analysis program exists, processing enormous data sets simply becomes a question of access to sufficiently powerful computers. 



TO GO IN PRINTED STARTING DOCUMENT (reference on [Notebook basics](http://nbviewer.jupyter.org/github/jupyter/notebook/blob/master/docs/source/examples/Notebook/Notebook%20Basics.ipynb): using the notebook dashboard and navigation)

# Getting Started with Jupyter Notebooks<a id='getting_started'></a>

A Jupyter notebook consists of a series of **cells** that contain text. These cells are arranged vertically, top-to-bottom in the document. Any cell can be edited by clicking on it. A cell in **edit mode** is indicated by a green border. 
<img style="width:700px" src='figures/target_cell.png' />
A cell with a blue border is in **command mode**. 
<img style="width:700px" src='figures/command_mode_cell.png' />
In command mode you are not able to type into a cell, but you can still edit the notebook (reordering cells, executing code, etc.) Commands for editing notebooks can be accessed from the manu at the top of the screen, and commonly used commands have keyboard shortcuts, which will be highlighted in examples using <span style="color:green">green text</span>. The full list of keyboard shortcuts can be found through Help > Keyboard Shortcuts in the menu.

To edit a cell in command mode, press enter or double click on the cell.

## Running Code<a id="running_code"></a> 

The Jupyter notebook is primarily useful for writing and running code. A large number of different computer languages can be used in Jupyter notebooks. In these examples, we will be using Python (specifically Python 3). Python is increasingly used across a large number of scientific disciplines for data management and analysis. The large scientific community means that very good resources already exist for data processing, such as the Jupyter project, and specific prewritten tools for manipulating and plotting data.

This course will not go into detail about how to write your own Python code. Instead, as much as possible we are going to focus on learning how to use readily available packages for data analysis. There are plenty of resources for learning Python for more traditional programming tasks, such as the tutorials at [Code Academy](https://www.codecademy.com/learn/python).

The default cell type in a Jupyter notebook is a **code** cell. If you open a new notebook it will have one, empty, code cell. And you can always create more cells by clicking in the menu on Insert > Insert Cell Above (<span style="color:green">a</span>) or Insert > Insert Cell Below (<span style="color:green">b</span>).  
<img style="width:550px" src='figures/insert_cell.png'/>
Any code typed into a code cell can be run (or "**executed**") by pressing `Shift-Enter` or pressing the <img style='display:inline; height:1.5em; vertical-align: bottom;' src='figures/run_code_button.png'/> button in the notebook toolbar.

This practical consists of an interactive tutorial (this notebook), followed by a a series of exercises. Some code cells in the tutorial will already have code in them, which you can **run** by selecting and pressing `Shift-Enter` or clicking the toolbar button:

In [None]:
2+3 # run this cell…

You should now have <span style="color:#D64423; font-family:monospace">Out[ ]:</span> with the result of running this code printed next to it:
<img style="width:590px" src='figures/output.png'/>
and the focus has automatically moved to the next cell. You can always re-select a cell to run it again.

Most of the code examples will be presented like this:

>```python
print("hello")
```

with an empty code cell underneath. These examples are for you to type into the empty code cell and then run. Do not copy and paste these. You will learn the concepts faster and become comfortable with writing your own code if you type each piece of code out.

Start with this example:

>```python
print("hello")
```

Your output should be

```
hello
```

There will also be small exercises that ask you to write your own piece of code from scratch, or modify an example that is not finished yet; it might contain an error – often called a **bug**, or just not do exactly what we would like. These will be in green boxes.

<div class="alert alert-success"> 
<b>Edit</b> the <span style="font-family:monospace;">print</span> statement below, so that when you run the cell it prints your name.
</div>

In [None]:
# Type your code into this cell and run it.
print("")

<div class="alert alert-success">
Enter code into the cell below to print today's date.
</div>

## Simple calculations<a id="simple_calculations"></a>

One of the simplest forms of &ldquo;code&rdquo; that can be run in code cells is mathematical expressions:

>```python
1+2+3+4
```

>```python
4*5/2
```

>```python
2**4 - 2
```

`**` is the &ldquo;power&rdquo; operator. This code calculates $2^4 - 2$.

## Explain Your Code: Comments and Markdown Cells<a id="comments"></a>
One of the advantages of using *code* for numerical calculations and data analysis is that you end up with a record of exactly what you have done. You, or anyone else, can read the code, to understand how you reached your answer. You can think of this as &ldquo;showing your working&rdquo;, and it can be very helpful if you want to solve a *similar* problem in the future.

Often, reading only the code can get quite cryptic (it is called &ldquo;code&rdquo;, after all), which makes it difficult to understand what is happening. To explain what a particular piece of code does, or to explain *why* a piece of code is being used, you can include **comments**.

 
```python
# this is a comment
```

Any text that appears after a <span style="font-family:monospace; color:#438080">#</span> symbol is part of the comment, and is ignored when the code is run. 

Jupyter notebooks offer a second way to describe what you are doing: **Markdown cells**. A code cell can be converted to a Markdown cell by selecting Cell > Cell Type > Markdown from the menu
<img src='figures/markdown.png', width=350/>
A Markdown cell can be used to type plain text, which is displayed when the cell is run. Markdown cells are useful for documenting a notebook, particularly when you want to write something more detailed than a short comment. Markdown cells can also contain basic text formatting, links, images, and equations (more information is [here](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html)).

## Mathematical functions and modules<a id="functions_and_modules"></a>

In mathematics, a **function** converts one number to another number; $y=f(x)$.

In programming, a **function** is more general than this, and converts an input into an output. For example, if we want to calculate a square root, we can use the `sqrt()` function.

>```python
sqrt(4)
```

This has given us an error:  

<span style="font-family:monospace"><span style="color:#890004">NameError</span><span style="color:#046308">: name 'sqrt' is not defined</style></span>. 

Python has a *lot* of built in commands (functions). Solving any particular problem will only require a small subset of these. To keep Python code efficient, only a minimal set of the available tools are available "out of the box". Other commands (such as mathematical functions) are collected in **modules** that we can load, to make these available in our notebook.

`sqrt` lives in the `math` module. We can load it like this:

>```python
from math import sqrt
sqrt(4)
```


Or we can import the entire math module:

>```python
import math
math.sqrt(4)
```

You can think of `math.sqrt()` as instructing the computer to &ldquo;use the `sqrt()` function provided inside the  `math` module&rdquo;.

The `math` module contains a [large set of common mathematical functions](https://docs.python.org/2/library/math.html), and the constants $\pi$ and $\mathrm{e}$ (natural logarithm).

>```python
from math import pi, sin, e, log
```

>```python
pi
```

>```python
sin( pi/2 )
```

>```python
e # natural logarithm
```

>```python
log(e)
```

>```python
e**2
```

#  Variables<a id="variables"></a>

Many of the code examples we have already seen produce some information. When each code cell is executed, if that code **returns** a result, this is printed directly underneath the corresponding code cell, next to <span style="color:#D64423; font-family:monospace">Out[ ]:</span>.

>```python
72/4
```

>```python
print("Nothing is returned here!")
```

<span style="font-family:monospace"><span style="color:#046308">print</span>()</span> is another function, like `math.sqrt()` or `math.log()`. Instead of performing a mathematical calculation, <span style="font-family:monospace"><span style="color:#046308">print</span>()</span> just prints whatever is inside the brackets. In this example, this is exactly what happened. The text is printed to the screen, but the <span style="font-family:monospace"><span style="color:#046308">print</span>()</span> function does not return a value.

<span style="font-family:monospace"><span style="color:#046308">print</span>()</span> can print more than one variable if these are separated by commas:

>```python
print("72/4 =",72/4)
```

Import statements are another example of code that does not return anything.

>```python
from math import sqrt
```

If you run an code cell and nothing appears underneath, the code ran okay (and hopefully did what you expected). Any output under a cell will either be the **returned** result, or an error.

One aspect that makes programmatic data analysis useful comes from the ability to write complex procedures with many steps, that are then performed *identically* every time the code is run against new data sets. To build up more sophisticated data analysis workflows, we often want to keep a result of one step, to use in a later step. Storing results in computer memory is called **assigning** **variables**. A variable is just a name; a sequence of letters and numbers; that labels the stored result. Then, to access the value stored in the variable, we can use the label (the variable name) to refer to the original result.

>```python
# calculate 2 + 3
2 + 3
```

>```python
# calculate 2 + 3 and store the result in the variable `my_result`
my_result = 2 + 3
```

Notice there is no return value printed to <span style="color:#D64423; font-family:monospace">Out[ ]:</span>.  
Instead a variable `my_result` is created, and the value returned by the calculation is stored here.  

Variable names can be nearly anything, as long as that name is not already used for some part of Python (e.g. `print`). Two limitations are they cannot begin with a number (but can contain numbers), and they cannot contain spaces. Underscores are commonly used instead of spaces to keep the code readable.

>```python
1st_result = 2 + 3
```

>```python
this result = 2 + 3
```

To check the value stored in a variable we can just type the variable name (which returns the stored value), or use `print()`.

>```python
my_result
```

>```python
print( my_result )
```

Variables can be used to store raw numbers, and can then be used for calculations.

>```python
the_number_six = 6
my_result + the_number_six
```

Any code that uses variables may itself return a further result, which can be assigned to a new variable, and used later (and so on).

>```python
yet_another_variable = my_result + the_number_six
print( yet_another_variable )
```

If you refer to a variable that has not yet been created you will get an error.

>```python
favourite_fruit = bananas
```

## Data types<a id="data_types"></a>

So far we have talked about &ldquo;data&rdquo; and &ldquo;results&rdquo;, but what are the pieces of information that we want to manipulate? Typically numbers (or groups of numbers) or text (or lists of text). Different kinds of data can be used for different things: numbers can be combined in mathematical expressions, text can be printed, searched, or reorganised; numbers can be arranged by magnitude, names can be arranged by alphabetical order. In Python, these differences are represented by different **data types**.

### Numbers: *int* and *float*<a id="numbers"></a>

We will discuss two kinds of numeric types: integers and floating point numbers. Python has other built in numeric data types, including complex numbers, which are useful in specialised cases.

Whole numbers, without decimal points are integers, e.g. <span style="color:#108714; font-family:monospace">1</span>, <span style="color:#108714; font-family:monospace">6</span>, <span style="color:#108714; font-family:monospace">2331</span>.  
Numbers with decimal points are floating point numbers of &ldquo;floats&rdquo;, e.g. <span style="color:#108714; font-family:monospace">1.0</span>, <span style="color:#108714; font-family:monospace">232.141</span>, <span style="color:#108714; font-family:monospace">1.3e5</span>.  
That last example uses scientific notation and is shorthand for <span style="color:#108714; font-family:monospace">130000.0</span>.

Note that <span style="color:#108714; font-family:monospace">1</span> and <span style="color:#108714; font-family:monospace">1.0</span> are different:

>```python
type(1) # `type()` returns the data-type of something
```

>```python
type(1.0)
```

>```python
1 is 1.0 # `is` tests whether two things are the same
```

Even though they both represent the number one, and have equal values (yes, this can be confusing), `1` and `1.0` are not **the same** because the first is an integer and the second is a float.  

To reassure ourselves slightly, we can test whether two things are equal using `==`

>```python
1 == 1.0
```

A common mistake is confusing &ldquo;is equal to&rdquo; (`==`) with variable assignment (`=`).

>```python
1 = 1.0
```

This gives an error, because we are trying to assign the floating point number <span style="color:#108714; font-family:monospace">1.0</span> to the variable `1`, which is not a valid variable name (it is already used to represent the integer <span style="color:#108714; font-family:monospace">1</span>).

### Strings<a id="strings"></a>

Strings are any sequence of text. We indicate that a sequence of text is a string, and not a Python command, by enclosing it in single or double quotes. Being able to use either quote type allows strings that themselves contain quotes.

>```python
'this is a string using single quotes'
```

>```python
"this is a string using double quotes"
```

>```python
'this string has "nested quotes"'
```

### Lists<a id="lists"></a>

Python also contains built-in data types for collections of things. For data analysis we often deal with sets of numbers. These can be collected in **lists**.

A list is denoted by a series separated by commas, and enclosed in square brackets:

>```python
my_list = [ 1, 2, 3, 4 ]
mylist
```

although lists can contain any set of Python objects, even other lists:

>```python
my_other_list = [ 4, 1.5, 'peach' ]
my_other_list
```

>```python
both_lists = [ my_list, my_other_list ]
both_lists
```

To refer to one element in a list, use the **index** of that element. Index numbering counts the number of jumps along the sequence, so starts at zero.

>```python
# 1st element (zero jumps along the sequence)
print( my_other_list[0] )
# 2nd element (one jump along the sequence)
print( my_other_list[1] ) 
# 3rd element (two jumps along the sequence)
print( my_other_list[2] ) 
```

Using an index outside the range of elements in the list will produce an error. For example, `my_other_list` has three elements, but `my_other_list[3]` tries to return the *4th* element (which does not exist)

>```python
print( my_other_list[3] )
```

You can also refer to a sequence of elements by giving a *range* as the index:

In [None]:
# run this cell to create the list `alphabet`
alphabet = [ 'a', 'b', 'c', 'd', 'e', 'f', 'g', 
             'h', 'i', 'j', 'k', 'l', 'm', 'n', 
             'o', 'p', 'q', 'r', 's', 't', 'u',
             'v', 'w', 'x', 'y', 'z' ]

>```python
alphabet[3:8]
```

→ start from three jumps, finish at eight jumps, i.e. elements 4 to 9.

Negative numbers count backwards from the end of the sequence.

>```python
alphabet[-8:-3]
```

→ 9th from the end up to 4th from the end.

And leaving out one of the numbers in the range will include all elements up to the start or end of the sequence.

>```python
alphabet[14:]
```

>```python
alphabet[:14]
```

# numpy and arrays<a id='numpy'></a>

Although lists can be very useful for handling ordered collections of things, for data manipulation we usually deal with ordered lists of only numbers. The flexibility of lists means using them is (relatively) computationally slow. This is not an issue for small data sets, but can be prohibitive for large data sets, with perhaps millions or more entries.

An alternative data type, specifically designed for manipulating (large) numerical data sets is the **numpy array**. `numpy` is a module for numerical scientific computing with Python, and is conventionally imported via

```python
import numpy as np
```

This is similar to the <span style="color:#108714; font-family:monospace; font-weight:bold">import</span> <span style="font-family:monospace">math</span> we saw [above](#functions_and_modules), but uses the <span style="color:#108714; font-family:monospace; font-weight:bold">as</span> keyword to make `numpy` more convenient to work with.

>```python
import math as m
m.sqrt(4)
```

Having imported `numpy` (as `np`) we can store lists of numbers as `numpy` arrays.

>```python
import numpy as np
a = np.array( [ 1, 2, 3, 4 ] )
a
```

You can think of a 1-dimensional `numpy` array as a vector, and we can use very compact code to perform *vector* mathematical operations on the entire array.

>```python
a + 1
```

>```python
a**2
```

Remember that `**` is the $power$ operator. This code calculates $a^2$ for every number stored in `a`.

In both these cases, the mathematical operation (add one; square) is applied to every element in the array, and a new array with *all* the results is returned.

If the mathematical expression contains two (or more) arrays, then an **element-by-element** operation is performed:

e.g. vector addition:

>```python
b = np.array( [ 5, 6, 7, 8 ] )
a + b
```

>```python
a * b
```

Let us try to calculate the square root of all the numbers in `a`:

>```python
from math import sqrt
sqrt(a)
```

This gives an error.  

Because `numpy` is not part of the standard Python library, the `sqrt` function provided by the `math` module does not know how to treat a `numpy` array of numbers. To do what we want we can use the `sqrt` function in `numpy` instead. 

>```python
np.sqrt(a)
```

`numpy` contains a great many functions for performing mathematical operations on arrays of numbers, which are all listed on the [`numpy` website](https://docs.scipy.org/doc/numpy/reference/routines.math.html).

Often, we will want to use `numpy` arrays to store experimental data. Other times we might just want a list of number, e.g. from 1 to 20. We could write these out to create the array:

```python
one_to_twenty = np.array( [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 ] )
```

To save typing (and make your code easier to read) `numpy` contains a function for creating lists of numbers:

>```python
n = np.arange(1,21)
n
```

Notice that `arange` gives us numbers starting from 1, up to, but not including, 21.  

We can generate lists of numbers with different spacings by providing a step-size (which has a default value of 1)

>```python
m = np.arange(2,21,2)
m
```

# Plotting data with matplotlib<a id='matplotlib'</a>

To plot data we use another module: [`matplotlib`](http://matplotlib.org) This is a very powerful (and complicated) plotting library, that be used for quick analysis of experimental data, or to generate publication quality figures. It supports an enormous number of plot types. We are going to start with simple 2D $x,y$ plots.

>```python
import matplotlib.pyplot as plt
%matplotlib inline
```

The `import` statement loads up the part of the `matplotlib` library we will use for plotting, and lets us refer to this as `plt` for convenience later.

The `%matplotlib inline` command tells the Jupyter notebook that we want all out &ldquo;plots&rdquo; to appear &ldquo;inline&rdquo;, i.e. inside the notebook (alternatives include opening the plots in other windows, or saving them as graphics files). The `%` symbol at the start means this is a &ldquo;magic&rdquo; command for controlling the behaviour of this Jupyter notebook, and is not standard Python.

If you are using a high DPI or &ldquo;retina&rdquo; screen, you will also want to switch on high resolution figures.

>```python
%config InlineBackend.figure_format = 'retina'
```

Creating a plot uses `plt.plot()`. Remember, we have assigned `plt` as shorthand for `matplotlib.pyplot`.

>```python
print("a:",a)
print("b:",b)
plt.plot( a, b )
plt.show()
```

This can be used for plotting $y$ as a function of $x$, e.g. $y=x^2$.

>```python
x = np.array( [0, 1, 2, 3, 4, 5] )
y = x**2
plt.plot( x, y )
plt.show()
```

The default plot shows a connected line. To plot individual points, we can add a third argument to `plt.plot()` that specifies the appearance for that data set:

>```python
plt.plot( x, y, "o" )
plt.show()
```

A large number of marker types exist in `matplotlib` (a full list is [here](#http://matplotlib.org/api/markers_api.html#module-matplotlib.markers)).

We can also control the line style, and combine code controlling marker and line appearance.

>```python
plt.plot( x, y, ":" ) # dotted line
plt.show()
```

>```python
plt.plot( x, y, "s:" ) # dotted line with squares
plt.show()
```

Adding axes labels and a title uses the `xlabel()`, `ylabel()`, and `title` commands.

>```python
plt.plot( x, y, 'o-' )
plt.xlabel( 'x' )
plt.ylabel( 'y^2' )
plt.title( 'y = x^2' )
plt.show()
```

Plotting multiple data sets on the same graph uses multiple `plot()` commands. For an example, let us create three `numpy` arrays, `u`, `v`, and `w`.

>```python
# create three numpy arrays, u, v, and w
u = x + 1
v = x ** 2
w = np.sqrt( (x*2)+1 )
print('u = ',u)
print('v = ',v)
print('w = ',w)
```

Now we can plot $u$, $v$, and $w$ versus $x$ on the same figure.

>```python
plt.plot( x, u, 'o-',  label='x+1' )
plt.plot( x, v, 'x--', label='x**2' )
plt.plot( x, w, '*:',  label='sqrt((x*2)+1)' )
plt.xlabel( 'x' )
plt.ylabel( 'y' )
plt.title( 'y=f(x)')
plt.legend()
plt.show()
```

We have assigned text labels for each data set by setting `label=string` in each `plt.plot()` command. These labels are then shown in the legend produced by the `plt.legend()` command.

Nearly every part of the plot appearance can be controlled. Two further examples are line colours and thickness. A number of line colours are predefined and can be referred to with a [corresponding string](http://matplotlib.org/examples/color/named_colors.html).

In [None]:
# run this cell
plt.plot( x, u, 'o-',  label='x+1',           color='salmon',    linewidth=3 )
plt.plot( x, v, 'x--', label='x**2',          color='darkolivegreen',  linewidth=2 )
plt.plot( x, w, '*:',  label='sqrt((x*2)+1)', color='slategrey', linewidth=4 )
plt.xlabel( 'x' )
plt.ylabel( 'y' )
plt.title( 'y=f(x)')
plt.legend()
plt.show()

You can save a figure to an external file using `plt.savefig('filename')` instead of `plt.show()`.

<div class="alert alert-success">
Edit the cell above to replace <br/><br/><span style='font-family:monospace; margin-left: 40px'>plt.show()</span><br/><br/> with <br/><br/><span style='font-family:monospace; margin-left: 40px'>plt.savefig('my_figure.pdf')</span><br/><br/>Then run the cell to save the figure to the disk.
</div>

## Creating tables using 2D numpy arrays<a id='2d_arrays'></a>

It can often be useful to collect different data sets together in a table.  
One way to do this is by combining `numpy` arrays into larger, two-dimensional, arrays.

>```python
# create an array `x` with the integers 1 to 5
x = np.arange(1,6)
# create three new arrays by performing calculations on `x`
y1 = x**2
y2 = x+3
y3 = x/2 + 1
print('x=',x)
print('y1=',y1)
print('y2=',y2)
print('y3=',y3)
```

We can combine numpy arrays into a table as **columns** using `np.column_stack()`

>```python
# combine x, y1, y2, and y3 as columns in a new table
column_table = np.column_stack( ( x, y1, y2, y3 ) )
print( column_table )
```

Or as **rows** using np.row_stack()

>```python
# arrange x, y1, y2, and y3 as rows in a new table
row_table = np.row_stack( ( x, y1, y2, y3 ) )
print( row_table )
```

A 1D `numpy` array can be indexed like a list.
>```python
my_1D_array = np.array( [ 1, 2, 3, 4, 5, 6] )
my_1D_array[2:5] 
# [2:5] selects from 2 jumps, up to, but not including, 5 jumps
```

A 2D `numpy` array can be treated like a [list of lists](lists), and indexing returns selected rows.
>```python
row_table[1] # return the 2nd row (1 jump from the start)
```

Because each row is a 1D `numpy` array, we can use a second index to select a single entry.
>```python
row_table[1][3]
```

These two indices can be combined into a single bracket
>```python
row_table[1,3]
```

To select a single row, we make use of the range character `:`. Remember, for a list or 1D array, `:` lets us select a range of elements, and leaving out one of the numbers selects all elements up to the start, or end, of the list.

>```python
my_list = [ 'a', 'b', 'c', 'd', 'e' ]
my_list[1:]
```

Leaving out *both* numbers extends our selection up to both ends of the list or array.
>```python
my_list[:]
```

For a 2D array, you can think of this as &ldquo;every row&rdquo; or &ldquo;every column&rdquo;.

>```python
print( row_table )
print()
print( row_table[:,3] ) # all rows, jump 3 columns
```

<div class="alert alert-success">
Use a combination of row and column indexing to select <span style='font-family:monospace'>[ 6., 7., 8.]</span> from <span style='font-family:monospace'>row_table</span>
</div>

# Introduction to `pandas`<a id='pandas'></a>
It would be easier to remember what data these tables contain if we could label the different axes.  
We can do this using another module `pandas` (the name is derived from "panel data"), which is designed for manipulating tables of data in much the same way as you might use a spreadsheet application.

>```python
import pandas as pd
```

`pandas` stores tables of data as **Data Frames**
>```python
pd.DataFrame( column_table )
```

This gives us labelled rows and columns, and nicer formatting when we output the data.  

You can define your own column labels by including this information when you create the DataFrame

>```python
data = pd.DataFrame( column_table, columns = [ 'x', 'y1', 'y2', 'y3' ] )
data
```

This helps to describe *what* each column represents. You can also refer to a column label to access that data subset.

>```python
data['y1']
```

>```python
plt.plot( data['x'], data['y1'] )
plt.show()
```

`pandas` DataFrames also have their own `plot()` function, that will plot all the data in the table with the appropriate column labels.

>```python
data.plot()
```

This probably is not exactly what we wanted. The pandas DataFrame.plot() function will plot *all* of the columns, using the **index** as the $x$ values. In this case we want to plot $x$ against $y_1, y_2, y_3$. We can acheive this by rearranging the DataFrame.  

First we set the index to be the same as the column **x**.

>```python
indexed_data = data.set_index( data['x'] )
indexed_data
```

and &ldquo;drop&rdquo; the original **x** column:

>```python
final_data = indexed_data.drop( 'x', 1 )
final_data
```

The `1` here means we want to drop a column. Using `0` would try to drop a matching row.


>```python
final_data.plot()
```

# Data analysis and statistics with numpy<a id='data_analysis'></a>

`numpy` contains a lot of powerful functions for performing simple statistical analysis on our data. For example, consider the set of numbers 1 to 50:

>```python
a = np.arange(1,51)
a
```

To find the minimum and maximum values we can use `np.min()` and `np.max()`

>```python
np.min(a)
```

>```python
np.max(a)
```

To find the **sum** of all these numbers, we can use `np.sum()`

>```python
np.sum(a)
```

The **mean** of a set of numbers is defined as 

\begin{equation}
\frac{\sum_i^N x_i}{N}
\end{equation}

which we could calculate with

>```python
np.sum(a) / len(a)
# len(a) returns the length of the array `a`
```

or with `np.mean()`

>```python
np.mean( a )
```

The **standard deviation**, $\sigma$ quantifies how much the numbers in our set deviate from the mean.

\begin{equation}
\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N(x_i-\mu)^2}
\end{equation}

where $\mu$ is the mean.

Again, we could write this out in code:

>```python
import math
sigma = math.sqrt( np.sum( ( a - np.mean(a))**2 ) / len(a) )
sigma
```

Or use the `np.std()` function

>```python
np.std(a)
```

### Linear Regression

Another commonly used data analysis technique is **linear regression**. This is used to calculate the relationship between two data sets, $X$ and $Y$, assuming that this relationship can be described by a straight line

\begin{equation}
y_i = m x_i + c.
\end{equation}

For any real data set, the data points are unlikely to all fall exactly on the same line. Linear regression is the process of calculating the line that &ldquo;best fits&rdquo; the given data.

In your Key Skills Excel practical you used linear regression to analyse equilibrium constant data for the equilibrium between NO$_2$ and N$_2$O$_4$, to find $\Delta H_\mathrm{r}$ and $\Delta S_\mathrm{r}$ for this reaction.  

As an example of using linear regression in a Jupyter notebook, and applying this to a chemical problem, let us work through the same process in code.

#### Theory

The equilibrium reaction we have data for is

\begin{equation}
2\mathrm{NO}_2 \mathrm{(g)}\leftrightharpoons \mathrm{N}_2\mathrm{O}_4 \mathrm{(g)}
\end{equation}

Taking the equations relating $\Delta G$ to $K$ and to $\left\{\Delta H, \Delta S\right\}$:

\begin{equation}
\Delta G = -RT \ln K, \tag{1}
\end{equation}

\begin{equation}
\Delta G = \Delta H - T\Delta S; \tag{2}
\end{equation}

we get

\begin{equation}
\ln K = \frac{\Delta H}{RT}-\frac{\Delta S}{R}. \tag{3}
\end{equation}

This is in the form

\begin{equation}
y = mx + c
\end{equation}

\begin{equation}
\ln K = \frac{\Delta H}{R}\frac{1}{T} - \frac{\Delta S}{R}. \tag{4}
\end{equation}

and plotting $\ln(K)$ against $\frac{1}{T}$ should give a straight line, with slope $\frac{\Delta H}{R}$ and intercept $-\frac{\Delta S}{R}$.

#### Analysis

The data from this experiment are stored in a text file in `data/equilbirium_constant.dat`, which looks like

```
# equilibrium constant data for 2 NO2 => N2O4  
# columns are: temperature (degrees Celsius), K
  
9   34.3
20  12
25  8.79
33  4.4
40  2.8
52  1.4
60  0.751
70  0.4
```

Not every line in this text file contains a data point. The first two lines describe the data set and tell us what is in each column and the units (where relevant). These &ldquo;non-data&rdquo; lines at the head of the file are usually called the **header**. Data files should always include a description of the data so that this is available for any later analysis.

To read the data into this notebook we can use `read_csv()` contained in `pandas`. The csv in `read_csv` stands for `comma-separated values`, which is a common data file format, and can be exported from spreadsheet software such as Excel. A &ldquo;comma-separated&rdquo; data file would look like:

```
x,y
0.3,2323
1.5,1442
3.7,2827
5.2,12332
```
This is easily processed by computers, and has the advantage that you can include entries with spaces, such as names of people. For pure numerical datasets, however, separating columns with **whitespace** means the original file can be easily ready by humans. `read_csv()` can handle different separators between data fields (called **delimiters**), and has an optional extra setting for when the fields are separated by spaces.

>```python
data = pd.read_csv( 'data/equilibrium_constant.dat', 
                     delim_whitespace = True, 
                     comment='#', 
                     names = [ 'T (C)', 'K' ] )
```

In [None]:
|

This looks quite complicated, but we can understand the options for `read_csv()` in turn:  

First, we supply the filename as a string (including the name of the `data` directory).  

Second, we set `delim_whitespace = True`. This does what you would expect.

We are telling `read_csv()` that the file will use spaces to separate fields.  

Third, `comment='#'`: the data file contains comments, and these are indicated by lines that start with `#`. 

Finally, we define `names = [ 'T (C)', 'K' ]`. This provides labels for the columns in our final DataFrame, which is stored in `data`

>```python
data
```

Looking at `data` shows us 8 data points (numbered 0 to 7), where each has a temperature (in the **T (C)** column) and a measured equilibrium constant (in the **K** column).

Before we can plot these data, we need to convert the temperature to Kelvin, and calculate $\ln K$.

Remember that we can access a column in a DataFrame by using it's label:

>```python
data['T (C)']
```

and can use this to create a *new* column.

>```python
data['T (K)'] = data['T (C)'] + 273.0
data
```

This has created a new column, with the label **T (K)**.

Next, we do the same to calculate a set of $\ln K$ values:

>```python
data['ln K'] = np.log( data['K'] )
data
```

We are now ready to plot $\ln K$ versus $1/T$.

>```python
x = 1.0 / data['T (K)']
y = data['ln K']
plt.plot( x, y, 'o' )
plt.xlabel( '1/T' )
plt.ylabel( 'ln K' )
plt.show()
```

Notice that this code calculates all the inverse temperatures in place. An alternative way to do this would be to generate a new column in `data` with all the $1/T$ values, and then plot this directly.

And we find this plot gives an approximate straight line.

There are a number of different ways to calculate the line of best-fit. One of the simplest is to use another module, [`scipy.stats`](https://docs.scipy.org/doc/scipy-0.18.1/reference/stats.html). As you might suspect, this contains an enourmous set of statistical analysis tools. We want [`linregress()`](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.stats.linregress.html#scipy.stats.linregress), which works as follows:

>```python
from scipy.stats import linregress
linregress( x, y )
```

You can see the output is complicated, but includes a list of values that includes the slope and the intercept. In fact you can treat the output like a [list](lists), and use indexing to select a specific result.

>```python
linregress( x, y )[0] # use indexing to get the slope
```

Another option is to collect all five of the output values at once

>```python
slope, intercept, rvalue, pvalue, stderr = linregress( x, y )
print( "slope =", slope )
print( "intercept =", intercept )
```

To plot the best-fit line against the original data, we need to generate a new data set according to $y=mx+c$, with $m$ and $c$ as the slope and intercept from `linregress`.

>```python
y_fit = slope * x + intercept # remember, x is an array storing 1.0 / data['T (K)']
plt.plot( x, y, 'o' )
plt.plot( x, y_fit, '-' )
plt.xlabel( '1/T' )
plt.ylabel( 'ln K' )
plt.show()
```


And because we have calculated the slope and intercept, we can derive $\Delta H$ and $\Delta S$ for the reaction.

\begin{equation}
\Delta H = R \times \mathrm{slope}
\end{equation}

\begin{equation}
\Delta S = R \times \mathrm{intercept}
\end{equation}

To save us having to look up and type in the gas constant, $R$, we can use [`scipy.constants`](https://docs.scipy.org/doc/scipy-0.18.1/reference/constants.html#).

>```python
from scipy.constants import R # gas constant in J K^-1 mol^-1
R 
```

>```python
delta_H = R * slope
delta_S = R * intercept
print( 'Delta H =', delta_H / 1000, 'kJ mol^-1' )
print( 'Delta S = ', delta_S, 'J K^-1 mol^-1' )
```

To finish, although we have not done any particularly complicated analysis on the original data, we still might want to save our new data set to save us having to go through this again.

Our modified data set is stored as a `pandas` DataFrame in `data`

>```python
data
```

To save this out to another file we can use [`DataFrame.to_csv()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html).

>```python
filename = 'data/modified_equilibrium_data.dat'
data.to_csv( filename, sep=' ' )
```

which saves the complete modified data set in plain text to `modified_equilibrium_data.dat` in the `modified_data` directory:

```
"T (C)" K "T (K)" "ln K"
9 34.3 282.0 3.535145354171894
20 12.0 293.0 2.4849066497880004
25 8.79 298.0 2.1736147116970854
33 4.4 306.0 1.4816045409242156
40 2.8 313.0 1.0296194171811581
52 1.4 325.0 0.3364722366212129
60 0.7509999999999999 333.0 -0.28634962721800244
70 0.4 343.0 -0.916290731874155
```

As a final note; it is important that you save *modified* data out to a different filename to your *original* data, to prevent overwriting it. In this case, our modified data set includes the original, but without the original raw data, there would be no way of checking this in the future.

# Further Reading<a id='further_reading'></a>

Learning Python:
- [Code Academy](https://www.codecademy.com/learn/python)
- [Learn Python the Hard Way](https://learncodethehardway.org/python/)

pandas:
- [pandas Cookbook on github](https://github.com/jvns/pandas-cookbook)

# Additional info<a id='additional_info'></a>

## Installing Jupyter on your own computer

[instructions](http://jupyter.readthedocs.io/en/latest/install.html)

[Anaconda Installers](https://www.continuum.io/downloads)

### macOS — alternate installation from the terminal

Use [homebrew](http://brew.sh) to install `python3`:

```
brew install python3
```

Use `pip3` to install jupyter:

```
pip3 install jupyter
```

Additional modules can also be installed with `pip3`, e.g.

```
pip3 install pandas
```