# Contents

- [Plotting data with matplotlib](#matplotlib)
- [Creating Tables with 2D numpy arrays](#2d_arrays)
- [Introduction to pandas](#pandas)
- [Managing datasets with pandas](#pandas2)
- [Introductory data analysis with numpy](#data_analysis)

# Plotting data with matplotlib<a id='matplotlib'</a>

To plot data we use another module: [`matplotlib`](http://matplotlib.org) This is a very powerful (and complicated) plotting library, that be used for quick analysis of experimental data, or to generate publication quality figures. It supports an enormous number of plot types. We are going to start with simple 2D $x,y$ plots.

>```python
import matplotlib.pyplot as plt
%matplotlib inline
```

The `import` statement loads up the part of the `matplotlib` library we will use for plotting, and lets us refer to this as `plt` for convenience later.

The `%matplotlib inline` command tells the Jupyter notebook that we want all out &ldquo;plots&rdquo; to appear &ldquo;inline&rdquo;, i.e. inside the notebook (alternatives include opening the plots in other windows, or saving them as graphics files). The `%` symbol at the start means this is a &ldquo;magic&rdquo; command for controlling the behaviour of this Jupyter notebook, and is not standard Python.

If you are using a high DPI or &ldquo;retina&rdquo; screen, you will also want to switch on high resolution figures.

>```python
%config InlineBackend.figure_format = 'retina'
```

Creating a plot uses `plt.plot()`. Remember, we have assigned `plt` as shorthand for `matplotlib.pyplot`.

>```python
print("a:",a)
print("b:",b)
plt.plot( a, b )
plt.show()
```

This can be used for plotting $y$ as a function of $x$, e.g. $y=x^2$.

>```python
x = np.array( [0, 1, 2, 3, 4, 5] )
y = x**2
plt.plot( x, y )
plt.show()
```

The default plot shows a connected line. To plot individual points, we can add a third argument to `plt.plot()` that specifies the appearance for that data set:

>```python
plt.plot( x, y, "o" )
plt.show()
```

A large number of marker types exist in `matplotlib` (a full list is [here](#http://matplotlib.org/api/markers_api.html#module-matplotlib.markers)).

We can also control the line style, and combine code controlling marker and line appearance.

>```python
plt.plot( x, y, ":" ) # dotted line
plt.show()
```

>```python
plt.plot( x, y, "s:" ) # dotted line with squares
plt.show()
```

Adding axes labels and a title uses the `xlabel()`, `ylabel()`, and `title` commands.

>```python
plt.plot( x, y, 'o-' )
plt.xlabel( 'x' )
plt.ylabel( 'y^2' )
plt.title( 'y = x^2' )
plt.show()
```

Plotting multiple data sets on the same graph uses multiple `plot()` commands. For an example, let us create three `numpy` arrays, `u`, `v`, and `w`.

>```python
# create three numpy arrays, u, v, and w
u = x + 1
v = x ** 2
w = np.sqrt( (x*2)+1 )
print('u = ',u)
print('v = ',v)
print('w = ',w)
```

Now we can plot $u$, $v$, and $w$ versus $x$ on the same figure.

>```python
plt.plot( x, u, 'o-',  label='x+1' )
plt.plot( x, v, 'x--', label='x**2' )
plt.plot( x, w, '*:',  label='sqrt((x*2)+1)' )
plt.xlabel( 'x' )
plt.ylabel( 'y' )
plt.title( 'y=f(x)')
plt.legend()
plt.show()
```

We have assigned text labels for each data set by setting `label=string` in each `plt.plot()` command. These labels are then shown in the legend produced by the `plt.legend()` command.

Nearly every part of the plot appearance can be controlled. Two further examples are line colours and thickness. A number of line colours are predefined and can be referred to with a [corresponding string](http://matplotlib.org/examples/color/named_colors.html).

In [None]:
# run this cell
plt.plot( x, u, 'o-',  label='x+1',           color='salmon',    linewidth=3 )
plt.plot( x, v, 'x--', label='x**2',          color='darkolivegreen',  linewidth=2 )
plt.plot( x, w, '*:',  label='sqrt((x*2)+1)', color='slategrey', linewidth=4 )
plt.xlabel( 'x' )
plt.ylabel( 'y' )
plt.title( 'y=f(x)')
plt.legend()
plt.show()

You can save a figure to an external file using `plt.savefig('filename')` instead of `plt.show()`.

<div class="alert alert-success">
Edit the cell above to replace <br/><br/><span style='font-family:monospace; margin-left: 40px'>plt.show()</span><br/><br/> with <br/><br/><span style='font-family:monospace; margin-left: 40px'>plt.savefig('my_figure.pdf')</span><br/><br/>Then run the cell to save the figure to the disk.
</div>

## Creating tables using 2D numpy arrays<a id='2d_arrays'></a>

It can often be useful to collect different data sets together in a table.  
One way to do this is by combining `numpy` arrays into larger, two-dimensional, arrays.

>```python
# create an array `x` with the integers 1 to 5
x = np.arange(1,6)
# create three new arrays by performing calculations on `x`
y1 = x**2
y2 = x+3
y3 = x/2 + 1
print('x=',x)
print('y1=',y1)
print('y2=',y2)
print('y3=',y3)
```

We can combine numpy arrays into a table as **columns** using `np.column_stack()`

>```python
# combine x, y1, y2, and y3 as columns in a new table
column_table = np.column_stack( ( x, y1, y2, y3 ) )
print( column_table )
```

Or as **rows** using np.row_stack()

>```python
# arrange x, y1, y2, and y3 as rows in a new table
row_table = np.row_stack( ( x, y1, y2, y3 ) )
print( row_table )
```

A 1D `numpy` array can be indexed like a list.
>```python
my_1D_array = np.array( [ 1, 2, 3, 4, 5, 6] )
my_1D_array[2:5] 
# [2:5] selects from 2 jumps, up to, but not including, 5 jumps
```

A 2D `numpy` array can be treated like a [list of lists](lists), and indexing returns selected rows.
>```python
row_table[1] # return the 2nd row (1 jump from the start)
```

Because each row is a 1D `numpy` array, we can use a second index to select a single entry.
>```python
row_table[1][3]
```

These two indices can be combined into a single bracket
>```python
row_table[1,3]
```

To select a single row, we make use of the range character `:`. Remember, for a list or 1D array, `:` lets us select a range of elements, and leaving out one of the numbers selects all elements up to the start, or end, of the list.

>```python
my_list = [ 'a', 'b', 'c', 'd', 'e' ]
my_list[1:]
```

Leaving out *both* numbers extends our selection up to both ends of the list or array.
>```python
my_list[:]
```

For a 2D array, you can think of this as &ldquo;every row&rdquo; or &ldquo;every column&rdquo;.

>```python
print( row_table )
print()
print( row_table[:,3] ) # all rows, jump 3 columns
```

<div class="alert alert-success">
Use a combination of row and column indexing to select <span style='font-family:monospace'>[ 6., 7., 8.]</span> from <span style='font-family:monospace'>row_table</span>
</div>

# Introduction to `pandas`<a id='pandas'></a>
It would be easier to remember what data these tables contain if we could label the different axes.  
We can do this using another module `pandas` (the name is derived from "panel data"), which is designed for manipulating tables of data in much the same way as you might use a spreadsheet application.

>```python
import pandas as pd
```

`pandas` stores tables of data as **Data Frames**
>```python
pd.DataFrame( column_table )
```

This gives us labelled rows and columns, and nicer formatting when we output the data.  

You can define your own column labels by including this information when you create the DataFrame

>```python
data = pd.DataFrame( column_table, columns = [ 'x', 'y1', 'y2', 'y3' ] )
data
```

This helps to describe *what* each column represents. You can also refer to a column label to access that data subset.

>```python
data['y1']
```

>```python
plt.plot( data['x'], data['y1'] )
plt.show()
```

`pandas` DataFrames also have their own `plot()` function, that will plot all the data in the table with the appropriate column labels.

>```python
data.plot()
```

This probably is not exactly what we wanted. The pandas DataFrame.plot() function will plot *all* of the columns, using the **index** as the $x$ values. In this case we want to plot $x$ against $y_1, y_2, y_3$. We can acheive this by rearranging the DataFrame.  

First we set the index to be the same as the column **x**.

>```python
indexed_data = data.set_index( data['x'] )
indexed_data
```

and &ldquo;drop&rdquo; the original **x** column:

>```python
final_data = indexed_data.drop( 'x', 1 )
final_data
```

The `1` here means we want to drop a column. Using `0` would try to drop a matching row.


>```python
final_data.plot()
```

# Data analysis and statistics with numpy<a id='data_analysis'></a>

`numpy` contains a lot of powerful functions for performing simple statistical analysis on our data. For example, consider the set of numbers 1 to 50:

>```python
a = np.arange(1,51)
a
```

To find the minimum and maximum values we can use `np.min()` and `np.max()`

>```python
np.min(a)
```

>```python
np.max(a)
```

To find the **sum** of all these numbers, we can use `np.sum()`

>```python
np.sum(a)
```

The **mean** of a set of numbers is defined as 

\begin{equation}
\frac{\sum_i^N x_i}{N}
\end{equation}

which we could calculate with

>```python
np.sum(a) / len(a)
# len(a) returns the length of the array `a`
```

or with `np.mean()`

>```python
np.mean( a )
```

The **standard deviation**, $\sigma$ quantifies how much the numbers in our set deviate from the mean.

\begin{equation}
\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N(x_i-\mu)^2}
\end{equation}

where $\mu$ is the mean.

Again, we could write this out in code:

>```python
import math
sigma = math.sqrt( np.sum( ( a - np.mean(a))**2 ) / len(a) )
sigma
```

Or use the `np.std()` function

>```python
np.std(a)
```

### Linear Regression

Another commonly used data analysis technique is **linear regression**. This is used to calculate the relationship between two data sets, $X$ and $Y$, assuming that this relationship can be described by a straight line

\begin{equation}
y_i = m x_i + c.
\end{equation}

For any real data set, the data points are unlikely to all fall exactly on the same line. Linear regression is the process of calculating the line that &ldquo;best fits&rdquo; the given data.

In your Key Skills Excel practical you used linear regression to analyse equilibrium constant data for the equilibrium between NO$_2$ and N$_2$O$_4$, to find $\Delta H_\mathrm{r}$ and $\Delta S_\mathrm{r}$ for this reaction.  

As an example of using linear regression in a Jupyter notebook, and applying this to a chemical problem, let us work through the same process in code.

#### Theory

The equilibrium reaction we have data for is

\begin{equation}
2\mathrm{NO}_2 \mathrm{(g)}\leftrightharpoons \mathrm{N}_2\mathrm{O}_4 \mathrm{(g)}
\end{equation}

Taking the equations relating $\Delta G$ to $K$ and to $\left\{\Delta H, \Delta S\right\}$:

\begin{equation}
\Delta G = -RT \ln K, \tag{1}
\end{equation}

\begin{equation}
\Delta G = \Delta H - T\Delta S; \tag{2}
\end{equation}

we get

\begin{equation}
\ln K = \frac{\Delta H}{RT}-\frac{\Delta S}{R}. \tag{3}
\end{equation}

This is in the form

\begin{equation}
y = mx + c
\end{equation}

\begin{equation}
\ln K = \frac{\Delta H}{R}\frac{1}{T} - \frac{\Delta S}{R}. \tag{4}
\end{equation}

and plotting $\ln(K)$ against $\frac{1}{T}$ should give a straight line, with slope $\frac{\Delta H}{R}$ and intercept $-\frac{\Delta S}{R}$.

#### Analysis

The data from this experiment are stored in a text file in `data/equilbirium_constant.dat`, which looks like

```
# equilibrium constant data for 2 NO2 => N2O4  
# columns are: temperature (degrees Celsius), K
  
9   34.3
20  12
25  8.79
33  4.4
40  2.8
52  1.4
60  0.751
70  0.4
```

Not every line in this text file contains a data point. The first two lines describe the data set and tell us what is in each column and the units (where relevant). These &ldquo;non-data&rdquo; lines at the head of the file are usually called the **header**. Data files should always include a description of the data so that this is available for any later analysis.

To read the data into this notebook we can use `read_csv()` contained in `pandas`. The csv in `read_csv` stands for `comma-separated values`, which is a common data file format, and can be exported from spreadsheet software such as Excel. A &ldquo;comma-separated&rdquo; data file would look like:

```
x,y
0.3,2323
1.5,1442
3.7,2827
5.2,12332
```
This is easily processed by computers, and has the advantage that you can include entries with spaces, such as names of people. For pure numerical datasets, however, separating columns with **whitespace** means the original file can be easily ready by humans. `read_csv()` can handle different separators between data fields (called **delimiters**), and has an optional extra setting for when the fields are separated by spaces.

>```python
data = pd.read_csv( 'data/equilibrium_constant.dat', 
                     delim_whitespace = True, 
                     comment='#', 
                     names = [ 'T (C)', 'K' ] )
```

In [None]:
|

This looks quite complicated, but we can understand the options for `read_csv()` in turn:  

First, we supply the filename as a string (including the name of the `data` directory).  

Second, we set `delim_whitespace = True`. This does what you would expect.

We are telling `read_csv()` that the file will use spaces to separate fields.  

Third, `comment='#'`: the data file contains comments, and these are indicated by lines that start with `#`. 

Finally, we define `names = [ 'T (C)', 'K' ]`. This provides labels for the columns in our final DataFrame, which is stored in `data`

>```python
data
```

Looking at `data` shows us 8 data points (numbered 0 to 7), where each has a temperature (in the **T (C)** column) and a measured equilibrium constant (in the **K** column).

Before we can plot these data, we need to convert the temperature to Kelvin, and calculate $\ln K$.

Remember that we can access a column in a DataFrame by using it's label:

>```python
data['T (C)']
```

and can use this to create a *new* column.

>```python
data['T (K)'] = data['T (C)'] + 273.0
data
```

This has created a new column, with the label **T (K)**.

Next, we do the same to calculate a set of $\ln K$ values:

>```python
data['ln K'] = np.log( data['K'] )
data
```

We are now ready to plot $\ln K$ versus $1/T$.

>```python
x = 1.0 / data['T (K)']
y = data['ln K']
plt.plot( x, y, 'o' )
plt.xlabel( '1/T' )
plt.ylabel( 'ln K' )
plt.show()
```

Notice that this code calculates all the inverse temperatures in place. An alternative way to do this would be to generate a new column in `data` with all the $1/T$ values, and then plot this directly.

And we find this plot gives an approximate straight line.

There are a number of different ways to calculate the line of best-fit. One of the simplest is to use another module, [`scipy.stats`](https://docs.scipy.org/doc/scipy-0.18.1/reference/stats.html). As you might suspect, this contains an enourmous set of statistical analysis tools. We want [`linregress()`](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.stats.linregress.html#scipy.stats.linregress), which works as follows:

>```python
from scipy.stats import linregress
linregress( x, y )
```

You can see the output is complicated, but includes a list of values that includes the slope and the intercept. In fact you can treat the output like a [list](lists), and use indexing to select a specific result.

>```python
linregress( x, y )[0] # use indexing to get the slope
```

Another option is to collect all five of the output values at once

>```python
slope, intercept, rvalue, pvalue, stderr = linregress( x, y )
print( "slope =", slope )
print( "intercept =", intercept )
```

To plot the best-fit line against the original data, we need to generate a new data set according to $y=mx+c$, with $m$ and $c$ as the slope and intercept from `linregress`.

>```python
y_fit = slope * x + intercept # remember, x is an array storing 1.0 / data['T (K)']
plt.plot( x, y, 'o' )
plt.plot( x, y_fit, '-' )
plt.xlabel( '1/T' )
plt.ylabel( 'ln K' )
plt.show()
```


And because we have calculated the slope and intercept, we can derive $\Delta H$ and $\Delta S$ for the reaction.

\begin{equation}
\Delta H = R \times \mathrm{slope}
\end{equation}

\begin{equation}
\Delta S = R \times \mathrm{intercept}
\end{equation}

To save us having to look up and type in the gas constant, $R$, we can use [`scipy.constants`](https://docs.scipy.org/doc/scipy-0.18.1/reference/constants.html#).

>```python
from scipy.constants import R # gas constant in J K^-1 mol^-1
R 
```

>```python
delta_H = R * slope
delta_S = R * intercept
print( 'Delta H =', delta_H / 1000, 'kJ mol^-1' )
print( 'Delta S = ', delta_S, 'J K^-1 mol^-1' )
```

To finish, although we have not done any particularly complicated analysis on the original data, we still might want to save our new data set to save us having to go through this again.

Our modified data set is stored as a `pandas` DataFrame in `data`

>```python
data
```

To save this out to another file we can use [`DataFrame.to_csv()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html).

>```python
filename = 'data/modified_equilibrium_data.dat'
data.to_csv( filename, sep=' ' )
```

which saves the complete modified data set in plain text to `modified_equilibrium_data.dat` in the `modified_data` directory:

```
"T (C)" K "T (K)" "ln K"
9 34.3 282.0 3.535145354171894
20 12.0 293.0 2.4849066497880004
25 8.79 298.0 2.1736147116970854
33 4.4 306.0 1.4816045409242156
40 2.8 313.0 1.0296194171811581
52 1.4 325.0 0.3364722366212129
60 0.7509999999999999 333.0 -0.28634962721800244
70 0.4 343.0 -0.916290731874155
```

As a final note; it is important that you save modified data out to a *different* filename to your original data, to prevent overwriting it. In this case, our modified data set still includes the original data, but without the original raw data, there would be no way of checking this in the future.