<img style="float: right;" src="http://www2.le.ac.uk/liscb1.jpg">
# Analysing data

By: TJ Ragan  
Data: Software Carpentry

Python's real power lies in it's libraries.  Implementing new data analysis algorithms or strategies can take hours, or months.  However, remember that you're probably not the first person to try to do most things, and if anyone else has tried it in python, they've probably made a library so you can do it too.  The most common libraries for data analysis in python are *numpy*, *pandas* and *matplotlib*.

We have some data from an inflamation study stored in `.csv` files in the `data` directory.  Each row represents one patient, and each column represents their inflamation score as the study progressed.  Each file is from a different group of patients.

Lets try to analyse the data a few different ways:

## 1. Analysis using just python and *matplotlib*

We start by getting a list of the files, using python's *glob* library, which contains only one funciton, `glob`

In [None]:
import glob
glob.glob('../../data/*.csv')

Now that we have a list of files, lets look a the first file:

In [None]:
data_filenames = glob.glob('../../data/*.csv')
first_filename = data_filenames[0]
print(first_filename)

We can look at the data using an IPython command to list the file contents, just like we would on the command line:  
*Note that you can click on the area to the left of the output to shrink it down.*

In [None]:
%cat ../../data/inflammation-01.csv

Now that we see what the data look like, we can formulate a stragegy for analysing it:
1. Open the file
2. Read each line
3. Split the values at the commas
4. Convert each value into an integer
5. Add that patient's data to your study

Files are funny things.  If you open a file and forget to close it, bad things happen.  If your program crashes half-way through, bad things happen.  If you try to open it more than once, bad things happen.  Python has a trick that takes care of all of this for you:  `with open( ) as f:`

In [None]:
study_participants = []

# Open the file
with open(first_filename) as file:
    # Read each line
    for line in file:
        # Split the values at the commas
        split_line = line.split(',')
        inflamation_scores = []
        for inflamation_score in split_line:
            inflamation_scores.append(int(inflamation_score))
        study_participants.append(inflamation_scores)

Now we can look at the data:

In [None]:
for participant in study_participants:
    print(participant)

We can also ask basic questions, like what's the minimum, average, and maximum inflamation value for each participant.

In [None]:
for participant in study_participants:
    minimum_inflamation = min(participant)
    average_inflamation = sum(participant) /len(participant)
    maximum_inflamation = max(participant)
    print('min:', minimum_inflamation, 'avg:', average_inflamation, 'max:', maximum_inflamation)

**What?  everyone's got a minimum score of 0!**

Now we can plot each participant using *matplotlib* to see if we can see what's going on.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
for participant in study_participants:
    plt.plot(participant);  # The semicolon keeps the notebook from printing out extra information

1.  Looks very busy.
2. Looks very triangular!  

Now we could try to re-orient our list of lists to look at things along the other axis.  But we're really talking about 2D data here, so why not use a library meant to work with 2D (or nD) data?

## 2. Analysis using numpy and matplotlib

In [None]:
import numpy as np

In [None]:
study_participants_array = np.array(study_participants)
study_participants_array

Numpy arrays have all sorts of nice features.  For example, we can easily find out what the shape of the array is:

In [None]:
study_participants_array.shape

So we have 60 participants with 40 observations each.  It turns out that reading these files into Numpy is common enough that we don't need those nested *for* loops to do it - someone's already done it for us:

In [None]:
data = np.loadtxt(fname=first_filename, delimiter=',')
data.shape

#### Slicing
One of the nice features of numpy arrays is that we can easliy select subsets of the data.  A 60 x 40 array of data is too big to look at easliy, which is why we had all those `...` above, so we'll make a smaller one for this.

In [None]:
array = np.array([[1,2,3], [4,5,6], [7,8,9], [10,11,12]])
print(array)
print(array.shape)

Notice that arrays are oriented rows x columns.  This is the standard way of representing matrices in linear algebra - one of the primary uses of Numpy, but can git a little confusing.  In the same way we could slice lists or tuples, we can slice nD arrays.  The only difference here is that we can work directly in nD.  

It's worth noting here that you must specify the rows to take.  If you leave out the columns, numpy assumes you want all of them.

In [None]:
# Both of these slices do the same thing
print(array[0:2])
print()
print(array[0:2, :])

In [None]:
# If you want all the rows, but a subset of columns, you have to be specific:
print(array[:, 0:2])

#### Plotting  
Because numpy is designed to work with arrays of values, we can easily remake the plot above, without the loop.  

As you can see, the default orientation plots one line per observation - meaning we see 40 lines (one for each observation,) each with 60 data points on the x-axis for the 60 participants.

In [None]:
plt.plot(data);

What a mess.  What we wanted was a plot across the observations, not participants.  Fortunatelly, we can just swap the axes of the array using a 'Transpose'.  In the same way nD arrays carry around their shape in the `.shape` attribute, they carry around their transpose in the `.T` attribute.

In [None]:
plt.plot(data.T);

### EXERCISE 1 - Min, mean, max

1. Extract the data for the second patient
2. Calculate the minimum, mean, and maximum inflamation scores for that patient
3. Using the `axis=` parameter, calculate the minimum, mean, and maximum inflamation scores for each observation  
    *tip: since there are 60 patients with 40 observations each, you can check that you're working observation-wise and not patient wise* 
4. Plot the minimum, average, and maximum inflamation scores per observation

__BONUS__

Ask google how to add a figure legend to your plot.
1. Search google for "add figure legend to matplotlib"
2. Choose the first link to `stackoverflow.com`
3. Look at the top answer, which generally should have a green check mark 

In [None]:
second_patient = data[1]
assert(len(second_patient) == 40)
print('Second patient min:', second_patient.min(), 'avg:', second_patient.mean(), 'max:', second_patient.max())

data_min = data.min(axis=0)
assert len(data_min) == 40
data_mean = data.mean(axis=0)
data_max = data.max(axis=0)

plt.plot(data_max, label='max')
plt.plot(data_mean, label='avg')
plt.plot(data_min, label='min')

plt.legend(loc='upper left');

Ok, so it's time to call the IRB and report someone for faking (badly) the data.  

### EXERCISE 2 - multiple files
Re-create the plot above for the first three `.csv` files.  In order to get a new figure, use: `plt.figure()`

__BONUS__

Ask google how to add a figure title to each plot, so that you can tell which file it comes from.

In [None]:
for f in data_filenames[:3]:
    data = np.loadtxt(fname=f, delimiter=',')
    plt.figure()
    plt.plot(data.max(axis=0), label='max')
    plt.plot(data.mean(axis=0), label='avg')
    plt.plot(data.min(axis=0), label='min')
    plt.title(f)
    plt.legend(loc='upper left')

Ok, so it's time to call the IRB and report everyone for faking (badly) the data.  

## 3. Analysis using pandas and matplotlib

Numpy is designed and built for doing array manipulations.  It's good at doing the kinds of table-like operations we've been doing so far, but it's really meant for doing math.  The Pandas library, on the other hand, is built from the ground up for doing this type of work.  

The most common feature of Pandas you're likely to use is called a DataFrame (if you're familiar with the R programming languate, these are the same as data frames in that language.)  These are 2D tables of data, that can behave like both Excel spreadsheets and database tables.  More on Excel later,...

Like Numpy and Matplotlib, Pandas is a large, powerful library, and we're only going to look at a small portion in this workshop.

In [None]:
import pandas as pd

In [None]:
pd.read_csv('../../data/inflammation-01.csv')

Oops!  By default, Pandas assumes that the first row is the column names.  As our data has no header, we need to tell Pandas that the header is `None`.

In [None]:
pd.read_csv('../../data/inflammation-01.csv', header=None)

While Numpy arrays are designed to do math, Pandas dataframes are designed to hold data, so in general you should try to treat them as immutable.  They also tend to be column focused, so while you think you may be doing a Numpy type slice, Pandas may think you're asking for either a column or some rows:

In [None]:
inflammation_01 = pd.read_csv('../../data/inflammation-01.csv', header=None)
inflammation_01[0]  # Column 0

In [None]:
inflammation_01[0:3]  # but this is the first three rows.

__Ouch!__  

To simplify matters, Pandas provides a location indexer that behaves exactly like Numpy

In [None]:
inflammation_01.loc[1]

You may have noticed that the slice we've just taken brings it's own index with it (the numbers 0 through 39).  Remember that Pandas behaves as a spreadsheet.  While using `.loc[]` looks like it uses 'positional' slicing, what you're actually doing is slicing based on the index.  

If we create a dataframe with labeled rows and columns, the behaviour becomes more clear

In [None]:
df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], index=['one', 'two', 'three'], columns = ['a', 'b', 'c'])
df

In [None]:
df[['a','b']]  # We can ask for a list of columns

Or we can use the `.loc[]` indexing to do the same kind of slice we do with Numpy:

In [None]:
df.loc['one':'two']

But we can also provide a list of indices, and Pandas will give us those back in the order we ask for them.

In [None]:
df.loc[['three','one'], ['c', 'a', 'b']]

So what do we get for this increased complexity?

### EXERCISE 3 - General information on our csv file with Pandas
1. Load the second csv file into a Pandas DataFrame called inflammation_02
2. Create a variable called inflammation_02_description using the .describe() method of the DataFrame
3. Examine the contents of the inflammation_02_description variable  

4. Using the `.loc[]` indexer, extract the max, min, and average values and conver them to lists
5. Plot the max, min, and average values using plt.plot()

In [None]:
inflammation_02 = pd.read_csv('../../data/inflammation-02.csv',header=None)
inflammation_02_description = inflammation_02.describe()
inflammation_02_description

In [None]:
print('max:')
print(list(inflammation_02_description.loc['max']))
print()
print('min:')
print(list(inflammation_02_description.loc['min']))
print()
print('average:')
print(list(inflammation_02_description.loc['mean']))

In [None]:
plt.plot(inflammation_02_description.loc['max'])
plt.plot(inflammation_02_description.loc['min'])
plt.plot(inflammation_02_description.loc['mean'])

Because Pandas dataframes behave like Excel spreadsheets, the people who created pandas decided that you should be able to work with Excel spreadsheets.  

Load a all the sheets from the `inflamation.xlsx` Excel file by telling the `read_excel` function not to take a specific one (note that by default it takes the first one).  This will give us a Dictionary of sheets:

In [None]:
inflammation_workbook = pd.read_excel('../../data/inflammation.xlsx', sheet_name=None, header=None)
inflammation_workbook.keys()

In [None]:
inflammation_workbook['inflammation-01']

One final bit of Pandas.  It plots, too.

In [None]:
inflammation_02_description.loc['min'].plot()
inflammation_02_description.loc['mean'].plot()
inflammation_02_description.loc['max'].plot()

### EXERCISE 4 - Plot all the things!

Plot min, average, and max values for all the sheets in your excel file.  

__bonus__:  
Plot them in order.  
*tip: you can order the keys in a dictionary using `sorted(dictionary.keys())`*

In [None]:
for name in sorted(inflammation_workbook.keys()):
    plt.figure()
    sheet_description = inflammation_workbook[name].describe()
    sheet_description.loc[['min', 'mean', 'max']].T.plot()
    plt.title(name)

Hmm,...  That last one looks funny.

## One final plot...  
As our data is 2-dimensional, one final way we can plot it is to show it as an image.  Because Pandas and Matplotlib work together well, it's easy.  We just use the `imshow` method of the `pyplot` library, and then add a colorbar to get a range.

In [None]:
plt.imshow(inflammation_workbook['inflammation-01'], cmap='jet')
plt.colorbar()

MathWorks, the company who makes Matlab, used to use the Jet colormap (and so did Matplotlib before version 2.0.)  This colormap was so popular (and pretty,) it became the default in most packages.  Unfortunatelly, Jet is a terrible colormap. Recently, there has been a lot of research about the effects of different colormaps on our perception of data that has shown how truly awful the Jet colormap is.  In [one recent study](http://www.eecs.harvard.edu/~kgajos/papers/2011/borkin11-infoviz.pdf), physicians who were switched from Jet to a perceptually 'appropriate' map showed a 47% increase in the ability to detect potential sites of coronary artery disease.

Because Jet was so bad, Mathworks changed their default map to one called *parula*, which is much better, but still not ideal.  Matplotlib has gone for a map called *viridis* which is perceptually uniform, looks the same if you have red-green colour blindness, and prints nicely in black and white.

As an example, we'll plot the same data as above using both a greyscale map and viridis.

In [None]:
plt.figure(figsize=(6,6))  # Notice that you can change the figure size by poviding an (x,y) tuple

plt.imshow(inflammation_workbook['inflammation-01'], cmap='Greys_r')
plt.colorbar()

plt.figure(figsize=(6,6))

plt.imshow(inflammation_workbook['inflammation-01'], cmap='viridis')
plt.colorbar()

### EXERCISE 5 - Try some other colormaps

Repeat the plot above using *at least* three different maps.

__tip: Google 'matplotlib colormap'__


In [None]:
plt.imshow(inflammation_workbook['inflammation-01'], cmap='inferno')
plt.colorbar()

plt.figure()

plt.imshow(inflammation_workbook['inflammation-01'], cmap='plasma')
plt.colorbar()

plt.figure()

plt.imshow(inflammation_workbook['inflammation-01'], cmap='magma')
plt.colorbar()

plt.figure()

plt.imshow(inflammation_workbook['inflammation-01'], cmap='bwr')
plt.colorbar()

### EXERCISE 6 - I promise Jet is bad

Plot all the spreadsheets in the viridis, Greys_r, and jet colormaps.

Do you see anything interesting???

In [None]:
for name in sorted(inflammation_workbook.keys()):
    for cm in ['viridis', 'Greys_r', 'jet']:
        plt.figure()
        plt.imshow(inflammation_workbook[name], cmap=cm)
        plt.colorbar()
        plt.title(name + ': ' + cm)