<img style="float: right;" src="http://www2.le.ac.uk/liscb1.jpg">

# Scientific Python for Programmers

Python's real power lies in it's libraries.

Implementing new data analysis algorithms or strategies can take hours, or months. However, remember that you're probably not the first person to try to do most things, and if anyone else has tried it in python, they've probably made a library so you can do it too. 

The most common libraries for data analysis in python are *numpy*, *pandas* and *matplotlib*.

We have some data from an inflamation study stored in .csv files in the data directory. Each row represents one patient, and each column represents their inflamation score as the study progressed. Each file is from a different group of patients.

Lets try to analyse the data a few different ways:

## 1. Analysis using just python and *matplotlib*

We start by getting a list of the files, using python's *glob* library, which contains the function, `glob`. 

The `glob` function finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. More details can be found at:- https://docs.python.org/3/library/glob.html

In [None]:
import glob

# Use * as wild card character to get list of all csv files. 
glob.glob('data/*.csv')

**_Please note that the list of files is not guraranteed to be in order._**

Now that we have a list of files, lets look at the first file:

In [None]:
data_filenames = glob.glob('data/*.csv')
first_filename = data_filenames[0]
print(first_filename)

We can look at the data using an IPython command to list the file contents, just like we would on the command line:  
*Note that you can click on the area to the left of the output to shrink it down.*

In [None]:
%cat data/inflammation-08.csv

Now that we see what the data look like, we can formulate a stragegy for analysing it:
1. Open the file
2. Read each line
3. Split the values at the commas
4. Convert each value into an integer
5. Add that patient's data to your study

Files are funny things.  If you open a file and forget to close it, bad things happen.  If your program crashes half-way through, bad things happen.  If you try to open it more than once, bad things happen.  Python has a trick that takes care of all of this for you:  `with open(Your_file_name) as Your_file_handle:`

Let us understand above 5 steps in a more simple manner with a file having just one line `first_line_inflammation-08.csv` 

In [None]:
%cat data/first_line_inflammation-08.txt

<b>Step 1 & Step 2:

Let us open the file and read one line.

In [None]:
my_file_name = 'data/first_line_inflammation-08.txt'

# Open the file
with open(my_file_name) as file:
    
    # Read each line
    for line in file:
        print (line)

#Also let us see the data type of line that we just read.
print('Data type of line is :-', type(line))

<b> Step 3:

Let us now split the values at commas so that we can manipulate each value as we wish.

In [None]:
print('LINE:')
print(line)
print('Data type of line is  :-', type(line))

print()

split_line = line.split(',')
print('SPLIT LINE:')
print(split_line)
print('Data type of split_line is  :-', type(split_line))


<b>Step 4:

Let us see the type of inidividual values in split str and convert it into a suitable data type so that we can perform mathematical operations.

In [None]:
print('Data type of value in split_line is :-', type(split_line[0]))

As we can see that the each value in `split_line` is `string (or str)` type. We will now convert them to integer types so that we can perform mathematical operarations.

In [None]:
# Initialise converted split line to an empty list.
converted_split_line = []

for value in split_line:
    
    #Convert to int values.
    converted_split_line.append(int(value))

print(converted_split_line)
print('Data type of value in converted_split_line is :-', type(converted_split_line[0]))

Let us now combine all above steps to read all the lines in one csv file and look at the data.

In [None]:
# This is the main list that will store all the  values in the current file.
print(first_filename)
study_participants = []

# Open the file
with open(first_filename) as file:
    
    # Read each line
    for line in file:
        
        # Split the values at the commas
        split_line = line.split(',')
        
        # This list is for stroing each row/line and appending it it study_participants.
        inflamation_scores = []        
        for inflamation_score in split_line:
            inflamation_scores.append(int(inflamation_score))
        
        study_participants.append(inflamation_scores)

# Look at the data
for participant in study_participants:
    print(participant)

We can also ask basic questions, like what's the minimum, average, and maximum inflamation value for each participant.

In [None]:
for participant in study_participants:
    minimum_inflamation = min(participant)
    average_inflamation = sum(participant) / len(participant)
    maximum_inflamation = max(participant)
    print('min:', minimum_inflamation, 'avg:', average_inflamation, 'max:', maximum_inflamation)

<b>Everyone's got a minimum score of 0!

Now we can plot each participant using *matplotlib* to see if we can see what's going on.

### 1.1 Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. You can find more details at:- https://matplotlib.org/. 

It is common practice to import `matplotlib` as `plt` to save some typing. We will also use the similar convention in this notebook.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

Let us look at some plotting functions in matplotlib. 

As the first example, let us plot the curve of $y = x^2$. 

<b>Step 1:

Let us define the $x$ range we want to plot. We will use a pyhton function `range` for this purpose. 

The syntax of `range` is `range(start, stop, step)`. It generates an iterable object starting from `start` and incrementing by `step` untill it reaches `stop`. Note that `stop` is not included. 

In [None]:
x_values = list(range(-10, 11, 1))  # List will convert the iterable into a list of values.
print(x_values)

<b>Step 2:

Let us calculate the value of y for each x.

In [None]:
y_values = [] 

for x in x_values:
    y = x ** 2
    y_values.append(y)

print(y_values)

<b>Step 3:

Let us plot the curve now using matpltolib.

In [None]:
plt.plot(x_values, y_values)

<b>A. Add label to the graph

In [None]:
plt.xlabel('x')
plt.ylabel('y')
plt.plot(x_values, y_values)

<b>B. Add title to the graph

In [None]:
plt.xlabel('x')
plt.ylabel('y')
plt.plot(x_values, y_values)
plt.title('Graph of some simple mathematical functions')

<b>C. Add markers to the graph and change line type

In [None]:
plt.xlabel('x')
plt.ylabel('y')
plt.plot(x_values, y_values, '--')
plt.plot(x_values, y_values, 'ro')
plt.title('Graph of some simple mathematical functions')

<b>D. Adding more than one curve on the same graph

Let us add the graph of $y=10x$ on the same graph.

In [None]:
#Since x values are same, we use them to calculate y values.
linear_y_values = []

for x in x_values:
    linear_y_values.append(10*x)

print(linear_y_values)

In [None]:
plt.xlabel('x')
plt.ylabel('y')
plt.plot(x_values, y_values, 'ro-')
plt.plot(x_values, linear_y_values, 'g*--')
plt.title('Graph of some simple mathematical functions')

Above plot looks _OK_ but we should always add legend to graph to identify which curve represents which thing.

<b>E. Adding legend to the graph

In [None]:
plt.xlabel('x')
plt.ylabel('y')
plt.plot(x_values, y_values, 'ro-')
plt.plot(x_values, linear_y_values, 'g*--')
plt.title('Graph of some simple mathematical functions')
plt.legend(['$y=x^2$', '$y=10x$'])

<b>F. Changing axes limits of the graph

Let us change the limits of x and y axis to have only positive values.

In [None]:
plt.xlabel('x')
plt.ylabel('y')
plt.plot(x_values, y_values, 'ro-')
plt.plot(x_values, linear_y_values, 'g*--')
plt.title('Graph of some simple mathematical functions')
plt.legend(['$y=x^2$', '$y=10x$'])
plt.xlim(0, 10)
plt.ylim(-5, 105)

<b>G. Adding subplots

Subplots mean groups of axes that can exist in a single matplotlib figure. 

Consider the following example:

In [None]:
#Adding subplots.
fig,axs = plt.subplots(1,2)

axs[0].plot(x_values, y_values,'cx--')
axs[0].set_xlabel('$X1$')
axs[0].set_ylabel('$Y1$')
axs[0].set_title('Graph of $y=x^2$')

axs[1].plot(x_values, linear_y_values,'mo:')
axs[1].set_xlabel('$X2$')
axs[1].set_ylabel('$Y2$')
axs[1].set_title('Graph of $y=10x$')

fig.tight_layout()

<b>H. Adding more than one figure

In [None]:
#Adding more than 1 figure
fig1 = plt.figure(1)
plt.plot(x_values, y_values,'b*--')
plt.xlabel('$X$')
plt.ylabel('$Y$')
plt.title('Graph of $y=x^2$')

fig2 = plt.figure(2)
plt.plot(x_values, linear_y_values,'gs:')
plt.xlabel('$X$')
plt.ylabel('$Y$')
plt.title('Graph of $y=10x$')

#plt.figure(1)
#plt.xlabel('x axis')

### 1.2 Plot Participant

Now let us turn back to our patient data and plot those values on a graph. Later, in the next section, we will learn some more ways to read and plot the same data.

In [None]:
for participant in study_participants:
    plt.plot(participant)

1.  Looks very busy.
2. Looks very triangular!  

Now we could try to re-orient our list of lists to look at things along the other axis.

But we're really talking about 2D data here, so why not use a library meant to work with 2D (or nD) data?

## 2. Analysis using numpy and matplotlib

### 2.1. Numpy Basics
### Numpy arrays (ndarrays)

Numpy arrays are one of the most commonly used collections of things we mentioned earlier. Even though numpy arrays (often written as ndarrays, for n-dimensional arrays) are not part of the
core Python libraries, they are so useful in scientific Python that we'll include them here in the 
core lesson. Numpy arrays are collections of things, all of which must be the same type, that work
similarly to lists (as we've described them so far). The most important are:

1. You can easily perform elementwise operations (and matrix algebra) on arrays
1. Arrays can be n-dimensional
1. There is no equivalent to append, although arrays can be concatenated

Arrays can be created from existing collections such as lists, or instantiated "from scratch" in a 
few useful ways.

When getting started with scientific Python, you will probably want to try to use ndarrays whenever
you're doing math or dealing with numerical data, saving the other types of collections for those cases when you have a specific reason to use them.

In [None]:
# We need to import the numpy library to have access to it 
# We can also create an alias for a library, this is something you will commonly see with numpy
import numpy as np

<b>A. Creating simple numpy arrays

In [None]:
# Make an array from a list
alist = [2, 3, 4]
blist = [5, 6, 7]

a = np.array(alist)
b = np.array(blist)

print(a, type(a))
print(b, type(b))

<b>B. Concatinating values to numpy arrays

In [None]:
# Let us create a 2D list.
c = np.array([ 
      [1, 2, 3],
      [4, 5, 6]
    ])

d = np.array([[1, 8, 9]])
e = np.concatenate((c, d), axis=0)
print('Concatenation (Axis = 0) :\n', e)

In [None]:
#Now Let us use axis argument.
d = np.array([[1, 8]])
e = np.concatenate((c, d.T), axis=1)     #Pay attention to the transpose applied to 'd' numpy array.
print('Concatenation (Axis = 1) :\n', e)

In [None]:
# There is yet another option called axis=None. Let us try that.
d = np.array([[1, 8, 7, 9]])
e = np.concatenate((c, d), axis=None)
print('Concatenation (Axis = None) :\n', e)

<b>C. Arithmetic on numpy arrays

In [None]:
# Do element-wise arithmetic on arrays
print(a)
print(b)
print(a**2)
print(np.sin(a))
print(a * b)

In [None]:
# Do linear algegra on arrays
print(a.dot(b))
print(np.dot(a, b))

<b>D. Boolean operations on numpy arrays

In [None]:
# Boolean operators work on arrays too, and they return boolean arrays
print(a > 2)
print(b == 6)

c = a > 2
print(c)
print(type(c))
print(c.dtype)

<b>E. Indexing and Slicing Numpy arrays

Indexing: In most simple terms, accessing a particular element or elements based on its position. In Python, indices start from zero.

Slicing: When we want a portion of the array. In any dimension, we can use the syntax: `Your_array_name[start:stop:step]` (`stop` index is not included in the sliced array).

Please note that:
+ default `start` is `zero`.
+ default `end` is `length of array in that dimension`.
+ default `step` is `1`.

For 2D arrays, the slicing can be extended as `Your_2D_array[row_slicing, column_slicing]` where each slicing can have `start:stop:end`.

In [None]:
# Indexing arrays
print(a)
print(a[0:2])
print()
print()
c = np.random.rand(3, 3)
print(c)
print(c[1:3, 0:2])

Please note that you can skip specifying rows or columns explicitly and use `:` instead which means all rows or all columns. Consider the following example for clarity.

In [None]:
# Let us create a 4*3 numpy array.
array = np.array([[1,2,3], [4,5,6], [7,8,9], [10,11,12]])
print(array)

Let us print all columns of first two rows.

In [None]:
# Both of these slices do the same thing
print(array[0:2])
print()
print(array[0:2, :])

How about printing all rows but only the second column.

In [None]:
print(array[:, 1])

In [None]:
# If you want all the rows, but a subset of columns, you have to be specific:
print(array[:, 0:2])

In [None]:
# Let us replace zeroth row of c with array a.
c[0, :] = a   # Using ':' as index for either row or column means all rows or all columns.
print(c)

In [None]:
# Arrays can also be indexed with other boolean arrays
print('a =', a)
print('b =', b)

print(a > 2)
print('a[a > 2] = ', a[a > 2])
print('b[a > 2] = ', b[a > 2])

b[a == 3] = 77
print(b)

<b>F. Attributes and methods of numpy arrays

In [None]:
# ndarrays have attributes...
#c.
print('Shape of array c is =', c.shape)
print('Number of dimensions in c =', c.ndim)
print('Number of bytes consumed by c =', c.nbytes)
print('\n')

# ...and methods
print(a.prod())     # Will lead to multiplication of all elements in array.
print(c.flatten())  # Will reduce the number of dimensions to just one.

<b>G. Some easy ways to create and initialise arrays

In [None]:
# There are handy ways to make arrays full of ones and zeros
print(np.zeros((5, 5)), '\n')
print(np.ones(5), '\n')
print(np.identity(5), '\n')

In [None]:
# You can also easily make arrays of number sequences
print(np.arange(0, 10, 2))

### 2.2 Using numpy with matplotlib to analyse our data

In [None]:
study_participants_array = np.array(study_participants)
study_participants_array

Numpy arrays have all sorts of nice features.  For example, we can easily find out what the shape of the array is:

In [None]:
study_participants_array.shape

So we have 60 participants with 40 observations each.  It turns out that reading these files into Numpy is common enough that we don't need those nested *for* loops to do it - someone's already done it for us:

In [None]:
print(first_filename)
data = np.loadtxt(fname=first_filename, delimiter=',')
data.shape

#### Plotting  
Because numpy is designed to work with arrays of values, we can easily remake the plot above, without the loop.  

As you can see, the default orientation plots one line per observation - meaning we see 40 lines (one for each observation,) each with 60 data points on the x-axis for the 60 participants.

In [None]:
plt.plot(data);

What a mess.  What we wanted was a plot across the observations, not participants.  Fortunatelly, we can just swap the axes of the array using a 'Transpose'.  In the same way nD arrays carry around their shape in the `.shape` attribute, they carry around their transpose in the `.T` attribute.

In [None]:
plt.plot(data.T);

<b>EXERCISE 1 - min, mean, max

1. Extract the data for the second patient
2. Calculate the minimum, mean, and maximum inflamation scores for that patient
3. Using the `axis=` parameter, calculate the minimum, mean, and maximum inflamation scores for each observation  
    *tip: since there are 60 patients with 40 observations each, you can check that you're working observation-wise and not patient wise* 
4. Plot the minimum, average, and maximum inflamation scores per observation

<b>BONUS

Ask google how to add a figure legend to your plot.
1. Search google for "add figure legend to matplotlib"
2. Choose the first link to `stackoverflow.com`
3. Look at the top answer, which generally should have a green check mark 

In [None]:
# Exercise 1: Your solution goes here.


<b>EXERCISE 2 - multiple files

Re-create the plot above for the first three `.csv` files.  In order to get a new figure, use: `plt.figure()`

<b>BONUS

Ask google how to add a figure title to each plot, so that you can tell which file it comes from.

In [None]:
# Exercise 2: Your solution goes here.


## 3. Analysis using pandas and matplotlib

Numpy is designed and built for doing array manipulations.  It's good at doing the kinds of table-like operations we've been doing so far, but it's really meant for doing math.  The Pandas library, on the other hand, is built from the ground up for doing this type of work.  

The primary two components of pandas are the `Series` and `DataFrame`. A `Series` is essentially a column or row, and a `DataFrame` is a 2-dimensional table made up of a collection of `Series` (if you're familiar with the R programming language, these are the same as data frames in that language).

Like Numpy and Matplotlib, Pandas is a large, powerful library, and we're only going to look at a small portion in this workshop.

In [None]:
import pandas as pd

In [None]:
# Creating DataFrames from scratch. Eeach (key, value) pair in data corresponds to a column in the resulting DataFrame.
data = {'Semester': [3, 1, 3, 3, 1],
        'Name': ['Sarah', 'John', 'George', 'Julia', 'Peter'],
        'Grade': [7.0, 6.8, 7.2, 7.8, 7.1]}

students = pd.DataFrame(data)
students

Pandas provides a location indexer (`.loc[]`) that behaves in a similar manner to Numpy.

In [None]:
students.loc[1]

Remember that Pandas behaves as a spreadsheet.  While using `.loc[]` looks like it uses 'positional' slicing, what you're actually doing is slicing based on the index label.  

If we create a dataframe with labeled rows and columns, the behaviour becomes more clear.

In [None]:
# You can also provide your own index
df = pd.DataFrame(data, index=['student2', 'student2', 'student3', 'student4', 'student5'])  # having the string student2 twice is not a typo. You will see below.
df

In [None]:
# You can now locate an entry with...
# print(df.loc[0])  # This doesn't work with this DataFrame
df.loc['student2']

You can also provide a column name.

In [None]:
df.loc['student3', 'Name']

But we can also provide a list of indices, and Pandas will give us those back in the order we ask for them.

In [None]:
df.loc[['student4', 'student3'], ['Grade', 'Name']]

Or we can use the `.loc[]` indexing to do the same kind of slicing we do with Numpy.

In [None]:
df.loc['student3':'student5', 'Semester':'Name']

Pandas provides various methods and attributes that can give you details about the DataFrame. For example `.info()` provides the essential details about the dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory the DataFrame is using, `.shape` outputs a tuple of (rows, columns) and `.columns` and `.index` print the column and row names respectively of the dataset.

In [None]:
# Getting info about your data
print(students.info())

In [None]:
# More info about a DataFrame
print(students.shape)
print(students.columns)
print(students.index)

We can also append to the end of our DataFrame with the `._append()` method. We can either append a single row by passing a dictionary or multiple rows by passing another DataFrame. 

Note: as of pandas v2.0 the `.append()` was removed.

In [None]:
new_student = {'Semester': 1, 'Name': 'Mark', 'Grade': 6.9}
modified_students = students._append(new_student, ignore_index=True)  # not in-place, need to reassign
modified_students

Another way to create a DataFrame is by passing a list of lists, along with the index and columns arguments if needed, e.g.

In [None]:
pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]], index=['one', 'two', 'three'], columns=['a', 'b', 'c', 'd'])

We can loop over the rows of a DataFrame with the `.iterrows()` method as (index, Series) pairs.

In [None]:
for index, row in students.iterrows():
    print(index, row['Semester'], row['Name'], row['Grade'])

In the previous section we used a built-in numpy method (`.loadtxt()`) to parse the comma separated values file. Pandas provides a similar functionality, giving back a DataFrame instead of a numpy array, via the `.read_csv()` method. 

In [None]:
pd.read_csv('data/inflammation-01.csv')

Oops!  By default, Pandas assumes that the first row is the column names.  As our data has no header, we need to tell Pandas that the header is `None`.

In [None]:
pd.read_csv('data/inflammation-01.csv', header=None)

While Numpy arrays are designed to do math, Pandas dataframes are designed to hold data, so in general you should try to treat them as immutable.

In [None]:
inflammation_01 = pd.read_csv('data/inflammation-01.csv', header=None)

So what do we get for this increased complexity?

<b>EXERCISE 3 - General information on our csv file with Pandas

1. Load the second csv file into a Pandas DataFrame called inflammation_02
2. Create a variable called inflammation_02_description using the `.describe()` method of the DataFrame
3. Examine the contents of the inflammation_02_description variable  
4. Using the `.loc[]` indexer, extract the min, mean, max values and convert them to lists
5. Plot the min, mean, max values using `plt.plot()`

In [None]:
# Exercise 3: Your solution goes here.


### 3.1 Reading excel spreadsheets with Pandas

Because Pandas dataframes behave like Excel spreadsheets, the people who created pandas decided that you should be able to work with Excel spreadsheets.  

We can load all the sheets from the `inflamation.xlsx` Excel file by telling the `read_excel` method not to get a specific one (note that by default it gets the first one).  This will give us a Dictionary of DataFrames:

In [None]:
inflammation_workbook = pd.read_excel('data/inflammation.xlsx', sheet_name=None, header=None, engine="openpyxl")
inflammation_workbook.keys()

In [None]:
inflammation_workbook['inflammation-01']

One final bit of Pandas. It plots, too.

In [None]:
inflammation_01_description = inflammation_01.describe()
inflammation_01_description.loc['min'].plot()
inflammation_01_description.loc['mean'].plot()
inflammation_01_description.loc['max'].plot()


plt.xlabel('Observation')
plt.ylabel('Value')
plt.legend(['Minimum', 'Mean', 'Maximum'])
plt.title('Data for file data/inflammation-01.csv')

<b>EXERCISE 4 - Plot all the things!

Plot min, mean, max values for all the sheets in your excel file.

<b>BONUS

Plot them in order.  
*tip: you can order the keys in a dictionary using `sorted(dictionary.keys())`*

In [None]:
# Exercise 4: Your solution goes here.


Hmm,...  That last one looks funny.

#### One final plot...  
As our data is 2-dimensional, one final way we can plot it is to show it as an image.  Because Pandas and Matplotlib work together well, it's easy.  We just use the `imshow` method of the `pyplot` library, and then add a colorbar to get a range.

In [None]:
plt.imshow(inflammation_workbook['inflammation-01'], cmap='jet')
plt.colorbar()

MathWorks, the company who makes Matlab, used to use the Jet colormap (and so did Matplotlib before version 2.0.)  This colormap was so popular (and pretty,) it became the default in most packages.  Unfortunatelly, Jet is a terrible colormap. Recently, there has been a lot of research about the effects of different colormaps on our perception of data that has shown how truly awful the Jet colormap is.  In [one recent study](http://www.eecs.harvard.edu/~kgajos/papers/2011/borkin11-infoviz.pdf), physicians who were switched from Jet to a perceptually 'appropriate' map showed a 47% increase in the ability to detect potential sites of coronary artery disease.

Because Jet was so bad, Mathworks changed their default map to one called *parula*, which is much better, but still not ideal.  Matplotlib has gone for a map called *viridis* which is perceptually uniform, looks the same if you have red-green colour blindness, and prints nicely in black and white.

As an example, we'll plot the same data as above using both a greyscale map and viridis.

In [None]:
plt.figure(figsize=(6, 6))  # Notice that you can change the figure size by poviding an (x,y) tuple
plt.imshow(inflammation_workbook['inflammation-01'], cmap='Greys_r')
plt.colorbar()

plt.figure(figsize=(6, 6))
plt.imshow(inflammation_workbook['inflammation-01'], cmap='viridis')
plt.colorbar()

<b>EXERCISE 5 - Try some other colormaps

Repeat the plot above using *at least* three different colormaps.

<b>tip: Google 'matplotlib colormap'


In [None]:
# Exercise 5: Your solution goes here.


<b>EXERCISE 6 - I promise Jet is bad

Plot all the spreadsheets in the viridis, Greys_r, and jet colormaps.

Do you see anything interesting?

In [None]:
# Exercise 6: Your solution goes here.


## Congratulations, You made it. ##