# Assignment 3

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

from scipy.optimize import curve_fit

In [None]:
from itertools import product

This assignment is broadly based on preliminary work done for the article [*Quantifying massively parallel microbial growth with spatially mediated interactions*](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011585), which consists of a data science approach on analysing high-throughput growth data from [*Scan-o-matic: High-Resolution Microbial Phenomics at a Massive Scale*](https://www.g3journal.org/content/6/9/3003.short).
Please read the paper, it is also in your reading list.

### Loading the data

Let's start by loading the **time series**. The file `curves_raw.npy` is in a `numpy` compressed format, which means that we'll use `np.load` to read it :

In [None]:
data = np.load("curves_raw.npy")
n_plates, n_rows, n_columns, n_points = data.shape

In [None]:
data.shape

The variable `data` now contains the time series as an `np.array`. It has 4 dimensions, which depict :
1. the plate number
2. the row number
3. the column number
4. the time point

### Transforming into a `pd.DataFrame`

This version of the data is very compact, but **not easily readable**. Let's turn it into a `pd.DataFrame`, so we can add better index and columns names :

In [None]:
Nt = pd.DataFrame(
    data    = data.reshape((n_plates * n_rows * n_columns, n_points)).T,
    columns = pd.MultiIndex.from_product(
        (range(n_plates), range(n_rows), range(n_columns)),
        names = ["plate", "row", "column"]
    ),
    index   = pd.Index(range(n_points), name = "time point")
)

In [None]:
Nt

This way, a specific time serie can be accessed in a similar fashion to the above `np.array` :

In [None]:
Nt[0, 0, 0]

### Additional calculation : distance to the border of the plate

This assignment will require the use of specific additional data, which is the **inverse distance** of each grid point to the **closest border of the plate**.
This calculation is here quickly done for you :

In [None]:
dists = np.empty((n_rows, n_columns))

for r in range(n_rows):
    for c in range(n_columns):
    #   distance to top/left/bottom/right wall
        m = min(r, c, n_rows-1 - r, n_columns-1 - c)
        
        dists[r, c] = 1 / (1 + m)

In [None]:
_, ax = plt.subplots()
sns.heatmap(dists, ax = ax)
ax.set_xlabel("column")
ax.set_ylabel("row");

# Exercise 1

Currently, our data looks just like a lot of _numbers_ ; obviously, they do not really make sense when considered individually. The most common way to make sense of such a dataset is to **visualise** it graphically.

Indeed, when starting to work with a new dataset the first step is very often looking at it in its entirety via various visualisations, so consider Exercise 1 as a generic "protocol".  The first part of this exercise will be to _plot_ the data in various ways, then later use automated methods to _regroup_ similar parts of the data.

## 1a. Plotting individual time series

Since the data represents **time series**, it makes sense to plot the _number of cells_ for each _time point_. In this exercise, we will do that first for a few selected populations of each plate, and then for all populations of each plate.

### A selection of arbitrary populations

First, let's create a plot for each plate, displaying the time series of the **locations** _(0, 0), (1, 1), (16, 24), (20, 12), (24, 40)_ and _(31, 47)_ :

In [None]:
coords = [(0, 0), (1, 1), (16, 24), (20, 12), (24, 40), (31, 47)]

**Reminder :** The time series of the population located at _(0, 0)_ can be displayed for each plate the following way :

In [None]:
_, axes = plt.subplots(nrows = 2, ncols = 2, figsize = (16, 10))

for p in range(n_plates):
    axes[p//2, p%2].plot(
        Nt[p, 0, 0],
        label = f"({0}, {0})"
    )
    
    axes[p//2, p%2].set_title(f"plate {p+1}")
    axes[p//2, p%2].legend()

Copy the above code in the cell below, and modify it so that it displays the trajectories of all the coordinates contained in the `coords` variable :

**Hint :** In the `for` loop iterating through the plates, nest another `for` loop iterating through the `coords` variable, and place the right function calls into that loop.

You should observe that the populations **close to the border** behave very differently than the ones **far from the border**.

#### Adding colour

Add colour to the lines according to the _inverse distance_ of the populations _to the closest border_ (use a gradient from red to black if in doubt) :

**Hint 1 :** The `dists` variable maps for every location in a plate, the value you seek. So the _inverse distance_ of the location _(r, c)_ _to the closest border_ can be obtained as `dists[r, c]`.
(Note that these values are between 0 and 1.)

**Hint 2 :** _PyPlot_'s `plot` function has a `color` parametre which you can use the following way : a colour is represented as a combination of **RGB** values, which means a gradient from black to red can be obtained by varying the red component of the RGB triplet you provide to the `color` parametre. [More on colours here.](https://matplotlib.org/stable/tutorials/colors/colors.html "Really, read the docs !")
A few examples :
* **black** is _(0, 0, 0)_
* **red** is _(1, 0, 0)_
* **cyan** is _(0, 1, 1)_
* **orange** is _(1, 0.5, 0)_

### All the trajectories

Now plot **all** trajectories of each plate instead of just the few arbitrary coordinates from above, with the same colour scheme :

(By the way, it is normal for this plot to take more time to generate.)

**Note :** As this plot gets messy very fast, you may want to add transparency to the lines. Use the `alpha` parametre of the `plot` function for that (0 means fully transparent, 1 means fully opaque).

#### Comment here on how the plates differ

**Answer :** 

## 1b. Plotting statistics

As can be seen from the trajectories, there is a **high diversity** on each plate. Since the plates are two-dimensional, there are two ways to display it : as a _histogram_ or as a _heatmap_.

### Histograms

A common way to summarise datasets is to use a **histogram** ; _PyPlot_ has a function called [hist](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) which can plot a histogram for you, if provided with a 1-dimensional array of values.

This function can be used the following way, if for example you want to plot a histogram of the cell numbers of each plate at coordinate _(0, 0)_ :

In [None]:
_, axes = plt.subplots(nrows = 2, ncols = 2, figsize = (16, 10))

for p in range(n_plates):
    axes[p//2, p%2].hist(data[p, 0, 0], bins = 100)
    axes[p//2, p%2].set_title(f"plate {p+1}")
    axes[p//2, p%2].set_xlabel("N")

Plot here for each plate a histogram of the number of cells at **time 0** (use 100 bins) :

and at **the last time point** :

**Hint :** A time slice of the data is a 2x2 matrix, and you need a 1-dimensional array. Use the `reshape` method to make that transformation.

#### Comment here on how the variance changes between the two times

**Answer :** 

#### Comment here on what the outliers at the last time point represent, on plates 2-4

**Answer :** 

### Heatmaps

Another common way, particularly relevant for 2-dimensional data such as a grid, is to visualise it as a **heatmap** ; though _PyPlot_'s function called `imshow` can be used for that purpose, the _seaborn_ module has a [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html "Again, read the docs ! ;)") function which is much more powerful.

Plot here for each plate a heatmap of the number of cells at **time 0** :

and at **the last time point** :

#### Comment here on how the structure of the sizes differ between the two times

**Answer :** 

## 1c. Dimensionality reduction and clustering

It is fortunate that we can use intuition such as the distance to the border of the plate, to distinguish different time series. But this kind of intuition does not allow us to go further into categorising the data.

The standard technique for investigating data in an automated fashion is to first **reduce its dimensionality** and then to run a **clustering** algorithm, allowing us to group similar data.

### PCA

#### All plates

Let's start with dimensionality reduction ; one very common algorithm that allows that is **PCA**. We can use it on the whole data regardless of the plate, with _two components_, and project the data into the new space (read the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html "Docs often have an example section.") to understand how the transformation to a new space is performed) :

In [None]:
pca_all = pd.DataFrame(
    data    = PCA(n_components = 2).fit_transform(Nt.T),
    columns = ["component 1", "component 2"],
    index   = Nt.columns
)

Having a _two components_ decomposition means we decompose our set of time series as a **set of points in a 2D space**. This means, we can plot them as a _scatter plot_ ; when doing that, try to colour each point according to the plate it belongs to :

In [None]:
_, ax = plt.subplots(figsize = (16, 10))

sns.scatterplot(
    x       = "component 1",
    y       = "component 2",
    hue     = "plate",
    data    = pca_all.reset_index(level = 0),
    palette = sns.color_palette(n_colors = n_plates),
    alpha   = 0.2,
    ax      = ax
);

#### Comment here on how well the plates are separable

**Answer :** 

#### Individual plates

Instead of grouping the whole data from all plates, this time, reduce the dimensionality of each plate individually :

Plot the decompositions for each plate on a different _scatter plot_, but this time give the colours according to the **distance to the border of the plate** :

**Hint :** You can pass the `dists` variable to the _hue_ parametre, provided you linearise it into a 1-dimensional array.

#### Comment here on how well the populations are separable

**Answer :** 

### t-SNE (bonus exercise)

Another very popular algorithm for dimensionality reduction is **t-SNE** ; it is substantially slower than PCA, but in turn offers often better performances. Repeat what you did with PCA but use this time t-SNE (the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) shows it works the same way than for the previous exercise) :

#### Comment here on how the performances of t-SNE compete with PCA

**Answer :** 

### K-Means (bonus exercise)

Until here we made (again !) some assumptions on how various aspects of the data can be separated or regrouped : for example, we set colours according to which plate a point belongs to, or where it is placed on a plate. But how about avoiding that step too ? 

There is a set of **clustering** algorithms that allow us to do that ; **K-Means** is one of them. Unfortunately, these algorithms may run quite slowly if subjected to too many dimensions ; this is why we try to reduce it beforehand.

In this part of the exercise, try to run K-means on one of the sets using all plate data from before (either obtained with PCA or t-SNE), and in particular, observe how it behaves if you change the number of clusters (in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) you can find a method analogous to the one you used in the last two exercies) :

In [None]:
pca_all["cluster"] = KMeans(n_clusters = 4).fit_predict(pca_all[["component 1", "component 2"]])

Make a _scatter plot_ as done above, but use the **cluster prediction** as colour :

Repeat this with a data set where the plates are separated (use 5 clusters) :

Their scatter plot :

#### Comment here on why the clusters look different from the distance-based ones

**Answer :** 

# Exercise 2

The previous exercise was about _exploring_ the time series data ; making _sense_ of it. Exercise 2 on the other hand, is about **fitting** the data to a _model_ ; after all, we want to **explain** the data with a formal view, not just compare numbers and features.

The model we choose is a variant of the _generalised logistic curve_ or **Richards' curve**, which is recurrent in biology :
$$
    N(t) = \beta + \frac{L_f}{(1 + \nu e^{-k(t-t_m)})^{1/\nu}}
$$

## 2a. Training : linear fit

The fits for the above model are going to be difficult to perform. In order to learn to use a fitting tool called [curve_fit](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html), we will start by fitting a **linear model** :
$$
    N(t) = at + b
$$
The idea of a fit is to find values for the parametres $a$ and $b$ so that the model _fits_ the data we try to fit.

Let's do this for plate 1 at coordinate _(16, 16)_ :

In [None]:
ts = np.arange(n_points)

In [None]:
(a, b), _ = curve_fit(
    f     = lambda t, a, b: a * t + b,
    xdata = ts,
    ydata = Nt[0, 16, 16]
)

and comparing the fit to the actual data :

In [None]:
plt.plot(ts, Nt[0, 16, 16], label = "N(t)")
plt.plot(ts, a * ts + b, label = "linear fit")
plt.title(f"({16}, {16})")
plt.legend();

**Note :** I'm providing here a _lambda_ function to the `curve_fit` function (you may want to take time to understand the concept of [anonymous function](https://en.wikipedia.org/wiki/Closure_(computer_programming)#Anonymous_functions "Merci Wikipedia !")), as it allows to create functions on the fly, which represents a linear model $at + b$.

Perform a series of linear fits for the set of coordinates below :

In [None]:
coords = [(0, 0), (1, 1), (16, 24), (20, 12), (24, 40), (31, 47)]

Start by performing the fits, and store the obtained values into this variable :

In [None]:
params = np.empty((len(coords), 2))

**Hint :** Use a `for` loop to iterate through the coordinates, and store the computed parametres into the `params` variable directly.

Now plot the linear fits against the data as done above, for every coordinate (use 2 rows and 3 columns) :

#### Comment here on why the fits work well or not

**Answer :** 

## 2b. Upgrade : logistic model

A common model encountered in growth problems is the **logistic curve** :
$$
    N(t) = \frac{L}{1 + e^{-k(t-t_0)}}
$$
This model has 3 parametres ($L$, $k$ and $t_0$) which are to be fitted.

In [None]:
def lm(t, l, k, t0):
    return l / (1 + np.exp(-k * (t - t0)))

In [None]:
plt.plot(ts, lm(ts, l = 1, k = 0.1, t0 = 100))
plt.title("example of logistic curve")
plt.xlabel("t")
plt.ylabel("N(t)");

### The catastrophic failure

Adapt the code you wrote above in order to fit this time the `lm` function given above instead of a linear model :

In [None]:
params = np.empty((len(coords), 3))

**Note :** There will be an _OptimizeWarning_ about the _covariance of the parametres_ not being _estimated_. Don't panic, this is expected.

Plot the fitted curves for every coordinate :

Have a look at the values stored in `params` (in particular the third column, the $t_0$) :

In [None]:
pd.DataFrame(
    data    = params,
    columns = ["L", "k", "t0"],
    index   = coords
)

#### Comment here on what you observe about these failed fits

**Answer :** 

### The improvement

In order to obtain a _sigmoid_, the values of $t_0$ and $k$ have to be constrained. The solver might make better assumptions if we give hints on the values of the parametres ; one way is to set such constraints. We will choose the following constraints :
* $L \in ]0, \infty[$
* $k \in ]0, 1]$
* $t_0 \in [0, \infty[$

Copy your above code, but this time providing these constraints to `curve_fit` :

**Hint :** There is a link to the documentation for `curve_fit` earlier in this exercise ; you may want to look at the `bounds` part.

Plot the fitted curves for every coordinate :

#### Comment here on those upgraded fits

**Answer :** 

## 2c. The generalised logistic curve

The model used from now on is **Richards' curve**, which is :
$$
    N(t) = \beta + \frac{L_f}{(1 + \nu e^{-k(t-t_m)})^{1/\nu}}
$$

In [None]:
def richards(t, beta, l_f, nu, k, t_m):
    denom = 1 + nu * np.exp(-k * (t - t_m))
    return beta + l_f / np.power(denom, 1/nu)

### A 5-parametres failure

Do the fits as above, but this time using the `richards` function and setting the bounds $]0, \infty[$ for every parametre :

In [None]:
params = np.empty((len(coords), 5))

**Note :** This time you will get two _RuntimeWarning_ ; as before, don't be alarmed : this is expected.

Plot the curves here :

Let's see how those values look like :

In [None]:
pd.DataFrame(
    data    = params,
    columns = ["beta", "Lf", "nu", "k", "tm"],
    index   = coords
)

#### Comment here on what you suppose happened this time

**Answer :** 

### A 3-parametres variant

The parade for such a failure is to reduce the number of parametres to fit, but it requires us to have prior knowledge on their value. Fortunately, there are two parametres in the above model that we already know :
* $\beta$ : the number of cells at the start of the experiment
* $L_f$ : the difference between the numbers of cells at the start and end of the experiment

In this part, we will fit only the other three parametres. Find a way to make `curve_fit` vary only the last three parametres while keeping $\beta$ and $L_f$ to the above values (hint : try to understand the concept of _nested function_ and how the `lambda` keyword works) :

In [None]:
params = np.empty((len(coords), 5))

Set the first two columns of `params` to the right values for $\beta$ and $L_f$ (which you compute for each coordinate) :

In [None]:
for i, (r, c) in enumerate(coords):
    n0, nf = Nt[0, r, c].iloc[[0, -1]]
    
    params[i, 0] = # beta
    params[i, 1] = # Lf

Now adapt the code you wrote previously, so that it calls the `richards` function inside a _lambda_ function instead of directly, allowing you to fix the $\beta$ and $L_f$ parametres you computed by hand when calling the `richards` function :

**Hint 1 :** Your lambda function should take only the remaining parametres.

**Hint 2 :** You will have to find a way to store the values returned by `curve_fit` (less then 5) into the right locations in your `params` variable.

Plot the fitted curves here :

#### Comment here on what you observe

**Answer :** 

### Back to 5 parametres

Fitting problems can become much easier if we can give some good **initial parametres** ; in the previous exercise, you fitted the last three parametres. This means that you now possess such initial parametres.

Repeat what you did, but this time by providing these initial parametres :

In [None]:
params3 = params.copy()
params  = np.empty((len(coords), 5))

**Hint :** There is a link to the documentation for `curve_fit` earlier in this exercise ; you may want to look at the `p0` part.

Plot the fitted curves here :

#### Comment here on why the last part fits better now

**Answer :** 