# 4 - Further work with confocal image datasets

## 4.1 - Overview

### Introduction

Yesterday you worked with some statistics extracted from an Imaris file. Today,
you will learn how to work directly with the file to extract additional
information. 

Imaris is an open-source file format based on HDF5. A number of languages have
packages for reading this file format. For Python, there are two main packages,
`h5py` and `pytables`. We will use `pytables` for this exercise.

### The data

Up to 25 auditory nerve fibers synapse onto individual inner hair cells in
normal-hearing individuals. However, these synapses can be permanently lost due
to aging, exposure to noise or ototoxic drugs.  In experiments that study
hearing loss, we need a way of quantifying the number of synapses per inner
hair cell.

One approach is to dissect the cochlea out of the experimental animals and use
whole-mount immunohistochemistry to label the tissue with antibodies for
pre-synaptic ribbons (CtBP2), post-synaptic receptors (GluR2) and cytoskeleton
(Myosin VIIa). In a second step, each antibody is tagged with a fluorescent dye
that can be illuminated using a laser (much like how a black light can cause
certain materials to glow).

The distribution of these fluorescent dyes (which map to the underlying
distribution of the proteins of interest) can be captured by taking a series of
two-dimensional images at various depths in the tissue.  These images are then
"stacked" to create a three-dimensional image known as a Z-stack (since the
third dimension is commonly referred to as the Z-axis).

For this exercise, the dataset has been trimmed down to a small subset showing
only two inner hair cells (the full dataset is 0.5 GB in size) with CtBP2 (fig.
1a) and GluR2 (fig. 1b).

<table>
	<body>
		<tr>
			<td>1A. CtBP2 (pre-synaptic ribbon)</td>
			<td>1B. GluR2 (post-synaptic glutamate receptor)</td>
		</tr>
		<tr>
			<td><img src="data/CtBP2.png" /></td>
			<td><img src="data/GluR2.png" /></td>
		</tr>
	</body>
</table>

###  The problem

A functional inner hair cell synapse requires both a pre-synaptic ribbon and a
post-synaptic glutamate receptor. The next step in our analysis is to determine
whether each CtBP2 puncta is near a GluR2 label. 

This dataset was analyzed using Imaris to identify all CtBP2 puncta (white dots
in fig. 2a). If you look closely at the composite (fig. 2b), you'll see that
not all puncta have a glutamate receptor patch next to them (fig. 2b)! We
should not be counting these for the purpose of analysis. So, we need to find a
way to detect these false hits and eliminate them.

<table>
	<body>
		<tr>
			<td>A. CtBP2 puncta</td>
			<td>B. CtBP2 puncta overlaid on GluR2</td>
		</tr>
		<tr>
			<td><img src="data/CtBP2+points.png" /></td>
			<td><img src="data/CtBP2+GluR2+points.png" /></td>
		</tr>
	</body>
</table>

One approach is to extract a fixed volume around each CtBP2 puncta (e.g., a 1um
cube) and quantify the amount of GluR2 label in the volume. But, we don't know
very much about the format of the data. We need to do a little exploration
first.

## 4.2 - Getting started

First, let's import a few modules we'll need. Most of them are common third-party modules; however, I have written a helper module (`imaris`) to extract some of the data on the CtBP2 puncta from the file (you loaded this data from a comma-separated-values file yesterday; however, we are now going to load it directly from the HDF5 file itself).

In [None]:
import pylab as pl
import tables as tb
import numpy as np

import imaris

%matplotlib inline

Now, let's open the file.

In [None]:
fh = tb.open_file('data/confocal dataset.ims')

We need to figure out how to find the image data. We can start by taking a look at the file. Since HDF5 is a hierarchial data format, data inside the file is stored in a tree-like structure. We can view this structure using `print`.

In [None]:
print(fh)

This is a lot of information! However, in scanning the list there are several
things that jump out as important clues. First, remember that the tissue has
three labels (CtBP2, GluR2 and MyosinVIIa). In confocal imaging, each label is
acquired using a separate channel. At the bottom of the list we see several
rows that mention `Channel 0`, `Channel 1` and `Channel 2`. This is most likely
the data we need.

However, the channels appear several times (under `ResolutionLevel 0`,
`ResolutionLevel 1`). Which one do we want? Our
intuition as a programmer tells us that Imaris likely generates the dataset at
multiple resolutions and uses the appropriate resolution based on your zoom
level. For quantitative analysis, we probably want the highest resolution
level. 

Take another look at the list. You'll notice that at the end of each line
there's an indicator in parenthesis (`Group`, `Array`, `CArray`). These are the
different types of nodes (i.e., entries) in the HDF5 file. The simplest way to
think of a HDF5 file is that it's a self-contained filesystem. A `group` node
is equivalent to a folder. A `leaf` node (e.g., `Array`, `CArray`, `Table`) is
equivalent to a file. Group nodes are used to organize the data in the HDF5
file.

Now, let's look at the `Channel 0/Data` line for each resolution level. There's
some information about the size of the array. This tells us that
`ResolutionLevel 0` contains the highest resolution data and `ResolutionLevel
1` contains the lowest resolution data. Otherwise, they should be identical.

Let's take a look at a single node so we can understand how to work with
the data.

In [None]:
node = fh.get_node('/DataSet/ResolutionLevel 0/TimePoint 0/Channel 0/Data')
data_CtBP2 = node.read()
print(data_CtBP2.shape)

This is a 3D `numpy` array containing the image data. Each element in the array represents a voxel (i.e., a 3D pixel). The first dimension is the Z-axis, second dimension the Y-axis and third (last) dimension the X-axis. 
For example, to pull out the pixel located at XYZ coordinates (60, 20, 50), you can index it.

In [None]:
data_CtBP2[50, 20, 60]

There are ways to visualize 3D data in Python. However, these approaches are
not readily available out of the box for Jupyter notebooks. Let's focus on simple 2D plotting
instead. A common way of presenting confocal image stacks is to take the
maximum projection along an an axis (i.e., dimension). Let's take the maximum
projection along the first axis (i.e., Z-axis) and plot the resulting 2D image using `imshow`.

The `origin='lower left'` argument to `pl.imshow` indicates that the data at `projection[0, 0]` should appear at the lower left corner of the axes instead of the default location (the upper left).

In [None]:
projection = data_CtBP2.max(axis=0)
pl.imshow(projection, origin='lower left')

## 4.3 - Exercise - cropping the dataset

It looks like the image has been "padded" with empty data by Imaris, making it a bit ugly to look at. Let's crop out that extra data. To do this, we need to find out what the actual image extents are in pixels. There is a way to do this by looking at the HDF5 file, but this is outside the scope of the exercise. For now, we provide the numbers for you.

Using these numbers, use Numpy indexing to crop out the empty regions and replot the data.

In [None]:
x_pixels = 161
y_pixels = 194
z_pixels = 135

# Your solution for the exercise:
## crop the dataset
## compute the maximum projection of the cropped dataset
## plot the new maximum projection

## 4.4 - Exercise - understanding the documentation

The units on the X and Y-axes are in pixels. Let's convert them to actual image dimensions (in microns). First, you need to know the actual dimensions (this can be done by looking at the HDF5 file, but we provide the numbers for you). These are the the dimensions of the cropped dataset.

In [None]:
x_um = 22.7418
y_um = 27.442
z_um = 21.526

Now, remember how to get help on a function? If not, go back and take a look at *insert reference to section here*. Take a look at the documentation for `pl.imshow`. Any clues as to what arguments can be used to get `imshow` to properly map each pixel to it's spatial location in microns? As a bonus, be sure to label the X and Y axes too!

In [2]:
# Solution for exercise
## update imshow with the needed argument to plot the axes correctly
## label the x and y axes

## 4.5 - Pulling in extra data

Now, we need to load the data about the CtBP2 puncta that were identified using Imaris. Specifically, we need to know the XYZ location of each puncta. A helper function is provided to extract this information from the Imaris file and return it as a dataframe.

In [None]:
stats_CtBP2 = imaris.load_node_stats(fh, 'CtBP2', 'point')

## 4.6 - Exercise - overlaying a scatterplot on the image

Our goal is to take the plot we've created using `imshow` with the axes showing the correct spatial location in microns and overlay a scatterplot showing the location of each CtBP2 puncta identified by Imaris.

You've already learned how to inspect the contents of a dataframe (for a reminder, see *reference to section*). Take a look at the dataframe. What type of information does it have? What are the units (e.g., pixels or microns)?

Once you have figured out how to obtain the X and Y coordinates for each puncta, you can plot them using `pl.plot(x_coordinates, y_coordinates, 'r+')` (the `'r+'` specifies a red cross marker). Do the coordinates align with the puncta observed in the image?

In [3]:
# Solution to exercise
## figure out what you need to pull out of the dataframe to plot the x and y coordinates of each puncta
## cut and paste your answer from the previous exercise to plot the image
## add the pl.plot command to plot the puncta

## 4.7 - Extracting data from each image

We want to use this data to extract a 1$\mu m$ x 1$\mu m$ x 1$\mu m$ cube centered around each puncta. To do this we need to convert from $\mu m$ to pixels. Since we know the dimensions in pixels and $\mu m$, we can calculate the size of each pixel.

In [None]:
x_size = x_um/x_pixels
y_size = y_um/y_pixels
z_size = z_um/z_pixels
print(x_size, y_size, z_size)

This means that each pixel along the X and Y axes are 0.14 microns and the Z axis is 0.16 microns. If we want to convert from microns to pixels, we can simply divide by the pixel size. This means that a 1$\mu m$ x 1$\mu m$ x 1$\mu m$ cube is approximately 7 x 7 x 6 pixels in size (rounded to the nearest pixel). For simplicity, let's assume that the cube should be 7 x 7 x 7 pixels in size.


## 4.8 - Exercise 

Now that you know how to convert from microns to pixels, let's pull out the first puncta in the dataframe and plot the maximum projection of the 1 x 1 x 1 $\mu m$ region centered around the puncta.

If you don't remember how to extract the first row of a dataframe, see *insert reference here*.

In [None]:
# Solution to exercise
## Extract first row of dataframe
## Convert coordinates stored in first row of dataframe to pixels.
## Extract cube from `data_CtBP2`. Be sure to verify its size is 7 x 7 x 7.
## Compute maximum projection along z-axis and plot it. Ensure axes are labeled appropriately.

## 4.9 - Exercise (doing the same for GluR2)

Looks like we've adequately identified the cube we need. Now, let's load the GluR2 data so we can plot the amount of GluR2 signal within this region as well.

In [None]:
node = fh.get_node('/DataSet/ResolutionLevel 0/TimePoint 0/Channel 1/Data')
data_GluR2 = node.read()
data_GluR2 = data_GluR2[:z_pixels, :y_pixels, :x_pixels]

In [6]:
# Solution to exercise (cut and paste your solution from the prior exercise 
# and adapt it to work with data_GluR2 instead of data_CtBP2)

## 4.10 - Quantifying the GluR2 signal

Looks like there's some GluR2 signal next to the CtBP2 signal. Great! Now how do we quantify this? Maybe we can just take the average intensity within this GluR2 subset?

In [None]:
subset_GluR2.mean()

## 4.11 - Exercise

The next step would be to loop through each row (i.e., puncta) in the dataframe and extract the mean GluR2 signal. This can then be saved back as a new column in the dataframe. We can loop through the rows using the `iterrows` method. Flesh out the for loop below to create a list of the GluR2 signal intensity near each puncta.

In [None]:
signal = []
for _, puncta in stats_CtBP2.iterrows():
    # Write code to extract cube from `data_GluR2` and compute mean value
    signal.append(mean_GluR2_signal)
    
# Here, we can save the GluR2 signal back to the statistics dataframe as a new column
stats_CtBP2['GluR2'] = signal

## 4.12 - Exercise

Now, let's plot a histogram of the GluR2 signal near each CtBP2 puncta. This was covered in *insert reference*. Look at the histogram. Are there any obvious outliers? Is there an obvious cutoff threshold? Based on this, how many functional synapses are there?

In [None]:
# E

Looks like there's an obvious cutoff threshold we can use (i.e., < 10). How many functional synapses are there?

## Bonus - Creating a composite image

In the above images, `imshow` is using a color map in which purple reflects the regions with no signal and yellow reflects regions with the most signal. But, what if we'd like to merge the three channels into a single image where red is mapped to CtBP2, green to GluR2 and blue to MyosinVIIa. How can we do this? Let's take another look at the documentation for `imshow`.

In [None]:
pl.imshow?

It looks `imshow` can take a 3D array where the last dimension maps to the three colors (i.e., `x[..., 0]` is red, `x[..., 1]` is green and `x[..., 2]` is blue). The documentation also warns that the values in the array must be in the range 0 ... 1 for this to work. Let's check that. 

In [None]:
node = fh.get_node('/DataSet/ResolutionLevel 0/TimePoint 0/Channel 0/Data')
data_CtBP2 = node.read()
data_CtBP2 = data_CtBP2[:z_pixels, :y_pixels, :z_pixels]
node = fh.get_node('/DataSet/ResolutionLevel 0/TimePoint 0/Channel 1/Data')
data_GluR2 = node.read()
data_GluR2 = data_GluR2[:z_pixels, :y_pixels, :z_pixels]

In [None]:
data_CtBP2.max()

Uh oh. We need to fix that. The simplest way to coerce data to the range 0 ... 1 is to divide by the maximum value. Let's do this and check that we did OK.

In [None]:
data_CtBP2 = data_CtBP2/np.max(data_CtBP2)
data_GluR2 = data_GluR2/np.max(data_GluR2)

In [None]:
data_CtBP2.max()

Great. Now we need to make the 2D image for each color and then merge them into a 3D array. A list of 2D images can be stacked into a 3D array using Numpy's `dstack` function. We need to make a blank image for the blue color. The quickest way to do this is to use the `zeros_like` function from Numpy which will create an array of the same shape, but filled with zeros.

In [None]:
projection_CtBP2 = data_CtBP2.max(axis=0)
projection_GluR2 = data_GluR2.max(axis=0)
projection_blue = np.zeros_like(projection_CtBP2)

data = [projection_CtBP2, projection_GluR2, projection_blue]
projection = np.dstack(data)

pl.imshow(projection, extent=extent)