# Introduction to Bio Medical Image Preprocessing, Segmentation.

[Image 1 download](https://s3.amazonaws.com/bebi103.caltech.edu/data/bsub_100x_phase.tif), [Image 2 download](https://s3.amazonaws.com/bebi103.caltech.edu/data/bsub_100x_cfp.tif)

<hr>

In [None]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade bebi103 iqplot scikit-image watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np

# Our image processing tools
import skimage.filters
import skimage.io
import skimage.morphology

import bebi103
import iqplot

import bokeh.layouts
import bokeh.io

bokeh.io.output_notebook()

<hr>

In this lesson, we will learn some basic techniques for image processing using [scikit-image](http://scikit-image.org) with Python.

## Loading and viewing images

We will now load and view the test images we will use for segmentation.  We load the image using the `skimage.io.imread()`.  The image is stored as a NumPy array.  Each entry in the array is a pixel value.  This is an important point: **a digital image is data**!  It is a set of numbers with spatial positions.

Today, we'll be looking at some images of  *Bacillus subtilis*, a gram-positive bacterium famous for its ability to enter [a form of "suspended animation" known as sporulation](https://en.wikipedia.org/wiki/Sporulation_in_Bacillus_subtilis) when environmental conditions get rough. In these images, all cells have been engineered to express Cyan Fluorescent Protein (CFP) once they enter a particular genetic state known as [competence](https://en.wikipedia.org/wiki/Natural_competence). These cells have been imaged under phase contrast (`bsub_100x_phase.tif`) and epifluorescence (`bsub_100x_cfp.tif`) microscopy. These images were acquired by former Caltech graduate student [Griffin Chure](https://gchure.github.io).

Let's go ahead and load an image.

In [None]:
# Load the phase contrast image.
im_phase = skimage.io.imread(os.path.join(data_path, 'bsub_100x_phase.tif'))

# Take a look
im_phase

We indeed have a NumPy array of integer values. To properly display images, we also need to specify the **interpixel distance**, the physical distance corresponding to neighboring pixels in an image. Interpixel distances are calibrated for an optical setup by various means. For this particular setup, the interpixel distance was 62.6 nm.

In [None]:
# Store the interpixel distance in units of microns
ip_distance = 0.0626

Now that we have the image loaded, and know the interpixel distance, we would like to view it. Really, I should say "plot it" because, *an image is data*.

### Viewing images with Bokeh and bebi103

The `bebi103.viz.imshow()` function enables easy viewing of images.

In [None]:
# Create a rendering of the image
p = bebi103.image.imshow(im_phase)

# Use bokeh to display
bokeh.io.show(p)

The image is displayed with the bebi103 default colormap (more on that in a moment). By default, the axes are in units of pixels. We would rather have the axes marked in units of microns so we know the physical distances. We can specify that with the `interpixel_distance` keyword argument for `bebi103.image`. We can make that for this image using our value of the interpixel distance.

To demonstrate additional options, we will set the height of the image as displayed to be 200 pixels and a grayscale colormap, which we can do with the `frame_height` and `cmap` keyword arguments, respectively.

In [None]:
p1 = bebi103.image.imshow(
    im_phase,
    frame_height=200,
    interpixel_distance=ip_distance,
    cmap=bokeh.palettes.gray(256),
    x_axis_label="µm",
    y_axis_label="µm",
)

bokeh.plotting.show(p1)

Conveniently, the image is fully interactive; we can zoom to our heart's content.

## Lookup tables


In the above image, we used a gray colormap. Following are a few different colormaps we could use instead. As I discuss momentarily, you will *almost always* want to use Viridis.

In [None]:
p2 = bebi103.image.imshow(
    im_phase,
    frame_height=200,
    cmap=bokeh.palettes.magma(256),
    interpixel_distance=ip_distance,
    x_axis_label="µm",
    y_axis_label="µm",
)


p3 = bebi103.image.imshow(
    im_phase,
    frame_height=200,
    cmap=bokeh.palettes.viridis(256),
    interpixel_distance=ip_distance,
    x_axis_label="µm",
    y_axis_label="µm",
)

p4 = bebi103.image.imshow(
    im_phase,
    frame_height=200,
    cmap=bokeh.palettes.turbo(256),
    interpixel_distance=ip_distance,
    x_axis_label="µm",
    y_axis_label="µm",
)

# make a grid
grid = bokeh.layouts.gridplot([[p1, p2], [p3, p4]])

bokeh.io.show(grid)

The axis of the above images don't connect unless that is explicitly directed. To connect them so that zoomed regions are similar, we can use the following code.

In [None]:
#Set the ranges for each plot equal to one another, so that they zoom to the same region
p2.x_range = p1.x_range
p2.y_range = p1.y_range
p3.x_range = p1.x_range
p3.y_range = p1.y_range
p4.x_range = p1.x_range
p4.y_range = p1.y_range

bokeh.io.show(grid)

In image processing, a colormap is called a **lookup table** (LUT). A LUT is a mapping of pixel values to a color. This sometimes helps visualize images, especially when we use false coloring. Remember, a digital image is data, and false coloring an image is **not** manipulation of data. It is simply a different way of plotting it.

As we just saw, we specify a lookup table with a **colormap**. There is lots of debate about that the best colormaps (LUTs) are. The data visualization community seems to universally reject using rainbow colormaps. See, e.g., [D. Borland and R. M. Taylor, Rainbow Color Map (Still) Considered Harmful, IEEE Computer Graphics and Applications, 27,14-17, 2007](http://doi.ieeecomputersociety.org/10.1109/MCG.2007.46). In the lower right example, I use a rainbow colorscale. You can see how the brightest parts of the image are not the ones that have the highest values. The rainbow colormaps (as discussed in the above publication, have issues with colorblind accessability and data emphasis. You should **NOT** use rainbow colormaps.

Viridis [has been designed](http://bids.github.io/colormap/) to be perceptually flat across a large range of values.

Importantly, the false coloring helps use see that the intensity of the pixel values in the middle of cell clusters are similar to those of the background which will become an issue, as we will see, as we begin our segmentation.

## Introductory segmentation

As mentioned before, **segmentation** is the process by which we separate regions of an image according to their identity for easier analysis. E.g., if we have an image of bacteria and we want to determine what is "bacteria" and what is "not bacteria," we would do some segmentation. We will use bacterial test images for this purpose.

### Histograms

As we begin segmentation, remember that viewing an image is just a way of plotting the digital image data.  We can also plot a **histogram**, where we plot the number of pixels with a given value against pixel values. This helps use see some patterns in the pixel values and is often an important first step toward segmentation.

The histogram of an image is simply a list of counts of pixel values. When we plot the histogram, we can often readily see breaks in which pixel values are most frequently encountered. There are many ways of looking at histograms. The `spike()` function of iqplot can be used to conveniently display a histogram,

In [None]:
p = iqplot.spike(
    data=im_phase.flatten(),
    q='intensity',
    style='spike',
)

bokeh.io.show(p)

We see that there are is some structure in the histogram of the phase image. While our eyes are drawn to the large peak around 380, we should keep in mind that our bacteria are black on a bright background and occupy only a small area of the image. We can see a smaller peak in the vicinity of 200 which likely represent our bugs of interest. The peak to the right is brighter, so likely represents the background. Therefore, if we can find where the valley between the two peaks is, we may take pixels with intensity below that value to be bacteria and those above to be background. Eyeballing it, I think this critical pixel value is about 300.

### Thresholding

The process of taking pixels above or below a certain value is called **thresholding**. It is one of the simplest ways to segment an image. We call every pixel with a value below 300 part of a bacterium and everything above *not* part of a bacterium.

In [None]:
# Threshold value, as obtained by eye
thresh_phase = 300

# Generate thresholded image
im_phase_bw = im_phase < thresh_phase

# Display phase and thresholded image
p1 = bebi103.image.imshow(
    im_phase,
    frame_height=200,
    interpixel_distance=ip_distance,
    x_axis_label="µm",
    y_axis_label="µm",
)

p2 = bebi103.image.imshow(
    im_phase_bw,
    frame_height=200,
    interpixel_distance=ip_distance,
    x_axis_label="µm",
    y_axis_label="µm",
    cmap=["black", "white"],
)

p2.x_range = p1.x_range
p2.y_range = p1.y_range

plots = [p1, p2]

bokeh.io.show(bokeh.layouts.row(plots))

We can overlay these images to get a good view.  To do this, we will make an RGB image, and saturate the green channel where the thresholded image is white. We can then display it at an RGB image.

In [None]:
# Build RGB image by stacking grayscale images
im_phase_rgb = np.dstack(3 * [im_phase / im_phase.max()])

# Saturate green channel wherever there are white pixels in thresh image
im_phase_rgb[im_phase_bw, 1] = 1.0

# Show the result
bokeh.io.show(bebi103.image.imshow(im_phase_rgb))

We see that we did a decent job finding bacteria, but we also pick up quite a bit of garbage sitting around the cells. We can also see that in some of the bigger clusters, we do not effectively label the bacteria in the middle of colonies. This is because of the "halo" of high intensity signal near boundaries of the bacteria that we get from using phase contrast microscopy.

### Using the CFP channel

One way around these issues is to use bacteria that constitutively express a fluorescent protein and to segment in using the fluorescent channel. Let's try the same procedure with the CFP channel. First, let's look at the image.

In [None]:
# Load image
im_cfp = skimage.io.imread(os.path.join(data_path, "bsub_100x_cfp.tif"))

# Display the image
bokeh.io.show(bebi103.image.imshow(im_cfp))

We see that the bacteria are typically brighter than the background (which is impressively uniform), so this might help us in segmentation.

### Filtering noise: the median filter

While it may not be obvious from this image, the non-bacterial pixels are not completely dark due to autofluorescence of the immobilization substrate as well as some issues in our camera. In fact, the camera on which these images were acquired has a handful of "bad" pixels which are always much higher than the "real" value. This could cause issues in situations where we would want to make quantitative measurements of intensity. We can zoom in on one of these "bad" pixels below (ignoring in the display below the axes).

In [None]:
bokeh.io.show(bebi103.image.imshow(im_cfp[150:250, 450:550]))

We see a single bright pixel. In addition to throwing off our colormap a bit, this could alter the measured intensity of a cell if there happen to be any other bad pixels hiding within the bacteria. We can remove this noise by using a **median filter**. The concept is simple. We take a shape of pixels, called a **structuring element**, and pass it over the image. The value of the center pixel in the max is replaced by the median value of all pixels in the mask. To do this, we first need to construct a mask. This is done using the `skimage.morphology` module. The filtering is then done using `skimage.filters.rank.median()`. Let’s try it with a 3$\times$3 square mask.

In [None]:
# Make the structuring element
selem = skimage.morphology.square(3)

# Perform the median filter
im_cfp_filt = skimage.filters.median(im_cfp, selem)

# Display image
bokeh.io.show(bebi103.image.imshow(im_cfp_filt))

Now that we have dealt with the noisy pixels, we can now see more clearly that some cells are very bright compared with others.

### Thresholding in the CFP channel

We'll proceed by plotting the histogram and finding the threshold value.

In [None]:
p = iqplot.spike(
    data=im_cfp_filt.flatten(),
    q="intensity",
    style="spike",
)

bokeh.io.show(p)

Yeesh. There are lots of bright pixels, but it is kind of hard to see where (or even if) there is valley in the histogram. It sometimes helps to plot the histogram with the y-axis on a log scale. When we do this, we can eyeball the threshold value to be about 140.

In [None]:
# Use style='dot' when using a log scale, since there is no zero
p = iqplot.spike(
    data=im_cfp_filt.flatten(),
    q="intensity",
    style="dot",
    y_axis_type="log",
    y_range=[1e2, 1e6],
)

bokeh.io.show(p)

Now let's try thresholding the image.

In [None]:
# Threshold value, as obtained by eye
thresh_cfp = 140

# Generate thresholded image
im_cfp_bw = im_cfp_filt > thresh_cfp

# Display
plots = [
    bebi103.image.imshow(im_cfp, frame_height=200),
    bebi103.image.imshow(im_cfp_bw, frame_height=200, cmap=['black', 'white'])
]

bokeh.io.show(bokeh.layouts.row(plots))

Looks like we're doing much better!  Let's try overlapping the images now.

In [None]:
# Build RGB image by stacking grayscale images
im_rgb = np.dstack(3 * [im_phase / im_phase.max()])

# Saturate green channel wherever there are white pixels in thresh image
im_rgb[im_cfp_bw, 1] = 1.0

# Show the result
bokeh.io.show(bebi103.image.imshow(im_rgb))

Very nice! In general, it is often much easier to segment bacteria with fluorescence.

### Otsu's method for thresholding

It turns out that there is an automated way to find the threshold value, as opposed to eyeballing it like we have been doing. [Otsu's method](https://en.wikipedia.org/wiki/Otsu%27s_method) provides this functionality.

In [None]:
# Compute Otsu thresholds for phase and cfp
thresh_phase_otsu = skimage.filters.threshold_otsu(im_phase)
thresh_cfp_otsu = skimage.filters.threshold_otsu(im_cfp_filt)

# Compare results to eyeballing it
print("Phase by eye: ", thresh_phase, "   CFP by eye: ", thresh_cfp)
print("Phase by Otsu:", thresh_phase_otsu, "   CFP by Otsu:", thresh_cfp_otsu)

We see that for the CFP channel, the Otsu method did very well. However, for phase, we see a big difference. This is because the Otsu method assumes a bimodal distribution of pixels. If we look at the histograms on a log scale, we see more clearly that the phase image has a long tail, which will trip up the Otsu algorithm. The moral of the story is that you can use automated thresholding, but you should always do sanity checks to make sure it is working as expected.

## Determining the bacterial area

Now that we have a thresholded image, we can determine the total area taken up by bacteria.  It's as simple as summing up the pixel values of the thresholded image!

In [None]:
# Compute bacterial area
bacterial_area_pix = (im_cfp_filt > thresh_cfp_otsu).sum()

# Print out the result
print("bacterial area =", bacterial_area_pix, "pixels")

If we want to get the total area that is bacterial in units of µm, we could use the interpixel distances to get the area represented by each pixel. For this setup, the interpixel distance is 0.0636 µm. We can then compute the bacterial area as follows.

In [None]:
# Compute bacterial area
bacterial_area_micron = bacterial_area_pix * ip_distance**2

# Print total area
print('bacterial area =', bacterial_area_micron, 'square microns')

## Computing environment

In [None]:
%load_ext watermark
%watermark -v -p numpy,skimage,bokeh,iqplot,bebi103,jupyterlab