# MK_Analysis
This jupyter notebook is designed to analyze proplatelet production of Megakaryocytes from tiff time-lapses taken by the Incucyte Zoom (10X, 1392x1040px -- 1699.85 x 1549.40um).

The Analysis workflow occurs in 3 steps:

1 ilastikProcessing - Unpacks tiff stacks & generates probability masks from phase images through the ilastik project (.ilp) file.


2 Quantification Pipeline - Primary pipeline for proplatelet production analysis. Important output files are listed as follows:
    
    a. 'results_Image.csv' -> raw results
    
    b. 'results_cell/pplt.properties' -> use in CPA to train classifiers
    
    c. 'labels' folder, containing 16-bt labels of proplatelet objects used for the Skeleton pipeline
    
    d. 'overlay' of phase images, labeling megs as red & pplts as green
    
    e. 'Raw.csv','Area.csv','Pplt_Pct.csv' -> Calculated & Formatted Results
    
3 Skeletonization Pipeline - Secondary pipeline for proplatelet structure analysis. Important output files are listed as follows:

In [1]:
%matplotlib inline
import glob
import h5py
import matplotlib
import numpy
import os
import os.path
import pandas
import re
from shutil import copy2
import skimage
import skimage.exposure
import skimage.io
import subprocess
# from tkinter.filedialog import askdirectory
from tqdm._tqdm_notebook import tqdm_notebook

  from ._conv import register_converters as _register_converters


# ilastikProcessing
To effectively use ilastik some formating must be done before and after ilastik processes the images. The input is assumed to be a time-series of images stored in a multi-page TIFF.

# Unpack (or separate) input images into single files.

## Update input image variables
Update the variable *image_directory* with the path to a folder that contains the input images. Update the *regex_image* variable to process only the images that match the regular expression. If the *regex_image* variable is equal to `(.*)\.tif`, then any *.tif* in the folder will be processed.

If a filename matches the *regex_single* regular expression, then it is assumed that this image has already been unpacked. An unpacked single image will have timepoint appended to the end of the file following the pattern `\-\d{4}\.tif`

*path_to_ilastik* is a string with the path to the ilastik software for [running headless](http://ilastik.org/documentation/basics/headless.html).

In [2]:
image_directory = r"C:\Users\Prakrith\Desktop\test"
path_to_ilastik = r"C:\Program Files\ilastik-1.3.0b4\run-ilastik.bat"
path_to_project = r"C:\Users\Prakrith\Documents\GitHub\Test\ilps\180517_Zoom.ilp"
path_to_cellprofiler = r"C:\Program Files (x86)\CellProfiler\CellProfiler.exe"
path_to_cp_pipeline = r"C:\Users\Prakrith\Documents\GitHub\Test\pipelines\180512_MP.cppipe"
regex_image = "(.*)\.tif" #stack
regex_single = ".*\-\d{4}\.tif" #slice
re_image = re.compile(regex_image)
re_single = re.compile(regex_single)

In [3]:
os.makedirs(os.path.join(image_directory, "single_images"), exist_ok=True)
imdir_single = os.path.join(image_directory,"single_images")
os.makedirs(os.path.join(image_directory, "ilastik"), exist_ok=True)
imdir_ilastik = os.path.join(image_directory,"ilastik")
os.makedirs(os.path.join(image_directory, "output"), exist_ok=True)
output_directory = os.path.join(image_directory,"output")

In [None]:
# image_directory = r"C:\Users\Prakrith\Desktop\test"
# imdir_single = r"C:\Users\Prakrith\Desktop\test\single_images"
# imdir_ilastik = r"C:\Users\Prakrith\Desktop\test\ilastik"
# output_directory = r"C:\Users\Prakrith\Desktop\test\output"

## Methods to import image metadata
* *is_my_file* will use the regular expression to filter image files to be processed.
* *make_dict* parses a file to be processed and places metadata into a dictionary.

Parse the files to be processed and then place the metadata into a Pandas dataframe.

In [None]:
def is_my_file(filename, re_image, re_single):
    
    mybool = False
    
    if (    re_image.match(filename) != None 
        and re_single.match(filename) == None
       ):
        
        mybool = True
        
    return mybool


def make_dict(filename, path, re_obj):
    
    my_dict = re_obj.match(filename).groupdict()
    
    my_dict["filename"] = filename
    
    my_dict["path"] = path
    
    return my_dict

In [None]:
image_files_dict = [make_dict(f, image_directory, re_image) for f in os.listdir(image_directory) if is_my_file(f, re_image, re_single)]
image_df = pandas.DataFrame(image_files_dict)

## Unpack the multi-page TIFF images
For every image, create a single image file for each timepoint. The input images are assumed to be RGB, which has 3 dimensions (length, width, color). The multipage TIFF of RGB images will have 4 dimensions (timepoints, length, width, color). 

*If the upstream workflow changes and the input image format is altered, then the conditional logic below will need to be updated, specifically the logic based on the shape of the input images.*

### Are the images across experiments similar enough to treat equally
One concern is that overfitting from training a classification model either through ilastik or CellProfiler analyst. The training set needs to be representative of the possibility space. This is accomplished by choosing a large enough image set that includes images of all states of interest including undifferentiated and fully differentiated megakaryocytes.

We also want to eliminate noise from known sources of variablity that could potentially weaken the classifier. The primary sources of noise in the images will be non-uniform illumination and differences in exposure. Non-uniform illumination is difficult to correct, because the background is actually in the middle of the intensity range and the signal occupies both high and low intensities.



In [None]:
#filelist = glob.glob("D:\Prakrith\MK_Differentiation_Kyle\images\single_images\*.tif")

#for f in filelist:
    
#    im = skimage.io.imread(f)
    
#    im2 = skimage.color.rgb2gray(im)
        
#    im2 = skimage.img_as_ubyte(im2)

#    skimage.io.imsave(f, im2)

In [None]:
# ****DOWNSAMPLE

# def df_stack_image(p):
    
#     im = skimage.io.imread(os.path.join(p["path"], p["filename"]))
    
#     if len(im.shape) < 4:
        
#         retest = re_image.match(p["filename"])

#         retest.group(1)

#         fname = "{0}-{1:04d}.tif".format(retest.group(1), 0)
        
#         im2 = skimage.transform.rescale(im, 0.5)
        
#         im2 = skimage.color.rgb2gray(im2)
        
#         im2 = skimage.img_as_ubyte(im2)
        
#         skimage.io.imsave(os.path.join(p["path"], fname), im2)
        
#     else:
    
#         number_of_timepoints = im.shape[0]

#         for i in range(number_of_timepoints):

#             retest = re_image.match(p["filename"])

#             retest.group(1)

#             fname = "{0}-{1:04d}.tif".format(retest.group(1), i)
            
#             im2 = skimage.transform.rescale(im[i,:,:,:], 0.5)
            
#             im2 = skimage.color.rgb2gray(im2)
        
#             im2 = skimage.img_as_ubyte(im2)
            
#             skimage.io.imsave(os.path.join(p["path"], "single_images", fname), im2)

In [None]:
def df_stack_image(p):
    
    im = skimage.io.imread(os.path.join(p["path"], p["filename"]))
    
    if len(im.shape) < 4:
        
        retest = re_image.match(p["filename"])

        retest.group(1)

        fname = "{0}-{1:04d}.tif".format(retest.group(1), 0)
        
        im2 = skimage.color.rgb2gray(im)
        
        im2 = skimage.img_as_ubyte(im2)

        skimage.io.imsave(os.path.join(p["path"], fname), im2)
        
    else:
    
        number_of_timepoints = im.shape[0]

        for i in range(number_of_timepoints):

            retest = re_image.match(p["filename"])

            retest.group(1)

            fname = "{0}-{1:04d}.tif".format(retest.group(1), i)
            
            im2 = skimage.color.rgb2gray(im[i,:,:,:])
        
            im2 = skimage.img_as_ubyte(im2)

            skimage.io.imsave(os.path.join(p["path"], "single_images", fname), im2)

In [None]:
if image_df.empty is False:
    
    # Note that this can fail if the input images aren't in the expected format
    # If you receive an error, double check the format of the input images, e.g. are they RGB?
    tqdm_notebook.pandas(desc="unpack")
    _ = image_df.progress_apply(df_stack_image, axis=1)

else:
    
    print("no images to unpack")

# Run ilastik

Using the single images created earlier, process the images using ilastik. First, create another dataframe with the single image metadata. Note, this has been written for running on Windows.

## Process ilastik output for CellProfiler
ilastik will output and HDF5 file that must be parsed for use as input to CellProfiler. This workflow assumes the default export settings are being used in ilastik. We have observed performance costs when changing the exporting settings to formats beyond the standard ilastik HDF5 file. For example, exporting TIFF images changes the shape of the exported data from yxc (the default) to cyx. This rearrangement will cause downstream errors, because the code as written expects the channel to be the third dimension.

### ilastik stage-2 labels
The project file *Mouse_MK.ilp* has the following labels that are stored in the same order within the HDF5 output.
1. background
1. border_white
1. cell
1. protrusion
1. background_border
1. not_cell

In [None]:
def is_my_file(filename, re_obj):
    
    mybool = False
    
    if re_obj.match(filename) != None:
        
        mybool = True
        
    return mybool


def make_dict(filename, path, re_obj):
    
    my_dict = re_obj.match(filename).groupdict()
    
    my_dict["filename"] = filename
    
    my_dict["path"] = path
    
    return my_dict

In [None]:
image_files_dict = [make_dict(f, imdir_single, re_single) for f in os.listdir(imdir_single) if is_my_file(f, re_single)]
image_df = pandas.DataFrame(image_files_dict)

In [None]:
def df_ilastik(p):
    
    filename = os.path.join(p["path"], p["filename"])
    
    filename_noext = os.path.splitext(p["filename"])[0]
    
    filename_h5 = "{}_Probabilities Stage 2.h5".format(filename_noext)
    
    # Run ilastik using subprocess
    
    process = subprocess.Popen([path_to_ilastik, 
                  "--headless",
                  "--export_source=probabilities stage 2",
                  "--output_format=hdf5",
                  r"--project={}".format(path_to_project),
                  filename
                 ], stdout=subprocess.PIPE)
    
    out, err = process.communicate()
    
    # unpack the HDF5 file
    
    label_list = ["background", "protrusion", "cell_boundary", "cell"]
    
    path_h5 = os.path.join(p["path"], filename_h5)
    
    with h5py.File(path_h5, "r") as ilastik_hdf5:
    
        ilastik_probabilities = ilastik_hdf5["exported_data"].value
    
        for i in range(ilastik_probabilities.shape[2]):
            im = skimage.img_as_uint(ilastik_probabilities[:, :, i])
        
            filename_slice = "{}_{}_prbstg2_{}.png".format(filename_noext, label_list[i], i)
        
            skimage.io.imsave(os.path.join(p["path"], "..", "ilastik", filename_slice), im)
    
    os.remove(path_h5)

In [None]:
tqdm_notebook.pandas(desc="run ilastik")
_ = image_df.progress_apply(df_ilastik, axis=1)

# Run CellProfiler

## Make a filelist
Add the paths to each file that will be processed by CellProfiler into a text file.

In [4]:
CPA_Rules = r'C:\Users\Prakrith\Desktop\CPTemp_in\fgb_rules_pplt.txt' #directory with location of CellProfiler Analyst Rules
copy2(CPA_Rules, image_directory)

'C:\\Users\\Prakrith\\Desktop\\test\\fgb_rules_pplt.txt'

In [5]:
def sorted_nicely(l):
    convert = lambda text: int(text) if text.isdigit() else text
    alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)]
    return sorted(l, key = alphanum_key)

In [6]:
single_list = sorted_nicely(glob.glob(os.path.join(imdir_single,"*.tif")))
ilastik_list = sorted_nicely(glob.glob(os.path.join(imdir_ilastik,"*.png")))
big_list = single_list + ilastik_list
with open(os.path.join(image_directory,"filelist.txt"), 'w') as f:
    for item in big_list:
        f.write("{}\n".format(item))

## Quantification Pipeline
Use subprocess to run CellProfiler on the images to be processed.

Note, that a model that filters protrusions was trained in CellProfiler Analyst outside of this workflow. The model has to be in the input folder to be found by CellProfiler.

In [7]:
process = subprocess.Popen([path_to_cellprofiler,
                  "--run-headless",
                  "--pipeline={}".format(path_to_cp_pipeline),
                  "--file-list={}".format(os.path.join(image_directory,"filelist.txt")),
                  "--image-directory={}".format(image_directory),
                  "--output-directory={}".format(output_directory)
                 ], stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
    
out, err = process.communicate()

## Quantify Proplatelet Production
From the csvs generated by CellProfiler, the file 'results_Image' is parsed, & proplatelet production is quantified.

In [None]:
# df.columns = range(df.shape[1]) #drops column headers\
# df = df.reindex(index=natsorted(df.index));

# def sorted_nicely(l):
#     convert = lambda text: int(text) if text.isdigit() else text
#     alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)]
#     return sorted(l, key = alphanum_key)

In [8]:
result = open(os.path.join(output_directory,r'results_Image.csv')); 
header = ["URL_phase","AreaOccupied_AreaOccupied_proplatelets","Count_proplatelets","Count_megs","ImageNumber"]; #"AreaOccupied_AreaOccupied_megs"
df = pandas.read_csv(result, usecols = header, index_col = False);
df = df.set_index("URL_phase"); #won't allow setting URL_phase as index in read_csv
df = df.reindex(index=sorted_nicely(df.index)); #reorder df alphanumerically

In [9]:
def f(x):
    try:
        return (((x[2])/(x[1]))*100) #(count_proplatelet/count_meg)*100
    except ZeroDivisionError:
        pass;
    
def g(x):
    return ((x[0] * 2633747.59) / 1447680); #hard coded from IJ, [(area * total_area_um)/total_area_px]; AR = 1.22 um/px

x = df.apply(f,axis=1);
df['Pct'] = x;
x = df.apply(g,axis=1);
df['Area_um'] = x;
df.to_csv(os.path.join(output_directory,'Raw.csv'));
df.drop(df.columns[[0,1,2]], axis=1, inplace=True);
df = df.set_index("ImageNumber");
df2 = pandas.DataFrame();

In [11]:
stack_list = sorted_nicely(glob.glob(os.path.join(image_directory,"*.tif")));
t = int(len(single_list)); #total num images
n = int((len(single_list) / len(stack_list))); # slices per stack

for i in range(0,t,n):
    slc = df.iloc[i:i+n]
    slc = slc.reset_index(drop=True);
    df2 = pandas.concat([df2,slc],axis=1,ignore_index=True); #iter df by stack length (n), and concat 

In [12]:
def h(df,sl,n,name):
    df.columns = sl;
    df['Timepoint'] = list(range(1,n+1));
    df = df.set_index("Timepoint");
    df.to_csv(os.path.join(output_directory,name)); #function adds headers to columns, fixes index, and creates final csv
    
a = df2.loc[:,1::2];
h(a,stack_list,n,r'Area.csv')
p = df2.loc[:,::2];
h(p,stack_list,n,r'PPlt_Pct.csv')

# Run CellProfiler - step #2

## Make skeleton filelist
Include the paths to each label file that will also be processed by CellProfiler into a text file.

## Skeleton Pipeline
Use subprocess to run CellProfiler on the images/labels to be processed.

In [None]:
path_to_sk_pipeline = r"C:\Users\Prakrith\Documents\GitHub\Test\pipelines\Kyle_Skel.cppipe" #second pipeline

In [None]:
os.makedirs(os.path.join(image_directory, "skeleton"), exist_ok=True)
skeleton_directory = os.path.join(image_directory,"skeleton");

In [None]:
imdir_label = os.path.join(output_directory,"labels");
label_list = sorted_nicely(glob.glob(os.path.join(imdir_label,"*.tiff")))
skel_list = single_list + label_list
with open(os.path.join(image_directory,"filelist2.txt"), 'w') as f:
    for item in skel_list:
        f.write("{}\n".format(item))

In [None]:
process = subprocess.Popen([path_to_cellprofiler,
                  "--run-headless",
                  "--pipeline={}".format(path_to_sk_pipeline),
                  "--file-list={}".format(os.path.join(image_directory,"filelist2.txt")),
                  "--image-directory={}".format(image_directory),
                  "--output-directory={}".format(skeleton_directory)
                 ], stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)

out, err = process.communicate()

In [None]:
vertices_csv = open(os.path.join(skeleton_directory,r'vertices.csv'));
df = pd.read_csv(vertices_csv);
df.columns = ['image_number', 'vertex_number','y','x','labels','kind'];#rename i,j to y,x
cols = ['image_number', 'vertex_number', 'x','y','labels', 'kind'];
df = df[cols];#swap y,x to x,y in vertices_csv

In [None]:
# def node_find(vert,i):
#     slc = vert[vert['image_number'] == i]; #slice by image
#     slc = slc.drop(columns=['image_number']);
#     nodes = slc.set_index('vertex_number').T.to_dict('list');
#     return nodes; #takes all nodes from vert and converts to dict of lists

In [None]:
edges_csv = open(os.path.join(skeleton_directory,r'edges.csv'));
e_df = pd.read_csv(edges_csv);

In [None]:
# def node_finder(vert,i,location): #vert=vertex dataframe, i=image number, node location in (x,y)
#     slc = vert[vert['image_number'] == i];
#     slc2 = slc[(slc['x'] == location[0])]; #slice vertices in regards to x
#     slc3 = slc2[(slc2['y'] == location[1])]; #reduce to 1 specific vertex via y
#     node = slc3.iloc[0]['vertex_number']; #pull vertex_number
#     kind = slc3.iloc[0]['kind']; #pull node,kind
#     return node,kind;

In [None]:
def next_node(edge,node):
    slc = edge[edge['image_number'] == i];
    slc2 = slc[(slc['v1'] == v1[0])]; #slice vertex 1 
    slc3 = slc2[(slc2['v2'] == v2[0])]; #reduce to single branch
    length = slc3.iloc[0]['length']; #pull length between nodes
    return length;

In [None]:
def branch_gen(v1,v2,length):
    g = nx.Graph();
    g.addnode(v1);
    g.addnode(v2);
    g.edge();

In [None]:
def node_find2(vert,i,l):
    slc = vert[vert['image_number'] == i]
    slc2 = slc[slc['labels'] == l];
    slc2 = slc2.drop(columns=['image_number','labels']);
    nodes = slc2.set_index('vertex_number').T.to_dict('list');
    return nodes; #takes nodes from specific proplatelet struct within specific img, return as dict of list

In [None]:
def edge_find(edge,i):
    slc = edge[edge['image_number'] == i];
    slc = slc.drop(columns=['image_number','total_intensity']);
#     edges = slc.set_index('vertex_number').T.to_dict('list');
    return edges;

In [None]:
# G = DiGraph()
# nodes = csv.DictReader(open(nodeFile, 'rU'), ['index', 'label', 'type'])
# for row in nodes:
#     G.add_node(row['index'], {'index':row['index'], 'label':row['label'], 'type':row['type']})
# edges = csv.DictReader(open(edgeFile, 'rU'), ['v1', 'v2', 'weight'])
# for row in edges:
#     G.add_edge(row['v1'], row['v2'], row[weight'])