# Data Preprocessing to Transforming MPP data into images and predicting Transcription Rate (TS)

This notebook is ment to convert raw cell data from several wells into multichannel images (along with its corresponding mask, targets and metadata).

Data was taken from:
`/storage/groups/ml01/datasets/raw/20201020_Pelkmans_NascentRNA_hannah.spitzer/` and server `vicb-submit-01`. 

In the preprocessing done in this notebook. The objective of this preprocessing is to create a 'imaged' version of the MPP data.

The discretization of the channels (input_channels) and the selection of the target variable is done during the convertion into tensorflow dataset!

Considerations:
- NO discrimination of channels is done! All the channels are saved in the same order and all of them are also projected into a scalars and saved as target. However, if input_channels and output_channels are given in the parameters file, then the filtering of channels/targets is done during saving into disk.
- To avoid data duplication, the cell images for each well are saved right after the preprocessing and not at the end.
- NO train, val and test splitting is done here! That (and data normalization) is done during the creation of the TFDS.
- There are several ways of saving the images. This behaviour is defined by the parameter `img_saving_mode`:
    - **original_img**: save original cell image with its original size and shape (it could be recangular or squared). The drawback of this method is that images will be saved with different sizes and size ratios.
    - **original_img_and_squared**: save original cell image without fixed size but fixed shape (squared). Despite that the all images are saved with the same shape (squared), the drawback of this method is that images will be saved with different sizes.
    - **original_img_and_fixed_size**: save original cell image with fixed size and shape (squared). The drawback of this method is that if zoomin wants to be used as a data augmentation technique, then each image will need to be processed individually on the fly to avoid cropping the cell information.
    - **fixed_cell_size**: save image with a fixed size and shape (squared)maximizing the cell size within the image. This mean that up or down sampling may be needed. The drawback of this method is that the some distortion of the original data (image) is inevitable during up or down samplig (zoom in/out). There are several option for the up/down sampling (interpolation) which can be selected through the parameter `img_interpolation_method`. To see the complete list of available interpolation methods, please visit:<br>https://www.tensorflow.org/api_docs/python/tf/image/ResizeMethod
    

# 1.- Preparation

Load libraries:

In [None]:
# For Development and debugging:
# Reload modul without restarting the kernel
#%load_ext autoreload
#%autoreload 2

In [None]:
import numpy as np
import pandas as pd
# To display all the columns
pd.options.display.max_columns = None
import os
import sys
import matplotlib.pyplot as plt
import json
import math
import matplotlib.pyplot as plt
from datetime import datetime
import socket

# Set terminal output (to send mesages to the terminal stdout)
terminal_output = open('/dev/stdout', 'w')
print('Execution of Notebook started at {}'.format(datetime.now()), file=terminal_output)

Load external libraries:

In [None]:
# Load external libraries
if socket.gethostname() == 'hughes-machine':
    external_libs_path = '/home/hhughes/Documents/Master_Thesis/Project/workspace/libs'
else:
    external_libs_path= '/storage/groups/ml01/code/andres.becker/master_thesis/workspace/libs'
print('External libs path: \n'+external_libs_path, file=terminal_output)

if not os.path.exists(external_libs_path):
    msg = 'External library path {} does not exist!'.format(external_libs_path)
    raise Exception(msg)
    
# Add EXTERNAL_LIBS_PATH to sys paths (for loading libraries)
sys.path.insert(1, external_libs_path)
# Load external libraries
from pelkmans.mpp_data_V2 import MPPData as MPPData
from Utils import create_directory as create_directory
from Utils import print_stdout_and_log as print_stdout_and_log

Load Parameters:

In [None]:
# Do not touch the value of PARAMETERS_FILE!
# When this notebook is executed with jupyter-nbconvert (from script), 
# it will be replaced outomatically
#PARAMETERS_FILE = '/home/hhughes/Documents/Master_Thesis/Project/workspace/scripts/Parameters/MPP_to_imgs_no_split_local.json'
PARAMETERS_FILE = 'dont_touch_me-input_parameters_file'

if not os.path.exists(PARAMETERS_FILE):
    raise Exception('Parameter file {} does not exist!'.format(PARAMETERS_FILE))
    
# Open parameters
with open(PARAMETERS_FILE) as params_file:
    p = json.load(params_file)
    
# Save parameter file path and libs path
p['parameters_file_path'] = PARAMETERS_FILE
p['external_libs_path'] = external_libs_path

# Set some default parameters in case they are not given
if 'input_channels' not in p.keys():
    p['input_channels'] = None
    
if 'output_channels' not in p.keys():
    p['output_channels'] = None

for key in p.keys():
    print_stdout_and_log('{}: {}'.format(key, p[key]))

Set logging:

In [None]:
# Set logging configuration
import logging
logging.basicConfig(
    filename=p['log_file'],
    filemode='w', 
    level=getattr(logging, 'INFO')
)
print_stdout_and_log('Parameters loaded from file:\n{}'.format(PARAMETERS_FILE))

Set paths and Load external libraries:

In [None]:
# Load data path
DATA_DIR = p['raw_data_dir']
if not os.path.exists(DATA_DIR):
    raise Exception('Data path {} does not exist!'.format(DATA_DIR))
print_stdout_and_log('DATA_DIR: {}'.format(DATA_DIR))

# Create dirs to save data
outdir = p['output_pp_data_path']
create_directory(dir_path=outdir, clean_if_exist=False)

# Create directories to save images
output_data_path = os.path.join(outdir, p['output_pp_data_dir_name'])
create_directory(dir_path=output_data_path, clean_if_exist=True)

# 2.- Prepare selected data to process (wells and I/O channels)

Check available data (Perturbations and Wells):

In [None]:
print_stdout_and_log('Reading local available perturbations-wells...')
# Save available local Perturbations and Wells
perturbations = [per for per in os.listdir(DATA_DIR) if os.path.isdir(os.path.join(DATA_DIR, per))]
local_data = {}
#print('Local available perturbations-wells:\n')
for per in perturbations:
    pertur_dir = os.path.join(DATA_DIR, per)
    wells = [w for w in os.listdir(pertur_dir) if os.path.isdir(os.path.join(pertur_dir, w))]
    #print('{}\n\t{}\n'.format(p, wells))
    local_data[per] = wells

Select Perturbations and its wells to process: 

In [None]:
msg = 'Local available perturbations-wells:\n{}'.format(local_data)
print(msg)
logging.debug(msg)

# In case you only want to load some specific perturbations and/or wells here:
#selected_data = {
#    '184A1_hannah_unperturbed': ['I11', 'I09'],
#    '184A1_hannah_TSA': ['J20', 'I16'],
#}

# Load perturbations-wells from parameters file
selected_data = p['perturbations_and_wells']
# How many wlls will be processed?
n_wells = 0
for key in list(selected_data.keys()):
    n_wells += len(selected_data[key])

print('\nSelected perturbations-wells:\n{}'.format(selected_data))

#Generate and save data dirs
data_dirs = []
for per in selected_data.keys():
    for w in selected_data[per]:
        d = os.path.join(DATA_DIR, per, w)
        data_dirs.append(d)
        if not os.path.exists(d):
            msg = '{} does not exist!\nCheck if selected_data contain elements only from local_data dict.'.format(d)
            raise Exception(msg)
p['data_dirs'] = data_dirs

# 3.- Process data and save into disk as images

Process data:

In [None]:
msg = 'Starting processing of {} wells...'.format(n_wells)
logging.info(msg)

metadata_df = pd.DataFrame()
channels_df = pd.DataFrame()

for w, data_dir in enumerate(p['data_dirs'], 1):
    msg = 'Processing well {}/{} from dir {}...'.format(w, n_wells, data_dir)
    logging.info(msg)
    print('\n\n'+msg)
    # Load data as an MPPData object
    mpp_temp = MPPData.from_data_dir(data_dir, dir_type=p['dir_type'])
    
    # Validate same channels across wells
    if channels_df.shape[0] == 0:
        channels_df = mpp_temp.channels
    if not all(channels_df.name == mpp_temp.channels.name):
        raise Exception('Channels across MPPData instances are not the same!')
    
    # Add cell cycle to metadata (G1, S, G2)
    # Important! If mapobject_id_cell is not in cell_cycle_file =>
    # its corresponding cell is in Mitosis phase!
    if p['add_cell_cycle_to_metadata']:
        print_stdout_and_log('Adding cell cycle to metadata...')
        mpp_temp.add_cell_cycle_to_metadata(os.path.join(DATA_DIR, p['cell_cycle_file']))
    
    # Add well info to metadata
    if p['add_well_info_to_metadata']:
        print_stdout_and_log('Adding well info to metadata...')
        mpp_temp.add_well_info_to_metadata(os.path.join(DATA_DIR, p['well_info_file']))
    
    # Remove unwanted cells
    if p.get('filter_criteria', None) is not None:
        print_stdout_and_log('Removing unwanted cells...')
        mpp_temp.filter_cells(p['filter_criteria'], p['filter_values'])

    # Subtract background values for each channel
    if p['subtract_background']:
        print_stdout_and_log('Subtracting background...')
        mpp_temp.subtract_background(os.path.join(DATA_DIR, p['background_value']))
    
    # Project every uni-channel images into a scalar for further analysis
    if p['project_into_scalar']:
        print_stdout_and_log('Projecting data...')
        mpp_temp.add_scalar_projection(p['aggregate_output'])
        
        
    # Convert MPP into image and save to disk
    print_stdout_and_log('Creating well images and saving into disk...')
    mpp_temp.save_img_mask_and_target_into_fs(outdir=output_data_path,
                                              input_channels=p['input_channels'], 
                                              output_channels=p['output_channels'],
                                              projection_method=p['aggregate_output'],
                                              img_size=p['img_size'],
                                              img_saving_mode=p['img_saving_mode'],
                                              img_interpolation_method=p['img_interpolation_method'],
                                              pad=0, 
                                              dtype=p['images_dtype']
                                             )

    # Concatenate well metadata
    if metadata_df.shape[0] == 0:
        metadata_df = mpp_temp.metadata
        channels_df = mpp_temp.channels
    else:
        metadata_df = pd.concat((metadata_df, mpp_temp.metadata), axis=0, ignore_index=True)
    
    del(mpp_temp)

Take a look into the metadata:

In [None]:
metadata_df

# 4.- Save Metadata and parameters


In [None]:
msg = 'Saving Parameters and Metadata...'
logging.info(msg)

# save params
with open(os.path.join(outdir, 'params.json'), 'w') as file:
    json.dump(p, file, indent=4)

# save metadata
with open(os.path.join(outdir, 'metadata.csv'), 'w') as file:
    metadata_df.to_csv(file, index=False)

# Save used channels
with open(os.path.join(outdir, 'channels.csv'), 'w') as file:
    channels_df.to_csv(file, index=False)

Finally, load one saved file and take a look into the content to see if everithing was done correctlly:

In [None]:
cell_id = np.random.choice(metadata_df['mapobject_id_cell'].values)
file = os.path.join(output_data_path, str(cell_id)+'.npz')
cell = np.load(file)
cell_img = cell['img']
cell_img = cell_img / np.max(cell_img, axis=(0,1))
cell_mask = cell['mask']
cell_targets = cell['targets']

print('Cell image shape: {}\n'.format(cell_img.shape))
print('Cell mask shape: {}\n'.format(cell_mask.shape))
print('Cell target shape: {}\n'.format(cell_targets.shape))

# Now take a look into its image
plt.figure(figsize=(2 * 10,10))
plt.subplot(1,2,1)
plt.imshow(cell_img[:,:,10:13],
           cmap=plt.cm.PiYG,
           vmin=0, vmax=1,
           aspect='equal'
          )
plt.title('Cell image')
plt.grid(False)
plt.subplot(1,2,2)
plt.imshow(cell_mask,
           cmap=plt.cm.Greys,
           vmin=0, vmax=1,
           aspect='equal'
          )
plt.title('Cell mask')
plt.grid(False)
plt.show()

print('\nCell targets: {}\n'.format(cell_targets))

logging.info('\n\nPREPROCESSING FINISHED!!!!----------------------')