In [1]:
import cv2
from time import time
import numpy as np
import os
from random import shuffle
from random import sample
from random import choice
from tqdm import tqdm_notebook as tqdm
import tensorflow as tf
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import *
from keras.optimizers import *
from keras.callbacks import TensorBoard
from sklearn.model_selection import train_test_split
%matplotlib.inline

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
UsageError: Line magic function `%matplotlib.inline` not found.


# Training the classifiers on curated training data

Enter paths to your training data. In our training, to increase training samples and to achieve class balance when training one vs. all classifiers, we created two seperate folders for each class: 1) original files without any augmentation and 2) files augmented so that the number of augmented images in the training class approximately equals the number of un-augmented images in all other classes combined

## PLEASE NOTE:
This notebook is meant to be modified to be used with the user's own training data. Training data is not hosted on this Github repository and hence this notebook will not work without supplied training data. This notebook is a guide on how to train a model as we did.

In [9]:
aug_train_data_dir = '/notebooks/storage/master_trainingSet/balanced'
notaug_train_data_dir = '/notebooks/storage/master_trainingSet/'
test_data_dir = '/notebooks/storage/master_trainingSet/vallidation/'

A list of all classes being trained

In [10]:
train_classes = ['artifacts_A','artifacts_B','lysed','outFocus','bad_dense']

A function that returns a vector label for an image. If a training image is in the training class, it returns [1,0], otherwise it returns [0,1]

In [2]:
def label(given_class,train_class):
    if given_class == train_class:
        vec = [1,0]
    else:
        vec = [0,1]
    return vec

This is a function to load all training data with proper labels before training each one vs. all classifier

In [3]:
def train_data(train_class):

    train_images = []
    # load augmented in class data
    file_dir = aug_train_data_dir+'/'+train_class+'aug'
    for cl in train_classes:
        file_dir = aug_train_data_dir + '/' + cl + 'aug'
        for i in os.listdir(file_dir):
            path = os.path.join(file_dir,i)
            img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
            try:
                img = cv2.resize(img, (64,64))
            except:
                continue
            train_images.append([np.array(img).reshape(-1,64,64,1), label(cl,train_class)])
                
    shuffle(train_images)
    
    return  train_images

We used a gridsearch (implemented via scikit-learn package) to optimize hyperparameters for each one vs. all classifier

In [11]:
training_params = {'artifacts_A': ['categorical_crossentropy', 0.3, 'Adam'],
 'artifacts_B': ['categorical_crossentropy', 0.1, 'Adam'],
 'lysed': ['logcosh', 0.3, 'Adam'],
 'outFocus': ['categorical_crossentropy', 0.3, 'Adam'],
 'bad_dense': ['categorical_crossentropy', 0.25, 'Adam']}

When training on GPUs with batches, if training data isn't equally divided into each batch (i.e. one batch has fewer images than the rest), you will see a sudden spike in training loss. While the effects of this are negated after a few epochs of training, you can achieve cleaner training by ensuring that each batch has the same number of images

In [5]:
def find_batch_size(tr_set_size):
    # make list of factors
    factors = []
    for i in range(1, tr_set_size + 1):
        if tr_set_size % i ==0:
            factors.append(i)
    possible_batches = list(filter(lambda a: a > 20, factors))
    return possible_batches

Building the model in Keras. Here we use three consecutive convolutional + maxpooling layers, followed by a densely connected layer and a final 2 neuron layer (2 neurons for binary classification). If you wish to use transfer learning to use our models as a starting point but train it further to specialize on your desired cell line/phenotype. To make adjustments to the code below for transfer learning, see this article for a useful introduction to transfer learning using Keras: https://towardsdatascience.com/keras-transfer-learning-for-beginners-6c9b8b7143e

In [8]:
def train_model(x_train, y_train, x_val, y_val,params,train_class):
    
    #########
    #
    # ENTER THE PATH WHERE YOU WANT TO SAVE YOUR MODELS HERE:
    #########
    
    model_save_dir = '/notebooks/storage/hierarchical_files/models/'
    
    # find batch size
    found = False
    while found == False:
        batch_size = find_batch_size(len(x_train))[0]
        if batch_size < 100:
            found = True
        else:
            x_train = x_train[:-1]
            y_train = y_train[:-1]
    
    model = Sequential()

    model.add(InputLayer(input_shape=[64,64,1]))
    model.add(Conv2D(filters=32,kernel_size=5,strides=1,padding='same',activation='relu'))
    model.add(MaxPool2D(pool_size=5,padding='same'))

    model.add(Conv2D(filters=50,kernel_size=5,strides=1,padding='same',activation='relu'))
    model.add(MaxPool2D(pool_size=5,padding='same'))

    model.add(Conv2D(filters=80,kernel_size=5,strides=1,padding='same',activation='relu'))
    model.add(MaxPool2D(pool_size=5,padding='same'))

    model.add(Dropout(0.5))
    model.add(Flatten())
    model.add(Dense(512,activation='relu'))
    model.add(Dropout(rate=0.5))
    # Changing to 2 nuerons to reflect 2 classes
    model.add(Dense(2,activation='softmax'))
    optimizer = Adam(lr=0.000001)

    model.compile(optimizer=optimizer,loss='categorical_crossentropy',metrics=['accuracy'])
    
    #####
    #
    # IF USING TENSORBOARD, CHANGE DIRECTORY AND NAME AS YOU'D PREFER:
    ####
    
    tensorboard = TensorBoard(log_dir='/notebooks/storage/tensorboard_logs/'+train_class+"attemptX{}".format(time()),histogram_freq=0,write_graph=True, write_images=True)
    model.fit(x=x_train,y=y_train,
              epochs=10000,
              batch_size=batch_size,
              validation_data=[x_val,y_val],
              verbose=0,
              callbacks=[tensorboard])
    
    model.summary()
    
    # serialize model to JSON
    model_json = model.to_json()
    with open(model_save_dir + train_class+"_model.json", "w") as json_file:
        json_file.write(model_json)
    # serialize weights to HDF5
    model.save_weights(model_save_dir + train_class+"_weights.h5")
    print("Saved model to disk")
    
    
    return model
   

Training each one vs. all classifier. We trained our models on a P6000 GPU instance on Paperspace overnight (each model takes ~1-1.5 hours). If you are using

In [15]:
# train one v all classifiers
model_dict = {}
for cl in tqdm(train_classes):
    training_images = train_data(cl)
    X = np.array([i[0] for i in training_images]).reshape(-1,64,64,1)
    Y = np.array([i[1] for i in training_images])
    # create train/validation split (12.5% of train images are validation)
    x_train, x_valid, y_train, y_valid = train_test_split(X, Y, test_size=0.125, shuffle= True)
    params = training_params[cl]
    model_ = train_model(x_train,y_train,x_valid,y_valid,params,cl)
    model_dict[cl] = model_


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_7 (InputLayer)         (None, 64, 64, 1)         0         
_________________________________________________________________
conv2d_19 (Conv2D)           (None, 64, 64, 32)        832       
_________________________________________________________________
max_pooling2d_19 (MaxPooling (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_20 (Conv2D)           (None, 13, 13, 50)        40050     
_________________________________________________________________
max_pooling2d_20 (MaxPooling (None, 3, 3, 50)          0         
_________________________________________________________________
conv2d_21 (Conv2D)           (None, 3, 3, 80)          100080    
_________________________________________________________________
max_pooling2d_21 (MaxPooling (None, 1, 1, 80)          0         
__________

Saved model to disk
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_12 (InputLayer)        (None, 64, 64, 1)         0         
_________________________________________________________________
conv2d_34 (Conv2D)           (None, 64, 64, 32)        832       
_________________________________________________________________
max_pooling2d_34 (MaxPooling (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_35 (Conv2D)           (None, 13, 13, 50)        40050     
_________________________________________________________________
max_pooling2d_35 (MaxPooling (None, 3, 3, 50)          0         
_________________________________________________________________
conv2d_36 (Conv2D)           (None, 3, 3, 80)          100080    
_________________________________________________________________
max_pooling2d_36 (MaxPooling (None, 1, 1, 80)          0