The CNN gives good results, but what if we want to use the CNN trained on the images, and other information, like cell fluctuations, patient history, etc? We can stack another model on top of the CNN predictions! See the Jan 2020 report for details

In [1]:
#load the usual suspects
import sys
sys.path.insert(0, "/home/ubuntu/data/code/Modules/")
import numpy as np
import pickle
import DataGenerator
import random
import CNN_Module as cnn_module
import models

Using TensorFlow backend.





Prep cells and images as before

In [2]:
#prep cells for feeding into CNN
control_cells = np.array(cnn_module.findallcells_indir('/home/ubuntu/data/resistant/'))
sus_cells     = np.array(cnn_module.findallcells_indir('/home/ubuntu/data/susceptible/'))

#label cells, and split cells into test and train, making sure to have equal proportion of sus and ctrl in both
control_cells_label = cnn_module.create_label_dict(control_cells,0)
sus_cells_label  = cnn_module.create_label_dict(sus_cells,1)

train_ctrl, test_ctrl = cnn_module.split_train_test(control_cells_label,0.9)
train_sus, test_sus = cnn_module.split_train_test(sus_cells_label,0.9)

train_labels = train_ctrl+train_sus
test_labels = test_ctrl+test_sus

random.shuffle(train_labels)
random.shuffle(test_labels)

video_path = '/cropped_video80'
sample_gap = 10
im_paths_train, im_labels_train, im_paths_test, im_labels_test = cnn_module.get_labels_images(train_labels,test_labels,video_path,sample_gap)
random.shuffle(im_paths_train)
random.shuffle(im_paths_test)

In [3]:
#get our model params
model = models.get_luke_model(80)
augment_train, augment_valid = cnn_module.get_augmentations_train_test()
params_train, params_test = cnn_module.get_params_train_test(80,80,augment_train,augment_valid)













Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [4]:
training_generator = DataGenerator.DataGenerator(im_paths_train, im_labels_train, **params_train)
prediction_generator = DataGenerator.DataGenerator(im_paths_test, im_labels_test, **params_test)

we now call cnn.module.k_fold_train to train on 80% of the train data and predict on the remaining 20%. We do this with different 20% holdouts until we have predictions for the entire training set. We also fully train a model on the entire training set which we use to predict on the test set.

In [5]:
#select either normal or k-fold training
sample_gap=10
epochs=2
stepsperepoch=100
video_path = '/cropped_video80'
validate_steps = len(im_paths_test)//32

#get kfold predictions from training set 
#train_preds = cnn_module.k_fold_train(train_labels,video_path,model,sample_gap,epochs,stepsperepoch,params_train,params_test)

#don't use k-fold
train_preds = cnn_module.fit_model_from_labels(train_labels,video_path,model,sample_gap,epochs,stepsperepoch,params_train,params_test)

#get a fully trained model for the test predictions
fully_train_model = cnn_module.fit_model(model,training_generator,prediction_generator,epochs,stepsperepoch,validate_steps,params_train,params_test)
#make the test predictions.
test_predictions = cnn_module.predict(test_labels,video_path,fully_train_model,sample_gap,params_test)




Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2


Now we want to get our fluctuations and lengths. We can do this using the Get_Simple_Fluctuations_Average_Intensity module.

In [6]:
import Get_Simple_Fluctuations_Average_Intensity as simp

In [7]:
control_flucs_csv =  simp.flucs_from_csv('/home/ubuntu/data/resistant/')
print(control_flucs_csv)
sus_flucs_csv = simp.flucs_from_csv('/home/ubuntu/data/susceptible/')
print(sus_flucs_csv)
#combine dictionaries
all_flucs = {**control_flucs_csv, **sus_flucs_csv} 

control_lengths = simp.getlengths('/home/ubuntu/data/resistant/')
sus_lengths = simp.getlengths('/home/ubuntu/data/susceptible/')
all_lengths = control_lengths+sus_lengths

{'/home/ubuntu/data/resistant/sample52/cell052': 0.4105601323258871, '/home/ubuntu/data/resistant/sample52/cell053': 0.2492287167989228, '/home/ubuntu/data/resistant/sample52/cell054': 0.1912419898435697, '/home/ubuntu/data/resistant/sample52/cell055': 0.25145341079894984, '/home/ubuntu/data/resistant/sample52/cell056': 0.37586532491543895, '/home/ubuntu/data/resistant/sample52/cell057': 0.4540137046534464, '/home/ubuntu/data/resistant/sample52/cell058': 0.3237048144351004, '/home/ubuntu/data/resistant/sample52/cell059': 0.3445670103204346, '/home/ubuntu/data/resistant/sample52/cell060': 0.3714476252166599, '/home/ubuntu/data/resistant/sample52/cell061': 0.2445139433575501, '/home/ubuntu/data/resistant/sample52/cell062': 0.3911178476272494, '/home/ubuntu/data/resistant/sample52/cell063': 0.2915990885935849, '/home/ubuntu/data/resistant/sample52/cell064': 0.4364501098072686, '/home/ubuntu/data/resistant/sample52/cell065': 0.4474303337743871, '/home/ubuntu/data/resistant/sample52/cell066

We also want the labels

In [8]:
targets = simp.gettargets('/home/ubuntu/data/resistant/','/home/ubuntu/data/susceptible/')

We can compile these together. We iterate through the all_lengths list, getting the path and length from each element (ignoring the centre position which is also given by simp.getlengths with _ )

We then make a new dictionary, where each key is the path of the cell, and the value is (length, fluctuation, label)

In [1]:
result_dict = {}
for path, length, _ in all_lengths:
    result_dict[path]=(length,all_flucs[path],targets[path])

NameError: name 'all_lengths' is not defined

Now we have 3 results dictionaries. These are the train CNN predictions, the test CNN predictions, and the fluctuations and lengths.

The train/test CNN predictions are on images, but the flucs/lengths are per cell. We need to do some work to get cell-level predictions from the image level predictions

The code for gluing the cnn predictions to the flucs/lengths is called the glue_code

In [10]:
import glue_code

In [11]:
#First find the cells in the train and test partitions
test_cells = sorted(glue_code.partition_to_keys(list(test_predictions.keys())))
train_cells = sorted(glue_code.partition_to_keys(list(train_preds.keys())))

In [12]:
#go from list of cells, to list of lists of images from each cell
cell_ims_test = glue_code.split_preds_into_cells(test_predictions,test_cells)
cell_ims_train = glue_code.split_preds_into_cells(train_preds,train_cells)

#get the av score for susceptible, and proportion of images classified as susceptible
cell_predictions_test = glue_code.get_cell_predictions(cell_ims_test,test_predictions)
cell_predictions_train = glue_code.get_cell_predictions(cell_ims_train,train_preds)

In [13]:
#glue together CNN predictions and fluctuation/av/lengths/labels

test_final = glue_code.glue_flucs_preds(cell_predictions_test,result_dict)
train_final = glue_code.glue_flucs_preds(cell_predictions_train,result_dict)

We now have 2 beautiful lists, containing test and train data. Each list element is (path of cell, proportion of cell images identified as susceptible, average CNN susceptible image score, cell length, cell fluctuation, correct label)

For training a further algorithm on these, we drop the pesky paths, as they are strings, convert to arrays and shuffle.

In [14]:
test_arr =(np.array(test_final)[:,1:].astype(float))
train_arr =  (np.array(train_final)[:,1:].astype(float))

np.random.shuffle(test_arr)
np.random.shuffle(train_arr)

Now we split off the labels which we are trying to predict, from the predictors.

In [15]:
Xtrain = train_arr[:,:-1]
Ytrain = train_arr[:,-1]

Xtest = test_arr[:,:-1]
Ytest = test_arr[:,-1]

Now we can train our new algorithm! Here we try gradient boosting, but try uncommenting Adaboost and RandomForest too, and play around with their parameters.

In [16]:
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier,BaggingClassifier,GradientBoostingClassifier
#clf = AdaBoostClassifier(n_estimators=3, random_state=0)
clf = RandomForestClassifier(max_depth=1,n_estimators=1, random_state=0)
#clf = GradientBoostingClassifier(n_estimators=10,max_depth=1)
clf.fit(Xtrain, Ytrain)
print(clf.score(Xtest, Ytest))

0.5735294117647058
