###  <span style="color:red">**This Notebook can be run from Google Colab:**</span>

https://colab.research.google.com

# **<span style="color:red">Background and motivation:</span>**

#### First, we trained a model using positive patches only (only patches actually containing regions corresponding to growing bacterial colonies in the petri-dish), to specifically differentiate among the 8 bacterial species in our dataset. From the confusion matrix of that model, we could see that the model is having difficulty to differentiate between classes 'C1' and 'C2-3' and between classes 'C4-7' and 'C5'.

#### As a next step, we then trained a model to specifically learn to differentiate 'C1' vs 'C2-3' vs 'all_other' classes. This is a model with 3 classes only. 

#### Similarly, we also trained a model to specifically learn to differentiate 'C4-7' vs 'C5' vs 'all_other' classes.

#### The last model we trained was a model to specifically differentiate between positive bacterial colony patches (of any class) and negative patches (either petri-dish background, petri-dish border or white image background). For this, we just combined all positive patches (regardless of the bacterial species) in a single 'positive' class and all negative patches in a single 'negative' class.  

#### Now, we have 4 models, the first one producing 8 predicted probabilities (one for each baterial species), the second one producing 3 predicted probabilities ('C1', 'C2-3', 'all_other'), the third one also producing 3 predicted probabilities ('C4-7', 'C5', 'all_other') and the fourth one producing 2 predicted probabilities (positive_patch, negative_patch), for a total of 16 predicted probabilities.

#### As a final step, before testing our model, we want to **combine** our models. For this, we will try using the 16 predicted probabilities as features and train a SVM or other simple classification model, to learn to predict either negative or the correct bacterial species, from the probabilities produced by the 4 models above.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import os
import zipfile
import shutil
import json
import pickle
from google.colab import files

import keras
from keras.preprocessing.image import ImageDataGenerator
from keras.models import load_model

from sklearn.metrics import accuracy_score, confusion_matrix, \
                            classification_report, balanced_accuracy_score

# Import PyDrive and associated libraries (to connect with GoogleDrive):
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Mount my GoogleDrive:
from google.colab import drive
drive.mount('/content/gdrive')

# disable warnings
import warnings
warnings.simplefilter("ignore")
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

Using TensorFlow backend.


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


### **Check if we are using GPU:**

In [2]:
from keras import backend as K
if K.backend() == "tensorflow":
    import tensorflow as tf
    device_name = tf.test.gpu_device_name()
    if device_name == '':
        device_name = "None"
    print('Using TensorFlow version:', tf.__version__, ', GPU:', device_name)

Using TensorFlow version: 1.15.0 , GPU: /device:GPU:0


### **Download Validation ('Control') patches from GoogleDrive:**

### This dataset contains 9 classes in total (8 bacterial species + negative patches).

#### *Validation Patches were augmented with Patch_Generator, using 'stride=22' and rotations every 20 degrees until a full lap. Patches were then balanced by downsampling majority classes so we can compare accuracy of the model.*

###  **NOTE: Validation patches were generated from original, non-preprocessed images. In this way, we will ensure our model perform well at testing time when pre-processing may not be feasible. As example, being able to create masks/image annotation may not be feasible on testing data.**



In [7]:
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

#"Control_for_final_training_9_classes_128_vs22_minval_1024_rotate_every_20_full_balance.zip":
file_id = '1bebJEgFoWq04jX-6ZfqZPxAoxH5j4G9e' # Control only, rotate every 20, full balance
#file_id = '1ikOi87uvs6wCovfoY2Wyucj7kfod85Ty' # Added Serial positive+border

downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile(downloaded['title'])
print('Downloaded content: "{}"'.format(downloaded['title']))
print('Root dir content: {}'.format(os.listdir()))
patches_zip = downloaded['title']

Downloaded content: "Control_for_final_training_9_classes_128_vs22_minval_1024_rotate_every_20_full_balance.zip"
Root dir content: ['.config', 'Patches', 'gdrive', 'adc.json', 'Control_for_final_training_9_classes_128_vs22_minval_1024_rotate_every_20_full_balance.zip', 'sample_data']


### **Unzip the Validation ('Control') patches:**

In [8]:
# Remove 'Patches' dir if it already exists
if 'Patches' in os.listdir():
  shutil.rmtree('./Patches')
with zipfile.ZipFile(patches_zip,"r") as zip:
    zip.extractall()
os.remove(downloaded['title'])
print('Root dir content: {}'.format(os.listdir()))

Root dir content: ['.config', 'Patches', 'gdrive', 'adc.json', 'sample_data']


### **Let's count patches by type and class:**

In [9]:
classes = ['C1','C2-3','C4-7','C5','C6','C8','C9','C10','neg']
class_weights = {} # empty dictionary to store class weights

grand_total = 0
for type_ in ['Serial', 'Control', 'Streak']:
    print("\nTotal '{}' Patches per location:".format(type_))
    n_type = 0
    class_weights[type_] = {} # nested empty dictionary to store class weights
    for cls in classes:
        if cls != 'neg':
            pos_folder = './Patches/{}/{}_pos'.format(type_,cls)
        else:
            pos_folder = './Patches/{}/{}'.format(type_,cls)
        n_pos = len(os.listdir(pos_folder))
        n_type += n_pos
        #print(pos_folder, n_pos)
        print('total_{}: {}'.format(cls,n_pos))
        class_weights[type_]['{}'.format(cls)] = 1/n_pos if n_pos else 0
    print('Total {}: {}'.format(type_,n_type))
    for loc in class_weights[type_].keys():
        class_weights[type_][loc] *= n_type
    grand_total += n_type
print('\nGRAND TOTAL: {}'.format(grand_total))


Total 'Serial' Patches per location:
total_C1: 0
total_C2-3: 0
total_C4-7: 0
total_C5: 0
total_C6: 0
total_C8: 0
total_C9: 0
total_C10: 0
total_neg: 0
Total Serial: 0

Total 'Control' Patches per location:
total_C1: 6610
total_C2-3: 6610
total_C4-7: 6610
total_C5: 6610
total_C6: 6610
total_C8: 6610
total_C9: 6610
total_C10: 6610
total_neg: 6610
Total Control: 59490

Total 'Streak' Patches per location:
total_C1: 0
total_C2-3: 0
total_C4-7: 0
total_C5: 0
total_C6: 0
total_C8: 0
total_C9: 0
total_C10: 0
total_neg: 0
Total Streak: 0

GRAND TOTAL: 59490


#### **Let's build the validation generator, using keras.preprocessing.image.ImageDataGenerator, rescaling image pixel values from [0,  255] to [0, 1]:**

In [10]:
img_size = (128,128,3)
val_batch_size = 64

val_datagen = ImageDataGenerator(rescale=1./255)

val_generator = val_datagen.flow_from_directory(
        './Patches/Control',
        target_size=(img_size[0],img_size[1]),
        batch_size=val_batch_size,
        class_mode='categorical',
        shuffle=False)

Found 59490 images belonging to 9 classes.


#### **Let's check what is the data generators' index for each class:**

In [11]:
print('validation_generator.class_indices:', str(json.dumps(val_generator.class_indices, indent=2, default=str)))

validation_generator.class_indices: {
  "C10_pos": 0,
  "C1_pos": 1,
  "C2-3_pos": 2,
  "C4-7_pos": 3,
  "C5_pos": 4,
  "C6_pos": 5,
  "C8_pos": 6,
  "C9_pos": 7,
  "neg": 8
}


### **Let's download the 4 final models from GoogleDrive:**

In [12]:
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

eight_classes_id = '1w0u_EKaSG8zkMRtYkNjFd3IOnR3IpQsJ' # model_8_classes_0.8465
C1_C2_3_id = '18De1DbqyxD1JlNpue6LIUZXd73VgAN-f' # model C1 vs C2-3 vs all_other
C4_7_C5_id = '1-4e6W-yR13q3ckpgo8O9QVMTncWITHwg' # model_C4-7_vs_C5_vs_all-other
pos_vs_neg_id = '1-BxPnguFXE7PHmzKadW0AnwWO9VqTywR' # model pos-neg

files_ids_dict = {'model_eight_classes': eight_classes_id,
                  'model_C1_C2_3': C1_C2_3_id,
                  'model_C4_7_C5': C4_7_C5_id,
                  'model_pos_vs_neg': pos_vs_neg_id}

models_names_dict = {}
for model_name, file_id in files_ids_dict.items():
    downloaded = drive.CreateFile({'id': file_id})
    downloaded.GetContentFile(downloaded['title'])
    print('Downloaded content: "{}"'.format(downloaded['title']))
    models_names_dict[model_name] = downloaded['title']

print('\nmodels_names_dict:', str(json.dumps(models_names_dict, indent=2, default=str)))
print('\nRoot dir content: {}\n'.format(os.listdir()))

Downloaded content: "model_8_classes_08465.h5"
Downloaded content: "model_C1_C2-3_08983.h5"
Downloaded content: "model_C4-7_C5_083.h5"
Downloaded content: "model_pos_neg_09973.h5"

models_names_dict: {
  "model_eight_classes": "model_8_classes_08465.h5",
  "model_C1_C2_3": "model_C1_C2-3_08983.h5",
  "model_C4_7_C5": "model_C4-7_C5_083.h5",
  "model_pos_vs_neg": "model_pos_neg_09973.h5"
}

Root dir content: ['.config', 'Patches', 'gdrive', 'model_C1_C2-3_08983.h5', 'adc.json', 'model_C4-7_C5_083.h5', 'model_pos_neg_09973.h5', 'model_8_classes_08465.h5', 'sample_data']



#### **Let's load all 4 models from downloaded files:**

In [0]:
eight_classes_model = load_model(models_names_dict['model_eight_classes'])
#eight_classes_model.summary() # summarize model.

In [0]:
C1_C2_3_model = load_model(models_names_dict['model_C1_C2_3'])
#C1_C2_3_model.summary() # summarize model.

In [0]:
C4_7_C5_model = load_model(models_names_dict['model_C4_7_C5'])
#C4_7_C5_model.summary() # summarize model.

In [0]:
pos_vs_neg_model = load_model(models_names_dict['model_pos_vs_neg'])
#pos_vs_neg_C5_model.summary() # summarize model.

## **Let's *'evaluate'* the models, just to make sure they loaded correctly:**

#### **Let's evaluate '8 classes' model on the validation set:**

In [18]:
y_true = val_generator.classes
eight_classes_scores = eight_classes_model.predict_generator(val_generator)
y_pred = np.argmax(eight_classes_scores, axis=1)

# remove 'neg' class (index = 8):
y_true, y_pred = y_true[(y_true != 8)], y_pred[(y_true != 8)]

val_acc = accuracy_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
class_names = [k for k in val_generator.class_indices if k != 'neg']
c_report = classification_report(y_true, y_pred, target_names=class_names)

print('\nbalanced val_acc:\n', val_acc)
print('\nConfusion Matrix:\n', cm)
print('\nClassification Report:\n', c_report)


balanced val_acc:
 0.8236951588502269

Confusion Matrix:
 [[6545   26   34    0    0    5    0    0]
 [  30 5824  657    0    0    1   98    0]
 [   0 2043 4269    0    0  107  187    4]
 [   0    0   48 5458 1048    2   54    0]
 [   0   14    0 3891 2696    0    9    0]
 [  54    1   61   47    5 6359   82    1]
 [   0  190    9  581   34    0 5796    0]
 [   0    0    0    0    0    0    0 6610]]

Classification Report:
               precision    recall  f1-score   support

     C10_pos       0.99      0.99      0.99      6610
      C1_pos       0.72      0.88      0.79      6610
    C2-3_pos       0.84      0.65      0.73      6610
    C4-7_pos       0.55      0.83      0.66      6610
      C5_pos       0.71      0.41      0.52      6610
      C6_pos       0.98      0.96      0.97      6610
      C8_pos       0.93      0.88      0.90      6610
      C9_pos       1.00      1.00      1.00      6610

    accuracy                           0.82     52880
   macro avg       0.84      



**Let's remember the val_generator class indices (we will need them next):**

In [19]:
print('validation_generator.class_indices:', str(json.dumps(val_generator.class_indices, indent=2, default=str)))

validation_generator.class_indices: {
  "C10_pos": 0,
  "C1_pos": 1,
  "C2-3_pos": 2,
  "C4-7_pos": 3,
  "C5_pos": 4,
  "C6_pos": 5,
  "C8_pos": 6,
  "C9_pos": 7,
  "neg": 8
}


#### **Let's evaluate 'C1_C2_3' model on the validation set:**

In [20]:
y_true = val_generator.classes
print('\nunique values in y_true', np.unique(y_true))

C1_C2_3_scores = C1_C2_3_model.predict_generator(val_generator)
y_pred = np.argmax(C1_C2_3_scores, axis=1)

# remove 'neg' class (index = 8):
y_true, y_pred = y_true[(y_true != 8)], y_pred[(y_true != 8)]

# class indices for C1_C2_3_model are as follows:
C1_C2_3_model_class_dict = {0:"C1_pos", 1:"C2-3_pos", 2:"all_other_pos"}

# let's map y_pred class indices from 0,1,2 to 1,2,3 (same as val_generator):
y_pred +=1

# in y_true, convert classes != 'C1_pos'(1),'C2-3_pos'(2) into 'all_other'(3):
y_true[(y_true!=1) & (y_true!=2)] = 3

val_acc = balanced_accuracy_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
class_names = [v for v in C1_C2_3_model_class_dict.values()]
c_report = classification_report(y_true, y_pred, target_names=class_names)

print('\nbalanced val_acc:\n', val_acc)
print('\nConfusion Matrix:\n', cm)
print('\nClassification Report:\n', c_report)


unique values in y_true [0 1 2 3 4 5 6 7 8]

balanced val_acc:
 0.8449235165574046

Confusion Matrix:
 [[ 6352   232    26]
 [ 1995  3890   725]
 [  271   312 39077]]

Classification Report:
                precision    recall  f1-score   support

       C1_pos       0.74      0.96      0.83      6610
     C2-3_pos       0.88      0.59      0.70      6610
all_other_pos       0.98      0.99      0.98     39660

     accuracy                           0.93     52880
    macro avg       0.87      0.84      0.84     52880
 weighted avg       0.94      0.93      0.93     52880



**Again, just remember the val_generator class indices (we will need them next):**

In [21]:
print('validation_generator.class_indices:', str(json.dumps(val_generator.class_indices, indent=2, default=str)))

validation_generator.class_indices: {
  "C10_pos": 0,
  "C1_pos": 1,
  "C2-3_pos": 2,
  "C4-7_pos": 3,
  "C5_pos": 4,
  "C6_pos": 5,
  "C8_pos": 6,
  "C9_pos": 7,
  "neg": 8
}


#### **Let's evaluate 'C4-7_C5' model on the validation set:**

In [22]:
y_true = val_generator.classes
print('\nunique values in y_true', np.unique(y_true))

C4_7_C5_scores = C4_7_C5_model.predict_generator(val_generator)
y_pred = np.argmax(C4_7_C5_scores, axis=1)

# remove 'neg' class (index = 8):
y_true, y_pred = y_true[(y_true != 8)], y_pred[(y_true != 8)]

# class indices for C4-7_C5_model are as follows:
C4_7_C5_model_class_dict = {0:"C4-7_pos", 1:"C5_pos",2: "all_other_pos"}

# in y_true, convert classes != 'C4-7_pos'(3),'C5_pos'(4) into 'all_other'(2):
y_true[(y_true!=3) & (y_true!=4)] = 2

# now, let's map 'C4-7_pos' from 3 to 0 and 'C5_pos' from 4 to 1:
y_true[y_true==3] = 0
y_true[y_true==4] = 1

val_acc = balanced_accuracy_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
class_names = [v for v in C4_7_C5_model_class_dict.values()]
c_report = classification_report(y_true, y_pred, target_names=class_names)

print('\nbalanced val_acc:\n', val_acc)
print('\nConfusion Matrix:\n', cm)
print('\nClassification Report:\n', c_report)


unique values in y_true [0 1 2 3 4 5 6 7 8]

balanced val_acc:
 0.82523953605648

Confusion Matrix:
 [[ 3943  2506   161]
 [  499  6054    57]
 [ 1005   450 38205]]

Classification Report:
                precision    recall  f1-score   support

     C4-7_pos       0.72      0.60      0.65      6610
       C5_pos       0.67      0.92      0.78      6610
all_other_pos       0.99      0.96      0.98     39660

     accuracy                           0.91     52880
    macro avg       0.80      0.83      0.80     52880
 weighted avg       0.92      0.91      0.91     52880



**Again, just remember the val_generator class indices (we will need them next):**

In [23]:
print('validation_generator.class_indices:', str(json.dumps(val_generator.class_indices, indent=2, default=str)))

validation_generator.class_indices: {
  "C10_pos": 0,
  "C1_pos": 1,
  "C2-3_pos": 2,
  "C4-7_pos": 3,
  "C5_pos": 4,
  "C6_pos": 5,
  "C8_pos": 6,
  "C9_pos": 7,
  "neg": 8
}


#### **Let's evaluate 'pos_vs_neg' model on the validation set:**

In [24]:
y_true = val_generator.classes
print('\nunique values in y_true', np.unique(y_true))

pos_vs_neg_scores = pos_vs_neg_model.predict_generator(val_generator)
y_pred = np.argmax(pos_vs_neg_scores, axis=1)

# class indices for C4-7_C5_model are as follows:
pos_vs_neg_model_class_dict = {0:'neg', 1:'pos'}

# in y_true, convert classes != 'neg'(8) to 1:
y_true[y_true!=8] = 1

# now, let's map 'neg' from 8 to 0:
y_true[y_true==8] = 0

val_acc = balanced_accuracy_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
class_names = [v for v in pos_vs_neg_model_class_dict.values()]
c_report = classification_report(y_true, y_pred, target_names=class_names)

print('\nbalanced val_acc:\n', val_acc)
print('\nConfusion Matrix:\n', cm)
print('\nClassification Report:\n', c_report)


unique values in y_true [0 1 2 3 4 5 6 7 8]

balanced val_acc:
 0.9968891830559758

Confusion Matrix:
 [[ 6575    35]
 [   49 52831]]

Classification Report:
               precision    recall  f1-score   support

         neg       0.99      0.99      0.99      6610
         pos       1.00      1.00      1.00     52880

    accuracy                           1.00     59490
   macro avg       1.00      1.00      1.00     59490
weighted avg       1.00      1.00      1.00     59490



### **All models performed as expected let's now produce a composite array of training data with the features being the total 16 scores produced by the 4 models and the target being the actual class from the 9 classes in the downloaded patches:**

#### **First, let's check the sizes to make sure they are correct:**

In [25]:
y_true = val_generator.classes
print('\nunique values in y_true', np.unique(y_true))
print('y_true.shape', y_true.shape)
print('eight_classes_scores.shape', eight_classes_scores.shape)
print('C1_C2_3_scores.shape', C1_C2_3_scores.shape)
print('C4_7_C5_scores.shape', C4_7_C5_scores.shape)
print('pos_vs_neg_scores.shape', pos_vs_neg_scores.shape)


unique values in y_true [0 1]
y_true.shape (59490,)
eight_classes_scores.shape (59490, 8)
C1_C2_3_scores.shape (59490, 3)
C4_7_C5_scores.shape (59490, 3)
pos_vs_neg_scores.shape (59490, 2)


#### **For some unknown reason y_true classes are not correct so let's run the val_generator again to get correct y_true:**

In [26]:
val_datagen = ImageDataGenerator(rescale=1./255)

val_generator = val_datagen.flow_from_directory(
        './Patches/Control',
        target_size=(img_size[0],img_size[1]),
        batch_size=val_batch_size,
        class_mode='categorical',
        shuffle=False)

y_true = val_generator.classes
print('\nunique values in y_true', np.unique(y_true))

Found 59490 images belonging to 9 classes.

unique values in y_true [0 1 2 3 4 5 6 7 8]


#### **Let's combine the outputs from all models:**

In [27]:
X = np.hstack((eight_classes_scores,C1_C2_3_scores,C4_7_C5_scores,pos_vs_neg_scores))
print(X.shape)

(59490, 16)


#### **Let's add y_true as the last column:**

In [28]:
y = np.reshape(y_true, (len(y_true),-1)) # reshape from 1D to 2D with isngle column
train_data = np.hstack((X, y))
print(train_data.shape)

(59490, 17)


#### **Let's download the train_data into a CSV file on GoogleDrive:**

In [0]:
dest = 'gdrive/My Drive/Capstone/train_data.csv'
np.savetxt(dest, train_data, delimiter=",") 

# **Next Steps:**



#### As a final step, before testing our model, we will use the combined 16 predicted probabilities as features to train a SVM or other simple classification model, to learn to predict either negative or the correct bacterial species, from the probabilities produced by the 4 models above and stored together with y_true in a CSV file on GoogleDrive.