# Capstone Notebook 3: Testing Models and evaluation

In this notebook, we will evaluate the various models

In [37]:
import numpy as np
import pandas as pd
from PIL import Image

from keras.models import load_model
from keras.preprocessing import image
from keras.optimizers import Adam

df_test = pd.read_csv('./balanced-one-partition/pneumo_dataset_ITI_rev_clean.tsv' , sep="\t")
y, x, in_channel = 524, 524, 1

In [48]:
from pneumo_data_generator import DataGenerator

#from pneumo_data_generator_test import DataGeneratorTest
data_filter  = df_test['Projection']=='PA'
data_filter &= df_test['Partition']=='te'
df_pneumo_2d_test=df_test[ data_filter ]
data_generator_test = DataGenerator( df_pneumo_2d_test[data_filter ], y, x, in_channel, batch_size=1,shuffle=True)

  import sys


In [20]:
df_pneumo_2d_test.shape

(2378, 39)

In [39]:
# CNN model
CNNmodel= load_model('./models/modelCNN-train.hdf5')



In [49]:
predict_CNN = CNNmodel.predict(data_generator_test,steps = len(data_generator_test),verbose=1,
                               workers=1, use_multiprocessing=False)
#predict = np.argmax(predict, axis=1)



In [None]:
#data_generator_test.__getitem__(1)
CNNmodel.predict(x=data_generator_test.__getitem__(1))

In [47]:
predict_CNN

array([[0.20709012, 0.79290986],
       [0.20709012, 0.79290986],
       [0.20709012, 0.79290986],
       ...,
       [0.20709012, 0.79290986],
       [0.20709012, 0.79290986],
       [0.20709012, 0.79290986]], dtype=float32)

In [8]:
# VGG model
vggmodel= load_model('./models/vggmodel-train.hdf5')

In [9]:
predict_vgg = vggmodel.predict_generator(data_generator_test,steps = len(data_generator_test),verbose=1,
                               workers=1, use_multiprocessing=False)




In [21]:
predict_vgg

array([[0.20379415, 0.7962058 ],
       [0.20379415, 0.7962058 ],
       [0.20379415, 0.7962058 ],
       ...,
       [0.20379418, 0.7962058 ],
       [0.20379418, 0.7962058 ],
       [0.20379412, 0.7962058 ]], dtype=float32)

In [9]:
# resnet model
resnetmodel= load_model('./models/resnetmodel-train.hdf5')

In [11]:
predict_resnet = resnetmodel.predict(data_generator_test,steps = len(data_generator_test),verbose=1)




In [12]:
predict_resnet

array([[0.19495799, 0.805042  ],
       [0.19495799, 0.805042  ],
       [0.19495799, 0.805042  ],
       ...,
       [0.19495799, 0.805042  ],
       [0.19495799, 0.805042  ],
       [0.19495799, 0.805042  ]], dtype=float32)

In [5]:
#inception model
inceptionmodel= load_model('./models/inceptionmodel-train.hdf5')

  


In [6]:
predict_inception = inceptionmodel.predict(data_generator_test,steps = len(data_generator_test),verbose=1)




In [7]:
predict_inception

array([[0.19546415, 0.80453587],
       [0.19546415, 0.80453587],
       [0.19546415, 0.80453587],
       ...,
       [0.19546415, 0.80453587],
       [0.19546415, 0.80453587],
       [0.19546415, 0.80453587]], dtype=float32)

Clearly, all 4 models are pretty close to each other in terms of their accuracy, with little to differentiate between them. This brings us to the next criterion of the models, which is the number of parameters, and the amount taken to train each of the models. They are summarised in the table below:

|Model	|Train Score|Validation Score|Test Score|No. of Params| Training Time|
|---	|---	|---|---|---|---|
|CNN|0.8044|0.8111|0.7930|60,940,898|17min|
|VGG16|0.8044|0.8111|0.7962|15,238,018|32 min|
|ResNet50|0.8041|0.8111|0.8050|25,678,786|47 min|
|InceptionNet|0.8042 |0.8111|0.8045|55,221,090|1h|

While ResNet50 and InceptionNet were marginally better at predicting the test scores, the sheer amount of time they needed to train prohibits them to be used on any but the most powerful computing devices for computation. 

On the other hand, prediction remained relatively computationally inexpensive, with each sample run of 2000+ images in the test set being cleared in less than 3 min. Thus, ResNet50 and InceptionNet may still have some value locally on machines in hospitals, which may make them useful for diagnosis. 

Another factor is the size of the model .hdf5 file as well, which are on the order of megabytes. These may make it difficult for hospitals to retrieve, or store these models locally, especially since there will be models for many other diseases who are using vision machine learning for diagnosis as well.


## Future Work
These models can be further extended to Computed Tomography scans, which are X-ray images done in 3-dimensions. A relatively simple 3layer model was reported to be able to achieve 83% accuracy for a similar pneumonia study:
https://keras.io/examples/vision/3D_image_classification/
