### Benchmark model

#### Import Dataset

##### Resize data on disk

Code snippet to resize data on disk and optimize limited computer resources such as RAM and disk space.

In [1]:
### #!/usr/bin/python
### from PIL import Image
### import os, sys
### 
### path = "OCT2017-RESIZED-V1/train/NORMAL/"
### dirs = os.listdir( path )
### 
### def resize():
###     for item in dirs:
###         if os.path.isfile(path+item):
###             im = Image.open(path+item)
###             f, e = os.path.splitext(path+item)
###             imResize = im.resize((224,224), Image.ANTIALIAS)
###             imResize.save(f + '.jpeg', 'JPEG', quality=100)
### 
### resize()

#### Import necessary libraries

In [2]:
import matplotlib.pyplot as plt
%matplotlib inline  

from sklearn.datasets import load_files       
from keras.utils import np_utils
import numpy as np
from glob import glob

from keras.preprocessing import image                  
from tqdm import tqdm

from PIL import ImageFile    

from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.layers import Dropout, Flatten, Dense
from keras.models import Sequential
from keras.callbacks import ModelCheckpoint  

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Settings

In [3]:
train_directory = 'OCT2017-RESIZED-V1/train'
valid_directory = 'OCT2017-RESIZED-V1/valid'
test_directory = 'OCT2017-RESIZED-V1/test'

##### Data Exploration

Import dataset

In [4]:
# define function to load train, test, and validation datasets
def load_dataset(path):
    data = load_files(path)
    oct_files = np.array(data['filenames'])
    oct_targets = np_utils.to_categorical(np.array(data['target']), 4)
    return oct_files, oct_targets

# load train, test, and validation datasets
train_files, train_targets = load_dataset(train_directory)
valid_files, valid_targets = load_dataset(valid_directory)
test_files, test_targets = load_dataset(test_directory)

# load list of oct names
oct_names = [item[20:-1] for item in sorted(glob(train_directory + "/*/"))]

# print statistics about the dataset
print('There are %d total oct categories.' % len(oct_names))
print('There are %s total oct images.\n' % len(np.hstack([train_files, valid_files, test_files])))
print('There are %d training oct images.' % len(train_files))
print('There are %d validation oct images.' % len(valid_files))
print('There are %d test oct images.'% len(test_files))

There are 4 total oct categories.
There are 7020 total oct images.

There are 5082 training oct images.
There are 938 validation oct images.
There are 1000 test oct images.


#### Similarities between images of different categories

CNV | DME | DRUSEN | NORMAL
- | -
<img src="OCT2017-RESIZED-V1/test/CNV/CNV-6256161-1.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/test/DME/DME-4940184-1.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/test/DRUSEN/DRUSEN-7373858-1.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/test/NORMAL/NORMAL-3103940-1.jpeg" width="224">

#### Distinctive Images of CNV

CNV | CNV | CNV | CNV
- | -
<img src="OCT2017-RESIZED-V1/train/CNV/CNV-154835-109.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/train/CNV/CNV-154835-83.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/train/CNV/CNV-172472-41.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/train/CNV/CNV-172472-39.jpeg" width="224">



#### Distinctive Images of DME

DME | DME | DME | DME
- | -
<img src="OCT2017-RESIZED-V1/train/DME/DME-306172-74.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/train/DME/DME-323904-8.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/train/DME/DME-462675-36.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/train/DME/DME-633268-67.jpeg" width="224">

#### Distinctive Images of DRUSEN

DRUSEN | DRUSEN | DRUSEN | DRUSEN
- | -
<img src="OCT2017-RESIZED-V1/train/DRUSEN/DRUSEN-457907-12.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/train/DRUSEN/DRUSEN-2128644-16.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/train/DRUSEN/DRUSEN-7106073-1.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/train/DRUSEN/DRUSEN-8023853-38.jpeg" width="224">

#### Distinctive Images of NORMAL

NORMAL | NORMAL | NORMAL | NORMAL
- | -
<img src="OCT2017-RESIZED-V1/train/NORMAL/NORMAL-216250-2.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/train/NORMAL/NORMAL-258763-36.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/train/NORMAL/NORMAL-286318-1.jpeg" width="224"> | <img src="OCT2017-RESIZED-V1/train/NORMAL/NORMAL-778975-37.jpeg" width="224">

In [5]:


def path_to_tensor(img_path):
    # loads RGB image as PIL.Image.Image type
    img = image.load_img(img_path, target_size=(224, 224))
    # convert PIL.Image.Image type to 3D tensor with shape (224, 224, 3)
    x = image.img_to_array(img)
    # convert 3D tensor to 4D tensor with shape (1, 224, 224, 3) and return 4D tensor
    return np.expand_dims(x, axis=0)

def paths_to_tensor(img_paths):
    list_of_tensors = [path_to_tensor(img_path) for img_path in tqdm(img_paths)]
    return np.vstack(list_of_tensors)

In [6]:
                        
ImageFile.LOAD_TRUNCATED_IMAGES = True                 

# pre-process the data for Keras
train_tensors = paths_to_tensor(train_files).astype('float32')/255
valid_tensors = paths_to_tensor(valid_files).astype('float32')/255
test_tensors = paths_to_tensor(test_files).astype('float32')/255

100%|█████████████████████████████████████████████████████████████████████████████| 5082/5082 [00:08<00:00, 603.70it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 938/938 [01:02<00:00, 15.12it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:25<00:00, 38.47it/s]


---
<a id='step0'></a>
## Step N: Create benchmark model

### Import Dog Dataset

In the code cell below, we import a dataset of dog images.  We populate a few variables through the use of the `load_files` function from the scikit-learn library:
- `train_files`, `valid_files`, `test_files` - numpy arrays containing file paths to images
- `train_targets`, `valid_targets`, `test_targets` - numpy arrays containing onehot-encoded classification labels 
- `dog_names` - list of string-valued dog breed names for translating labels

In [7]:
model = Sequential([
    
    #Locally connected layer containing fewer weights
    #Break the image up into smaller pieces
    #Use 75 filters to identify the most general patterns
    #Use standard kerner_size of 2
    Conv2D(filters=75, kernel_size=2, padding='same', activation='relu', input_shape=(224,224,3)),
    
    #Reduce dimensionality of convolutional layer,
    #Reduce by taking the maximum value in the filter
    MaxPooling2D(pool_size=2),
    
    #Use 100 filters to identify the more specific patterns
    #Use standard kerner_size of 2
    Conv2D(filters=100, kernel_size=2, padding='same', activation='relu'),
    MaxPooling2D(pool_size=2),
    
    #Use 125 filters to identify the more specific patterns
    #Use standard kerner_size of 2
    Conv2D(filters=125, kernel_size=2, padding='same', activation='relu'),
    
    MaxPooling2D(pool_size=2),
    
    
    
    #Conv2D(filters=125, kernel_size=2, padding='same', activation='relu'),    
    #MaxPooling2D(pool_size=2),
    #Conv2D(filters=125, kernel_size=2, padding='same', activation='relu'),    
    #MaxPooling2D(pool_size=2),
    #Conv2D(filters=125, kernel_size=2, padding='same', activation='relu'),    
    #MaxPooling2D(pool_size=2),    
    #Conv2D(filters=125, kernel_size=2, padding='same', activation='relu'),    
    #MaxPooling2D(pool_size=2),

    
    
    
    
    Dropout(0.3),
    Flatten(),
    # Add a softmax activation layer
    Dense(4, activation='softmax')
])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 224, 224, 75)      975       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 112, 112, 75)      0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 112, 112, 100)     30100     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 56, 56, 100)       0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 56, 56, 125)       50125     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 28, 28, 125)       0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 28, 28, 125)       0         
__________

In [8]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

In [9]:
epochs = 7

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.from_scratch.hdf5', 
                               verbose=1, save_best_only=True)

model.fit(train_tensors, train_targets, 
          validation_data=(valid_tensors, valid_targets),
          epochs=epochs, batch_size=20, callbacks=[checkpointer], verbose=2)





Train on 5082 samples, validate on 938 samples
Epoch 1/7
 - 165s - loss: 1.1929 - acc: 0.4866 - val_loss: 1.4747 - val_acc: 0.3614

Epoch 00001: val_loss improved from inf to 1.47466, saving model to saved_models/weights.best.from_scratch.hdf5
Epoch 2/7
 - 104s - loss: 0.7626 - acc: 0.7050 - val_loss: 1.1320 - val_acc: 0.5362

Epoch 00002: val_loss improved from 1.47466 to 1.13203, saving model to saved_models/weights.best.from_scratch.hdf5
Epoch 3/7
 - 103s - loss: 0.5408 - acc: 0.7934 - val_loss: 1.5254 - val_acc: 0.4414

Epoch 00003: val_loss did not improve from 1.13203
Epoch 4/7
 - 104s - loss: 0.3929 - acc: 0.8597 - val_loss: 1.2960 - val_acc: 0.5608

Epoch 00004: val_loss did not improve from 1.13203
Epoch 5/7
 - 104s - loss: 0.2858 - acc: 0.8981 - val_loss: 1.3064 - val_acc: 0.5650

Epoch 00005: val_loss did not improve from 1.13203
Epoch 6/7
 - 104s - loss: 0.2097 - acc: 0.9256 - val_loss: 1.4265 - val_acc: 0.5458

Epoch 00006: val_loss did not improve from 1.13203
Epoch 7/7
 

<keras.callbacks.History at 0x233af50ceb8>

In [10]:
model.load_weights('saved_models/weights.best.from_scratch.hdf5')

In [11]:
# get index of predicted diagnosis for each image in test set
diagnosis_predictions = [np.argmax(model.predict(np.expand_dims(tensor, axis=0))) for tensor in test_tensors]

# report test accuracy
test_accuracy = 100*np.sum(np.array(diagnosis_predictions)==np.argmax(test_targets, axis=1))/len(diagnosis_predictions)
print('Test accuracy: %.4f%%' % test_accuracy)

Test accuracy: 58.8000%
