We first take a look at our data. We check the balance of our classes before we start any of the classification task. Also, since we have images of both eyes per patient, we can check if diabetic retinopathy is more prevalent in either eye in our dataset.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
from scipy import misc
from skimage import color
import glob
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
import os
sns.set(style="white", palette="muted", color_codes=True)

In [2]:
import keras
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D
import os

Using TensorFlow backend.


In [3]:
labelpath = '/home/aremirata/thesis/kaggle/trainLabels.csv'
df_labels = pd.read_csv(labelpath)

In [4]:
df_labels.head()

Unnamed: 0,image,level
0,10_left,0
1,10_right,0
2,13_left,0
3,13_right,0
4,15_left,1


In [5]:
Counter(df_labels.level)

Counter({0: 25810, 1: 2443, 2: 5292, 3: 873, 4: 708})

In [6]:
frequency = df_labels.level.value_counts()/len(df_labels)

In [7]:
frequency

0    0.734783
2    0.150658
1    0.069550
3    0.024853
4    0.020156
Name: level, dtype: float64

The levels of Retinopathy in this dataset are:

$
0 - \text{No DR} \\
1 - \text{Mild} \\
2 - \text{Moderate} \\
3 - \text{Severe} \\
4 - \text{Proliferative DR} \\
$

Clearly, we have many more instances of normal eyes than eyes with retinopathy in our training set. Oddly class 2 retinopathy is more common than the other levels of severity. Kaggle claims these images were rated by clinicians in varying settings. Perhaps clinicians are avoiding giving out Mild or Severe ratings for some reason?


We check if retinopathy is more prevalent in left or right eye.

In [8]:
df_labels['location'] = df_labels['image'].apply(lambda s: 'left' if 'left' in s.lower() else 'right') 

In [9]:
df_labels.head()

Unnamed: 0,image,level,location
0,10_left,0,left
1,10_right,0,right
2,13_left,0,left
3,13_right,0,right
4,15_left,1,left


It seems that if you have diabetic retinopathy on one eye, you're more likely to have it on the other. Just a sanity check, we find how many datapoints belong to left or right eye.

In [10]:
Counter(df_labels.location)

Counter({'left': 17563, 'right': 17563})

In [11]:
training_images_resize = glob.glob('/home/aremirata/thesis/kaggle/train_resize/*')

In [12]:
test_resize_dir = '/home/aremirata/thesis/kaggle/test_resize/'

# Create A Big Matrix For Training Images

## Store Each Of The Training Images in DataFrames

In [13]:
resize_train_images = glob.glob('/home/aremirata/thesis/kaggle/train_resize/*')

In [14]:
index = range(0, len(resize_train_images))
df_image = pd.DataFrame(index=index, columns=['image','actual_image'])
for i in range(len(resize_train_images)):
    df_image['image'][i] = resize_train_images[i].split('/')[-1][0:-5]
    df_image['actual_image'][i] = misc.imread(resize_train_images[i])

`imread` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``imageio.imread`` instead.


In [15]:
df_image.head()

Unnamed: 0,image,actual_image
0,34380_right,"[[[6, 0, 0], [6, 0, 0], [6, 0, 0], [6, 0, 0], ..."
1,26313_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
2,35365_left,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
3,1299_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
4,17021_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."


In [16]:
train_df = pd.merge(df_image, df_labels, on='image', how='inner')

In [17]:
train_df.head()

Unnamed: 0,image,actual_image,level,location
0,34380_right,"[[[6, 0, 0], [6, 0, 0], [6, 0, 0], [6, 0, 0], ...",0,right
1,26313_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,right
2,35365_left,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,left
3,1299_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,right
4,17021_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,right


In [18]:
len(train_df), len(df_image), len(df_labels)

(28530, 28530, 35126)

# Work on Test Images

In [19]:
testlabelpath = '/home/aremirata/thesis/kaggle/retinopathy_solution.csv'
test_labels = pd.read_csv(testlabelpath)

In [20]:
test_labels.head()

Unnamed: 0,image,level,Usage
0,1_left,0,Private
1,1_right,0,Private
2,2_left,0,Public
3,2_right,0,Public
4,3_left,2,Private


In [21]:
resize_test_images = glob.glob('/home/aremirata/thesis/kaggle/test_resize/*')

In [22]:
index = range(0, len(resize_test_images))
df_test_image = pd.DataFrame(index=index, columns=['image','actual_image'])
for i in range(len(resize_test_images)):
    df_test_image['image'][i] = resize_test_images[i].split('/')[-1][0:-5]
    df_test_image['actual_image'][i] = misc.imread(resize_test_images[i])

`imread` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``imageio.imread`` instead.


In [23]:
df_test_image.head()

Unnamed: 0,image,actual_image
0,37581_left,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
1,1191_left,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
2,38146_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
3,36599_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."
4,1143_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ..."


In [24]:
test_df = pd.merge(df_test_image, test_labels, on='image', how='inner')

In [25]:
test_df.head()

Unnamed: 0,image,actual_image,level,Usage
0,37581_left,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,Private
1,1191_left,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,Private
2,38146_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,Private
3,36599_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,Private
4,1143_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",1,Private


# Obtain Training and Test Matrices

In [26]:
train_df.head()

Unnamed: 0,image,actual_image,level,location
0,34380_right,"[[[6, 0, 0], [6, 0, 0], [6, 0, 0], [6, 0, 0], ...",0,right
1,26313_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,right
2,35365_left,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,left
3,1299_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,right
4,17021_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,right


In [27]:
X_training_set = np.stack(train_df['actual_image'].as_matrix(),axis=0)

In [28]:
X_training_set.shape

(28530, 256, 256, 3)

In [29]:
y_training_set = np.stack(train_df['level'].as_matrix(),axis=0)

In [30]:
y_training_set.shape

(28530,)

In [31]:
test_df.head()

Unnamed: 0,image,actual_image,level,Usage
0,37581_left,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,Private
1,1191_left,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,Private
2,38146_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,Private
3,36599_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",0,Private
4,1143_right,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",1,Private


In [33]:
X_test_set = np.stack(test_df['actual_image'].as_matrix(),axis=0)

In [None]:
X_test_set.shape

In [None]:
y_test_set = np.stack(test_df['level'].as_matrix(),axis=0)

In [None]:
y_test_set.shape

## Deep Learning on Kaggle Diabetic Retinopathy Image Set

In [None]:
from sklearn.metrics import roc_auc_score
from keras.callbacks import Callback

class custom_callback(Callback):
    def __init__(self,training_data,validation_data):
        self.x = training_data[0]
        self.y = training_data[1]
        self.x_val = validation_data[0]
        self.y_val = validation_data[1]
    def on_train_begin(self, logs={}):
        return
    def on_train_end(self, logs={}):
        return
    def on_epoch_begin(self, epoch, logs={}):
        return
    def on_epoch_end(self, epoch, logs={}):
        y_pred = self.model.predict(self.x)
        roc = roc_auc_score(self.y, y_pred)
        y_pred_val = self.model.predict(self.x_val)
        roc_val = roc_auc_score(self.y_val, y_pred_val)
        print('\rroc-auc: %s - roc-auc_val: %s' % (str(round(roc,4)),str(round(roc_val,4))),end=100*' '+'\n')
        logs["roc-auc"] = roc
        logs["roc-auc_val"] = roc_val
        return
    def on_batch_begin(self, batch, logs={}):
        return
    def on_batch_end(self, batch, logs={}):
        return

In [None]:
x_test = X_test_set

In [None]:
x_train, x_val, y_train, y_val = train_test_split(X_training_set, y_training_set, 
                                                    test_size=0.33, random_state=42)

In [None]:
y_test = y_test_set

In [None]:
len(np.unique(y_train)), len(np.unique(y_test)), len(np.unique(y_val))

In [None]:
# Convert class vectors to binary class matrices.
y_train = keras.utils.to_categorical(y_train, 5)
y_test = keras.utils.to_categorical(y_test, 5)
y_val = keras.utils.to_categorical(y_val, 5)

In [None]:
x_train.shape, x_val.shape, y_train.shape, y_val.shape, x_test.shape, y_test.shape

In [None]:
batch_size = 32
num_classes = 5
epochs = 10
data_augmentation = True
num_predictions = 10

In [None]:
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same',
                 input_shape=x_train.shape[1:]))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2,2)))
model.add(Dropout(0.25))

model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2,2)))
model.add(Dropout(0.25))

model.add(Conv2D(128, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2,2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(Dense(num_classes))
model.add(Activation('softmax'))


In [None]:

# Let's train the model using rmsprop
#opt = keras.optimizers.rmsprop(lr=0.0001, decay=1e-6)

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

if not data_augmentation:
    print('Not using data augmentation.')
    model.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(x_test, y_test),
              shuffle=True)
else:
    print('Using real-time data augmentation.')
    # This will do preprocessing and realtime data augmentation:
    datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=0,  # randomly rotate images in the range (degrees, 0 to 180)
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=True,  # randomly flip images
        vertical_flip=False)  # randomly flip images

    # Compute quantities required for feature-wise normalization
    # (std, mean, and principal components if ZCA whitening is applied).
    datagen.fit(x_train)

    # Fit the model on the batches generated by datagen.flow().
    cnn_callback = model.fit_generator(datagen.flow(x_train, y_train,
                                     batch_size=batch_size),
                        steps_per_epoch=10,
                        epochs=epochs,
                        callbacks = [custom_callback(
                                training_data=(x_train, y_train),
                                validation_data=(x_val, y_val))],
                        validation_data=(x_val, y_val),
                        use_multiprocessing=True,
                        workers=4)

score = model.evaluate(x_val, y_val, verbose=0)
    
print('Validation loss:', score[0])
print('Validation accuracy:', score[1])

pred_test = model.predict(x_test)
roc_score_test = roc_auc_score(y_test, pred_test)


In [63]:
roc_score_test

0.5105357069256602