# Kaggle Diabetic Retinopathy Detection Analysis

This notebook does basic analysis on the training images.

Link to competition: https://www.kaggle.com/c/aptos2019-blindness-detection

Forked from "Orig_TFDataset_Analysis_V01"

Sample pandas examples:

https://github.com/rasbt/pattern_classification/blob/master/data_viz/matplotlib_viz_gallery.ipynb
https://github.com/rasbt/pattern_classification/blob/master/data_viz/matplotlib_viz_gallery.ipynb


### Processing for using Google Drive, Kaggle and normal includes



In [None]:

#"""
# Google Collab specific stuff....
from google.colab import drive
drive.mount('/content/drive')

import os
!ls "/content/drive/My Drive"

USING_COLLAB = True
# Force to use 2.x version of Tensorflow
%tensorflow_version 2.x
#"""

In [None]:
# Upload your "kaggle.json" file that you created from your Kaggle Account tab
# If you downloaded it, it would be in your "Downloads" directory

from google.colab import files
files.upload()

In [None]:
# To start, install kaggle libs
#!pip install -q kaggle

# Workaround to install the newest version
# https://stackoverflow.com/questions/58643979/google-colaboratory-use-kaggle-server-version-1-5-6-client-version-1-5-4-fai
!pip install kaggle --upgrade --force-reinstall --no-deps

In [None]:
# On your VM, create kaggle directory and modify access rights

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json

In [None]:
#!kaggle competitions list
# Takes about 4 mins to download
!kaggle competitions download -c aptos2019-blindness-detection

In [None]:
# Takes about 5 mins to unzip
!unzip -uq aptos2019-blindness-detection.zip 

In [None]:
!ls

In [None]:
# Cleanup to add some space....
!rm -r test_images
!rm aptos2019-blindness-detection.zip 

In [None]:
# Setup sys.path to find MachineLearning lib directory

# Check if "USING_COLLAB" is defined, if yes, then we are using Colab, otherwise set to False
try: USING_COLLAB
except NameError: USING_COLLAB = False

%load_ext autoreload
%autoreload 2

# set path env var
import sys
if "MachineLearning" in sys.path[0]:
    pass
else:
    print(sys.path)
    if USING_COLLAB:
        sys.path.insert(0, '/content/drive/My Drive/GitHub/MachineLearning/lib')  ###### CHANGE FOR SPECIFIC ENVIRONMENT
    else:
        sys.path.insert(0, '/Users/john/Documents/GitHub/MachineLearning/lib')  ###### CHANGE FOR SPECIFIC ENVIRONMENT
    
    print(sys.path)

In [None]:
# Normal includes...

from __future__ import absolute_import, division, print_function, unicode_literals

import os, sys, random, warnings, time, copy, csv
import numpy as np 

import IPython.display as display
from PIL import Image
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import tensorflow as tf
print(tf.__version__)

# This allows the runtime to decide how best to optimize CPU/GPU usage
AUTOTUNE = tf.data.experimental.AUTOTUNE

from TrainingUtils import *

#warnings.filterwarnings("ignore", category=DeprecationWarning)
#warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", "(Possibly )?corrupt EXIF data", UserWarning)

## General Setup

- Create a dictionary wrapped by a class for global values.  This is how I manage global vars in my notebooks.



In [None]:
# Set root directory path to data
if USING_COLLAB:
    #ROOT_PATH = "/content/drive/My Drive/ImageData/KaggleDiabeticRetinopathy/Data"  ###### CHANGE FOR SPECIFIC ENVIRONMENT
    ROOT_PATH = ""  ###### CHANGE FOR SPECIFIC ENVIRONMENT
else:
    ROOT_PATH = ""
        
# Establish global dictionary
parms = GlobalParms(ROOT_PATH=ROOT_PATH,
                    TRAIN_DIR="train_images", 
                    NUM_CLASSES=5,
                    CLASS_NAMES=['Normal', 'Moderate', 'Mild', 'Proliferative', 'Severe'],
                    IMAGE_ROWS=224,
                    IMAGE_COLS=224,
                    IMAGE_CHANNELS=3,
                    BATCH_SIZE=1,  # must be one if you want to see different image sizes
                    IMAGE_EXT=".png")

parms.print_contents()
print("Classes: {}  Labels: {}  {}".format(parms.NUM_CLASSES, len(parms.CLASS_NAMES), parms.CLASS_NAMES) )

In [None]:
# Simple helper method to display batches of images with labels....        
def show_batch(image_batch, label_batch, number_to_show=25, r=5, c=5, print_shape=False):
    show_number = min(number_to_show, parms.BATCH_SIZE)

    if show_number < 8: #if small number, then change row, col and figure size
        if parms.IMAGE_COLS > 64 or parms.IMAGE_ROWS > 64:
            plt.figure(figsize=(25,25)) 
        else:
            plt.figure(figsize=(10,10))  
        r = 4
        c = 2 
    else:
        plt.figure(figsize=(10,10))  

    for n in range(show_number):
        if print_shape:
            print("Image shape: {}  Max: {}  Min: {}".format(image_batch[n].shape, np.max(image_batch[n]), np.min(image_batch[n])))

        ax = plt.subplot(r,c,n+1)
        plt.imshow(tf.keras.preprocessing.image.array_to_img(image_batch[n]))
        plt.title(parms.CLASS_NAMES[np.argmax(label_batch[n])])
        plt.axis('off')

### Load csv file

- Load list of filenames and diagnosis
- Perform initiall analysis on dataframe


In [None]:
train_df = pd.read_csv(os.path.join(parms.ROOT_PATH, "train.csv"))
train_df["file_path"] = parms.TRAIN_PATH + "/" + train_df["id_code"] + ".png"
images_list_len = len(train_df)
print("Training set is {}".format(len(train_df)))
train_df.head()


In [None]:
train_df['diagnosis'].hist()
train_df['diagnosis'].value_counts()

In [None]:
# Plot diagnosis
sizes = train_df.diagnosis.value_counts()

fig1, ax1 = plt.subplots(figsize=(10,7))
ax1.pie(sizes, labels=parms.CLASS_NAMES, autopct='%1.1f%%', shadow=True, startangle=90)
ax1.axis("Equal")

plt.title("Diabetic retinopathylabels")
plt.show()

### Create dataset and normal mappings

Pipeline Flow:

create dataset -> map "process_path" -> repeat forever -> batch

The mappings open and read an image.  These next cells should be changed based on your specific needs.


In [None]:
# Decode the image, convert to float, normalize by 255 and resize
def decode_img(image: tf.Tensor) -> tf.Tensor:

    # convert the compressed string to a 3D uint8 tensor
    image = tf.image.decode_png(image, channels=parms.IMAGE_CHANNELS)
    # Use `convert_image_dtype` to convert to floats in the [0,1] range.
    image = tf.image.convert_image_dtype(image, parms.IMAGE_DTYPE)

    # uncomment to resize the image to the desired size.
    #image = tf.image.resize(image, [parms.IMAGE_ROWS, parms.IMAGE_COLS])

    #gamma = tf.math.reduce_mean(image) + 0.5
    #image = tf.image.adjust_gamma(image, gamma=gamma)

    return image

# method mapped to load, resize and aply any augmentations
def process_path(file_path: tf.Tensor, label: tf.Tensor) -> tf.Tensor:
    # load the raw data from the file as a string
    image = tf.io.read_file(file_path)
    image = decode_img(image)

    return image, label

### Create dataset from list of images and apply mappings

In [None]:
# Create Dataset from list of images
full_dataset = tf.data.Dataset.from_tensor_slices(
                            (train_df["file_path"].values, 
                             tf.cast(train_df['diagnosis'].values, tf.int32)))

# Verify image paths were loaded and save one path for later in "some_image"
for f, l in full_dataset.take(2):
    some_image = f.numpy().decode("utf-8")
    print(f.numpy(), l.numpy())
    
print("Some Image: ", some_image)

In [None]:
# map training images to processing, includes any augmentation
full_dataset = full_dataset.map(process_path, num_parallel_calls=AUTOTUNE)

# Verify the mapping worked
for image, label in full_dataset.take(1):
    print("Image shape: {}  Max: {}  Min: {}".format(image.numpy().shape, np.max(image.numpy()), np.min(image.numpy())))
    print("Label: ", label.numpy())

# Repeat forever
full_dataset = full_dataset.repeat()

# set the batch size
full_dataset = full_dataset.batch(parms.BATCH_SIZE)


In [None]:
# Show the images, execute this cell multiple times to see the images

steps = 1
for image_batch, label_batch in tqdm(full_dataset.take(steps)):
    show_batch(image_batch.numpy(), label_batch.numpy())

### Collect image information

This will loop over each image and collect information to be used to create a Pandas dataframe.  The dataframe will then be used to report information.  You can also save the dataframe for future analysis.

This is where you can also customize what information is collected.

The size of the image is not changed, but you can change so every image is exactly like how it will be used for training.  I've found that looking at the raw image information is more helpful than looking at images that have been resized.


In [None]:
# Collect various information about an image
def dataset_analysis(dataset, steps, test=False):
    if test == True:
        steps = 4

    image_info = []

    for image_batch, label_batch in tqdm(dataset.take(steps)):
        #show_batch(image_batch.numpy(), label_batch.numpy())

        for j in range(parms.BATCH_SIZE):
            image = image_batch[j].numpy()
            label = label_batch[j].numpy()
            #label = np.argmax(label)
            r = image.shape[0]
            c = image.shape[1]
            d = 0
            mean0=0
            mean1=0
            mean2=0
            if parms.IMAGE_CHANNELS == 3:
                d = image.shape[2]
                mean0 = np.mean(image[:,:,0])
                mean1 = np.mean(image[:,:,1])
                mean2 = np.mean(image[:,:,2])
            image_info.append([label, r, c, d, np.mean(image), np.std(image), mean0, mean1, mean2])

            if test:
                print(image_info[-1])
                
    return image_info

In [None]:
# Build image_info list
steps = np.ceil(len(train_df) // parms.BATCH_SIZE)
image_info = dataset_analysis(full_dataset, steps=steps, test=False)

In [None]:
# Build pandas dataframe
image_info_df = pd.DataFrame(image_info, columns =['label', 'row','col', 'dim', 'mean', 'std', "chmean0", "chmean1", "chmean2"])
print(image_info_df.describe())
image_info_df.head()

In [None]:
#https://jamesrledoux.com/code/group-by-aggregate-pandas
image_info_df.groupby('label').agg({'mean': ['count', 'mean', 'min', 'max'], 'std': ['mean', 'min', 'max'], 'row': ['mean', 'min', 'max'],'col': ['mean', 'min', 'max'], 'chmean0':['mean'],'chmean1':['mean'],'chmean2':['mean'] })


In [None]:
image_info_df.agg({'mean': ['mean', 'min', 'max'], 'std': ['mean', 'min', 'max'], 'row': ['mean', 'min', 'max'],'col': ['mean', 'min', 'max'] })



In [None]:
image_mean = image_info_df["mean"]
print("Mean: ", np.mean(image_mean), "  STD: ", np.std(image_mean))

In [None]:
image_info_df["label"].value_counts().plot.bar()

In [None]:
image_info_df["label"].value_counts().plot.pie()

In [None]:
image_info_df.hist(column='mean')

In [None]:
image_info_df.plot.scatter(x='row', y='col', color='Blue', label='Row Col')

In [None]:
# Plot Histograms and KDE plots for images from the training set
# Source: https://www.kaggle.com/chewzy/eda-weird-images-with-new-updates
import seaborn as sns

plt.figure(figsize=(14,6))
plt.subplot(121)
sns.distplot(image_info_df["col"], kde=False, label='Train Col')
sns.distplot(image_info_df["row"], kde=False, label='Train Row')
plt.legend()
plt.title('Training Dimension Histogram', fontsize=15)

plt.subplot(122)
sns.kdeplot(image_info_df["col"], label='Train Col')
sns.kdeplot(image_info_df["row"], label='Train Row')
plt.legend()
plt.title('Train  Dimension KDE Plot', fontsize=15)

plt.tight_layout()
plt.show()

In [None]:
# Save results
# If saved on VM, need to copy to storage
result_path = os.path.join(parms.ROOT_PATH, "image-info.pkl")
image_info_df.to_pickle(result_path)  

In [None]:
image_info_df["c-r"] = image_info_df["col"] - image_info_df["row"]

In [None]:
image_info_df.head()
#print(np.count_nonzero(a < 4))
c_r = image_info_df["c-r"].values.tolist()
c_r_np = np.array(c_r)
print("  == 0, ", np.count_nonzero(c_r_np == 0))
print("   < 0, ", np.count_nonzero(c_r_np < 0))
print("   > 0, ", np.count_nonzero(c_r_np > 0))
print(" > 500, ", np.count_nonzero(c_r_np > 500))
print("> 1000, ", np.count_nonzero(c_r_np > 1000))

# = 0 - 974,  < 0 - none,  >0 - 2688, >500 - 2340, >1000 462, 

In [None]:
# open and read saved file
image_info_df = pd.read_pickle(result_path)
image_info_df.head()

### Final Thoughts.....
- Very imbalanced classes, will need to augment training to have more images or account for the imbalance

- Images are mostly rectangles, scaling will need to account for keeping the features the same when image is resized (up or down). Aspect ration, etc

  - 0, Ave (1206, 1433), Min (614, 819), Max(2136, 3216)

  - 1, Ave (1784, 2471), Min (614, 819), Max(2848, 4288)

  - 2, Ave (1846, 2603), Min (358, 474), Max(2848, 4288)

  - 3, Ave (1840, 2590), Min (480, 640), Max(2848, 4288)

  - 4, Ave (1874, 2632), Min (358, 474), Max(2848, 4288)


- Image size increases as the diagnosis gets worse. Which implies encoded information for diagnosis needs more features as level of diagnosis increases - so if image resized smaller, could lose information. Could make it harder to tell the difference between level 4 or 3, 3 or 2.  Images might need to be as large as possible, or used pre-trained networks.  (?? Larger network parms if small image sizes, smaller network parms if larger image sizes)

- Normal images are the smallest, so could assume there is not as much information needed to determine a "good" eye. Guess is that they would look similar or same.

- The diagnosis is cumulative. level 2 is really 1 + 2, level 3 is really 1 + 2 + 3, level 4 is really 1 + 2 + 3 + 4, ...

