<a href="https://colab.research.google.com/github/ajaythakur3369/CodeClause-Internship/blob/main/Blindness_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name - Blindness Detection**
## **Developed By - Ajay Thakur (ajaythakur3369@gmail.com)**
## **Branch Name - Electronics and Communication Engineering**
## **Institute Name - Indian Institute of Information Technology Kota**
## **Submitted To - CodeClause**
## **Project Link (GitHub) - [Click here](https://github.com/ajaythakur3369/CodeClause-Internship)**




## **Importing necessary libraries**

In [None]:
import numpy as np
import pandas as pd
from random import randrange
import os
import matplotlib.pyplot as plt
from PIL import Image
import seaborn as sns

In [None]:
from albumentations import (
    HorizontalFlip, IAAPerspective, ShiftScaleRotate, CLAHE, RandomRotate90,
    Transpose, ShiftScaleRotate, Blur, OpticalDistortion, GridDistortion, HueSaturationValue,
    IAAAdditiveGaussianNoise, GaussNoise, MotionBlur, MedianBlur, RandomBrightnessContrast, IAAPiecewiseAffine,
    IAASharpen, IAAEmboss, Flip, OneOf, Compose, PadIfNeeded
)

## **Introduction**

##### Diabetic retinopathy is the leading cause of blindness among working-aged adults. Millions of people suffer from this disease. People with diabetes can develop an eye condition called diabetic retinopathy. This occurs when high blood sugar levels damage blood vessels in the retina. These vessels can swell and leak, or they can close, obstructing blood flow. In some cases, abnormal new blood vessels grow on the retina. These changes can ultimately lead to blindness.

## **Stages of Diabetic Eye Disease**

##### **NPDR (non-proliferative diabetic retinopathy):** With NPDR, tiny blood vessels leak, causing the retina to swell. When the macula swells, it is called macular edema, which is the most common reason why people with diabetes lose their vision. Also, with NPDR, blood vessels in the retina can close off, a condition known as macular ischemia. When this happens, blood cannot reach the macula. Sometimes, tiny particles called exudates can form in the retina, which can also affect vision.



##### **PDR (proliferative diabetic retinopathy):** PDR is the more advanced stage of diabetic eye disease. It occurs when the retina begins to grow new blood vessels, a condition known as neovascularization. These fragile new vessels often bleed into the vitreous. If they bleed only a little, you might see a few dark floaters. If they bleed a lot, it might block all vision. Moreover, these new blood vessels can form scar tissue, which can cause problems with the macula or lead to a detached retina. PDR is very serious and can affect both your central and peripheral (side) vision.

## **Load Data**

Load training and testing CSV files containing image filenames and corresponding labels (only for the training set):

In [None]:
# Access the Drive from Colab to access the file
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
# Load CSV files with labels as Pandas DataFrames
train = pd.read_csv('/content/drive/MyDrive/Colab_Notebook/Internship_Name/CodeClause/File_Name/Training_dataset.csv')
test = pd.read_csv('/content/drive/MyDrive/Colab_Notebook/Internship_Name/CodeClause/File_Name/Testing_dataset.csv')

In [None]:
# Find out the number of images in the test and train sets
print('The number of images in the training set is {}'.format(len(train)))
print('The number of images in the test set is {}'.format(len(test)))

In [None]:
# Plot a pie chart
labels = 'Train', 'Test'
sizes = len(train), len(test)

fig1, ax1 = plt.subplots(figsize = (5, 5))
ax1.pie(sizes, labels = labels, autopct = '%1.1f%%', shadow = True, startangle = 90)
ax1.axis('equal')

plt.title('Train and Test sets')
plt.show()

Both the training and testing datasets are not too large.

The training dataset is about three times larger than the testing dataset.

## **Analyze Train Set Labels**

Plot a pie chart showing the percentage of images for each diabetic retinopathy severity condition:

In [None]:
# Plot a pie chart
labels = 'No DR', 'Moderate', 'Mild', 'Proliferative DR', 'Severe'
sizes = train.diagnosis.value_counts()

fig1, ax1 = plt.subplots(figsize = (10, 7))
ax1.pie(sizes, labels = labels, autopct = '%1.1f%%', shadow = True, startangle = 90)
ax1.axis('equal')

plt.title('Diabetic retinopathy condition labels')
plt.show()

We can see that the training dataset is **very imbalanced**. There are ten times more images with no DR than images with the severe DR condition.

**Data augmentation** is required to perform the classification.

## **Visualize Sample Images**

Let's plot fundus photography images from the training set of different conditions:

In [None]:
# Define paths to train and test images
TRAIN_IMG_PATH = "/content/drive/MyDrive/Colab_Notebook/Internship_Name/CodeClause/Folder_Name/blindness_train_images/"
TEST_IMG_PATH = "/content/drive/MyDrive/Colab_Notebook/Internship_Name/CodeClause/Folder_Name/blindness_test_images/"

# Function to plot a grid of images
def view_fundus_images(images, title = ''):

    """
    Function to plot grid with several examples of fundus images.
    INPUT:
        train - array with filenames for images and condition labels
    OUTPUT: None
    """

    width = 5
    height = 2
    fig, axs = plt.subplots(height, width, figsize = (15, 5))

    for im in range(0, height * width):

        # Open image
        image = Image.open(os.path.join(TRAIN_IMG_PATH, images[im] + '.png'))
        i = im // width
        j = im % width

        # Plot the data
        axs[i, j].imshow(image)
        axs[i, j].axis('off')

    # Set suptitle
    plt.suptitle(title)
    plt.show()

In [None]:
view_fundus_images(train[train['diagnosis'] == 0][:10].id_code.values, title = 'Images without DR')

In [None]:
view_fundus_images(train[train['diagnosis'] == 1][:10].id_code.values, title = 'Images with Mild condition')

In [None]:
view_fundus_images(train[train['diagnosis'] == 2][:10].id_code.values, title = 'Images with Moderate condition')

In [None]:
view_fundus_images(train[train['diagnosis'] == 3][:10].id_code.values, title = 'Images with Severe condition')

In [None]:
view_fundus_images(train[train['diagnosis'] == 4][:10].id_code.values, title = 'Images with Proliferative DR')

Just glancing through various images, we can observe:

* Images are of different sizes, and the height and width ratios vary. Therefore, image cropping or padding is necessary.
* Pictures are taken with various scales, indicating the need for random cropping augmentation.
* Lighting and colors vary greatly, suggesting the need for augmentations that adjust brightness and color scales.

## **Analyze Image Sizes**

Plot histograms for image sizes (used code from this kernel for the analysis):

In [None]:
def get_image_sizes(df, train = True):

    '''
    Function to get sizes of images from test and train sets
    INPUT:
        df - dataframe containing image filenames
        train - indicates whether we are getting sizes of images from train or test set
    '''

    if train:
        path = TRAIN_IMG_PATH
    else:
        path = TEST_IMG_PATH

    widths = []
    heights = []

    images = df.id_code

    max_im = Image.open(os.path.join(path, images[0] + '.png'))
    min_im = Image.open(os.path.join(path, images[0] + '.png'))

    for im in range(0, len(images)):
        image = Image.open(os.path.join(path, images[im] + '.png'))
        width, height = image.size

        if len(widths) > 0:
            if width > max(widths):
                max_im = image

            if width < min(widths):
                min_im = image

        widths.append(width)
        heights.append(height)

    return widths, heights, max_im, min_im

In [None]:
# Get sizes of images from test and train sets
train_widths, train_heights, max_train, min_train = get_image_sizes(train, train = True)
test_widths, test_heights, max_test, min_test = get_image_sizes(test, train = False)

In [None]:
print('The maximum width for the training set is {}'.format(max(train_widths)))
print('The minimum width for training set is {}'.format(min(train_widths)))
print('The maximum height for training set is {}'.format(max(train_heights)))
print('The minimum height for training set is {}'.format(min(train_heights)))

In [None]:
print('The maximum width for test set is {}'.format(max(test_widths)))
print('The minimum width for test set is {}'.format(min(test_widths)))
print('The maximum height for test set is {}'.format(max(test_heights)))
print('The minimum height for test set is {}'.format(min(test_heights)))

In [None]:
# Plot histograms and KDE plots for images from the training set
plt.figure(figsize = (14, 6))
plt.subplot(121)
sns.distplot(train_widths, kde = False, label = 'Train Width')
sns.distplot(train_heights, kde = False, label = 'Train Height')
plt.legend()
plt.title('Training Image Dimension Histogram', fontsize = 15)

plt.subplot(122)
sns.kdeplot(train_widths, label = 'Train Width')
sns.kdeplot(train_heights, label = 'Train Height')
plt.legend()
plt.title('Train Image Dimension KDE Plot', fontsize = 15)

plt.tight_layout()
plt.show()

In [None]:
# Plot Histograms and KDE plots for images from the test set
plt.figure(figsize = (14, 6))
plt.subplot(121)
sns.distplot(test_widths, kde = False, label = 'Test Width')
sns.distplot(test_heights, kde = False, label = 'Test Height')
plt.legend()
plt.title('Test Image Dimension Histogram', fontsize = 15)

plt.subplot(122)
sns.kdeplot(test_widths, label = 'Test Width')
sns.kdeplot(test_heights, label = 'Test Height')
plt.legend()
plt.title('Test Image Dimension KDE Plot', fontsize = 15)

plt.tight_layout()
plt.show()

We see that we have very different distributions of image sizes for the train and test datasets.

## **Plot largest and smallest images**

Let's look at the largest and the smallest images from both sets.

Image with the largest width from the training set:

In [None]:
plt.axis('off')

# Plot the data
plt.imshow(max_train)

Image with the smallest width from the training set:

In [None]:
plt.axis('off')

# Plot the data
plt.imshow(min_train)

Image with the largest width from the test set:

In [None]:
plt.axis('off')

# Plot the data
plt.imshow(max_test)

Image with the smallest width from the training set:

In [None]:
plt.axis('off')

# Plot the data
plt.imshow(min_test)

## **Playing with Augmentations**

Finally, I would like to experiment with some augmentations. This will help to get an impression of the augmented dataset.

In [None]:
# Define the dictionary for labels
diagnosis_dict = {
    0:'No DR',
    1:'Mild',
    2:'Moderate',
    3: 'Severe',
    4: 'Proliferative DR'
}

In [None]:
# Function to plot a grid of images
def view_fundus_images_labels(train, rand_indices, aug = None, title = ''):

    """
    Function to plot grid with several examples of fundus images.
    INPUT:
        train - array with filenames for images and condition labels
        rand_indices - indices of images to plot
        title - plot title

    OUTPUT: None
    """

    width = 5
    height = 2
    counter = 0
    fig, axs = plt.subplots(height, width, figsize = (15, 5))

    for im in rand_indices:

        # Open image
        image = Image.open(os.path.join(TRAIN_IMG_PATH, train.iloc[im].id_code + '.png'))

        if aug is not None:
            image = aug(image = np.array(image))['image']

        i = counter // width
        j = counter % width

        # Plot the data
        axs[i, j].imshow(image)
        axs[i, j].axis('off')

        diagnosis = train[train['id_code'] == train.iloc[im].id_code].diagnosis.values[0]

        axs[i,j].set_title(diagnosis_dict[diagnosis])
        counter += 1

    # Set supertitle
    plt.suptitle(title)
    plt.show()

Plot random images from the training set.

In [None]:
# Get some random image indices from the training set
rand_indices = [randrange(len(train)) for x in range(0, 10)]
rand_indices

In [None]:
# Plot original images
view_fundus_images_labels(train, rand_indices, title = 'Original images')

Now let's experiment with some Albumentations filters:

Augment the images with CLAHE:

In [None]:
aug = CLAHE(p = 1)
view_fundus_images_labels(train, rand_indices, aug, title = 'CLAHE')

Try adding some Gaussian noise:

In [None]:
aug = GaussNoise(p = 1)
view_fundus_images_labels(train, rand_indices, aug, title = 'GaussNoise')

Playing with brightness and contrast:

In [None]:
aug = RandomBrightnessContrast(brightness_limit = 1, contrast_limit = 1, p = 1)
view_fundus_images_labels(train, rand_indices, aug, title = 'RandomBrightnessContrast')

See how random brightness and contrast affect images. This filter should certainly be used for data augmentation.

Padding images:

In [None]:
aug = PadIfNeeded(min_height = 1024, min_width = 1024, p = 1)
view_fundus_images_labels(train, rand_indices, aug, title = 'Padding Images')

## **Conclusion**

After the EDA, we can conclude the following:

* The dataset is heavily imbalanced, necessitating data augmentation.
* The distribution of image sizes differs between the train and test sets, which may affect classification results.
* Additionally, in this EDA, we explored augmented images to gain an impression of what the augmented dataset will look like.