### Data Loading and Image Path Addition

In this code  data loading and image path addition are performed on three CSV files containing labels and metadata about retinal images. The goal is to load the data, add an 'IMG_DIR' column that specifies the image paths, and consolidate the data into a single DataFrame.

#### Data Loading

- Three CSV files containing labels and metadata are loaded into DataFrames using the `pd.read_csv` function. The files are:
  - `TRAIN_CSV_DIR`: Training labels.
  - `VAL_CSV_DIR`: Validation labels.
  - `TEST_CSV_DIR`: Testing labels.

#### Adding Image Path Column

- A custom function `add_image_path_column` is defined to add an 'IMG_DIR' column to a given DataFrame. This column is generated by applying a lambda function that constructs the image path based on the 'ID' column and the specified image directory.

- A dictionary `directories` maps dataset names to their respective image directories.

- A loop iterates over each dataset name and directory in the `directories` dictionary. For each dataset, the 'IMG_DIR' column is added to the corresponding DataFrame.

#### Data Concatenation

- The code concatenates the three DataFrames (`train_labels`, `val_labels`, and `test_labels`) into a single DataFrame `rfmid` using `pd.concat`. The resulting DataFrame `rfmid` contains all the labels and metadata from the three datasets.

- The `df` variable is assigned the `rfmid` DataFrame, which can be used for further data processing and analysis.

This code is useful for loading and preparing data for retinal image analysis, making it accessible for further analysis or model training.




In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
pd.options.display.max_columns = 50

TRAIN_CSV_DIR = "/kaggle/input/retinal-disease-classification/Training_Set/Training_Set/RFMiD_Training_Labels.csv"
VAL_CSV_DIR = '/kaggle/input/retinal-disease-classification/Evaluation_Set/Evaluation_Set/RFMiD_Validation_Labels.csv'
TEST_CSV_DIR='/kaggle/input/retinal-disease-classification/Test_Set/Test_Set/RFMiD_Testing_Labels.csv'

train_labels = pd.read_csv(TRAIN_CSV_DIR)
val_labels = pd.read_csv(VAL_CSV_DIR)
test_labels=pd.read_csv(TEST_CSV_DIR)

import os
import pandas as pd

def add_image_path_column(df, directory):
    df['IMG_DIR'] = df['ID'].apply(lambda image_id: os.path.join(directory, f"{image_id}.png"))


directories = {
    "train_labels": '/kaggle/input/retinal-disease-classification/Training_Set/Training_Set/Training',
    "val_labels"  : '/kaggle/input/retinal-disease-classification/Evaluation_Set/Evaluation_Set/Validation',
    "test_labels" : '/kaggle/input/retinal-disease-classification/Test_Set/Test_Set/Test'
}


for dataset_name, directory in directories.items():
    add_image_path_column(globals()[dataset_name], directory)
test_labels


rfmid = pd.concat([train_labels, val_labels, test_labels], ignore_index=True)
df=rfmid
df.head()

Unnamed: 0,ID,Disease_Risk,DR,ARMD,MH,DN,MYA,BRVO,TSLN,ERM,LS,MS,CSR,ODC,CRVO,TV,AH,ODP,ODE,ST,AION,PT,RT,RS,CRS,EDN,RPEC,MHL,RP,CWS,CB,ODPM,PRH,MNF,HR,CRAO,TD,CME,PTCR,CF,VH,MCA,VS,BRAO,PLQ,HPED,CL,IMG_DIR
0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,/kaggle/input/retinal-disease-classification/T...
1,2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,/kaggle/input/retinal-disease-classification/T...
2,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,/kaggle/input/retinal-disease-classification/T...
3,4,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,/kaggle/input/retinal-disease-classification/T...
4,5,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,/kaggle/input/retinal-disease-classification/T...


### Label Aggregation and Data Restructuring

This code  focuses on aggregating labels and restructuring the dataset to include a reduced set of labels for further analysis or modeling.

#### Label Aggregation

- `label_columns` is defined to store the names of the columns that contain labels. The selection excludes the first two columns, which are typically 'ID' and 'EXAMINATION_ID', and the last column, which might be 'IMG_DIR'.

- `label_counts` calculates the sum of labels for each label column, providing a count of positive instances for each label.

- `top_27_labels` extracts the top 27 labels with the highest counts, sorting them in descending order. These labels are considered the most frequent and significant.

- A new column 'Other' is added to the DataFrame, initialized with zeros.

- The code iterates through each row in the DataFrame using a `for` loop. For each row, it checks whether the sum of labels in the 'label_columns' minus the sum of labels in 'top_27_labels' is greater than 0. If this condition is met, the 'Other' column for that row is set to 1.

#### Data Restructuring

- `columns_to_remove` is a list comprehension that identifies label columns not present in 'top_27_labels' and stores them in a list.

- The code removes the columns specified in 'columns_to_remove' using the `df.drop` function.

- The `column_order` variable is defined as a list that specifies the desired order of columns. It reorders the DataFrame columns to place 'Other' and 'IMG_DIR' at the end of the DataFrame.

- The DataFrame `df` is updated to reflect the new column order.

This code is useful for reducing the dimensionality of label columns by aggregating less frequent labels into an 'Other' category. It also restructures the dataset for further analysis or modeling with a focus on the top 27 labels.




In [2]:
label_columns = df.columns[2:-1]
label_counts = df[label_columns].sum()

top_27_labels = label_counts.sort_values(ascending=False).head(27).index


df['Other'] = 0

for index, row in df.iterrows():
    if row[label_columns].sum() - row[top_27_labels].sum() > 0:
        df.at[index, 'Other'] = 1

columns_to_remove = [col for col in label_columns if col not in top_27_labels]


df = df.drop(columns=columns_to_remove)

column_order = list(df.columns[:-2]) + ['Other', 'IMG_DIR']
df = df[column_order]
df

Unnamed: 0,ID,Disease_Risk,DR,ARMD,MH,DN,MYA,BRVO,TSLN,ERM,LS,MS,CSR,ODC,CRVO,TV,AH,ODP,ODE,ST,AION,PT,RT,RS,CRS,EDN,RPEC,MHL,RP,Other,IMG_DIR
0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,/kaggle/input/retinal-disease-classification/T...
1,2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,/kaggle/input/retinal-disease-classification/T...
2,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,/kaggle/input/retinal-disease-classification/T...
3,4,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,/kaggle/input/retinal-disease-classification/T...
4,5,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,/kaggle/input/retinal-disease-classification/T...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3195,636,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,/kaggle/input/retinal-disease-classification/T...
3196,637,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,/kaggle/input/retinal-disease-classification/T...
3197,638,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,/kaggle/input/retinal-disease-classification/T...
3198,639,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,/kaggle/input/retinal-disease-classification/T...


In [3]:
df.to_csv('/kaggle/working/rfmid_28.csv', index=False)

### Retinal Image Cropping and Data Transformation

In this code  retinal images are cropped and the data is transformed for further analysis or modeling.

#### Image Cropping

- The code defines the `Retinal_Crop` class, which is designed for cropping retinal images. It uses the Albumentations library to perform cropping based on specified conditions. The conditions are determined by the input image's shape, and the corresponding cropper is applied.

- Several cropping options are defined based on different image shapes, such as (1424, 2144, 3), (1536, 2048, 3), and more.

- The `transform` method of the `Retinal_Crop` class selects the appropriate cropper based on the input image's shape and applies it.

#### Data Processing

- A CSV file specified by `csv_file` is read using pandas. This CSV file likely contains information about the images and their paths.

- A `cropper` object is instantiated, which is an instance of the `Retinal_Crop` class.

- An output directory named `output_image_folder` is created to store the cropped images.

- The code uses a progress bar (`tqdm`) to track the processing of images. It iterates through each row in the DataFrame (loaded from the CSV file), opens the image, applies the appropriate cropping, and saves the cropped image to the output directory. The image path in the DataFrame is updated to point to the cropped image.

- The code generates a new DataFrame with the updated image paths and saves it to a new CSV file specified by `new_csv_file`.

This code is valuable for pre-processing retinal images, cropping them according to their shapes, and updating the DataFrame with the new image paths. It's an essential step when dealing with medical image data or other applications requiring precise image processing.



In [4]:
import os
import pandas as pd
from PIL import Image
import numpy as np
from albumentations import Compose, CenterCrop, Crop
from tqdm import tqdm

class Retinal_Crop:
    def __init__(self):
        self.cropper_alpha = Compose([CenterCrop(width=1424,
                                                 height=1424,
                                                 p=1.0, always_apply=True)])
        self.cropper_beta = Compose([CenterCrop(width=1536,
                                                height=1536,
                                                p=1.0, always_apply=True)])
        self.cropper_gamma = Compose([Crop(x_min=100, x_max=4200,
                                           y_min=0, y_max=2848,
                                           p=1.0, always_apply=True)])
        self.cropper_a = Compose([Crop(x_min=0, x_max=2848,
                                           y_min=100, y_max=4200,
                                           p=1.0, always_apply=True)])
    def transform(self, image):
        if image.shape == (1424, 2144, 3):
            return self.cropper_alpha(image=image)["image"]
        elif image.shape == (1536, 2048, 3):
            return self.cropper_beta(image=image)["image"]
        elif image.shape == (2848, 4288, 3):
            return self.cropper_gamma(image=image)["image"]
        elif image.shape == (4288, 2848, 3):
            return self.cropper_a(image=image)["image"]
        elif image.shape == (2048, 1536, 3):
            return self.cropper_beta(image=image)["image"]
        elif image.shape == (2144, 1424, 3):
            return self.cropper_alpha(image=image)["image"]
        else:
            print("No matching condition, returning original image.")
            return image


csv_file = '/kaggle/working/rfmid_28.csv'
data_df = pd.read_csv(csv_file)

cropper = Retinal_Crop()


output_image_folder = 'images_crop'
os.makedirs(output_image_folder, exist_ok=True)


tqdm_bar = tqdm(total=len(data_df), desc='Processing Images', unit='image')

processed_data = []

for index, row in data_df.iterrows():
    img_path = row['IMG_DIR']
    img = Image.open(img_path)
    img_array = np.array(img)
    cropped_img = cropper.transform(img_array)
    img_split = os.path.dirname(img_path).split('/')[-1]
    new_img_name = os.path.basename(img_path).replace('.png', f'.png')
    output_image_path = os.path.join(output_image_folder, new_img_name)
    Image.fromarray(cropped_img).save(output_image_path)

    new_row = row.copy()
    new_row['IMG_DIR'] = output_image_path  
    processed_data.append(new_row)

    tqdm_bar.update(1) 

tqdm_bar.close()  


processed_df = pd.DataFrame(processed_data)


new_csv_file = 'image_crop.csv'
processed_df.to_csv(new_csv_file, index=False)


Processing Images: 100%|██████████| 3200/3200 [1:45:53<00:00,  1.99s/image]  


### Resizing and Padding of Retinal Images

This code  demonstrates the resizing and padding of retinal images to ensure they have consistent dimensions and are ready for further processing or modeling.

#### Image Preprocessing

- The code defines a function `square_resize_and_pad` that takes an image file path and a target size as input. The function opens the image using the PIL library, resizes it to the target size while preserving the aspect ratio, and pads the image as needed. The resulting image is returned.

#### Processing the Dataset

- An `output_folder` is specified to store the resized images.

- A CSV file named "image_crop.csv" is read using pandas. This CSV file likely contains information about the cropped images and their paths.

- The code ensures that the output folder exists, creating it if necessary.

- An empty DataFrame named `updated_data` is created with columns identical to the input dataset.

- The code iterates through each row in the input dataset, retrieves the image path, and applies the `square_resize_and_pad` function to resize and pad the image to a target size of 380x380.

- The code generates a new image name and path for the resized image in the specified output folder.

- The image path in the current row is updated to point to the new resized image path.

- The current row is added to the `updated_data` DataFrame.

- After processing all rows, the `updated_data` DataFrame is saved to a new CSV file named "data_files.csv" without the index column.

This code is useful for standardizing the dimensions of retinal images, making them consistent for analysis or modeling. It is a common step in image preprocessing for various computer vision tasks.




In [5]:
import os
import pandas as pd
from PIL import Image

def square_resize_and_pad(image_path, target_size):
    image = Image.open(image_path)
    resized_image = image.resize((target_size, target_size), Image.LANCZOS)
    return resized_image 
output_folder = "resized_images"
input_data = pd.read_csv("/kaggle/working/image_crop.csv" )
os.makedirs(output_folder, exist_ok=True)
updated_data = pd.DataFrame(columns=input_data.columns)

for index, row in input_data.iterrows():
    img_path = row['IMG_DIR']
    resized_image = square_resize_and_pad(img_path, 380)
    new_img_name = os.path.basename(img_path).replace('.png', f'.png')
    new_img_path = os.path.join(output_folder, new_img_name)
    resized_image.save(new_img_path)
    row['IMG_DIR'] = new_img_path
    updated_data = pd.concat([updated_data, row.to_frame().T])
updated_csv = 'data_files.csv'
updated_data.to_csv(updated_csv, index=False)



### Data Augmentation and Balancing for Retinal Images

In this code  data augmentation and class balancing techniques are applied to a dataset of retinal images. The goal is to ensure that each class has a sufficient number of samples while expanding the dataset through augmentation.

#### Data Analysis and Balancing

- The code starts by analyzing the frequency of each class (Disease_Risk) in the dataset and identifying classes that have fewer than `N_class` samples.

- Labels and their corresponding IDs are paired to determine which labels can be combined to create a new class.

- A loop is used to perform data augmentation and class balancing. For each identified class combination (pair):
  - Randomly select an image (index) from the existing dataset.
  - Apply data augmentation techniques to the selected image to create a new augmented image.
  - Save the augmented image with a new unique index.
  - Add the new index and corresponding labels to the dataset.
  - Update the labels_pairings dictionary to include the new index in the corresponding pair.

- The process continues until there are enough samples for each class (Disease_Risk) to meet or exceed `N_class`.

#### Data Augmentation Layer

- The `CustomImageAugmentation` class defines an image augmentation layer that can be applied to images. It includes various augmentation options such as flipping, rotation, brightness adjustment, contrast adjustment, saturation adjustment, hue adjustment, and more.

#### Image Loading and Augmentation

- The code uses the `image_loader` function to load an image from the dataset by index. It can convert the image to grayscale if specified.

- The `perform_augmentation` function performs the actual augmentation. It randomly selects an image, applies augmentation using the `CustomImageAugmentation` layer, and saves the augmented image with a new unique index.

#### Dataset Directory Structure

- The code ensures that the necessary output directories exist for saving augmented images and the final balanced dataset.

- Existing images are copied to the augmentation directory.

- The entire process is repeated until class balancing is achieved.

- The final balanced dataset is saved as "data.csv" in the specified output directory.

This code is valuable for addressing class imbalance in datasets and expanding the dataset using data augmentation techniques. It ensures that each class has a sufficient number of samples for training machine learning models effectively.



In [13]:
import os
import pandas as pd
import numpy as np
from shutil import copyfile
import uuid
from PIL import Image
from multiprocessing.pool import ThreadPool
import tensorflow as tf

path_riadd = "/kaggle/working/"


path_aug = "/kaggle/working/"

N_class = 180
N_pair = 36


path_images = os.path.join(path_riadd, "resized_images")
path_csv = "/kaggle/working/data_files.csv"


dt = pd.read_csv(path_csv, sep=",")
dt=dt.drop(['IMG_DIR'], axis=1)

def analyse_classes(dt):
    class_freq = dt.iloc[:, 1:].sum(axis=0).to_dict()


    n_diseased = (dt["Disease_Risk"] == 1).sum()


    labels_pairings = {}
    labels_pairings_ohe = {}
    col_list = dt.columns[1:]
    for index, row in dt.iterrows():
        pair = []
        for j, col in enumerate(col_list):
            if row[j+1] == 1 : pair.append(col)
        if not any(class_freq[c] < N_class for c in pair) : continue
        key = "|".join(pair)
        if key in labels_pairings : labels_pairings[key].append(str(row["ID"]))
        else:
            labels_pairings[key] = [str(row["ID"])]
            labels_pairings_ohe[key] = row[1:].tolist()

    return dt, class_freq, labels_pairings, labels_pairings_ohe



class CustomImageAugmentation(tf.keras.layers.Layer):
    def __init__(self, flip=True, rotate=True, brightness=True,
                 contrast=True, saturation=True, hue=True, scale=False,
                 crop=False, grid_distortion=False, compression=False,
                 gaussian_noise=False, gaussian_blur=False,
                 downscaling=False, gamma=False, elastic_transform=False, **kwargs):
        super(CustomImageAugmentation, self).__init__(**kwargs)
        self.flip = flip
        self.rotate = rotate
        self.brightness = brightness
        self.contrast = contrast
        self.saturation = saturation
        self.hue = hue
        self.scale = scale
        self.crop = crop
        self.grid_distortion = grid_distortion
        self.compression = compression
        self.gaussian_noise = gaussian_noise
        self.gaussian_blur = gaussian_blur
        self.downscaling = downscaling
        self.gamma = gamma
        self.elastic_transform = elastic_transform

    def call(self, inputs, apply=True):
        if apply:
            augmented = tf.image.random_flip_left_right(inputs) if self.flip else inputs
            augmented = tf.image.rot90(augmented, k=tf.random.uniform(shape=[], minval=0, maxval=4, dtype=tf.int32)) if self.rotate else augmented
            augmented = tf.image.random_brightness(augmented, max_delta=0.2) if self.brightness else augmented
            augmented = tf.image.random_contrast(augmented, lower=0.5, upper=1.5) if self.contrast else augmented
            augmented = tf.image.random_saturation(augmented, lower=0.5, upper=1.5) if self.saturation else augmented
            augmented = tf.image.random_hue(augmented, max_delta=0.2) if self.hue else augmented
        
            return augmented
        else:
            return inputs
img_aug = CustomImageAugmentation()
def image_loader(index, folder_path, image_format="png", grayscale=False):
    image_path = os.path.join(folder_path, f"{index}.{image_format}")
    image = Image.open(image_path)
    
    if grayscale and image.mode != "L":
        print("gray")
        image = image.convert("L")
    
    return image

def perform_augmentation(index_list, pair):

    ir = np.random.choice(len(index_list), 1, replace=True)
    index = index_list[ir[0]]

    img = image_loader(index, path_images_aug, image_format="png",
                       grayscale=False)

    img_tensor = tf.keras.preprocessing.image.img_to_array(img)
    img_tensor = tf.expand_dims(img_tensor, axis=0)

    augmented_tensor = img_aug(img_tensor, apply=True)

    augmented_image = tf.keras.preprocessing.image.array_to_img(augmented_tensor[0])

    index_new = str(uuid.uuid4())

    path_img = os.path.join(path_images_aug, index_new + ".png")
    augmented_image.save(path_img)

    return index_new



if not os.path.exists(path_aug) : os.mkdir(path_aug)
path_images_aug = os.path.join(path_aug, "images")
if not os.path.exists(path_images_aug) : os.mkdir(path_images_aug)


for file in os.listdir(path_images):
    copyfile(os.path.join(path_images, file),
             os.path.join(path_images_aug, file))


class_freq = dt.iloc[:, 1:].sum(axis=0).to_dict()
while any(class_freq[c] < N_class for c in class_freq):
    dt, class_freq, labels_pairings, labels_pairings_ohe = analyse_classes(dt)

    for j in range(0, N_pair):

        wk_list = [(labels_pairings[pair], pair) for pair in labels_pairings]

        with ThreadPool(92) as pool:
            list_newIndicies = pool.starmap(perform_augmentation, wk_list)

        for i, new_index in enumerate(list_newIndicies):
            pair = wk_list[i][1]
            new_entry = [new_index] + labels_pairings_ohe[pair]
            dt.loc[len(dt)] = new_entry
            labels_pairings[pair].append(new_index)


n_diseased = (dt["Disease_Risk"] == 1).sum()



dt.to_csv(os.path.join(path_aug, "data.csv"), sep=",", header=True, index=False)