<a href="https://colab.research.google.com/github/c4bath/cf860/blob/main/AptosSampler.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# APTOS 2019 Blindness Detection Dataset

Sampling (for compute and memory resource constraints):

* 10% of the 3,662 images from the initial APTOS training set split 70/30 into train_small and test_small

* creates .csv files train_small and test_small with 'id_code' and 'diagnosis' for the corresponding sample sets

* Original class balances preserved

A clinician has rated each image for the severity of diabetic retinopathy on a scale of 0 to 4:

0 - No DR

1 - Mild

2 - Moderate

3 - Severe

4 - Proliferative DR


train_small: 256 files

test_small: 110 files



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# Import libraries
import os
import pandas as pd
import numpy as np
import shutil
from sklearn.model_selection import train_test_split


In [3]:
data_dir = '/content/drive/MyDrive/cfPublicData/aptos2'
image_dir = f'{data_dir}/train_images'
csv_file = f'{data_dir}/train.csv'
train_small_dir = f'{data_dir}/train_small'
test_small_dir = f'{data_dir}/test_small'

In [6]:
os.makedirs(train_small_dir, exist_ok=True)
os.makedirs(test_small_dir, exist_ok=True)

# Read the CSV file
df = pd.read_csv(csv_file)

# Calculate 10% of the data
sample_size = int(0.1 * len(df))

# Stratify to maintain class balance
df_sampled = df.groupby('diagnosis', group_keys=False).apply(lambda x: x.sample(int(np.rint(0.1*len(x))))).reset_index(drop=True)

# Split the sampled data into 70% train and 30% test with stratification
train_df, test_df = train_test_split(df_sampled, test_size=0.3, stratify=df_sampled['diagnosis'], random_state=27)

train_small_csv = f'{data_dir}/train_small.csv'
test_small_csv = f'{data_dir}/test_small.csv'

# Save to CSV
train_df.to_csv(train_small_csv, index=False)
test_df.to_csv(test_small_csv, index=False)

# Function to copy images to the respective directories
def copy_images(df, source_dir, dest_dir):
    for idx, row in df.iterrows():
        src_path = os.path.join(source_dir, row['id_code'] + '.png')
        dst_path = os.path.join(dest_dir, row['id_code'] + '.png')
        shutil.copy(src_path, dst_path)


In [7]:
copy_images(train_df, image_dir, train_small_dir)
copy_images(test_df, image_dir, test_small_dir)

In [8]:
print("Class distribution in the original data:")
print(df['diagnosis'].value_counts(normalize=True))

print("\nClass distribution in the sampled data (10%):")
print(df_sampled['diagnosis'].value_counts(normalize=True))

print("\nClass distribution in the train_small subset (70% of 10%):")
print(train_df['diagnosis'].value_counts(normalize=True))

print("\nClass distribution in the test_small subset (30% of 10%):")
print(test_df['diagnosis'].value_counts(normalize=True))

print(f'\nCopied {len(train_df)} images to {train_small_dir}')
print(f'Copied {len(test_df)} images to {test_small_dir}')

Class distribution in the original data:
0    0.492900
2    0.272802
1    0.101038
4    0.080557
3    0.052703
Name: diagnosis, dtype: float64

Class distribution in the sampled data (10%):
0    0.491803
2    0.273224
1    0.101093
4    0.081967
3    0.051913
Name: diagnosis, dtype: float64

Class distribution in the train_small subset (70% of 10%):
0    0.492188
2    0.273438
1    0.101562
4    0.082031
3    0.050781
Name: diagnosis, dtype: float64

Class distribution in the test_small subset (30% of 10%):
0    0.490909
2    0.272727
1    0.100000
4    0.081818
3    0.054545
Name: diagnosis, dtype: float64

Copied 256 images to /content/drive/MyDrive/cfPublicData/aptos2/train_small
Copied 110 images to /content/drive/MyDrive/cfPublicData/aptos2/test_small
