In [1]:
import os
import numpy as np
from PIL import Image
import tensorflow as tf 
import pandas as pd

### Data Splitting for Disease Classification

This code  demonstrates how to split a dataset for disease classification tasks, ensuring that each class is represented in training, validation, and test sets. The dataset is filtered to include only instances where `Disease_Risk=1` and then divided into three subsets: training, validation, and test sets.

#### Data Filtering

- The code starts by reading the dataset from a CSV file and filters it to include only rows where `Disease_Risk=1`.

- It calculates the size of the filtered data (`filtered_size`) and extracts the label columns.

- The positive instances for each label are counted and sorted.

#### Data Splitting

- The data is iteratively split for each label, ensuring that the distribution of positive instances is preserved in each split. For each label:
  - It splits the data into a training part, a validation part, and a test part using `train_test_split` from scikit-learn.
  - The split indices are stored in separate dictionaries for each label: `train_splits`, `val_splits`, and `test_splits`.
  - The samples used for the splits are removed from the remaining data.

- After iterating through all labels, the code combines the splits to obtain the final training, validation, and test sets by selecting the samples based on the stored indices.

#### Summary

- The code provides a systematic approach to ensure that positive instances of each class are represented in training, validation, and test sets, making it suitable for disease classification tasks.

- It calculates and prints the sizes of the resulting sets: `train_data`, `val_data`, and `test_data`.

This data splitting approach helps maintain class balance in each subset, which is crucial for training machine learning models on imbalanced datasets.




In [2]:
data = pd.read_csv('/kaggle/input/updated-csv/data_file.csv')
filtered_data = data[data['Disease_Risk'] == 1]

filtered_size = len(filtered_data)
label_columns = filtered_data.columns[2:-1]

label_counts = filtered_data[label_columns].sum().sort_values()

from sklearn.model_selection import train_test_split
from collections import defaultdict


train_splits = defaultdict(list)
val_splits = defaultdict(list)
test_splits = defaultdict(list)

remaining_data = filtered_data.copy()

for label in label_counts.index:

    train_part, temp_part = train_test_split(remaining_data[remaining_data[label] == 1], test_size=0.2, random_state=42, stratify=remaining_data[remaining_data[label] == 1][label])
    val_part, test_part = train_test_split(temp_part, test_size=0.5, random_state=42, stratify=temp_part[label])
    

    train_splits[label].extend(train_part.index.tolist())
    val_splits[label].extend(val_part.index.tolist())
    test_splits[label].extend(test_part.index.tolist())

    remaining_data = remaining_data.drop(train_part.index)
    remaining_data = remaining_data.drop(val_part.index)
    remaining_data = remaining_data.drop(test_part.index)


train_idx = [idx for idx_list in train_splits.values() for idx in idx_list]
val_idx = [idx for idx_list in val_splits.values() for idx in idx_list]
test_idx = [idx for idx_list in test_splits.values() for idx in idx_list]

train_data = filtered_data.loc[train_idx]
val_data = filtered_data.loc[val_idx]
test_data = filtered_data.loc[test_idx]

len(train_data), len(val_data), len(test_data)

(7502, 934, 953)

### Data Augmentation and Combining with Normal Data

In this code  the dataset is further processed by augmenting the data for the positive class (diseased) and combining it with normal data. This is done to ensure a balanced dataset for training and evaluation.

#### Augmenting Diseased Data

- Initially, the code selects the rows from the dataset where `Disease_Risk=0` (normal instances) and stores them in the `normal` variable.

- The normal data is then split into a training part and a temporary part, followed by splitting the temporary part into validation and test parts using `train_test_split`. The split indices are stored in the corresponding variables (`normaltrain_data`, `normalval_data`, `normaltest_data`).

#### Combining with Diseased Data

- The code concatenates the normal data splits with the corresponding splits for the diseased data, which were previously obtained. This results in the combination of both normal and diseased data for the training, validation, and test sets (`train_data`, `val_data`, `test_data`).

- The code calculates and prints the sizes of the resulting sets to confirm the balanced distribution of samples.

This approach is particularly useful for ensuring that the model has sufficient exposure to both normal and diseased samples during training, promoting robust and balanced model performance.




In [3]:
normal = data[data['Disease_Risk'] == 0]
normaltrain_data, temp_data = train_test_split(normal, test_size=0.4, random_state=42)
normalval_data, normaltest_data = train_test_split(temp_data, test_size=0.5, random_state=42)
train_data = pd.concat([train_data,normaltrain_data ])
val_data = pd.concat([val_data,normalval_data ])
test_data = pd.concat([test_data,normaltest_data ])

### Organizing Data into Split Directories

In this code  the dataset is organized into split directories, including training, validation, and test sets. This is a common step when preparing data for machine learning experiments. Here's what the code does:

#### Destination Directory

- The code defines the `base_dest_dir`, which is the destination directory where the split images will be copied. You can specify your desired destination directory.

#### Split Folders

- The code specifies the split folders, which include 'train', 'val', and 'test'. These folders will contain the respective data splits.

#### Iterating Over Splits

- The code iterates over each split, creating a destination directory for each split (e.g., `/kaggle/working/data/train`, `/kaggle/working/data/val`, `/kaggle/working/data/test`).

- For each split, it iterates over the rows in the corresponding data (e.g., `train_data`, `val_data`, `test_data`), where each row represents an image and its associated information.

- It extracts the image path from the dataset (you may need to replace `'IMG_DIR'` with the actual column name) and obtains the image file name using `os.path.basename(image_path)`.

- It constructs the destination path for the image in the split directory.

- The image file is copied from its original location to the destination directory using `shutil.copy`.

This code structure is commonly used to prepare data for machine learning experiments, ensuring that the data is organized into distinct training, validation, and test sets within separate directories.



In [8]:
import os
import shutil


base_dest_dir = '/kaggle/working/data'


split_folders = ['train', 'val', 'test']


for split, data in zip(split_folders, [train_data, val_data, test_data]):
    split_dest_dir = os.path.join(base_dest_dir, split)
    os.makedirs(split_dest_dir, exist_ok=True)  
    

    for index, row in data.iterrows():
        image_path = row['IMG_DIR']  
        image_name = os.path.basename(image_path)  
        
        dest_path = os.path.join(split_dest_dir, image_name)
        
     
        shutil.copy(image_path, dest_path)

In [9]:
train_data.to_csv('/kaggle/working/train_data.csv', index=False)

In [10]:
val_data.to_csv('/kaggle/working/val_data.csv', index=False)

In [11]:
test_data.to_csv('/kaggle/working/test_data.csv', index=False)

In [13]:
import shutil


directory_to_zip = '/kaggle/working/data'


zip_file_name = 'data'


shutil.make_archive(zip_file_name, 'zip', directory_to_zip)

print(f'"{directory_to_zip}" has been zipped to "{zip_file_name}"')


"/kaggle/working/data" has been zipped to "data"
