# Initial Data Structuring and Mapping Configuration

This notebook constitutes the foundational step in preparing the X-ray image dataset for a multi-stage deep learning pipeline focused on COVID-19, Viral Pneumonia, and Lung Opacity detection. Its primary function is to **aggregate all raw data paths and metadata** and establish the **hierarchical classification framework** used throughout the project.

---

## Key Data Preparation Tasks

## 1. Data Aggregation and Consolidation
The script systematically accesses the raw image and mask files across the four diagnostic categories—**COVID**, **Normal**, **Viral Pneumonia**, and **Lung\_Opacity**—and their associated metadata (stored in `.xlsx` files).

* **Image-Mask Pairing:** It creates a comprehensive index of all image paths and their corresponding segmentation mask paths, handling cases where a mask may be absent.
* **Structured Data Output:** The aggregated paths and metadata are then persisted into two structured CSV files, **`all_image_mask_pairs.csv`** and **`all_metadata.csv`**, which serve as the definitive single source of truth for all subsequent training, validation, and testing procedures.

## 2. Defining Hierarchical Classification Mappings
A core objective of this notebook is to define the necessary mappings to support the project's **two-step classification strategy**:

1.  **Stage 1: Binary Classification (Triage)**: Distinguishing between **Healthy** (Normal) and **Unhealthy** (all disease classes) lung conditions. This is defined in `healthy_binary_mapping.json`.
2.  **Stage 2: Specific Disease Classification**: For images identified as Unhealthy, classifying the specific pathology among **COVID**, **Viral Pneumonia**, and **Lung Opacity**. This specialized task is configured in `unhealthy_mapping.json`.

In addition, a standard **Four-Class Multi-class Mapping** (`class_mapping.json`) is created for baseline model training and comparison. All class-to-index and index-to-class mappings are saved as JSON files for consistent use across all model training and evaluation notebooks.

This notebook ensures the dataset is correctly structured and encoded according to the defined machine learning objectives before any model training begins.

In [1]:
# Import neccessary libraries
import os
import tensorflow as tf
import numpy as np
import pandas as pd
from glob import glob
import numpy as np
import json

2025-11-08 20:27:46.241745: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-08 20:27:46.301783: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-08 20:27:47.482577: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


## ⚙️ Section 1: Utility Functions and Data Persistence

This section defines two essential helper functions—one for pairing image and segmentation mask file paths and another for loading class-specific metadata—which are executed in a preceding loop (not shown in this block). Following the execution of that loop, this block summarizes the total aggregated data counts and finally structures and saves the complete dataset indices and metadata into two master CSV files.

---

### 1.1 Utility Functions Definition

#### 1.1.1 `get_image_mask_pairs(cls, DATA_DIR)`
This function is crucial for preparing the data for both segmentation and classification tasks by mapping input images to their corresponding ground-truth segmentation masks.

* **Functionality:** It constructs file paths for images and masks within the specified class (`cls`) subdirectory. It iterates through all found image files (`.png`) and checks for the existence of a mask file with the identical name in the mask directory.
* **Output Structure:** It returns a list of tuples formatted as `(image_path, mask_path, class)`. Importantly, if a mask is not found for an image, the `mask_path` element is set to `None`, ensuring that all image records are captured, regardless of mask availability.

In [2]:
def get_image_mask_pairs(cls, DATA_DIR= None):
    '''
    Retrieves and pairs image and corresponding mask file paths for a specific class.

    This function iterates through all images in a class directory and attempts
    to find a matching segmentation mask.

    Args:
        cls (str): The name of the class (e.g., 'COVID', 'Normal').
        DATA_DIR (str, optional): The root directory of the dataset. Defaults to None.

    Returns:
        list: A list of tuples, where each tuple contains 
              (image_path, mask_path, class_name).
    '''
    # Construct the directory paths for images and masks based on the class and data root
    img_dir = os.path.join(DATA_DIR, cls, 'images')
    mask_dir = os.path.join(DATA_DIR, cls, 'masks')

    # Get sorted lists of all image and mask file paths (assuming .png format)
    images = sorted(glob(os.path.join(img_dir, '*.png')))
    masks = sorted(glob(os.path.join(mask_dir, '*.png'))) # Note: 'masks' variable is not used in the loop, but kept as per original code structure

    image_mask_pairs = []
    # Iterate through each image path
    for img_path in images:
        # Extract the base file name (e.g., 'patient001.png')
        fname = os.path.basename(img_path)
        # Construct the expected full path for the corresponding mask file
        mask_path = os.path.join(mask_dir, fname)

        # Check if the expected mask file exists
        if os.path.exists(mask_path):
            # If the mask exists, pair the image and mask paths
            image_mask_pairs.append((img_path, mask_path, cls))
        else:
            # If the mask does not exist, append the image path with None for the mask
            # NOTE: The original code uses 'image_path' here, which appears to be
            # an undeclared variable (typo for 'img_path' in the original).
            # Keeping the original code's variable name for strict adherence:
            image_mask_pairs.append((image_path, None, cls))

    return image_mask_pairs

#### 1.1.2 `get_metadata(cls, DATA_DIR)`
This function standardizes the process of loading supplemental clinical or descriptive data associated with each class.

* **Functionality:** It attempts to load an Excel file named `${cls}.metadata.xlsx` from the `DATA_DIR`.
* **Output Structure:** If the file exists, it is loaded into a pandas DataFrame; otherwise, the function returns `None`.

In [3]:
def get_metadata(cls, DATA_DIR= None):
    '''
    Loads the metadata Excel file associated with a specific diagnostic class.

    The function constructs the expected file path for the metadata based on the class
    name and attempts to read it into a pandas DataFrame.

    Args:
        cls (str): The name of the class (e.g., 'COVID', 'Normal').
        DATA_DIR (str, optional): The root directory of the dataset. Defaults to None.

    Returns:
        pandas.DataFrame or None: The metadata DataFrame if the file exists, 
                                  otherwise returns None.
    '''
    # Construct the full path to the metadata Excel file using f-strings
    metadata_path = os.path.join(DATA_DIR, f'{cls}.metadata.xlsx')
    
    # Check if the metadata file exists at the constructed path
    if os.path.exists(metadata_path):
        # Read the Excel file into a pandas DataFrame
        df = pd.read_excel(metadata_path)
        return df
        
    # If the file does not exist, return None
    return None

### 1.2 Data Consolidation and Saving

After all individual class data have been aggregated into `all_pairs` (list of image/mask tuples) and `all_meta` (list of metadata DataFrames), this part of the script finalizes the dataset preparation.

#### 1.2.1 Data Summary
The total counts of the aggregated records are printed to confirm that the data loading step was complete:
* The script confirms the **Total image/Mask pairs** count using `len(all_pairs)`.
* The script also confirms the **Total metadata records** count after concatenating all class-specific metadata.

| Class | Image/Masks pairs loaded | Metadata pairs loaded |
| :--- | :--- | :---
| `COVID` | 3616 | 3616 |
| `Normal` | 10192 | 10192 |
| `Viral Pneumonia` | 1345 | 1345 |
| `Lung_Opacity` | 6012 | 6012 |
| `Healthy` | 10192 | 10192 |
| `Unhealthy(all diseases lungs images)` | 10973 | 10973

In [4]:
# Define the root directory for the dataset
DATA_DIR = './data'
# Define the four classes for which data is being loaded
classes = ['COVID', 'Normal', 'Viral Pneumonia', 'Lung_Opacity']

# Initialize lists to store all image/mask pairs and metadata DataFrames
all_pairs, all_meta = [], []

# Loop through each class name to load and aggregate data
for cls in classes:
    # 1. Load image and mask paths for the current class using the utility function
    pairs = get_image_mask_pairs(cls, DATA_DIR= DATA_DIR)
    # Extend the list with the paths found for the current class
    all_pairs.extend(pairs)
    # Print a confirmation message
    print(f'{cls}: {len(pairs)} image-mask pairs loaded Successfully !')
    
    # 2. Load the metadata DataFrame for the current class
    meta = get_metadata(cls, DATA_DIR= DATA_DIR)
    
    # Check if the metadata was successfully loaded
    if meta is not None:
        # Append the loaded DataFrame to the list of all metadata
        all_meta.append(meta)
        # Print a confirmation message
        print(f'{cls}: Metadata loaded. {len(meta)} recored successfully !')

COVID: 3616 image-mask pairs loaded Successfully !
COVID: Metadata loaded. 3616 recored successfully !
Normal: 10192 image-mask pairs loaded Successfully !
Normal: Metadata loaded. 10192 recored successfully !
Viral Pneumonia: 1345 image-mask pairs loaded Successfully !
Viral Pneumonia: Metadata loaded. 1345 recored successfully !
Lung_Opacity: 6012 image-mask pairs loaded Successfully !
Lung_Opacity: Metadata loaded. 6012 recored successfully !


In [5]:
# Print the total count of all image/mask pairs loaded
print(f'Total image/Mask pairs: {len(all_pairs)}')

# Check if the list of metadata DataFrames is not empty
if all_meta:
    # Concatenate all individual metadata DataFrames into a single master DataFrame
    meta_df = pd.concat(all_meta, ignore_index= True)
    # Print the total count of all consolidated metadata records
    print(f'Total metadata records: {len(meta_df)}')
else:
    # If no metadata was loaded, set the master DataFrame to None
    meta_df = None

Total image/Mask pairs: 21165
Total metadata records: 21165


#### 1.2.2 Final File Persistence
The aggregated data is structured into two pandas DataFrames and saved as CSV files for efficient loading in downstream notebooks.
* The image and mask paths are converted to `pairs_df` and saved as **`./data/all_image_mask_pairs.csv`**.
* The concatenated metadata is saved as **`./data/all_metadata.csv`**.

This step marks the completion of the physical data indexing and preparation stage.

In [6]:
# Create a pandas DataFrame from the aggregated list of image/mask paths
# Columns are explicitly named: 'image_path', 'mask_path', and 'class'
pairs_df = pd.DataFrame(all_pairs, columns= ['image_path', 'mask_path', 'class'])

# Save the DataFrame containing all image and mask indices to a CSV file
# index=False prevents pandas from writing row numbers to the file
pairs_df.to_csv('./data/all_image_mask_pairs.csv', index= False)

# Print a confirmation message for the image/mask pairs file
print(f'all_image_mask_pairs CSV file created successfully !')

# Save the consolidated metadata DataFrame to a separate CSV file
meta_df.to_csv('./data/all_metadata.csv', index= False)

# Print a confirmation message for the metadata file
print(f'all_metadata CSV file created successfully !')

all_image_mask_pairs CSV file created successfully !
all_metadata CSV file created successfully !


## Section 2: Classification Mapping Configuration

This section is dedicated to establishing the **taxonomy** for the entire deep learning project. It defines the three hierarchical classification schemes (four-class, binary, and three-class unhealthy) and creates both the **class-to-index** and **index-to-class** mappings. These six dictionaries are then permanently saved as JSON files, ensuring consistent numerical encoding of class labels across all subsequent training and evaluation stages.

---

### 2.1 Configuration of Output File Paths

The first step defines the explicit file paths where the six mapping configuration files will be stored in the `./data/` directory.

| File Name | Purpose |
| :--- | :--- |
| `class_mapping_path` | Four-class mapping (Class $\rightarrow$ Index) |
| `healthy_binary_mapping_path` | Binary mapping (Class $\rightarrow$ Index) |
| `unhealthy_mapping_path` | Unhealthy-specific mapping (Class $\rightarrow$ Index) |
| `class_index_mapping_path` | Four-class reverse mapping (Index $\rightarrow$ Class) |
| `healthy_binary_index_mapping_path` | Binary reverse mapping (Index $\rightarrow$ Class) |
| `unhealthy_index_mapping_path` | Unhealthy-specific reverse mapping (Index $\rightarrow$ Class) |

In [19]:
class_mapping_path = './data/class_mapping.json'
healthy_binary_mapping_path = './data/healthy_binary_mapping.json'
unhealthy_mapping_path = './data/unhealthy_mapping.json'
class_index_mapping_path = './data/index_mapping.json'
healthy_binary_index_mapping_path = './data/healthy_binary_index_mapping.json'
unhealthy_index_mapping_path = './data/unhealthy_index_mapping.json'

### 2.2 Creating and Saving Class-to-Index Mappings

Three primary dictionary mappings are defined to convert descriptive class names into numerical indices suitable for model training. Each is immediately saved to a JSON file.

#### 2.2.1 Four-Class Mapping (`class_mapping.json`)
This is the standard multi-class mapping used for general classification training.

$$\text{class\_mapping} = \{\text{'COVID': 0, 'Normal': 1, 'Viral Pneumonia': 2, 'Lung\_Opacity': 3}\}$$


In [20]:
class_mapping = {'COVID': 0, 'Normal': 1, 'Viral Pneumonia': 2, 'Lung_Opacity': 3}

with open(class_mapping_path , 'w') as f:
    json.dump(class_mapping, f, indent= 4)

print('Class Mapping json file created Successfully !')

Class Mapping json file created Successfully !


#### 2.2.2 Binary Mapping (`healthy_binary_mapping.json`)
This mapping supports the **Stage 1: Binary Classification (Triage)** objective, distinguishing between healthy and diseased states.

$$\text{healthy\_binary\_mapping} = \{\text{'Healthy': 0, 'Unhealthy': 1}\}$$

In [21]:
healthy_binary_mapping = {'Healthy': 0, 'Unhealthy': 1}
with open(healthy_binary_mapping_path, 'w') as f:
    json.dump(healthy_binary_mapping, f, indent= 2)

print('Healthy binary mapping json file created successfully !')

Healthy binary mapping json file created successfully !


#### 2.2.3 Unhealthy Mapping (`unhealthy_mapping.json`)
This mapping supports the **Stage 2: Specific Disease Classification** objective, focusing only on the three pathological classes.

$$\text{unhealthy\_mapping} = \{\text{'COVID': 0, 'Viral Pneumonia': 1, 'Lung Opacity': 2}\}$$

In [22]:
unhealthy_mapping = {'COVID': 0, 'Viral Pneumonia': 1, 'Lung Opacity': 2}
with open(unhealthy_mapping_path, 'w') as f:
    json.dump(unhealthy_mapping, f, indent= 3)

print('Unhealthy mapping json file created successfully !')

Unhealthy mapping json file created successfully !


### 2.3 Generating and Saving Index-to-Class Mappings

The reverse mappings are generated using a simple dictionary comprehension on the primary mappings. These are essential for converting a model's numerical output (index) back into a human-readable class label during prediction and evaluation.

* **`class_index_mapping`** is generated from `class_mapping`.
* **`healthy_binary_index_mapping`** is generated from `healthy_binary_mapping`.
* **`unhealthy_index_mapping`** is generated from `unhealthy_mapping`.

All three reverse mappings are then saved to their respective JSON files, concluding the data configuration stage.

In [23]:
class_index_mapping = {index: class_name for class_name, index in class_mapping.items()}
healthy_binary_index_mapping = {index: healthy_class for healthy_class, index in healthy_binary_mapping.items()}
unhealthy_index_mapping = {index: unhealthy_class for unhealthy_class, index in unhealthy_mapping.items()}

In [24]:
with open(class_index_mapping_path , 'w') as f:
    json.dump(class_index_mapping, f, indent= 4)

print('Class Index Mapping json file created Successfully !')

Class Index Mapping json file created Successfully !


In [26]:
with open(healthy_binary_index_mapping_path , 'w') as f:
    json.dump(healthy_binary_index_mapping, f, indent= 4)

print('Healthy Binary Index Mapping json file created Successfully !')

Healthy Binary Index Mapping json file created Successfully !


In [27]:
with open(unhealthy_index_mapping_path , 'w') as f:
    json.dump(unhealthy_index_mapping, f, indent= 4)

print('Unhealthy Index Mapping json file created Successfully !')

Unhealthy Index Mapping json file created Successfully !
