# ZhangLab Chest X-ray Dataset Preprocessing Guide

This notebook provides a comprehensive guide for downloading and preprocessing the ZhangLab chest X-ray dataset for ART-ASyn.

## Dataset Information
- **Source**: [ZhangLab Data Chest X-ray](https://datasetninja.com/zhang-lab-data-chest-xray)
- **Type**: Chest X-ray images for medical diagnosis
- **Format**: JPEG images
- **Classes**: Normal (Healthy) vs. Diseased conditions

## Overview
The ZhangLab dataset contains chest X-ray images with binary classification:
- **NORMAL**: Healthy chest X-rays
- **ABNORMAL**: Diseased chest X-rays

This preprocessing script organizes the data into a structured format suitable for machine learning models, separating training and testing sets with proper class organization.

---

In [1]:
# Required Libraries
import shutil
from tqdm import tqdm
from pathlib import Path
import os

print("‚úÖ All required libraries imported successfully")

‚úÖ All required libraries imported successfully


## Setup and Configuration

### Step 1: Download the Dataset

Before running this notebook, you need to:

1. **Visit the dataset source**: [ZhangLab Data Chest X-ray](https://datasetninja.com/zhang-lab-data-chest-xray)
2. **Download the dataset** to your local machine
   - **File Size**: 1.18GB
   - **Downloaded File**: `zhanglabdata-chest-xray-DatasetNinja.tar`
3. **Extract the dataset** to create the `../ZhangLabData` folder structure:
   ```
   ZhangLabData/           # Extract here
   ‚îú‚îÄ‚îÄ train/
   ‚îÇ   ‚îî‚îÄ‚îÄ img/
   ‚îÇ       ‚îú‚îÄ‚îÄ NORMAL-0001-1.jpeg
   ‚îÇ       ‚îú‚îÄ‚îÄ ABNORMAL-0001-2.jpeg
   ‚îÇ       ‚îî‚îÄ‚îÄ ...
   ‚îî‚îÄ‚îÄ test/
       ‚îî‚îÄ‚îÄ img/
           ‚îú‚îÄ‚îÄ NORMAL-0002-1.jpeg
           ‚îî‚îÄ‚îÄ ...
   ```

**Note**: The main notebook directory should be at the same level as the `ZhangLabData` folder:
```
data/
‚îú‚îÄ‚îÄ ZhangLab/
‚îÇ   ‚îî‚îÄ‚îÄ Data Preprocessing.ipynb
‚îî‚îÄ‚îÄ ZhangLabData/          # Dataset folder
```

### Step 2: Configure Data Paths

Set the root directory where your ZhangLab dataset is located:

In [2]:
# Configure dataset root directory
# Change this path to match your local dataset location
root = "../ZhangLabData"

# Verify the dataset directory exists
if Path(root).exists():
    print(f"‚úÖ Dataset directory found: {root}")
    print(f"üìÅ Directory contents: {list(Path(root).iterdir())}")
else:
    print(f"‚ùå Dataset directory not found: {root}")
    print("Please ensure you have downloaded and extracted the dataset correctly.")
    print("Expected structure: ../ZhangLabData/{train,test}/{img/*.jpeg}")

‚úÖ Dataset directory found: ../ZhangLabData
üìÅ Directory contents: [WindowsPath('../ZhangLabData/LICENSE.md'), WindowsPath('../ZhangLabData/meta.json'), WindowsPath('../ZhangLabData/README.md'), WindowsPath('../ZhangLabData/test'), WindowsPath('../ZhangLabData/train')]


## Data Preprocessing

### Overview

The preprocessing pipeline will:
1. **Organize images** by class (healthy vs diseased)
2. **Create structured directories** for ML workflows
3. **Separate training and testing** data
4. **Ensure no duplicate processing** with existing files

### Dataset Structure After Preprocessing

```
üìÇ ZhangLab/
‚îú‚îÄ‚îÄ train/
‚îÇ   ‚îú‚îÄ‚îÄ healthy/     # NORMAL class images
‚îÇ   ‚îî‚îÄ‚îÄ diseased/   # ABNORMAL class images
‚îî‚îÄ‚îÄ test/
    ‚îú‚îÄ‚îÄ healthy/     # NORMAL class images
    ‚îî‚îÄ‚îÄ diseased/   # ABNORMAL class images
```

### File Naming Convention

Original files follow the pattern: `{CLASS}-{STUDY}-{INDEX}.jpeg`
- `CLASS`: NORMAL or ABNORMAL
- `STUDY`: Study identifier number
- `INDEX`: Image index within the study

**Example**: `NORMAL-0001-1.jpeg` means:
- Class: Normal (Healthy)
- Study ID: 0001
- Image index: 1

## Training Data Preprocessing

The following code processes the training dataset:

In [3]:
# Process Training Data
print("üîÑ Processing training data...")
print("=" * 50)

# Create organized directory structure
healthy_dir = Path("train/healthy")
diseased_dir = Path("train/diseased")

healthy_dir.mkdir(parents=True, exist_ok=True)
diseased_dir.mkdir(parents=True, exist_ok=True)

print(f"üìÅ Created directories: {healthy_dir}, {diseased_dir}")

# Process each image file
study_idx = []
total_processed = 0
normal_count = 0
abnormal_count = 0

for source in tqdm(list(Path(f"{root}/train/img").glob("*.jpeg")), desc="Processing training images"):
    # Parse filename: CLASS-STUDY-INDEX.jpeg
    status, study, idx = source.stem.split("-")
    
    # Determine destination based on class
    if status == "NORMAL":
        destination = healthy_dir / source.parts[-1]
        normal_count += 1
    else:  # ABNORMAL
        destination = diseased_dir / source.parts[-1]
        abnormal_count += 1
    
    # Copy file if it doesn't already exist
    if not destination.exists():
        shutil.copy(source, destination)
        total_processed += 1
    
    study_idx.append((status, study, idx))

# Print summary statistics
print("\nüìä Training Data Processing Summary:")
print(f"   Total images processed: {total_processed}")
print(f"   Normal (Healthy) images: {normal_count}")
print(f"   Abnormal (Diseased) images: {abnormal_count}")
print(f"   Total original files: {normal_count + abnormal_count}")

print("‚úÖ Training data preprocessing completed!")

üîÑ Processing training data...
üìÅ Created directories: train\healthy, train\diseased


Processing training images: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5232/5232 [00:00<00:00, 35020.92it/s]


üìä Training Data Processing Summary:
   Total images processed: 0
   Normal (Healthy) images: 1349
   Abnormal (Diseased) images: 3883
   Total original files: 5232
‚úÖ Training data preprocessing completed!





## Testing Data Preprocessing

The following code processes the testing dataset using the same methodology as the training data:

In [4]:
# Process Testing Data
print("üîÑ Processing testing data...")
print("=" * 50)

# Create organized directory structure
healthy_dir = Path("test/healthy")
diseased_dir = Path("test/diseased")

healthy_dir.mkdir(parents=True, exist_ok=True)
diseased_dir.mkdir(parents=True, exist_ok=True)

print(f"üìÅ Created directories: {healthy_dir}, {diseased_dir}")

# Process each image file
study_idx = []
total_processed = 0
normal_count = 0
abnormal_count = 0

for source in tqdm(list(Path(f"{root}/test/img").glob("*.jpeg")), desc="Processing testing images"):
    # Parse filename: CLASS-STUDY-INDEX.jpeg
    status, study, idx = source.stem.split("-")
    
    # Determine destination based on class
    if status == "NORMAL":
        destination = healthy_dir / source.parts[-1]
        normal_count += 1
    else:  # ABNORMAL
        destination = diseased_dir / source.parts[-1]
        abnormal_count += 1
    
    # Copy file if it doesn't already exist
    if not destination.exists():
        shutil.copy(source, destination)
        total_processed += 1
    
    study_idx.append((status, study, idx))

# Print summary statistics
print("\nüìä Testing Data Processing Summary:")
print(f"   Total images processed: {total_processed}")
print(f"   Normal (Healthy) images: {normal_count}")
print(f"   Abnormal (Diseased) images: {abnormal_count}")
print(f"   Total original files: {normal_count + abnormal_count}")

print("‚úÖ Testing data preprocessing completed!")

üîÑ Processing testing data...
üìÅ Created directories: test\healthy, test\diseased


Processing testing images: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 624/624 [00:00<00:00, 18546.64it/s]


üìä Testing Data Processing Summary:
   Total images processed: 0
   Normal (Healthy) images: 234
   Abnormal (Diseased) images: 390
   Total original files: 624
‚úÖ Testing data preprocessing completed!





## Final Dataset Summary

Let's verify the final dataset structure and get a comprehensive summary:

In [5]:
# Final Dataset Verification and Summary
print("üìã Final Dataset Summary")
print("=" * 60)

def count_files(directory):
    """Count files in a directory"""
    return len(list(Path(directory).glob("*"))) if Path(directory).exists() else 0

# Define dataset paths
dataset_paths = {
    "Training - Healthy": "train/healthy",
    "Training - Diseased": "train/diseased", 
    "Testing - Healthy": "test/healthy",
    "Testing - Diseased": "test/diseased"
}

# Count files and display summary
total_train = 0
total_test = 0

print("üìÅ Dataset Structure:")
for label, path in dataset_paths.items():
    count = count_files(path)
    print(f"   {label:<20}: {count:>6} images")
    
    if "Training" in label:
        total_train += count
    else:
        total_test += count

print("-" * 60)
print(f"{'Total Training':<20}: {total_train:>6} images")
print(f"{'Total Testing':<20}: {total_test:>6} images")
print(f"{'Total Dataset':<20}: {total_train + total_test:>6} images")

# Calculate class distribution
train_healthy = count_files("train/healthy")
train_diseased = count_files("train/diseased")
test_healthy = count_files("test/healthy")
test_diseased = count_files("test/diseased")

print("\nüìä Class Distribution:")
print("   Training Set:")
print(f"      Healthy  : {train_healthy:>6} ({train_healthy/total_train*100:.1f}%)")
print(f"      Diseased : {train_diseased:>6} ({train_diseased/total_train*100:.1f}%)")
print("   Testing Set:")
print(f"      Healthy  : {test_healthy:>6} ({test_healthy/total_test*100:.1f}%)")
print(f"      Diseased : {test_diseased:>6} ({test_diseased/total_test*100:.1f}%)")

print("\n‚úÖ Dataset preprocessing completed successfully!")

üìã Final Dataset Summary
üìÅ Dataset Structure:
   Training - Healthy  :   1349 images
   Training - Diseased :   3883 images
   Testing - Healthy   :    234 images
   Testing - Diseased  :    390 images
------------------------------------------------------------
Total Training      :   5232 images
Total Testing       :    624 images
Total Dataset       :   5856 images

üìä Class Distribution:
   Training Set:
      Healthy  :   1349 (25.8%)
      Diseased :   3883 (74.2%)
   Testing Set:
      Healthy  :    234 (37.5%)
      Diseased :    390 (62.5%)

‚úÖ Dataset preprocessing completed successfully!
