# Dataset Ingestion and Cleaning (Chest X-ray Dataset)

This notebook focuses on ingesting, cleaning, analyzing, and filtering the ChestMNIST dataset. It includes data loading, cleaning, basic statistics, visualizations, and interactive filtering using natural-language prompts. Exercises refine statistical prompts and apply statistical tests.

**Author**: Mohammad Rezapourian
<br>
**Date**: May 12, 2025
<br>
**License**: Apache-2.0

## Table of Contents
1. [Initial Setup](#initial-setup)
2. [Data Loading and Cleaning](#data-loading-and-cleaning)
3. [Basic Statistics](#basic-statistics)
4. [Interactive Filtering](#interactive-filtering)
5. [Conclusion](#conclusion)

## Initial Setup

Set up the environment for Google Colab. If running locally, skip to the offline setup.

In [None]:
# Colab setup
!wget -q -O - https://github.com/University-Clinic-of-Neuroradiology/python-bootcamp/archive/refs/heads/main.tar.gz | tar -xzf - --strip-components=2 python-bootcamp-main/notebooks/DeepLearning
import os
import sys
from google.colab import output
output.enable_custom_widget_manager()
sys.path.insert(0, 'DeepLearning')
os.chdir(sys.path[0])
%pip install -q numpy matplotlib seaborn pandas scipy
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import scipy.stats as stats

**Note**: If running locally, install dependencies (`pip install numpy matplotlib seaborn pandas scipy`) and use `%matplotlib inline` or `notebook`.

## Data Loading and Cleaning

Load the ChestMNIST dataset and perform cleaning to ensure data quality.

In [None]:
# Define paths
data_path = './Data/ChestMNIST/chestmnist.npz'
csv_path = './Data/ChestMNIST/dataset.csv'

# Load NumPy dataset
try:
    ds = np.load(data_path)
    print('Dataset keys:', list(ds.keys()))
except FileNotFoundError:
    print('Error: chestmnist.npz not found. Ensure the dataset is in ./Data/ChestMNIST/')
    raise

# Load CSV
try:
    df = pd.read_csv(csv_path)
    print('CSV columns:', df.columns.tolist())
except FileNotFoundError:
    print('Error: dataset.csv not found. Ensure the CSV is in ./Data/ChestMNIST/')
    raise

# Display sizes
print(f'Training images: {len(ds["train_images"])}')
print(f'Validation images: {len(ds["val_images"])}')
print(f'Testing images: {len(ds["test_images"])}')
print(f'CSV rows: {len(df)}')

In [None]:
# Cleaning
condition_columns = ['Atelectasis', 'Cardiomegaly', 'Effusion', 'Infiltration', 'Mass', 'Nodule',
                    'Pneumonia', 'Pneumothorax', 'Consolidation', 'Edema', 'Emphysema',
                    'Fibrosis', 'Pleural Thickening', 'Hernia']
print('Missing values:')
print(df.isnull().sum())
print('Duplicates:', df.duplicated().sum())
invalid_labels = df[condition_columns].apply(lambda x: ~x.isin([0, 1])).sum()
print('Invalid labels:')
print(invalid_labels)
invalid_pixels = (ds['train_images'] < 0) | (ds['train_images'] > 255)
print('Invalid pixels in train images:', np.any(invalid_pixels))

## Basic Statistics

Compute summary statistics and visualize condition prevalence.

In [None]:
# Summary statistics
print('Summary statistics:')
print(df[condition_columns].describe())
print('Samples per split:')
print(df['split'].value_counts())

# Prevalence
prevalence = df[condition_columns].mean() * 100
plt.figure(figsize=(10, 5))
sns.barplot(x=condition_columns, y=prevalence)
plt.xticks(rotation=45)
plt.title('Condition Prevalence (%)')
plt.ylabel('Prevalence (%)')
plt.show()

# Number of conditions per sample
df['num_conditions'] = df[condition_columns].sum(axis=1)
condition_counts = df['num_conditions'].value_counts().sort_index()
plt.figure(figsize=(8, 5))
sns.barplot(x=condition_counts.index, y=condition_counts.values)
plt.title('Number of Conditions per Sample')
plt.xlabel('Number of Conditions')
plt.ylabel('Count')
plt.show()

**Exercise**: Refine the prompt to show percentages instead of counts for the number of conditions.

**Prompt**: "Show the percentage of samples for each number of conditions."


In [None]:
# Exercise solution
total_samples = len(df)
condition_percentages = (condition_counts / total_samples) * 100
plt.figure(figsize=(8, 5))
sns.barplot(x=condition_percentages.index, y=condition_percentages.values)
plt.title('Percentage of Samples by Number of Conditions')
plt.xlabel('Number of Conditions')
plt.ylabel('Percentage (%)')
plt.show()
print('Percentages:')
for num, perc in condition_percentages.items():
    print(f'{num} conditions: {perc:.2f}%')

## Interactive Filtering

Filter the dataset using natural-language prompts.

In [None]:
def filter_data(prompt, df):
    if 'Pneumonia' in prompt:
        filtered = df[df['Pneumonia'] == 1]
        print(f'Filtered {len(filtered)} samples with Pneumonia.')
    elif 'Effusion and Edema' in prompt:
        filtered = df[(df['Effusion'] == 1) & (df['Edema'] == 1)]
        print(f'Filtered {len(filtered)} samples with Effusion and Edema.')
    elif 'healthy' in prompt:
        filtered = df[df[condition_columns].sum(axis=1) == 0]
        print(f'Filtered {len(filtered)} healthy samples.')
    else:
        filtered = df
        print('No filter applied.')
    return filtered

prompts = ['Show samples with Pneumonia.', 'Filter samples with both Effusion and Edema.', 'Get healthy samples.']
for prompt in prompts:
    print(f'Prompt: {prompt}')
    filtered_df = filter_data(prompt, df)
    print('Prevalence (%):')
    print(filtered_df[condition_columns].mean() * 100)

**Exercise**: Filter samples with at least three conditions and test Effusion prevalence.

**Prompt**: "Filter samples with at least three conditions."


In [None]:
# Exercise solution
filtered_df = df[df['num_conditions'] >= 3]
print(f'Filtered {len(filtered_df)} samples.')
effusion_all = df['Effusion'].mean()
effusion_filtered = filtered_df['Effusion'].mean()
print(f'Effusion prevalence (all): {effusion_all * 100:.2f}%')
print(f'Effusion prevalence (filtered): {effusion_filtered * 100:.2f}%')
count_all = df['Effusion'].sum()
n_all = len(df)
count_filtered = filtered_df['Effusion'].sum()
n_filtered = len(filtered_df)
z_stat, p_value = stats.proportions_ztest([count_filtered, count_all], [n_filtered, n_all])
print(f'Z-statistic: {z_stat:.2f}, P-value: {p_value:.4f}')

## Conclusion

This notebook ingested, cleaned, and analyzed the ChestMNIST dataset, with interactive filtering and statistical tests.

**References**:
- https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community
- https://pandas.pydata.org/docs/
- https://seaborn.pydata.org/