# Natural-Language Queries on Tabular Data (Chest X-ray Dataset)

This notebook demonstrates how to perform natural-language queries on the ChestMNIST dataset, a collection of chest X-ray images with multi-label binary classifications for 14 medical conditions. We will explore the dataset, use Chain-of-Thought (CoT) reasoning to answer queries, verify results, and provide exercises to double-check findings using Pandas. The notebook is designed to be run step-by-step.

**Author**: Mohammad Rezapourian
<br>
**First version**: May 12, 2025
<br>
**License**: Apache-2.0

## Table of Contents
1. [Initial Setup](#initial-setup)
   - [Setup for Google Colab](#setup-for-google-colab)
   - [Setup for Offline Use](#setup-for-offline-use)
2. [Data Loading and Exploration](#data-loading-and-exploration)
   - [Loading the ChestMNIST Dataset](#loading-the-chestmnist-dataset)
   - [Natural-Language Query Prompts](#natural-language-query-prompts)
   - [Visualizing Data](#visualizing-data)
3. [Chain-of-Thought Reasoning and Verification](#chain-of-thought-reasoning-and-verification)
   - [Query 1: Distribution of Labels](#query-1-distribution-of-labels)
   - [Query 2: Healthy vs. Diseased Samples](#query-2-healthy-vs-diseased-samples)
   - [Query 3: Most Common Co-occurring Conditions](#query-3-most-common-co-occurring-conditions)
4. [Exercises: Double-Checking with Pandas](#exercises-double-checking-with-pandas)
   - [Exercise 1: Verify Label Distribution](#exercise-1-verify-label-distribution)
   - [Exercise 2: Verify Healthy Samples](#exercise-2-verify-healthy-samples)
5. [Conclusion](#conclusion)
   - [References](#references)

## Initial Setup

This section sets up the environment for running the notebook, either in Google Colab or offline. Execute the appropriate setup based on your environment.

### Setup for Google Colab
<u>Execute these code blocks only in Google Colab!</u>

In [None]:
!wget -q -O - https://github.com/University-Clinic-of-Neuroradiology/python-bootcamp/archive/refs/heads/main.tar.gz | tar -xzf - --strip-components=2 python-bootcamp-main/notebooks/DeepLearning

In [None]:
import os
import sys
from google.colab import output
output.enable_custom_widget_manager()

sys.path.insert(0, 'DeepLearning')
os.chdir(sys.path[0])

In [None]:
%pip install -q ipympl numpy matplotlib seaborn pandas

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

### Setup for Offline Use

In [None]:
%matplotlib widget

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

## Data Loading and Exploration

We will load the ChestMNIST dataset and explore its structure using natural-language queries. The dataset contains chest X-ray images (28x28 grayscale) and multi-label binary classifications for 14 medical conditions.

### Loading the ChestMNIST Dataset

In [None]:
# Define the path to the dataset
data_path = './Data/ChestMNIST/chestmnist.npz'

# Load the dataset
ds = np.load(data_path)
print('Content of dataset:', list(ds.keys()))

In [None]:
# Display dataset sizes
print(f"Number of training images: {len(ds['train_images'])} and labels: {len(ds['train_labels'])}")
print(f"Number of validation images: {len(ds['val_images'])} and labels: {len(ds['val_labels'])}")
print(f"Number of testing images: {len(ds['test_images'])} and labels: {len(ds['test_labels'])}")

The dataset is split into training, validation, and testing sets. Each image is 28x28 pixels, and labels are binary vectors indicating the presence (1) or absence (0) of 14 conditions:
- 0: Atelectasis
- 1: Cardiomegaly
- 2: Effusion
- 3: Infiltration
- 4: Mass
- 5: Nodule
- 6: Pneumonia
- 7: Pneumothorax
- 8: Consolidation
- 9: Edema
- 10: Emphysema
- 11: Fibrosis
- 12: Pleural Thickening
- 13: Hernia

### Natural-Language Query Prompts

We will simulate natural-language queries to explore the dataset. Example queries include:
- "How are the labels distributed across the training set?"
- "What percentage of samples are healthy (no conditions)?"
- "Which conditions frequently co-occur in the same patient?"

These queries will be answered using Pandas and visualized with Matplotlib/Seaborn.

### Visualizing Data

Let's visualize a sample image and its corresponding labels to understand the data better.

In [None]:
# Visualize a sample image
data_idx = 42
plt.figure()
plt.imshow(ds['train_images'][data_idx], cmap='gray')
plt.colorbar()
plt.grid(False)
plt.title('Sample Chest X-ray Image')
plt.show()

# Display corresponding labels
labels = ds['train_labels'][data_idx]
condition_names = ['Atelectasis', 'Cardiomegaly', 'Effusion', 'Infiltration', 'Mass', 'Nodule',
                   'Pneumonia', 'Pneumothorax', 'Consolidation', 'Edema', 'Emphysema',
                   'Fibrosis', 'Pleural Thickening', 'Hernia']
print('Labels for sample image:')
for i, (label, name) in enumerate(zip(labels, condition_names)):
    if label == 1:
        print(f'- {name}')

## Chain-of-Thought Reasoning and Verification

We will answer three natural-language queries using Chain-of-Thought (CoT) reasoning, followed by verification using Pandas. Each query includes a reasoning explanation and cross-checking code.

### Query 1: Distribution of Labels

**Query**: "How are the labels distributed across the training set?"

**CoT Reasoning**:
1. The training labels are stored in `ds['train_labels']`, a NumPy array of shape (num_samples, 14), where each column represents one of the 14 conditions.
2. To find the distribution, we need to count how many times each condition is present (value = 1) across all samples.
3. We can sum the labels along the sample axis to get the total occurrences of each condition.
4. To make the results interpretable, we'll map the counts to condition names and visualize them using a bar plot.
5. For verification, we'll load the accompanying CSV file (`dataset.csv`) and compare the label counts.

**Code**:


In [None]:
# Calculate label distribution
label_counts = np.sum(ds['train_labels'], axis=0)
df_counts = pd.DataFrame({'Condition': condition_names, 'Count': label_counts})

# Visualize
plt.figure(figsize=(12, 6))
sns.barplot(x='Condition', y='Count', data=df_counts)
plt.xticks(rotation=45)
plt.title('Distribution of Conditions in Training Set')
plt.show()

# Print counts
print('Label counts in training set:')
for condition, count in zip(condition_names, label_counts):
    print(f'{condition}: {count}')

**Verification**:
To verify, we load the `dataset.csv` file and compute the sum of each condition column for the training split.

In [None]:
# Load CSV and verify
df_csv = pd.read_csv('./Data/ChestMNIST/dataset.csv')
train_df = df_csv[df_csv['split'] == 'train']
csv_counts = train_df[condition_names].sum()

# Compare
print('Verification - CSV counts:')
for condition, count in csv_counts.items():
    print(f'{condition}: {count}')
print('\nMatch:', np.allclose(label_counts, csv_counts.values))

**Conclusion**: The bar plot and counts show the distribution of conditions, with some (e.g., Effusion) being more common than others (e.g., Hernia). The CSV verification confirms the counts match, ensuring accuracy.

### Query 2: Healthy vs. Diseased Samples

**Query**: "What percentage of samples in the training set are healthy (no conditions)?"

**CoT Reasoning**:
1. A sample is healthy if all 14 labels are 0, i.e., the sum of the label vector is 0.
2. We can compute the sum of labels for each sample and count how many have a sum of 0.
3. The percentage is the number of healthy samples divided by the total number of samples, multiplied by 100.
4. For verification, we'll check the CSV file to ensure the count of healthy samples matches.

**Code**:


In [None]:
# Calculate healthy samples
label_sums = np.sum(ds['train_labels'], axis=1)
healthy_count = np.sum(label_sums == 0)
total_samples = len(ds['train_labels'])
healthy_percentage = (healthy_count / total_samples) * 100

print(f'Healthy samples: {healthy_count}')
print(f'Total samples: {total_samples}')
print(f'Percentage of healthy samples: {healthy_percentage:.2f}%')

# Visualize
plt.figure(figsize=(6, 6))
sns.barplot(x=['Healthy', 'Diseased'], y=[healthy_count, total_samples - healthy_count])
plt.title('Healthy vs. Diseased Samples in Training Set')
plt.ylabel('Count')
plt.show()

**Verification**:
We verify by checking the CSV file for rows in the training split where all condition columns are 0.

In [None]:
# Verify with CSV
healthy_csv = train_df[train_df[condition_names].sum(axis=1) == 0]
healthy_csv_count = len(healthy_csv)
print(f'Healthy samples (CSV): {healthy_csv_count}')
print('Match:', healthy_count == healthy_csv_count)

**Conclusion**: The percentage of healthy samples is calculated and visualized, showing the balance between healthy and diseased cases. The CSV verification confirms the count is correct.

### Query 3: Most Common Co-occurring Conditions

**Query**: "Which conditions frequently co-occur in the same patient in the training set?"

**CoT Reasoning**:
1. Co-occurring conditions are pairs of conditions that are both 1 in the same sample.
2. We can compute a correlation matrix of the label columns to identify pairs with high co-occurrence.
3. Alternatively, we can count the number of samples where each pair of conditions is present.
4. We'll visualize the correlations using a heatmap and list the top pairs.
5. For verification, we'll manually count co-occurrences for a few pairs using Pandas.

**Code**:


In [None]:
# Compute correlation matrix
label_df = pd.DataFrame(ds['train_labels'], columns=condition_names)
corr_matrix = label_df.corr()

# Visualize
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation of Conditions in Training Set')
plt.show()

# Find top co-occurring pairs
corr_pairs = corr_matrix.unstack().sort_values(ascending=False)
corr_pairs = corr_pairs[corr_pairs < 1]  # Exclude self-correlations
print('Top 5 co-occurring condition pairs:')
for (cond1, cond2), corr in corr_pairs.head(5).items():
    print(f'{cond1} & {cond2}: Correlation = {corr:.3f}')

**Verification**:
We manually count co-occurrences for the top pair using Pandas.

In [None]:
# Verify top pair (e.g., Effusion & Edema)
top_pair = corr_pairs.index[0]  # First pair
cond1, cond2 = top_pair
cooccur_count = len(label_df[(label_df[cond1] == 1) & (label_df[cond2] == 1)])
total_samples = len(label_df)
print(f'Co-occurrences of {cond1} & {cond2}: {cooccur_count}')
print(f'Percentage: {(cooccur_count / total_samples) * 100:.2f}%')

**Conclusion**: The heatmap and correlation values identify frequently co-occurring conditions. Manual counting verifies the top pair, ensuring the correlation analysis is accurate.

## Exercises: Double-Checking with Pandas

These exercises encourage you to verify the results using simple Pandas commands, reinforcing the findings from the queries.

### Exercise 1: Verify Label Distribution

**Task**: Use Pandas to compute the number of occurrences of 'Pneumonia' in the training set and compare it with the result from Query 1.

**Code** (fill in the blanks):


In [None]:
# Your code here
pneumonia_count = train_df['Pneumonia'].sum()
print(f'Pneumonia occurrences: {pneumonia_count}')

# Compare with Query 1 result
query1_pneumonia = label_counts[condition_names.index('Pneumonia')]
print(f'Match with Query 1: {pneumonia_count == query1_pneumonia}')

**Solution**:
The code above sums the 'Pneumonia' column in the training split of the CSV and compares it with the count from Query 1. The result should match, confirming the label distribution.

### Exercise 2: Verify Healthy Samples

**Task**: Use Pandas to count the number of healthy samples in the validation set and compare with the method used in Query 2.

**Code** (fill in the blanks):


In [None]:
# Your code here
val_df = df_csv[df_csv['split'] == 'val']
healthy_val = len(val_df[val_df[condition_names].sum(axis=1) == 0])
print(f'Healthy samples in validation set: {healthy_val}')

# Compare with NumPy method
val_label_sums = np.sum(ds['val_labels'], axis=1)
healthy_val_numpy = np.sum(val_label_sums == 0)
print(f'Match with NumPy: {healthy_val == healthy_val_numpy}')

**Solution**:
The code filters the validation split and counts rows with no conditions. It compares with the NumPy method from Query 2, ensuring consistency.

## Conclusion

This notebook demonstrated how to perform natural-language queries on the ChestMNIST dataset using Pandas, with a focus on data exploration, Chain-of-Thought reasoning, and verification. We answered queries about label distribution, healthy samples, and co-occurring conditions, using visualizations and cross-checking with CSV data. The exercises reinforced these findings with simple Pandas commands.

### References
- ChestMNIST Dataset: https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community
- Pandas Documentation: https://pandas.pydata.org/docs/
- Seaborn Documentation: https://seaborn.pydata.org/