# CheXpert dataset

CheXpert dataset (Kaggle version: CheXpert-v1.0-small).

Goals of this notebook:

- Understand the dataset structure
- Inspect labels and metadata

This notebook is only for dataset understanding and concept preparation.


In [None]:
# Imports

import os
import pandas as pd
import numpy as np

from collections import Counter

In [None]:
# Dataset path

DATASET_ROOT = "./data"

TRAIN_CSV = os.path.join(DATASET_ROOT, "train.csv")
VALID_CSV = os.path.join(DATASET_ROOT, "valid.csv")

print("Train CSV exists:", os.path.exists(TRAIN_CSV))
print("Valid CSV exists:", os.path.exists(VALID_CSV))

In [None]:
# Load CSV files

train_df = pd.read_csv(TRAIN_CSV)
valid_df = pd.read_csv(VALID_CSV)

print("Train size:", len(train_df))
print("Valid size:", len(valid_df))

train_df.head()

In [None]:
# List all columns
print(train_df.columns.tolist())


In [None]:
# Separate metadata vs label columns

METADATA_COLUMNS = [
    "Path",
    "Sex",
    "Age",
    "Frontal/Lateral",
    "AP/PA"
]

print("Metadata columns:")
for c in METADATA_COLUMNS:
    print("-", c)


In [None]:
# Define clinical label columns (candidate concepts)

LABEL_COLUMNS = [c for c in train_df.columns if c not in METADATA_COLUMNS]

print("Number of label columns:", len(LABEL_COLUMNS))
print(LABEL_COLUMNS)


We define the concept set C as the clinical observations provided by CheXpert.

These concepts are:

- No Finding
- Enlarged Cardiomediastinum
- Cardiomegaly
- Lung Opacity
- Lung Lesion
- Edema
- Consolidation
- Pneumonia
- Atelectasis
- Pneumothorax
- Pleural Effusion
- Pleural Other
- Fracture
- Support Devices

These are human-interpretable and clinically meaningful concepts.
They form the bottleneck layer in the CBM model.


In [None]:
# Inspect value distribution for one concept

def inspect_label_distribution(df, label):
    values = df[label].value_counts(dropna=False)
    return values

inspect_label_distribution(train_df, LABEL_COLUMNS[0])


In [None]:
## Show distributions for all concepts

summary = {}

for label in LABEL_COLUMNS:
    counts = train_df[label].value_counts(dropna=False)
    summary[label] = counts

summary_df = pd.DataFrame(summary).fillna(0).astype(int)
summary_df


CheXpert labels take the following values:

-  1  → positive finding
-  0  → negative finding
- -1  → uncertain finding
- NaN → not mentioned in the report

These labels are extracted automatically from radiology reports
using a rule-based NLP system (not manual image annotation).


In [None]:
## Check missing rate for each concept

missing_rates = {}

for c in LABEL_COLUMNS:
    missing_rates[c] = train_df[c].isna().mean()

pd.Series(missing_rates).sort_values(ascending=False)



Check missing and uncertainty rate


Missing rate

- how often the concept is not mentioned at all (NaN)

Uncertainty rate

- how often the concept is mentioned but uncertain (-1)

In [None]:
## Check uncertainty rate (-1) for each concept 

uncertainty_rates = {}

for c in LABEL_COLUMNS:
    uncertainty_rates[c] = (train_df[c] == -1).mean()

pd.Series(uncertainty_rates).sort_values(ascending=False)