# CheXpert dataset

CheXpert dataset (Kaggle version: CheXpert-v1.0-small).

Goals of this notebook:

- Understand the dataset structure
- Inspect labels and metadata

This notebook is only for dataset understanding and concept preparation.


In [1]:
# Imports

import os
import pandas as pd
import numpy as np

from collections import Counter

In [2]:
# Dataset path

DATASET_ROOT = "./data"

TRAIN_CSV = os.path.join(DATASET_ROOT, "train.csv")
VALID_CSV = os.path.join(DATASET_ROOT, "valid.csv")

print("Train CSV exists:", os.path.exists(TRAIN_CSV))
print("Valid CSV exists:", os.path.exists(VALID_CSV))

Train CSV exists: True
Valid CSV exists: True


In [3]:
# Load CSV files

train_df = pd.read_csv(TRAIN_CSV)
valid_df = pd.read_csv(VALID_CSV)

print("Train size:", len(train_df))
print("Valid size:", len(valid_df))

train_df.head()

Train size: 223414
Valid size: 234


Unnamed: 0,Path,Sex,Age,Frontal/Lateral,AP/PA,No Finding,Enlarged Cardiomediastinum,Cardiomegaly,Lung Opacity,Lung Lesion,Edema,Consolidation,Pneumonia,Atelectasis,Pneumothorax,Pleural Effusion,Pleural Other,Fracture,Support Devices
0,CheXpert-v1.0-small/train/patient00001/study1/...,Female,68,Frontal,AP,1.0,,,,,,,,,0.0,,,,1.0
1,CheXpert-v1.0-small/train/patient00002/study2/...,Female,87,Frontal,AP,,,-1.0,1.0,,-1.0,-1.0,,-1.0,,-1.0,,1.0,
2,CheXpert-v1.0-small/train/patient00002/study1/...,Female,83,Frontal,AP,,,,1.0,,,-1.0,,,,,,1.0,
3,CheXpert-v1.0-small/train/patient00002/study1/...,Female,83,Lateral,,,,,1.0,,,-1.0,,,,,,1.0,
4,CheXpert-v1.0-small/train/patient00003/study1/...,Male,41,Frontal,AP,,,,,,1.0,,,,0.0,,,,


In [4]:
# List all columns
print(train_df.columns.tolist())


['Path', 'Sex', 'Age', 'Frontal/Lateral', 'AP/PA', 'No Finding', 'Enlarged Cardiomediastinum', 'Cardiomegaly', 'Lung Opacity', 'Lung Lesion', 'Edema', 'Consolidation', 'Pneumonia', 'Atelectasis', 'Pneumothorax', 'Pleural Effusion', 'Pleural Other', 'Fracture', 'Support Devices']


In [5]:
# Separate metadata vs label columns

METADATA_COLUMNS = [
    "Path",
    "Sex",
    "Age",
    "Frontal/Lateral",
    "AP/PA"
]

print("Metadata columns:")
for c in METADATA_COLUMNS:
    print("-", c)


Metadata columns:
- Path
- Sex
- Age
- Frontal/Lateral
- AP/PA


In [None]:
# Define clinical label columns (visual concepts)

LABEL_COLUMNS = [c for c in train_df.columns if c not in METADATA_COLUMNS]

print("Number of label columns:", len(LABEL_COLUMNS))
print(LABEL_COLUMNS)


Number of label columns: 14
['No Finding', 'Enlarged Cardiomediastinum', 'Cardiomegaly', 'Lung Opacity', 'Lung Lesion', 'Edema', 'Consolidation', 'Pneumonia', 'Atelectasis', 'Pneumothorax', 'Pleural Effusion', 'Pleural Other', 'Fracture', 'Support Devices']


We define the concept set C as the clinical observations provided by CheXpert.

These concepts are:

- No Finding
- Enlarged Cardiomediastinum
- Cardiomegaly
- Lung Opacity
- Lung Lesion
- Edema
- Consolidation
- Pneumonia
- Atelectasis
- Pneumothorax
- Pleural Effusion
- Pleural Other
- Fracture
- Support Devices

These are human-interpretable and clinically meaningful concepts.
They form the bottleneck layer in the CBM model.


In [7]:
# Inspect value distribution for one concept

def inspect_label_distribution(df, label):
    values = df[label].value_counts(dropna=False)
    return values

inspect_label_distribution(train_df, LABEL_COLUMNS[0])


No Finding
NaN    201033
1.0     22381
Name: count, dtype: int64

In [8]:
## Show distributions for all concepts

summary = {}

for label in LABEL_COLUMNS:
    counts = train_df[label].value_counts(dropna=False)
    summary[label] = counts

summary_df = pd.DataFrame(summary).fillna(0).astype(int)
summary_df


Unnamed: 0,No Finding,Enlarged Cardiomediastinum,Cardiomegaly,Lung Opacity,Lung Lesion,Edema,Consolidation,Pneumonia,Atelectasis,Pneumothorax,Pleural Effusion,Pleural Other,Fracture,Support Devices
-1.0,0,12403,8087,5598,1488,12984,27742,18770,33739,3145,11628,2653,642,1079
0.0,0,21638,11116,6599,1270,20726,28097,2799,1328,56341,35396,316,2512,6137
1.0,22381,10798,27000,105581,9186,52246,14783,6039,33376,19448,86187,3523,9040,116001
,201033,178575,177211,105636,211470,137458,152792,195806,154971,144480,90203,216922,211220,100197


CheXpert labels take the following values:

-  1  → positive finding
-  0  → negative finding
- -1  → uncertain finding
- NaN → not mentioned in the report

These labels are extracted automatically from radiology reports
using a rule-based NLP system (not manual image annotation).


In [None]:
## Check missing rate for each concept (NaN)

missing_rates = {}

for c in LABEL_COLUMNS:
    missing_rates[c] = train_df[c].isna().mean()

pd.Series(missing_rates).sort_values(ascending=False)



Pleural Other                 0.970942
Lung Lesion                   0.946539
Fracture                      0.945420
No Finding                    0.899823
Pneumonia                     0.876427
Enlarged Cardiomediastinum    0.799301
Cardiomegaly                  0.793196
Atelectasis                   0.693649
Consolidation                 0.683896
Pneumothorax                  0.646692
Edema                         0.615261
Lung Opacity                  0.472826
Support Devices               0.448481
Pleural Effusion              0.403748
dtype: float64

Check missing and uncertainty rate


Missing rate

- how often the concept is not mentioned at all (NaN)

Uncertainty rate

- how often the concept is mentioned but uncertain (-1)

In [10]:
## Check uncertainty rate (-1) for each concept 

uncertainty_rates = {}

for c in LABEL_COLUMNS:
    uncertainty_rates[c] = (train_df[c] == -1).mean()

pd.Series(uncertainty_rates).sort_values(ascending=False)

Atelectasis                   0.151016
Consolidation                 0.124173
Pneumonia                     0.084014
Edema                         0.058116
Enlarged Cardiomediastinum    0.055516
Pleural Effusion              0.052047
Cardiomegaly                  0.036197
Lung Opacity                  0.025057
Pneumothorax                  0.014077
Pleural Other                 0.011875
Lung Lesion                   0.006660
Support Devices               0.004830
Fracture                      0.002874
No Finding                    0.000000
dtype: float64