## What is MIMIC-CXR ?

MIMIC-CXR is a large-scale dataset of chest X-ray images collected
from real clinical practice at Beth Israel Deaconess Medical Center.

Each study contains:
- one or more chest X-ray images
- a free-text radiology report

The dataset is fully de-identified.



## How are labels / annotations obtained?

Important point:

MIMIC-CXR does NOT provide manual bounding boxes or pixel-level labels.

The structured labels are obtained by:

â†’ automatic NLP processing of radiology reports

using tools such as:
- CheXpert labeler
- NegBio

Therefore:
labels are noisy and uncertain by design.

Typical label values:
- 1   : positive finding
- 0   : negative
- -1  : uncertain
- NaN : not mentioned


## What are the possible concepts for CBM?

The natural concepts are the clinical findings extracted from reports.

Typical MIMIC-CXR derived concepts are identical to CheXpert concepts:


    - "Atelectasis",
    - "Cardiomegaly",
    - "Consolidation",
    - "Edema",
    - "Enlarged Cardiomediastinum",
    - "Lung Lesion",
    - "Lung Opacity",
    - "Pleural Effusion",
    - "Pneumonia",
    - "Pneumothorax"


These are high-level semantic clinical concepts.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import os

pd.set_option("display.max_columns", 50)


In [3]:
BASE_DIR = Path("data/MIMIC-CXR")

CSV_PATH = BASE_DIR / "mimic-cxr.csv"

print(CSV_PATH.exists())


True


In [15]:
df = pd.read_csv(CSV_PATH)

print("Dataset shape:", df.shape)
df.head(10)


Dataset shape: (86003, 14)


Unnamed: 0,filename,split,label,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Lung Lesion,Lung Opacity,Normal,Pleural Effusion,Pneumonia,Pneumothorax
0,s50000014.jpg,train,Normal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,s50000052.jpg,train,Normal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,s50000125.jpg,train,Normal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,s50000173.jpg,train,Normal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,s50000198.jpg,train,Consolidation,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,s50001042.jpg,train,Normal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6,s50001080.jpg,train,Normal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
7,s50001166.jpg,train,"Pleural Effusion, Pneumonia",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
8,s50001349.jpg,train,Lung Opacity,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
9,s50001417.jpg,train,Normal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [5]:
df.columns.tolist()


['filename',
 'split',
 'label',
 'Atelectasis',
 'Cardiomegaly',
 'Consolidation',
 'Edema',
 'Enlarged Cardiomediastinum',
 'Lung Lesion',
 'Lung Opacity',
 'Normal',
 'Pleural Effusion',
 'Pneumonia',
 'Pneumothorax']

## Define metadata, concepts and target

In [16]:
metadata_cols = [
    "filename",
    "split"
]

concept_cols = [
    "Atelectasis",
    "Cardiomegaly",
    "Consolidation",
    "Edema",
    "Enlarged Cardiomediastinum",
    "Lung Lesion",
    "Lung Opacity",
    "Pleural Effusion",
    "Pneumonia",
    "Pneumothorax"
]

target_col = "label"


In [17]:
df[metadata_cols + [target_col] + concept_cols].head()

Unnamed: 0,filename,split,label,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Lung Lesion,Lung Opacity,Pleural Effusion,Pneumonia,Pneumothorax
0,s50000014.jpg,train,Normal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,s50000052.jpg,train,Normal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,s50000125.jpg,train,Normal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,s50000173.jpg,train,Normal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,s50000198.jpg,train,Consolidation,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
df[concept_cols].describe()

Unnamed: 0,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Lung Lesion,Lung Opacity,Pleural Effusion,Pneumonia,Pneumothorax
count,86003.0,86003.0,86003.0,86003.0,86003.0,86003.0,86003.0,86003.0,86003.0,86003.0
mean,0.215016,0.183901,0.04951,0.125321,0.031987,0.029034,0.24012,0.24748,0.073149,0.04358
std,0.410836,0.387405,0.216931,0.331085,0.175967,0.167902,0.427158,0.43155,0.260382,0.20416
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [19]:
df[concept_cols].head(10)

Unnamed: 0,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Lung Lesion,Lung Opacity,Pleural Effusion,Pneumonia,Pneumothorax
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
df["split"].value_counts()

split
train    83837
test      1455
valid      711
Name: count, dtype: int64