In [1]:
import pandas as pd
import os
import glob

# Datasets:
1. MIMIC-CXR-JPG
2. MIMIC-CXR
3. CheXpert

Access MIMIC-CXR at: https://physionet.org/content/mimic-cxr/   
Access MIMIC-CXR-JPG at: https://physionet.org/content/mimic-cxr-jpg/   
GitHub code repository: https://github.com/MIT-LCP/mimic-cxr   


# 1. MIMIC-CXR-JPG
* The MIMIC-CXR-JPG dataset is wholly derived from MIMIC-CXR, providing JPG format files derived from the DICOM images and structured labels derived from the free-text reports
* The aim of MIMIC-CXR-JPG is to provide a convenient processed version of MIMIC-CXR, as well as to provide a standard reference for data splits and image labels
* The dataset contains 377,110 JPG format images and structured labels derived from the 227,827 free-text radiology reports associated with these images.

MIMIC-CXR-JPG v2.0.0 contains:

* A set of 10 folders, each with ~6,500 sub-folders corresponding to all the JPG format images for an individual patient.
* mimic-cxr-2.0.0-metadata.csv.gz - a compressed CSV file providing useful metadata for the images including view position, patient orientation, and an anonymized date of image acquisition time allowing chronological ordering of the images.
* mimic-cxr-2.0.0-split.csv.gz - a compressed CSV file providing recommended train/validation/test data splits.
* mimic-cxr-2.0.0-chexpert.csv.gz - a compressed CSV file listing all studies with labels generated by the CheXpert labeler.
* mimic-cxr-2.0.0-negbio.csv.gz - a compressed CSV file listing all studies with labels generated by the NegBio labeler.


## mimic-cxr-2.0.0-metadata.csv.gz


The mimic-cxr-2.0.0-metadata.csv.gz file contains useful meta-data derived from the original DICOM files in MIMIC-CXR. The columns are:

* **dicom_id** - An identifier for the DICOM file. The stem of each JPG image filename is equal to the dicom_id.
* **PerformedProcedureStepDescription** - The type of study performed ("CHEST (PA AND LAT)", "CHEST (PORTABLE AP)", etc).
* **ViewPosition** - The orientation in which the chest radiograph was taken ("AP", "PA", "LATERAL", etc).
* **Rows** - The height of the image in pixels.
* **Columns** - The width of the image in pixels.
* **StudyDate** - An anonymized date for the radiographic study. All images from the same study will have the same date and time. Dates are anonymized, but chronologically consistent for each patient. Intervals between two scans have not been modified during de-identification.
* **StudyTime** - The time of the study in hours, minutes, seconds, and fractional seconds. The time of the study was not modified during de-identification.
* **ProcedureCodeSequence_CodeMeaning** - The human readable description of the coded procedure (e.g. "CHEST (PA AND LAT)". Descriptions follow Simon-Leeming codes [11].
* **ViewCodeSequence_CodeMeaning** - The human readable description of the coded view orientation for the image (e.g. "postero-anterior", "antero-posterior", "lateral").
* **PatientOrientationCodeSequence_CodeMeaning** - The human readable description of the patient orientation during the image acquisition. Three values are possible: "Erect", "Recumbent", or a null value (missing).

The names of the columns (aside from dicom_id) are defined as the Keyword from their corresponding DICOM data element, e.g. ViewPosition (0018, 5101). Column names for metadata derived from length-1 sequences are presented as KeywordSequence_KeywordSubitem, e.g. PatientOrientationCodeSequence_CodeMeaning is sourced from the DICOM standard Patient Orientation Code Sequence (0054, 0410), under Code Meaning (0008, 0104).

In [4]:
df_metadata = pd.read_csv('./data/MIMIC-III-CXR-JPG/mimic-cxr-2.0.0-metadata.csv.gz')
df_metadata.head()

Unnamed: 0,dicom_id,subject_id,study_id,PerformedProcedureStepDescription,ViewPosition,Rows,Columns,StudyDate,StudyTime,ProcedureCodeSequence_CodeMeaning,ViewCodeSequence_CodeMeaning,PatientOrientationCodeSequence_CodeMeaning
0,02aa804e-bde0afdd-112c0b34-7bc16630-4e384014,10000032,50414267,CHEST (PA AND LAT),PA,3056,2544,21800506,213014.531,CHEST (PA AND LAT),postero-anterior,Erect
1,174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962,10000032,50414267,CHEST (PA AND LAT),LATERAL,3056,2544,21800506,213014.531,CHEST (PA AND LAT),lateral,Erect
2,2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab,10000032,53189527,CHEST (PA AND LAT),PA,3056,2544,21800626,165500.312,CHEST (PA AND LAT),postero-anterior,Erect
3,e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c,10000032,53189527,CHEST (PA AND LAT),LATERAL,3056,2544,21800626,165500.312,CHEST (PA AND LAT),lateral,Erect
4,68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714,10000032,53911762,CHEST (PORTABLE AP),AP,2705,2539,21800723,80556.875,CHEST (PORTABLE AP),antero-posterior,



## mimic-cxr-2.0.0-split.csv.gz

The mimic-cxr-2.0.0-split.csv.gz file contains:

* dicom_id - An identifier for the DICOM file. The stem of each JPG image filename is equal to the dicom_id.
* study_id - An integer unique for an individual study (i.e. an individual radiology report with one or more.
* subject_id - An integer unique for an individual patient.
* split - a string field indicating the data split for this file, one of 'train', 'validate', or 'test'.
The split file is intended to provide a reference dataset split for studies using MIMIC-CXR-JPG.




In [5]:
df_split = pd.read_csv('./data/MIMIC-III-CXR-JPG/mimic-cxr-2.0.0-split.csv.gz')
df_split.head()

Unnamed: 0,dicom_id,study_id,subject_id,split
0,02aa804e-bde0afdd-112c0b34-7bc16630-4e384014,50414267,10000032,train
1,174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962,50414267,10000032,train
2,2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab,53189527,10000032,train
3,e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c,53189527,10000032,train
4,68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714,53911762,10000032,train


In [7]:
#Number of images
df_split['split'].value_counts()

train       368960
test          5159
validate      2991
Name: split, dtype: int64

In [9]:
#number of studies with a finding
df_split.drop_duplicates(subset="study_id")['split'].value_counts()

train       222758
test          3269
validate      1808
Name: split, dtype: int64

In [12]:
#number of patients with a finding
df_split.drop_duplicates(subset="subject_id")['split'].value_counts()

train       64586
validate      500
test          293
Name: split, dtype: int64

## Structured labels (mimic-cxr-2.0.0-chexpert.csv.gz and mimic-cxr-2.0.0-negbio.csv.gz)

The mimic-cxr-2.0.0-chexpert.csv.gz and mimic-cxr-2.0.0-negbio.csv.gz files are compressed comma delimited value files. A total of 227827 studies are assigned a label by CheXpert and NegBio. Eight studies could not be labeled due to a lack of a findings or impression section. The first three columns are:

* subject_id - An integer unique for an individual patient
* study_id - An integer unique for an individual study (i.e. an individual radiology report with one or more images associated with it)

The remaining columns are labels as presented in the CheXpert article [8]:

* Atelectasis
* Cardiomegaly
* Consolidation
* Edema
* Enlarged Cardiomediastinum
* Fracture
* Lung Lesion
* Lung Opacity
* Pleural Effusion
* Pneumonia
* Pneumothorax
* Pleural Other
* Support Devices
* No Finding

Note that "No Finding" is the absence of any of the 13 descriptive labels and a check that the text does not mention a specified set of other common findings beyond those covered by the descriptive labels. Thus, it is possible for a study in the CheXpert set to have no labels assigned. For example, study 57321224 has the following findings/impression text: "Hyperinflation.  No evidence of acute disease.". Normally this would be assigned a label of "No Finding", but the use of "hyperinflation" suppresses the labeling of no finding. For details see the CheXpert article [8], and the list of phrases are publicly available in their code repository (phrases/mention/no_finding.txt). There are 2414 studies which do not have a label assigned by CheXpert. Conversely, all studies present in the provided files have been assigned a label by NegBio.

Each label column contains one of four values: 1.0, -1.0, 0.0, or missing. These labels have the following interpretation:

* 1.0 - The label was positively mentioned in the associated study, and is present in one or more of the corresponding images
    * e.g. "A large pleural effusion"
* 0.0 - The label was negatively mentioned in the associated study, and therefore should not be present in any of the corresponding images
    * e.g. "No pneumothorax."
* -1.0 - The label was either: (1) mentioned with uncertainty in the report, and therefore may or may not be present to some degree in the corresponding image, or (2) mentioned with ambiguous language in the report and it is unclear if the pathology exists or not
    * Explicit uncertainty: "The cardiac size cannot be evaluated."
    * Ambiguous language: "The cardiac contours are stable."
* Missing (empty element) - No mention of the label was made in the report


### negbio

In [15]:
df_negbio = pd.read_csv('MIMIC-CXR-JPG/mimic-cxr-2.0.0-negbio.csv.gz')
df_negbio.head()

Unnamed: 0,subject_id,study_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Fracture,Lung Lesion,Lung Opacity,No Finding,Pleural Effusion,Pleural Other,Pneumonia,Pneumothorax,Support Devices
0,10000032,50414267,,,,,,,,,1.0,,,,,
1,10000032,53189527,,,,,,,,,1.0,,,,,
2,10000032,53911762,,,,,,,,,1.0,,,,,
3,10000032,56699142,,,,,,,,,1.0,,,,,
4,10000764,57375967,,,1.0,,,,,,,,,-1.0,,


### chexpert

In [14]:
df_chexpert = pd.read_csv('MIMIC-CXR-JPG/mimic-cxr-2.0.0-chexpert.csv.gz')
df_chexpert.head()

Unnamed: 0,subject_id,study_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Fracture,Lung Lesion,Lung Opacity,No Finding,Pleural Effusion,Pleural Other,Pneumonia,Pneumothorax,Support Devices
0,10000032,50414267,,,,,,,,,1.0,,,,,
1,10000032,53189527,,,,,,,,,1.0,,,,,
2,10000032,53911762,,,,,,,,,1.0,,,,,
3,10000032,56699142,,,,,,,,,1.0,,,,,
4,10000764,57375967,,,1.0,,,,,,,,,-1.0,,


# 2. MIMIC-CXR
Data Description
Overview

MIMIC-CXR v2.0.0 contains:

* A set of 10 folders (p10 - p19), each with ~6,500 sub-folders. Sub-folders are named according to the patient identifier, and contain free-text reports and DICOM files for all studies for that patient
* cxr-record-list.csv.gz - a compressed file providing the link between an image, its corresponding study identifier, and its corresponding patient identifier
* cxr-study-list.csv.gz - a compressed file providing a link between anonymous study and patient identifiers
* mimic-cxr-reports.tar.gz - for convenience, all free-text reports have been compressed in a single archive file


## cxr-record-list.csv.gz
* file lists all DICOM images available in the dataset. 
* It also provides a mapping of these DICOM images to their corresponding anonymous study and subject identifier.

In [33]:
df_record_list = pd.read_csv("./data/MIMIC-III-CXR/cxr-record-list.csv.gz")
df_record_list.head()

Unnamed: 0,subject_id,study_id,dicom_id,path
0,10000032,50414267,02aa804e-bde0afdd-112c0b34-7bc16630-4e384014,files/p10/p10000032/s50414267/02aa804e-bde0afd...
1,10000032,50414267,174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962,files/p10/p10000032/s50414267/174413ec-4ec4c1f...
2,10000032,53189527,2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab,files/p10/p10000032/s53189527/2a2277a9-b0ded15...
3,10000032,53189527,e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c,files/p10/p10000032/s53189527/e084de3b-be89b11...
4,10000032,53911762,68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714,files/p10/p10000032/s53911762/68b5c4b1-227d048...


## cxr-study-list.csv.gz
* lists all studies available in the dataset, and provides a mapping of these anonymous study identifiers to the patient identifier.

In [None]:
df_record_list = pd.read_csv("./data/MIMIC-III-CXR/cxr-study-list.csv.gz")
df_record_list.head()

## mimic-cxr-reports.zip
* compressed archive containing all text reports in the dataset
* text reports are available within each patient folder

In [38]:
import sys
import zipfile

path_to_zip_file = "./data/MIMIC-III-CXR/mimic-cxr-reports.zip"
directory_to_extract_to = "./data/MIMIC-III-CXR/mimic-cxr-reports/"
with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

In [39]:
allFiles = glob.glob(f'./data/MIMIC-III-CXR/**/*.txt', recursive=True)
len(allFiles)

227835

In [40]:
################# remove the unziped data #################
# os.system('rm -rf ./data/MIMIC-III-CXR/mimic-cxr-reports/')