This project contains source code for fully replicating MIMIC-Extract.
Warning: This repository contains over 1.5 GB of MIMIC-Extract output files tracked with Git LFS. If you want to save time cloning this repository, and don't care about the output files, clone with the following command:
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/atwalsh/DL4H-Project.git
NOTE: The output/
directory data was removed as MIMIC cannot be distributed publicly. Please run the extract script to obtain output data.
.
│ # Copy of MIT-LCP/mimic-code
├── mimic-code/
│ # Customized version of MIMIC-Extract
├── MIMIC_Extract/
│ # Script to easily set up and run MIMIC-Extract database resources
├── mimic_ez.py
│ # Sample outputs from MIMIC-Extract
├── output/
│ # Default settings; over 20 GB when unzipped
│ ├── full.zip
│ # Default settings with population size of 25
│ ├── population-25/
│ # Default settings with population size of 1000
│ ├── population-1000/
│ # Ablated data, skipping unit conversion with population size of 1000
│ └── population-1000-no-unit-conversion/
├── README.md
│ # Notebook comparing default exported data with ablated dataset
└── sample_notebook.ipynb
Note: .h5
and .npy
output files are tracked with Git LFS.
NOTE: The output/
directory data was removed as MIMIC cannot be distributed publicly. Please run the extract script to obtain output data.
The output/
directory contains the following outputs from MIMIC-Extract, generated by running ./MIMIC_Extract/mimic_direct_extract.py
:
full.zip
: A compressed program output folder using the default settings and the full population sizepopulation-25
: Program output with a population size set to 25population-1000
: Program output with a population size set to 1000population-1000-no-unit-conversion
: Program output with a population size set to 1000, and skipping the unit conversion function of MIMIC-Extract
- MIT-LCP/mimic-code -> MIMIC Code Repository: Code shared by the research community for the MIMIC family of databases.
- atwalsh/MIMIC_Extract -> A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III. Customized for this application.
- MIMIC-III Clinical Database via PhysioNet.
This repository provides two methods of configuring your local environment for running MIMIC-Extract:
- Using the
mimic_ez.py
script to automatically run all necessary steps to build the MIMIC-III database and configure MIMIC_Extract - Following the below instructions to build MIMIC-III and configure MIMIC_Extract manually
Both require that you have downloaded the MIMIC-III Clinical Database, which requires the following steps:
- Register for an account on the PhysioNet website, which hosts the MIMIC-III database.
- Become a credentialed user on PhysioNet.
- Complete the CITI Program training in human subjects research and HIPAA privacy rules.
- Sign the data use agreement (DUA) form for the dataset.
Before following either method to configure MIMIC_Extract, you must have the following dependencies installed:
- The mimic-code repository suggests that users should reserve 100 GB of space for the PostgreSQL database. It will also likely take many (6+) hours to build the database.
- While users may download individual files of the mimic-code repository, the MIMIC-Extract codebase expects to take full-database. Therefore, users who wish to partially download the dataset should take caution for possible sources of error.
- Downloading the full ZIP file from PhysioNet may take an hour or more.
- The MIMIC-Extract function may take between 1.5-2 hours to run on the full population in this test case. Running with a smaller population (e.g., 25) only takes a few minutes.
We were able to run MIMIC-Extract against the full MIMIC-III database with a 2021 MacBook Pro with an Apple M1 Max and 64 GB of RAM.
Gain access to the MIMIC-III Clinical Database on PhysioNet:
- Register for an account on the PhysioNet website, which hosts the MIMIC-III database.
- Become a credentialed user on PhysioNet.
- Complete the CITI Program training in human subjects research and HIPAA privacy rules.
- Sign the data use agreement (DUA) form for the dataset (at the bottom of the MIMIC-III PhsyioNet page).
- Download the ZIP file for the MIMIC-III Clinical Database (~6.2 GB in size).
The below steps outline how to successfully build MIMIC-III and run the MIMIC-Extract program. This was tested on a 2021 MacBook Pro with an Apple M1 Max and 64 GB of RAM, using PostgreSQL 13.5.
- Clone this repository with submodules initialized
- Install dependencies and create conda environment:
- Run
export MACOSX_DEPLOYMENT_TARGET=10.9
- Run
conda env create --force -f ./MIMIC_Extract/mimic_extract_env_py36.yml
i. This creates a Conda environment using our modified dependency file - Activate the environment:
conda activate mimic_extract_env_py36
- Run
- Run the
mimic_ez.py
script:
python mimic_ez.py \
--mimic_zip_path $UNZIPPED_MIMIC_DATABASE_DIR \
--mimic_code_path $CLONED_MIMIC_CODE_DIR \
--mimic_extract_path $CLONED_MIMIC_EXTRACT_DIR \
--pg_host localhost \
--pg_user postgres \
--pg_password postgres \
--pg_port 5432 \
--pg_db mimicez \
- Change directory to
$CLONED_MIMIC_EXTRACT_DIR
and runpython mimic_direct_extract.py
- Set the
--out-path
parameter to the desired extraction output location - Set the
--pop_size
parameter to set a population size
- Set the
- Clone mimic-code
- Configure a new PostgreSQL database with the name
mimic
- Follow the steps in mimic-code to create MIMIC-III in a local Postgres database
- Unzip the MIMIC-III database
- Open
mimic-code/mimic-iii/buildmimic/postgres/
in Terminal - Run
make create-user mimic-gz datadir="/path/to/mimic/unzipped/"
- NOTE: This will likely take many (6+) hours to complete
- By default, the Makefile uses the following parameters:
- Database name:
mimic
- User name:
postgres
- Password:
postgres
- Schema:
mimiciii
- Host: none (defaults to localhost)
- Port: none (defaults to 5432)
- Database name:
- Follow the steps in mimic-code to generate concepts in PostgreSQL
- Change directory to
mimic-code/mimic-iii/concepts_postgres/
- Run
psql -d mimic
- Run
SET search_path TO mimiciii;
- Run
\i postgres-functions.sql
- Run
\i postgres-make-concepts.sql
- Exist the database with
\q
- Change directory to
- Clone MIMIC-Extract
- Create the Conda environment (See updated Conda .yml in atwalsh/MIMIC-Extract)
- Open the cloned
MIMIC_Extract/utils/
folder in Terminal - Run
export MACOSX_DEPLOYMENT_TARGET=10.9
- Run
conda env create --force -f ../mimic_extract_env_py36.yml
- Activate the environment with
conda activate mimic_extract_env_py36
- Install the english language model for spacy:
python -m spacy download en_core_web_sm
- Open the cloned
- Build additional MIMIC-Extract concepts in PostgreSQL
- Run
bash postgres_make_extended_concepts.sh
- Run
psql -d mimic
- Run
\i niv-durations.sql
- Run
- Run the extraction script with
python mimic_direct_extract.py
- Set the
--out-path
parameter to the desired extraction output location - Set the
--pop_size
parameter to set a population size
- Set the
- Original MIMIC-Extract repository: https://github.com/MLforHealth/MIMIC_Extract
- Original mimic-code repository: https://github.com/MIT-LCP/mimic-code
Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Michael C. * Hughes, Tristan Naumann, and Marzyeh Ghassemi. MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III. arXiv:1907.08322.