# Data Description Notebook

MIMIC II (Multiparameter Intelligent Monitoring in Intensive Care) databasee, an Intensive Care Unit
(ICU) database which is freely available, together with the user guide, from:
http://www.physionet.org/mimic2

As of version 2.6 (April 2011) MIMIC II contains around 33,000 patients of
which approximately 25,000 are adults (having age ≥ 15 years old at time of
last admission) and around 8000 are neonates (age ≤ 1 month old at the time
of first admission. These patients experienced over 36,000 hospital admissions
and over 40,000 ICU stays.

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

In [2]:
import os 

In [3]:
os.listdir('.')

['.ipynb_checkpoints',
 'data_description.ipynb',
 'data_preprocessing.ipynb',
 'page 19 major MIMIC 2 clinical database component',
 'page 20 patient-to-ICD9 abd diagnosis-related froup code relationship',
 'page 21 around caregiver table',
 'page 22 around careunit table',
 'page 25 noteevents',
 'page 27 demographic information',
 'page 28 patient medication',
 'page 29 patient chart data',
 'page 30 patient IO data',
 'page 31 notes and reports',
 'page 33 laboratory and microbiology tests',
 'page 33-2 procedureevents',
 'User Guide for MIMIC II Database.pdf']

In [4]:
import fnmatch

In [5]:
for page in os.listdir('.'):
    if fnmatch.fnmatch(page, '*19*'):
        for table in os.listdir(os.path.join('.', page)):
            if fnmatch.fnmatch(table, '*.txt'):
                print(table)

a_meddurations.txt
additives.txt
admissions.txt
a_chartdurations.txt
a_iodurations.txt
censusevents.txt
deliveries.txt
d_patients.txt
icd9.txt
ioevents.txt
medevents.txt
noteevents.txt
totalbalevents.txt


In [6]:
# admissions and icd9
admissions_path = os.path.join('.', 'page 19 major MIMIC 2 clinical database component', 'admissions.txt')
icd9_path = os.path.join('.', 'page 19 major MIMIC 2 clinical database component', 'icd9.txt')

In [7]:
admissions_df = pd.DataFrame(pd.read_csv(admissions_path, sep='|'))
icd9_df = pd.DataFrame(pd.read_csv(icd9_path, sep='|'))

In [8]:
admissions_df.head() # subject_id, hadm_id, admit_dt

Unnamed: 0,subject_id,hadm_id,admit_dt,disch_dt
0,2,25967,2806-06-15 00:00:00,2806-06-19 00:00:00
1,3,2075,2682-09-07 00:00:00,2682-09-18 00:00:00
2,4,17296,3399-04-03 00:00:00,3399-04-10 00:00:00
3,5,1946,2579-04-09 00:00:00,2579-04-11 00:00:00
4,6,23467,3389-07-07 00:00:00,3389-07-23 00:00:00


In [9]:
icd9_df.head() # hadm_id, code

Unnamed: 0,subject_id,hadm_id,sequence,code,description
0,2,25967,1,V30.01,SINGLE LIVEBORN BORN IN HOSPITAL DELIVERED BY ...
1,2,25967,2,V05.3,NEED FOR PROPHYLACTIC VACCINATION AND INOCULAT...
2,2,25967,3,V29.0,OBSERVATION FOR SUSPECTED INFECTIOUS CONDITION
3,3,2075,1,038.9,UNSPECIFIED SEPTICEMIA
4,3,2075,2,785.59,OTHER SHOCK WITHOUT TRAUMA


每个病人1个subject_id 

每个病人每次入院1个hadm_id (hospital admission)

每次入院经过一系列诊断(sequence)

每次诊断给出一个icd9 code(对应病症)

可以按code将病症归为一些大类

icd9: International Statistical Classification of Diseases and Related
Health Problems (version 9)

In [10]:
# d_patients
d_patients_path = os.path.join('.', 'page 19 major MIMIC 2 clinical database component', 'd_patients.txt')
d_patients_df = pd.DataFrame(pd.read_csv(d_patients_path, sep='|'))
d_patients_df.head()

Unnamed: 0,subject_id,sex,dob,dod,hospital_expire_flg
0,1,F,2840-08-10 00:00:00,,N
1,2,M,2806-06-15 00:00:00,,N
2,3,M,2606-02-28 00:00:00,2683-05-02 00:00:00,N
3,4,F,3351-05-30 00:00:00,,N
4,5,M,2579-04-09 00:00:00,,N


病人的性别(sex), 出生(dob, date of birth)/死亡日期(dod, date of death) 

In [11]:
icd9_class = {
    ("001", "139"): "infectious and parasitic diseases",
    ("140", "239"): "neoplasms",
    ("240", "279"): "metabolic diseases",
    ("280", "289"): "diseases of the blood and blood-forming organs",
    ("290", "319"): "mental disorders",
    ("320", "389"): "neurologic disease",
    ("390", "392"): "acute rheumatic fever",
    ("393", "398"): "chronic rheumatic heart disease",
    ("401", "405"): "hypertensive disease",
    ("410", "414"): "ischemic heart disease",
    ("415", "417"): "diseases of pulmonary circulation",
    ("428", "428"): "heart failure",
    ("420", "429"): "other forms of heart disease",
    ("430", "438"): "cerebrovascular disease",
    ("440", "459"): "arteries and veins",
    ("460", "519"): "pulmonary disease",
    ("520", "579"): "digestive disease",
    ("580", "629"): "renal insufficiency",
    ("630", "677"): "Complications of pregnancy, childbirth, and the puerperium",
    ("680", "709"): "diseases of the skin and subcutaneous tissue",
    ("710", "739"): "diseases of the musculoskeletal system & connective tissue",
    ("740", "759"): "congenital anomalies",
    ("780", "799"): "symptoms, signs, and ill-defined conditions",
    ("800", "959"): "trauma",
    ("960", "989"): "poisoning",
    ("990", "995"): "other and unspecified effects of external causes",
    ("996",): "complications peculiar to certain specified procedures",
    ("997",): "complications affecting specified body systems, not elsewhere classified",
    ("998",): "other complications of procedures, NEC",
    ("999",): "complications of medical care, not elsewhere classified",
    ("E800", "E999"): "supplementary classification of external causes of injury and poisoning",
    ("V81", "V86"): "supplementary classification of factors influencing health status and contact with health services",
}

# lower bounds: description
lbs = []
lb_desc = {}

for bounds, description in icd9_class.items():
    lbs.append(bounds[0])
    lb_desc[bounds[0]] = description


In [12]:
lbs 

['001',
 '140',
 '240',
 '280',
 '290',
 '320',
 '390',
 '393',
 '401',
 '410',
 '415',
 '428',
 '420',
 '430',
 '440',
 '460',
 '520',
 '580',
 '630',
 '680',
 '710',
 '740',
 '780',
 '800',
 '960',
 '990',
 '996',
 '997',
 '998',
 '999',
 'E800',
 'V81']

In [13]:
lb_desc

{'001': 'infectious and parasitic diseases',
 '140': 'neoplasms',
 '240': 'metabolic diseases',
 '280': 'diseases of the blood and blood-forming organs',
 '290': 'mental disorders',
 '320': 'neurologic disease',
 '390': 'acute rheumatic fever',
 '393': 'chronic rheumatic heart disease',
 '401': 'hypertensive disease',
 '410': 'ischemic heart disease',
 '415': 'diseases of pulmonary circulation',
 '428': 'heart failure',
 '420': 'other forms of heart disease',
 '430': 'cerebrovascular disease',
 '440': 'arteries and veins',
 '460': 'pulmonary disease',
 '520': 'digestive disease',
 '580': 'renal insufficiency',
 '630': 'Complications of pregnancy, childbirth, and the puerperium',
 '680': 'diseases of the skin and subcutaneous tissue',
 '710': 'diseases of the musculoskeletal system & connective tissue',
 '740': 'congenital anomalies',
 '780': 'symptoms, signs, and ill-defined conditions',
 '800': 'trauma',
 '960': 'poisoning',
 '990': 'other and unspecified effects of external causes',


In [14]:
def find_largest_lb(code, sorted_lbs):
    left = 0 
    right = len(sorted_lbs) - 1 
    while left <= right: 
        mid = (left + right) // 2 
        if sorted_lbs[mid] < code:
            left += 1 
        else:
            right -= 1 
            
    return sorted_lbs[right]

find_largest_lb("785.59", lbs)

'780'