## NOTE : This notebook was originally created by Dr. Brian Chapman and others.  It has been modified slightly for our 2017 course.

# Identifying Patient Cohorts in [MIMIC-II](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3124312/)


[MIMIC-II](https://physionet.org/mimic2/mimic2_clinical_overview.shtml) is a freely available database of ICU patients. To access the full database (now migrated to [MIMIC-III](https://www.nature.com/articles/sdata201635.pdf))  you must sign a data use agreement. However, there is a [demo data set](https://physionet.org/mimic2/demo/) based on 4000 deceased patients that can be used without signing any DUA.

## How to Use the MIMIC-II Database
* [MIMIC-II Cookbook](https://physionet.org/mimic2/demo/MIMICIICookBook_v1.pdf)
* [MIMIC Data Dictionaries](http://physionet.incor.usp.br/physiobank/database/dictionaries/)


## The Varieties of...Data
The data set is very rich and so is a good resource for exploring the varieties of clinical data

![MIMIC Paper](./images/mimic_paper_header.jpg)
![MIMIC Publications](./images/mimic_publications.jpg)
(Sources : https://mimic.physionet.org/)

Data incluces free text notes (nursing, radiology, discharg summaries, etc.), input/output events, test results, procedure codes, diagnosis codes, etc.

# Very Short FAQ : 
* Q : What is the difference between MIMIC-II and MIMIC-III?
* A : MIMIC-II spans the time period of 2001 to 2008.  MIMIC-III spans 2001 to 2012 so it contains more data.  In addition, some data structures have been improved to make MIMIC-III easier to work with.  Some data quality issues have been resolved as well


* Q : How can I get access to MIMIC-III for my own research?
* A : You'll need to do CITI training and then some other steps.  Start here: https://mimic.physionet.org/gettingstarted/access/

In [None]:
%matplotlib inline

In [None]:
import pymysql
import pandas as pd
import getpass
import pandas as pd
import seaborn as sns

In [None]:
conn = pymysql.connect(host="mysql",
                       port=3306,user="jovyan",
                       passwd=getpass.getpass("Enter MySQL passwd for jovyan"),db='mimic2')
cursor = conn.cursor()

## Example Query: Identifying ICD9 Codes for Patients

In [None]:
icd9_codes = pd.read_sql('SELECT subject_id, code, description from icd9',conn)


In [None]:
icd9_counts = icd9_codes["description"].value_counts(["description"]).to_frame(name="ICD9 Counts")
icd9_counts.head(10)

## Selecting Cohorts

Our most interesting explorations will be when we use information from multiple tables to limit/select cases. Here is an example selecting radiology reports for patients with COPD.

## Select all the radiology reports for a patient with COPD
### [Codes obtained from CDC](http://www.cdc.gov/niosh/pdfs/98-157-d.pdf)
* chronic bronchitis (ICD-9 codes 490-491)
* emphysema (ICD-9 code 492)
* bronchiectasis (ICD-9 code 494)
* chronic airway obstruction (ICD-9 code 496). 

The **\** character indicates a line continuation.

In [None]:
copd_data = \
pd.read_sql("""SELECT noteevents.subject_id, 
                      noteevents.category, 
                      noteevents.text, 
                      icd9.code 
               FROM noteevents INNER JOIN icd9 ON 
                      noteevents.subject_id = icd9.subject_id 
               WHERE (   icd9.code LIKE '490%' OR
                         icd9.code LIKE '491%' OR
                         icd9.code LIKE '492%' OR
                         icd9.code LIKE '494%' OR
                         icd9.code LIKE '496%'
                      ) 
                      AND noteevents.category = 'RADIOLOGY_REPORT'""",conn)
copd_data.head(20)

<img src="images/stopsign.png">
## STOP!  We don't have time to continue these exercises in the MIMIC dataset, but we encourage you to come back to these later to understand the set and how to work with it

## Exercise

* Based on the query described on page 20 of the [MIMIC-II Cookbook](../Resources/MIMICIICookBook_v1.pdf) create a dataframe of urine output values from the database. Limit the query to a reasonable number of results
* Create a visualization of the values

## Exercise 

Come up with a visualization of the top ICD9 codes

## Exercise

If you do not know details about your data base, how might you use Pandas to discover the nature of your database? For exmaple, how might you learn the possible values for ``category`` in ``chart_events``?

## Exercise

1. Use online resources (e.g. [findacode](https://www.findacode.com/search/search.php), [CMS](https://www.cms.gov/medicare/coding/ICD9providerdiagnosticcodes/codes.html)) or clinical knowledge to select patients with a disease (or diseases) of interest to you. Use the mimic cookbook or data dictionaries to identify variables of interest.

<br/><br/>This material presented as part of the DeCART Data Science for the Health Science Summer Program at the University of Utah in 2018.<br/>
Presenters : Dr. Wendy Chapman, Jianlin Shi <br> Acknowledgement: Many thanks to Kelly Peterson, because part of the materials are adopted from his previous work.