<html>
<table width="100%" cellspacing="2" cellpadding="2" border="1">
<tbody>
<tr>
<td valign="center" align="center" width="45%"><img src="../media/Univ-Utah.jpeg"><br>
</td>
    <td valign="center" align="center" width="75%">
<h1 align="center"><font size="+1">University of Utah<br>Population Health Sciences<br>Data Science Workshop</font></h1></td>
<td valign="center" align="center" width="45%"><img
src="../media/U_Health_stacked_png_red.png" alt="Utah Health
Logo" width="128" height="134"><br>
</td>
</tr>
</tbody>
</table>
<br>
</html>

In [None]:
from helpers import *
import pandas as pd

In [None]:
conn = connect_to_mimic()

In [None]:
import seaborn as sns
sns.set()

# Diagnosis Data in MIMIC-II

## Standards and Terminologies
**Terminologies** are collections of concepts used to describe data. Each concept represents a single, unique item and has a unique identifier, also called a **code**. Medical data has terminologies to represent diagnoses, medications, and procedures. Furthermore, these terminologies are **standardized** so that they can be used across institutions - the same concepts used to represent a disease in one healthcare system means the same thing in another.

One example of a terminology is the [**International Disease Classification (ICD)**](https://www.who.int/standards/classifications/classification-of-diseases) system. ICD codes are used to represent patient diagnoses and are used in healthcare systems across the world. There are a few different versions of the ICD system. In the US, ICD-9 codes were used until 2015, at which point ICD-10 became the main system. Since MIMIC-II data was generated before 2015, it uses ICD-9 codes to represent patient diagnoses.

The table `icd9` contains the diagnoses assigned to patient hospitalizations. Here are the first 10 rows of `icd9`. A hospitalization can have one or more ICD-9 code and codes are ordered in importance by the `sequence` column.

In [None]:
query = """
SELECT * FROM icd9
LIMIT 10;
"""
df = pd.read_sql(query, conn)
df.head(10)

### Most common codes
Let's see which codes are used most frequently in MIMIC.

#### TODO
Write a query to get the `code` and `description` columns from `icd9`, along with a column called `n` which counts how many times they appear in the table.

In [None]:
query = """
SELECT code, ____, ____ n
FROM ____
GROUP BY ____, ____
ORDER BY COUNT(*) DESC
"""
icd_counts = pd.read_sql(query, conn)
icd_counts.head()

#### TODO
How many *unique* diagnosis codes are there in `icd9`?

In [None]:
# RUN CELL TO SEE QUIZ
quiz_unique_icd

#### TODO
How many *total* diagnosis codes are there in `icd9`?

In [None]:
# RUN CELL TO SEE QUIZ
quiz_total_icd9

#### TODO
Create a plot showing the counts of the **10 most common** ICD-9 codes. Display the **description** on one of the axes.

In [None]:
# RUN CELL TO SEE QUIZ
hint_plot_icd9_counts

## Creating patient cohorts

Research projects typically create a dataset from a particular **patient cohort** which is defined by some common attributes among a set of patients. This criterion will often include a particular diagnosis. For example, if we want to create a cohort of patients with diabetes, we could run a query like this to identify all hospitalizations with the code **250.00: Diabetes Mellitus w/o Complications Type II**.

In [None]:
query = """
SELECT *
FROM icd9
WHERE code = '250.00'
LIMIT 10
"""
pd.read_sql(query, conn)

### `DISTINCT` and `LIKE`
The last query gave us the first 10 rows of `icd9` which had a particular diabetes code. But there are codes which represent diabetes. Additionally, instead of identifying all hospitalizations which had this code, maybe we just want a unique list of ICD-9 codes to use as a **value set** for building datasets.

Two keywords which can help us here are `DISTINCT` and `LIKE`. The `DISTINCT` keyword deduplicates the values in your `SELECT` statement. So the code below returns all unique code/description pairs for this particular ICD-9 code:

In [None]:
query = """
SELECT DISTINCT code, description
FROM icd9
WHERE code = '250.00';
"""
pd.read_sql(query, conn)

The `LIKE` statement lets us do wildcard searches to match part of a text column, where `'%'` is used to represent any character. So by replacing the `WHERE` clause above to `description LIKE '%diabetes%'`, we can find all rows in the table whhere the description column contains "diabetes". Then we can use `DISTINCT` to deduplicate them.

In [None]:
query = """
SELECT DISTINCT code, description
FROM icd9
WHERE description LIKE '%diabetes%';
"""
pd.read_sql(query, conn)

#### TODO
Write and execute a query which returns all *unique* code/description pairs containing the word **pneumonia**. Save the result as `pna_codes`.

In [None]:
query = """

"""
pna_codes = pd.read_sql(query, conn)

In [None]:
pna_codes.head()

In [None]:
# RUN CELL TO TEST VALUE
test_pna_codes.test(pna_codes)

## Patient characteristics
Now that we know how to identify particular diagnoses, let's next study the characteristics of patients with those diagnoses.

The `icd9` column contains two columns which can be used to join to the tables `d_patients` and `demographic_detail`.

In [None]:
# RUN CELL TO SEE QUIZ
quiz_icd9_join_d_patients

In [None]:
# RUN CELL TO SEE QUIZ
quiz_icd9_join_demographic_detail

### `COUNT(DISTINCT ...)`
`icd9` and `demographic_detail` are both at the **hospitalization** level, meaning each row represents a unique hospitalization, while `d_patients` is at the **patient** level. That means that selecting  `COUNT(*)` from `demographic_detail` or `icd9` will give us a count of of hospitalizations, not patients.

One way we could count the number of patients is by selecting `COUNT(DISTINCT subject_id)`. This first deduplicates the results by `subject_id`, then returns a count of the deduplicated set of patients.

Let's say we want to count the total number of patients who have had diabetes during any of their hospitalizations. We could write the following query:

In [None]:
query = """
SELECT COUNT(DISTINCT subject_id) n
FROM icd9
WHERE description LIKE '%diabetes%'
"""
pd.read_sql(query, conn)

In [None]:
# RUN CELL TO SEE QUIZ
quiz_count_distinct

We can also use `COUNT(DISTINCT ...)` with `GROUP BY` queries to count the number of distinct values in each group. For example, the query below counts the number of distinct patients in each ethnic group who had a code for diabetes:

In [None]:
query = """
SELECT 
    e.ethnicity_descr, COUNT(DISTINCT i.subject_id) n
FROM icd9 i
    INNER JOIN demographic_detail e
        ON i.hadm_id = e.hadm_id
WHERE description LIKE '%diabetes%'
GROUP BY e.ethnicity_descr
ORDER BY n DESC
"""

pd.read_sql(query, conn)

#### TODO
Count the number of *unique* patients by sex who had a code containing **"pneumonia"**.

In [None]:
# RUN CELL TO SEE HINT
hint_pna_by_sex

In [None]:
query = """

"""
pd.read_sql(query, conn)

In [None]:
# RUN CELL TO SEE QUIZ
quiz_count_pna_by_sex

## Comorbidities
We're often interested in knowing about the **"comorbidity"** of a disease. A comorbidity is a condition which a patient has in addition to another condition. For example, if a patient has diabetes and they are also diagnosed with hypertension, then these two conditions would be comorbid. 

This is useful if we want to understand what conditions a population of patients might be at risk for based on the conditions they already have, or for measuring how certain diseases interact.

In this exercise we will identify what co-morbidities patients have. But first, let's look at one more SQL technique called **subqueries**.

### Subqueries
A **subquery** is a nested query within a larger query. Subqueries appear in the `FROM` clause, are surrounded by parentheses, and need to have an alias:
```sql
SELECT * FROM (
    SELECT 
    FROM table
) AS sub
```

Instead of directly querying a table, this selects from the subquery. This can be useful if we want to reduce the results of one table before joining with another.

For example, the following query first identifies unique patients who have a diabetes code, then joins with `d_patients`:

In [None]:
query = """
SELECT p.*
FROM (
    SELECT DISTINCT subject_id 
    FROM icd9
    WHERE description LIKE '%diabetes%'
) sub
    INNER JOIN d_patients p
        ON sub.subject_id = p.subject_id
LIMIT 10
"""
pd.read_sql(query, conn)

Let's see how this can help us identify comorbidities. 

In the query below, the subquery first identifies patients who have the diabetes code `250.00`. Then we join that set of patients with the `icd9` table to get all other codes for those patients. We'll also filter out 250.00 codes in the larger query, since we already know that those patients have this code:

In [None]:
query = """
SELECT 
    i.*
FROM 
    icd9 i
    INNER JOIN 
        /** Write the subquery here to get the subject_ids*/
        (
            SELECT DISTINCT
            subject_id
            FROM icd9
            WHERE description LIKE '%diabetes%'
        ) AS sub
    ON i.subject_id = sub.subject_id
WHERE i.description NOT LIKE '%diabetes%' -- filter out rows with 'diabetes'
LIMIT 10
"""
pd.read_sql(query, conn)

The next query then counts the number of *unique* patients who have each co-morbidity and returns the 10 most common comorbidities.

In [None]:
query = """
SELECT 
    code,
    description,
    COUNT(DISTINCT i.subject_id) n
FROM 
    icd9 i
    INNER JOIN 
        /** Write the subquery here to get the subject_ids*/
        (
            SELECT DISTINCT
            subject_id
            FROM icd9
            WHERE description LIKE '%diabetes%'
        ) AS sub
    ON i.subject_id = sub.subject_id
WHERE i.description NOT LIKE '%diabetes%' -- filter out rows with 'diabetes'
GROUP BY code, description
ORDER BY COUNT(DISTINCT i.subject_id) desc
LIMIT 10
"""
pd.read_sql(query, conn)

#### TODO
Write a query which contains the 10 most common co-morbidities for patients with pneumonia. Save the results as `pna_cmrbd`. Make sure to count distinct patients.

In [None]:
query = """

"""
pna_cmrbd = pd.read_sql(query, conn)
pna_cmrbd

In [None]:
# RUN CELL TO SEE QUIZ
quiz_pna_cmrbd

#### Advanced
Can you add a column `prop` to the table above which is the **proportion** of all patients with pneumonia who have a co-morbid condition?

In [None]:
# RUN CELL TO SEE HINT
hint_pna_prop_cmrbd