<html>
<table width="100%" cellspacing="2" cellpadding="2" border="1">
<tbody>
<tr>
<td valign="center" align="center" width="45%"><img src="../media/Univ-Utah.jpeg"><br>
</td>
    <td valign="center" align="center" width="75%">
<h1 align="center"><font size="+1">University of Utah<br>Population Health Sciences<br>Data Science Workshop</font></h1></td>
<td valign="center" align="center" width="45%"><img
src="../media/U_Health_stacked_png_red.png" alt="Utah Health
Logo" width="128" height="134"><br>
</td>
</tr>
</tbody>
</table>
<br>
</html>


In [None]:
import pandas as pd
from helpers import *

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()

In [None]:
conn = connect_to_mimic()

# Labs and Vitals
In the last two notebooks, we focused mainly on categorical data elements such as patient ethnicity and diagnoses. In this notebook we'll start looking at more numeric variables: lab results and patient vitals.

## I. Labs
Lab tests are used for diagnostic purposes. In MIMIC, the lab measurements are stored in `labevents`. Let's look at the first 10 rows of this table:

In [None]:
query = """SELECT * FROM labevents LIMIT 10;"""
df = pd.read_sql(query, conn)
df.head()

To understand what these values are, we'll need to turn to another terminology called **LOINC**. [**LOINC**](https://en.wikipedia.org/wiki/LOINC) is a standardized terminology representing laboratory tests and microbiology tests. Just like how we used ICD-9 codes to study patient diagnoses, we'll now loook at LOINC codes to study lab tests.

Metadata about the tests, such as a LOINC code and description, are stored in a separate table called `d_labitems`. As we discussed earlier today, this is common in relational database modeling since it means we don't need to store the name of the test every single time. Let's look at the first 10 rows of `d_labitems`. Note that there is information about the test, but no actual results.

In [None]:
query = """SELECT * FROM d_labitems LIMIT 10;"""
df = pd.read_sql(query, conn)
df.head()

To get the test metadata along with the test results, we can join these two tables together using the `itemid` column. 

#### TODO
Finish the query below to get all lab results, along with the metadata about the tests, for hospital admission `28766`. Save the result as `labs_28766`.

In [None]:
query = """
SELECT * 
FROM ____
    ____ ____ d_labitems
        ON labevents.itemid = ____.____
WHERE ____ = 28766;
"""
____ = pd.read_sql(query, conn)
labs_28766.head()

In [None]:
# RUN CELL TO TEST VALUE
test_labs_28766.test(labs_28766)

Let's focus on a specific lab test. We'll look at the LOINC code [2345-7](https://loinc.org/2345-7/), which measures the amount of glucose in a patient's blood. This test is relevant for testing whether a patient has diabetes. Here is a description from the LOINC website:
***
<strong>
Glucose (C6H12O6) is a simple monosaccharide and monomer of carbohydrates. Glucose provides energy for cellular processes and aids metabolism within the body. When food is ingested, the carbohydrates within the food are broken down into glucose molecules. Blood glucose content is significant in determining an individual's overall state of health. An elevated blood glucose level is called hyperglycemia and a deficient blood glucose level is called hypoglycemia. When an individual is hyperglycemic and cannot properly regulate their blood glucose level they are considered diabetic. Type 1 diabetes is caused by the immune system attacking pancreatic beta cells (cells that produce insulin) and Type 2 diabetes is caused by insulin resistance. [MedlinePlus Encyclopedia:003482]
</strong>
***

Let's specifically analyze the results of this test and generate some descriptive statistics. 

#### TODO
3. Using Python, select the first 10,000 rows. Call the resulting DataFrame `glucose`
4. Generate descriptive statistics of the DataFrame
5. Generate a box plot with Seaborn

#### TODO
Join `labevents` and `d_labitems` and filter to rows where the LOINC code is **'2345-7'**. Limit to 10 rows to get a preview

In [None]:
query = """
SELECT * 
FROM ____
    ____ ____ d_labitems
        ON ____
WHERE loinc_code = '____'
LIMIT 10;
"""
pd.read_sql(query, conn)


In [None]:
# RUN CELL TO SEE QUIZ
quiz_category_glucose

The numeric value of the test result is stored in the column `valuenum` (not `value` - what do you think the difference is?)
#### TODO
Select the `COUNT`, `MIN`, `MAX`, and `AVG` values of `2345-7`.

In [None]:
query = """
SELECT ____
    FROM labevents
    INNER JOIN d_labitems
        ON labevents.itemid = d_labitems.itemid
WHERE ____
"""
pd.read_sql(query, conn)

In [None]:
# RUN CELL TO SEE QUIZ
quiz_avg_glucose

Let's do some more detailed analysis using `pandas`. Because `labevents` is a big table, let's take a random sample of glucose tests. One way that we can take a random sample in SQL is by ording the results by a random number using the `RAND()` function and then limiting the results to the number we want to sample:

```sql
ORDER BY RAND()
LIMIT k
```


#### TODO
Write a query which returns the `subject_id`, `hadm_id`, `valuenum`, `flag`, and `flag` columns for a random sample of 1,000 glucose tests from MIMIC. Save the result as `glucose`.

In [None]:
query = """
SELECT subject_id, hadm_id, valuenum, flag
FROM labevents
    INNER JOIN d_labitems
        ON labevents.itemid = d_labitems.itemid
WHERE loinc_code = '2345-7'
ORDER BY ____
____ ____;
"""
glucose = pd.read_sql(query, conn)
glucose.head()

#### TODO
Earlier we used SQL to calculate some summary statistics of the `valuenum` for all 2345-7 tests. Now use `pandas` to calculate summary statistics for your random sample. Are they similar to the values of the entire table? 

In [None]:
# RUN CELL TO SEE QUIZ
hint_summary_glucose

#### TODO
Create a plot visualizing the results in your sample of glucose tests.

In [None]:
# RUN CELL TO SEE HINT
hint_viz_glucose

###  Flag attribute
Unless you're a clinician, the `valuenum` probably doesn't tell you much about the meaning of the test result. The `flag` column is there to tell us whether the test was outside the expected range. An `abnormal` value may be interpreted as a positive result.


#### TODO
The code below shows all distinct values for the `flag` column. What do you think a value of `None` means?

In [None]:
set(glucose["flag"])

In [None]:
# RUN CELL TO SEE QUIZ
quiz_none_glucose

### Replacing missing values
The `None` values above are examples of **missing values**. Missing values can mean different things, so you need to be careful about how you handle missing them. In this, since the only values are `None` and `abnormal`, it's pretty clear that this column only contains a string value if the flag is **"abnormal"** and is `NULL` (the SQL equivalent of `None`) otherwise. 

We will want to fill these nulls  with the value **"normal"**. Let's do this first in SQL and then in Python.

### Replacing NULL with SQL
We can fill these null values in our SQL query by using the `coalesce` function. This will take the first non-null value in a list. So, for example,

`coalesce(null, 'world!')` would return 'world!', while `coalesce('hello,', null)` would return 'hello'.

#### TODO
What would `coalesce('hello', 'world')` return?

In [None]:
# RUN CELL TO SEE QUIZ
quiz_coalesce_helloworld = MultipleChoiceQuiz(answer="'hello'", options=["'hello'", "'world'", "NULL"])
quiz_coalesce_helloworld

#### TODO
Edit the query below so that SQL will return the value of the column `flag` if it is not null and will return `'normal'` otherwise. Take a random sample of 100 rows and save it as `glucose2`

In [None]:
query = """
SELECT subject_id, hadm_id, valuenum, 
    ____(____, ____) AS flag
FROM labevents
    INNER JOIN d_labitems
        ON labevents.itemid = d_labitems.itemid
WHERE loinc_code = '2345-7'
____ __ RAND()
LIMIT 100;
"""
glucose2 = pd.read_sql(query, conn)

In [None]:
# RUN CELL TO SEE QUIZ
test_glucose_coalesce.test(glucose2)

In [None]:
glucose2

### Replacing missing values in `pandas`
We can also fill in missing values directly in our dataframe. The method `Series.fillna(new_value)` returns a new series with all missing values filled in with `new_value`. We can then reassign the column to this new, non-missing column.

#### TODO
Edit the code below so all missing values of the `flag` of `glucose` are filled in with `normal`.

In [None]:
glucose[____] = ____["flag"].fillna(____)

In [None]:
test_glucose_coalesce.test(glucose)

Now that we've filled in the missing values of `flag`, let's compare the distribution of `valuenum` between normal and abnormal results.

#### TODO
First calculate summary statistics of `glucose["valuenum"]` stratified by `flag`. Then create a visualization comparing the distributions in the two groups.

In [None]:
# RUN CELL TO SEE QUIZ
hint_glucose_value_by_flag

## II. Vital Signs
The `chartevents` table in MIMIC-II contains vitals measurements. The table `d_chartitems` defines what these measurements represents. Let's look at what the first 25 alphabetical vital measurements are:

In [None]:
query = """
SELECT 
    DISTINCT d_chartitems.label
FROM d_chartitems
LIMIT 25;
"""
df = pd.read_sql(query, conn)
df

Now let's query the first 100 rows from `chartevents` to see what some actual measurements look like:

In [None]:
query = """
SELECT *
FROM chartevents c
    INNER JOIN d_chartitems d
        ON c.itemid = d.itemid
LIMIT 100
"""
pd.read_sql(query, conn)

### Blood pressure
For our first analysis, let's focus on measurements of blood pressure. Here is a valuset of `itemid` values that you can use for blood pressure: `(6, 51, 455, 6701)`. Let's first generate counts of how many times each of these are used.

### `WHERE column IN (...)`
Earlier when we had filtered our results to particular code values (like for pneumonia or a glucose test), we had used individual codes. We now have four codes. One way we could do this is by using an `OR` in our `WHERE` statement:

```sql
WHERE itemid = 6
OR itemid = 51
OR itemid = 455
OR itemid = 6701
```

But a more concise way to do this would be to use the `IN` keyword, which checks if a value is in a list of values within parentheses:

```sql
WHERE itemid IN (6, 51, 455, 6701)
```

#### TODO
Write and execute a query which returns the `itemid` and `label` columns from `d_chartitems` for each of the 4 value sets above. Use the `IN` keyword in your query.

In [None]:
query = """
SELECT itemid, label
FROM d_chartitems
____ ____
"""
pd.read_sql(query, conn)

In [None]:
# RUN CELL TO SEE QUIZ
quiz_label_455 

In the next section, we'll take a random sample of blood pressure measurements and analyze them. The **semantics** of this table aren't always clear, so we can refer to the MIMIC documentation for some explanation.

For these rows, the values which we're interested in are:
- `"value1num"` - this represents the systolic blood pressure
- `"value2num"` - this represents the diastolic blood pressure
#### TODO
Query the **first 5,000 rows** of blood pressure measurements. Note that this shouldn't be a random sample, just the first 5,000. 

Select the following columns and assign aliases as appropriate:
- `subject_id`
- `value1num` as `systolic_bp`
- `value2num` as `diastolic_bp`

Name it `bp`. 

In [None]:
query = """
SELECT 
    subject_id,
    ____ AS systolic_bp,
    ____ AS ____
FROM chartevents c
WHERE itemid IN (6, 51, 455, 6701)
LIMIT 5000
"""
bp = pd.read_sql(query, conn)

In [None]:
bp.head()

#### TODO
How related do you think diastolic and systolic blood pressure are? Create a visualization comapring the two values and come up with a quantitative measure of their relationship.


In [None]:
hint_systolic_v_diastolic

#### Discussion
Just like the lab results from earlier, some of these the systolic and diastolic values are missing. With the glucose labs we decided that missing values of `flag` meant that the test the test results were normal. What do you think is the most likely cause of blood pressure values being missing? What are some options for dealing with these missing values?

In [None]:
# RUN CELL TO SEE QUIZ
quiz_missing_bp

### Option 1: Drop rows with missing values
One thing we can do is ismply drop any row that is misisng a blood pressure measurement. We can do this using the `dropna()` method. You can specify the columns in which to look for missing values by using the `subset` argument:

```python
df = df.dropna(subset=column_name)
```

#### TODO
Create a new dataframe called `bp2` which has dropped any row which is missing `systolic_bp` or `diastolic_bp`.

In [None]:
bp2 = bp.____(subset=["____", "____"])

### Option 2: Imputing missing values with the mean
A second option could be to fill in the missing values with the sample mean. This allows us to avoid dropping these rows so we can keep these rows in our dataset. However, make sure to note in your analysis how your are treating these missing values!

#### TODO
Fill in missing rows of `systolic_bp` and `diastolic_bp` with their respective means. This is a similar process to what we did with the `flag` column above, but we need to first calculate the value we'll be using to replace missing values.

In [None]:
# ...

In [None]:
# Now check if there are any NaN's
bp[["systolic_bp", "diastolic_bp"]]