In [None]:
import pandas as pd
import pymysql
import getpass

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()

In [None]:
# Let's connect to our database
username = "" # Enter your username here
conn = pymysql.connect(host="35.233.174.193",port=3306,
                       user=username,
                       passwd=getpass.getpass("Enter password for MIMIC2 database"),
                       db='mimic2')

# Labs
Lab tests are used for diagnostic purposes. In MIMIC, the lab measurements are stored in `labevents`. Let's look at the first 10 rows of this table:

In [None]:
query = """SELECT * FROM labevents LIMIT 10;"""
df = pd.read_sql(query, conn)
df.head()

Metadata about the tests, such as a LOINC code and description, are stored in a separate table called `d_labitems`. This is common in relational database modeling. Let's look at the first 10 rows of `d_labitems`. Note that there is information about the test, but no actual results.

In [None]:
query = """SELECT * FROM d_labitems LIMIT 10;"""
df = pd.read_sql(query, conn)
df.head()

To get the test metadata along with the test results, we can join these two tables together using the **"itemid"** column. 

### TODO
Join the `labevents` and `d_labitems`. Select the top 10 rows.

In [None]:
query = """
SELECT * 
FROM labevents
    ____ ____ d_labitems
        ON labevents.itemid = ____.itemid
LIMIT 10;
"""
df = pd.read_sql(query, conn)
df.head()

Let's focus on a specific lab test. We'll look at the LOINC code [2345-7](https://loinc.org/2345-7/), which measures the amount of glucose in a patient's blood. This test is relevant for testing whether a patient has diabetes. Here is a description from the LOINC website:
***
<strong>
Glucose (C6H12O6) is a simple monosaccharide and monomer of carbohydrates. Glucose provides energy for cellular processes and aids metabolism within the body. When food is ingested, the carbohydrates within the food are broken down into glucose molecules. Blood glucose content is significant in determining an individual's overall state of health. An elevated blood glucose level is called hyperglycemia and a deficient blood glucose level is called hypoglycemia. When an individual is hyperglycemic and cannot properly regulate their blood glucose level they are considered diabetic. Type 1 diabetes is caused by the immune system attacking pancreatic beta cells (cells that produce insulin) and Type 2 diabetes is caused by insulin resistance. [MedlinePlus Encyclopedia:003482]
</strong>
***

Let's specifically analyze the results of this test and generate some descriptive statistics. 

### TODO
1. Join `labevents` and `d_labitems` and filter to rows where the LOINC code is **'2345-7'**. Limit to 10 rows to get a preview
2. Using SQL, select the **minimum**, **maximum**, and **average** values of this test
3. Using Python, select the first 10,000 rows. Call the resulting DataFrame `glucose`
4. Generate descriptive statistics of the DataFrame
5. Generate a box plot with Seaborn

In [None]:
# Join labevents and d_labitems and filter to rows where the LOINC code is '2345-7'. 
# Limit to 10 rows to get a preview
query = """
SELECT * 
FROM labevents
    ____ ____ d_labitems
        ON ____.itemid = ____._____
WHERE _____ = _____
LIMIT 10;
"""
df = pd.read_sql(query, conn)
df.head()

In [None]:
# Using SQL, select the minimum, maximum, and average values of this test
query = """
SELECT 
    _____,
    _____,
    _____
FROM labevents
    INNER JOIN d_labitems
        ON labevents.itemid = d_labitems.itemid
____ loinc_code = '2345-7';"""
pd.read_sql(query, conn)

In [None]:
# Using Python, select the first 10,000 rows. Call the resulting DataFrame glucose
query = """
SELECT * 
FROM labevents
    _____
        _____
_____ _____
LIMIT 10000;
"""
glucose = pd.read_sql(_____, conn)
glucose.head()

In [None]:
# Generate descriptive statistics
glucose["valuenum"].____()

In [None]:
# Create a boxplot
sns.boxplot(____)

## Flag attribute

A lab's value often doesn't mean much on its own. Sometimes we're mainly interested in whether or not a lab result is within the expected range. If it's outside of the range, then this might be an indicator that something is wrong. We can look at the **"flag"** attribute in `labevents` to see if it is abnormal or normal. 

Let's take a the values in the **"flag"** column of the DataFrame `glucose`:

In [None]:
glucose.groupby("flag").size()

In [None]:
len(glucose)

### Discussion
Note that the only value in this column is **"abnormal"**, and only 6,773 / 10,000 rows have this value. What about the other rows? How can we know which rows are normal?

## Replacing NULL values
This column only contains a string value if the flag is **"abnormal"**. Otherwise, the column is left blank. We may want to fill these nulls  with the value **"normal"**. We can do this with either SQL or Python. Here we'll use SQL and will see an example later using Python.

### Replacing NULL with SQL
We can fill these null values in our SQL query by using the `coalesce` function. This will take the first non-null value in a list. So, for example,

`coalesce(null, 'world!')` would return 'world!', while `coalesce('hello,', null)` would return 'hello'.

### Discussion
What would `coalesce('hello', 'world')` return?

### TODO
Change the query below so that SQL will return the value of the column `flag` if it is not null and will return `'normal'` otherwise.

In [None]:
query = """
SELECT labevents.subject_id,
    hadm_id,
   valuenum,
    COALESCE(____, ____) AS 'flag',
    labevents.valueuom AS 'units',
    d_labitems.test_name,
    d_labitems.loinc_code,
    d_labitems.loinc_description
FROM labevents
    INNER JOIN d_labitems
    ON labevents.itemid = d_labitems.itemid
WHERE loinc_code = '2345-7'
LIMIT 1000
"""
glucose2 = pd.read_sql(query, conn)
glucose2.head()

In [None]:
glucose2.groupby("flag").size()

### TODO
Create two plots:
1. Generate a histogram of the `flag` column of df using either Pandas or Seaborn
2. Generate a boxplot of the `valuenum` column stratified by flag (**hint:** remember when we stratified patient age of death by gender?)

In [None]:
____

In [None]:
ax = sns.____(x=____, ____=____, data=df, order=['abnormal', 'normal'])

## Aggregate functions
Note that the tests above have multiple values for the same patient taken a few hours apart. It might be useful to group together all of the values for a single patient and perform operations on all of a patient's values. Let's use aggregate functions to determine the min, max, and average values for a patient during one hospital stay.

### TODO
Write a query which retrieves lab results for the LOINC code '2345-7' and groups the results together by **subject_id**. Calculate the minimum, maximum, and average values for each patient and name them 'min_value', 'max_value', and 'avg_value'.

In [None]:
query = """
SELECT 
    labevents.subject_id,
    ____(valuenum) as 'min_value',
    MAX(valuenum) as ____,
    ____(____) as ____
    INNER JOIN d_labitems
    ON labevents.itemid = d_labitems.itemid
WHERE loinc_code = '2345-7'
GROUP BY subject_id
LIMIT 100
"""
df = pd.read_sql(query, conn)
df.head()

### Bonus
Plot the 'avg_value' column from the dataframe above. What kind of distribution does the lab test have?

In [None]:
sns.distplot(df["avg_value"])

# Vital Signs
As we saw in Week 4, vital signs are taken frequently in medical visits. In a setting such as the ICU, vital signs will be monitored constantly in order to quickly detect and alert if anything is wrong.

The `chartevents` table in MIMIC-II contains vitals measurements. Just like lab values, metadata about the measurements are stored separately. The table `d_chartitems` defines what these measurements. Let's look at what the first 25 alphabetical vital measurements are:

In [None]:
query = """
SELECT 
    DISTINCT d_chartitems.label
FROM d_chartitems
LIMIT 25;
"""
df = pd.read_sql(query, conn)
df

Now let's query the first 1000 rows from `chartevents` to see what some actual measurements look like:

In [None]:
query = """
SELECT *
FROM mimic2.chartevents
    INNER JOIN d_chartitems ON chartevents.itemid = d_chartitems.itemid
LIMIT 1000;
"""
df = pd.read_sql(query, conn)
df.head()

## Blood pressure
Let's look at some measurements for blood pressure. I checked beforehand and found 4 tests which we could use. Their id's in `d_chartitems` are (6, 51, 455, 6701). Let's look at what these tests are:

In [None]:
# Blood pressure
query = """
SELECT *
FROM d_chartitems
WHERE itemid IN (6, 51, 455, 6701);
"""
df = pd.read_sql(query, conn)
df.head()

Again, let's query `chartevents` to see what these measurements actually look like:

In [None]:
# Blood pressure
query = """
SELECT *
FROM mimic2.chartevents
    INNER JOIN d_chartitems 
        ON chartevents.itemid = d_chartitems.itemid
WHERE d_chartitems.itemid in (6, 51, 455, 6701)
LIMIT 5;
"""
df = pd.read_sql(query, conn)
df.head()

The **semantics** of this table aren't always clear, so we can refer to the MIMIC documentation for some explanation.

The values which we're interested in here are:
- `"value1num"` - this represents the systolic blood pressure
- `"value2num"` - this represents the diastolic blood pressure

### TODO
Write a query for these blood pressure measurements and assign aliases to these two values:
- `"value1num"` should be called "systolic_bp"
- `"value2num"` should be called "diastolic_bp"

In [None]:
# Blood pressure
query = """
SELECT
    subject_id, 
    icustay_id, 
    charttime, 
    value1num as ____,
    ____ as ____,
    label
FROM mimic2.chartevents
    INNER join d_chartitems 
        ON chartevents.itemid = d_chartitems.itemid
WHERE d_chartitems.itemid in (6, 51, 455, 6701)
LIMIT 1000;
"""
df = pd.read_sql(query, conn)

In [None]:
df.head(10)

## Missing values
### Discussion
Some measurements are 0. Others are null. What do you think this means? What should we do with these rows?


### Dealing with missing values in Python
Earlier, we replaced `NULL` values using SQL. Let's look at some alternative ways to deal with this in Python.

We can see which rows containing NULL values for a column by using **boolean indexing** with the `isna()` method of a Pandas Series:

In [None]:
df[df["systolic_bp"].isna()]

### Option 1: Drop rows with missing values
Use the `dropna()` method to drop rows with missing values. You can specify the columns in which to look for missing values by using the `subset` argument (the default is to drop any row with **any** missing value).

In [None]:
# Option 1: Drop rows with NA or 0 value
df2 = df.dropna(subset=["systolic_bp", "diastolic_bp"])

### Option 2: Fill missing values
The second option is to fill missing values with some calculated value from the column, such as the mean. This is called **data imputation** and is a common solution for when you don't want to throw out rows due to a missing value.

- Calculate the mean value of **"systolic_bp"**. Save this as `systolic_mean`
- Call `df["systolic_bp"].fillna()`. This returns a new Series with the missing values replaced with `systolic_mean`. Save this as `systolic_no_na`
- Assign this value to the **"systolic_bp"** column

In [None]:
# Option 2: Fill with mean
df["systolic_bp"] = df["systolic_bp"].fillna(df["systolic_bp"].mean())

In [None]:
systolic_mean = df["systolic_bp"].mean()

In [None]:
systolic_no_na = df["systolic_bp"].fillna(systolic_mean)

In [None]:
df["systolic_bp"] = systolic_no_na

Now let's see which rows have missing values in **"systolic_bp"**.

In [None]:
df[df["systolic_bp"].isna()]

### TODO
Repeat this step with **"diastolic_bp"**.

## Plotting Vitals
Now, let's plot these variables.

### TODO
Plot the distribution of systolic and diastolic blood pressures side-by-side. I've created two subplots next to each other. Call the `hist` method on the appropriate columns of the DataFrame.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=True)
df["systolic_bp"].____(ax=ax1)
df[____.____(ax=ax2)

## Correlation
Let's look at how these two readings are **correlated** with one another. The correlation of two variables measures how dependent the two variables are on one other - they tell us how related they are. 

We can do this in two ways. First, we'll plot a **scatterplot** which will allow us to visualize the relationship between one variable (systolic blood pressure) and another (diastolic). Next, we can calculate the correlation coefficient of the two variables by using the `corr` method of the columns in the dataframe.

### TODO
Call the function `sns.scatterplot`. Plot 'diastolic_bp' on the x axis and 'systolic_bp' on the y axis.

In [None]:
sns.scatterplot(x=____, __=____, data=df)

### TODO
Call the `.corr()` method on `df['diastolic_bp']`. Pass in `df['systolic_bp']` as an argument.

In [None]:
df['diastolic_bp'].____(____)

### Discussion
Look at the scatterplot of the two blood pressure readings and the correlation coefficient returned by `.corr()`. What does this tell us about the relationship between these two variables?

# Next Steps
For homework, complete the following notebook. When you're done, save it as an HTML and submit it via Canvas:

[./03-homework.ipynb](./03-homework.ipynb)