In [None]:
import pandas as pd
import pymysql
import getpass

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()

In [None]:
# Let's connect to our database
username = "" # Enter your username here
conn = pymysql.connect(host="35.233.174.193",port=3306,
                       user=username,
                       passwd=getpass.getpass("Enter password for MIMIC2 database"),
                       db='mimic2')

# 01. Administrative and demographic data
As we discussed in Week 4, administrative and demographic data define general information about the patient. Some of data includes:
- Name
- Sex
- Date of birth
- Insurance information

In MIMIC, patient data is stored in a table called `d_patients`. Additional demograpic data is stored in a table called `demographic_detail`.

See this page in the MIMIC-II guide for more information about patient entities in the database:
https://mimic.mit.edu/archive/mimic-ii-guide.pdf#page=24

## `d_patients`

Let's first select all (`"select *"`) for the first 5 patients in `d_patients`. What columns are returned? What do they represent? In other words, what are the **semantics** of the data?

In [None]:
query = """
SELECT * 
FROM d_patients 
LIMIT 5;
"""
df = pd.read_sql(query, conn)
df

We can look at data for a specific patient by using a `where` statement to filter to a specific `subject_id`:

In [None]:
query = """
SELECT * 
FROM d_patients 
WHERE subject_id = 31;
"""
df = pd.read_sql(query, conn)
df

## `demographic_detail`

`d_patients` contains just a few of the attributes for patients in the MIMIC database. A number of other attributes are stored in the `demographic_detail` table.

#### TODO
Select the top 10 rows from the demographics table. Discuss the columns which are returned.

In [None]:
query = """

"""
df = pd.read_sql(query, conn)
df

## Joining tables
In a relational database like MIMIC, different attributes for entities are stored in different tables. These disparate tables can then be joined together in a query using a `join` statement. The column `subject_id`, which is the identifier for a patient, is consistent between these two columns and can be used to join them together:

#### TODO
Join the demographics and patients tables using the `subject_id` column in both as the joining keys. Select the **top 10** columns.

In [None]:
query = """
SELECT * 
FROM d_patients
    INNER JOIN ____ 
        ON ____.subject_id = ____.____
____ 10;
"""
df = pd.read_sql(query, conn)
df

# Analyzing administrative and demographic data
Now that we know what data we have, let's perform some analysis using these two tables. 

## Sex
Let's compare the number of male vs. female patients. We can do this in two ways:
1. **Pandas**: query all of the rows from `d_patients` and then use pandas to generate counts and plots
2. **SQL**: using a `GROUP BY` query to get the counts of rows with male and female patients

### 1. Pandas

In [None]:
query = """
SELECT * 
FROM d_patients;
"""
patients = pd.read_sql(query, conn)
print(len(patients))
patients.head()

In [None]:
patients.groupby("sex").size()

In [None]:
patients.groupby("sex").size().plot.bar()

### 2. SQL

In [None]:
query = """
SELECT 
    sex, 
    COUNT(*) 
FROM d_patients
GROUP BY sex;
"""
df = pd.read_sql(query, conn)
df

## Age at death

Let's say that we want to know what age patients were when they died. This will take a little more effort: there is no column containing this attribute, so we'll have to calculate it using the columns which are there.

#### DISCUSSION
What are the relevant columns in either `d_patients` or `demographic_detail` which will allow us to calculate how old a patient was when they died?

We'll again do this in two different ways: first using SQL and then pandas.

### 1. SQL
Just like how we use functions in Python, SQL offers certain functions for containing common operations in our queries. One of these functions is DATEDIFF, which subtracts one column containing a date from another.

#### TODO
Edit the query below to extract the relevant columns and to generate a new column called `age_at_death` which contains the difference:

In [None]:
query = """
SELECT 
    subject_id, 
    sex, 
    dob, 
    dod,
    DATEDIFF(___, ___) / 365  AS '___'
FROM mimic2.d_patients
LIMIT 100;
"""
df = pd.read_sql(query, conn)
df

We can sort the DataFrame by this **"age_at_death"** to see both the oldest and youngest patients who died in the hospital. To sort a dataframe based on a column, we use the `df.sort_values()` method. We'll pass in the following arguments:
- `by`: The name of the column to use for sorting
- `ascending`: Whether to sort in order of lowest to highest. Default is `True`

So, to get the 5 youngest patients, we'll use the `sort_values` and call the `head` method to see the first few rows:

In [None]:
df.sort_values("age_at_death", ascending=True).head()

#### TODO
Show the 5 oldest patients who died in the hospital.

In [None]:
df.sort_values("___", ascending=___).head()

In addition to looking at individual patients based on this value, we can do some analysis at a population level by calculating descriptive statistics around this attribute. Let's calculate the max, min, and average ages. We can use agreggate functions to do this.

#### TODO
Edit the query below to calculate the max, min, and average ages of death.

In [None]:
query = """
SELECT
    COUNT(1) as 'number_of_patients',
    ___(DATEDIFF(dod, dob) / 365) AS 'max_age_at_death',
    MIN(DATEDIFF(dod, dob) / 365)  AS '___',
    ___(___(dod, dob) / 365)  AS '___'
FROM mimic2.d_patients
"""
df = pd.read_sql(query, conn)
df

## 2. pandas
Now, let's use Python to do something similar. Earlier, we queried the entire `d_patients` table and stored it as a DataFrame called `patients`. Let's use this DataFrame to calculate a new column called **"age_at_death"** and then analyze it using pandas.

In [None]:
# Here is the DataFrame we created earlier containing all the rows from d_patients
patients.head()

#### TODO
Subtracting **"dob"** from **"dod"**. Save it as a variable called `days_at_death`.

In [None]:
days_at_death = patients[___] - ___["dob"]
days_at_death.head()

Let's take a look at what this column contains. Let's access the first row and look at the value. This is a different datatype then we're used to:

In [None]:
delta = days_at_death.iloc[0]
delta

In [None]:
type(delta)

This object represents the number of days between **"dod"** and **"dob"**. We will convert this into numbers.

In [None]:
delta.days

#### TODO
Write a function `delta_to_years` which takes a `Timedelta` object, gets the number of days, and then returns that time span in years.

In [None]:
def ____(delta):
    ____

Now, we can use the pandas method `apply` to run this function on all of the rows in `years_at_death`. This will return a new column where the values correspond to having run `delta_to_years` on all of the rows in `years_at_death`. 

#### TODO
Pass the name of our new function as an argument to `days_at_death.apply`

In [None]:
years_at_death = days_at_death.apply(____)

Now, finally, save this new computed series as a column in `patients`. We can then use the `describe` method to get descriptive statistics of this column:

In [None]:
patients[____] = years_at_death

In [None]:
patients.head()

In [None]:
patients["age_at_death"].describe()

## Plotting age at death

Now, let's use some additional Python libraries to plot this data in a histogram. Last week, we used a mix of `pandas`, `matplotlib`, and `seaborn` to plot BMI measurements. We'll now use some of those same methods to analyze the age of patients' death in MIMIC:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
ax = sns.distplot(patients['age_at_death'])

In [None]:
# We can also use a boxplot:
ax = sns.boxplot( y='age_at_death', data=patients)

# Combining attributes
In this notebook, we looked at two patient attributes: **sex** and **age at death**. Let's now combine these two variables to analyze whether the age at death differs between male and female patients. We'll first do this by calculating descriptive statistics, then we'll create some visualizations to help aid our analysis.

#### TODO
Call `patients.groupby()` to group the patients table by **"sex"**. Then, call the `describe` method to get descriptive statistics about the **"age_at_death"** column.  

In [None]:
patients[["sex", "age_at_death"]].____(____).describe()

#### TODO
Plot a boxplot using the `sns.boxplot` method. Like before, we'll plot **"age_at_death"** as the y variable. But we can break it up by gender by plotting **"sex"** as the x-axis variable. 

In [None]:
ax = sns.____(_='age_at_death', x=____, data=patients, order=['F', 'M'])

We can also use pandas to break the female and male datapoints into two histograms:

In [None]:
_ = patients.hist('age_at_death', by='sex', sharey=True, sharex=True)

### Discussion
Looking at these statistics and the two plots we generated, what can you say about the difference between the age of death for men and women?

# Next Steps
In our next notebook, we will look at lab and vitals measurements in MIMIC.

[02-labs_vitals.ipynb](./02-labs_vitals.ipynb)