Let's begin by loading our three main libraries: pandas, plotnine, and sqlite3:

In [0]:
import pandas as pd
import plotnine as p9
import sqlite3

Let's go ahead and mount the google drive to get easy-access to the course data:

In [0]:
# mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Let's create a database.

In [0]:
conn = sqlite3.connect('exercises.db')

Now populate the database with data from MIMIC and the Pima Indian diabetes dataset:

In [0]:
for filename, table in [
          ("mimic_iii/PATIENTS.csv", "patients"),
          ("mimic_iii/DIAGNOSES_ICD.csv", "diagnoses"),
          ("mimic_iii/D_ICD_DIAGNOSES.csv", "d_diagnoses"),
          ("diabetes.csv","diabetes")
]:
  data = pd.read_csv(f'/content/gdrive/My Drive/[YCMI_CBDS Summer Course] Data/{filename}')
  data.to_sql(table, conn, if_exists='replace', index=False)

## Exercise 1
Do a SQL query using `pd.read_sql_query` to get all the gender information from the `patients` table and plot it in a bar chart:

If all went well, you should have reproduced a figure from Monday. Let's select `gender` and `icd9_code` information from the combination of the `patients` and `diagnoses` tables where their `subject_id`s coincides:

Your data frame should have 1761 rows and 2 columns.

Now plot just the `icd9_code` information in a bar graph:

Whoa, that's way too many codes. Use `.value_counts()` to find out how many distinct `icd9_code` values there are.

Compare the `.value_counts()` to 10 and `sum` the result to find the number of diagnoses that appeared more than 10 times. Repeat for more than 20.

Add this additional condition to your `WHERE` clause with an `AND` to  extract the data only for diagnoses given more than 20 times:
```
icd9_code IN (
    SELECT icd9_code
    FROM diagnoses 
    GROUP BY icd9_code
    HAVING COUNT (icd9_code) > 20
  )
  ```

This SELECT inside the parentheses is constructing a one column table listing only those icd9_codes that appear more than 20 times.

Using this smaller dataset, make a bar chart comparing the frequencies of various `icd9_code`s:

Apply a `fill` aesthetic argument, based on the `gender`:



It's clear that `icd9_code` 99592 is assigned to males more than to females, but what about `icd9_code` 2859? It's hard to tell with stacked bar-charts. Recall that we can make the colors appear side-by-side using a `position='dodge'` parameter to `p9.geom_bar()`. Do that:

We now see clearly that more females got diagnosis code 2859 than males. What do these codes mean though? To find out, redo your SQL query again, this time also pulling in the `short_title` field from `d_diagnoses` where `d_diagnoses.icd9_code == diagnoses.icd9_code`.

Check the query results to make sure that it looks like what you expect. Strictly speaking, you don't need to have `icd9_code` in your table, but I kept it in, giving me three columns and 426 rows.

Now plot it, putting the `short_title` on the x-axis.

Remember that adding `p9.theme(axis_text_x=p9.element_text(angle=90))` will rotate the x-axis labels by 90 degrees, making things easier to read. Do that:

Visually, you can now quickly find the names of the four common diagnosis codes that are assigned to females more than to males.

## Exercise 2

Get solely the `pregnancy`, `Glucose`, `BMI` and `Outcome` from the `diabetes` table (each row representing a single observation of a patient).

`Pregnancies` is the number of pregnancies for each subject, we want to recode the value of a new variable, `Pregnancy` to be True if the patient has been pregnant, and False if not.
We also want to recode `Outcome` to replace with a descriptive variable ("diabetes" or "no diabetes").
We also want to alter the `Glucose` and `BMI` variables to set those at 0 to be NA, as they represent missingness rather than actual 0 values.

Look at your data frame to see if it looks the way you expect:

We don't need to, but go ahead and drop the `Pregnancies` column, then check to see if your dataframe changed the way you expected.

Make box plots comparing the `Glucose` distribution  by `Diabetes`.

**INTERPRET YOUR RESULTS**

Using `p9.facet_wrap`, facet this data by whether or not the patient has been pregnant with `Pregnancy`.

**INTERPRET THIS GRAPH** How does prior preganancy  affect the likelihood a diabetic has high glucose? A non-diabetic? How does it affect the glucose value?

By specifying aesthetic maps for `x`, `y`, and `fill` create a box plot figure without faceting that compares the `BMI` for different `Outcome` and `Pregnancy` states.

How does the median `BMI` compare between pregnant and non-pregnant women in each case? What happens if you switch `fill` and `x`? Does that make it easier or harder to answer this question?

Now, repeat the above, but make a violin plot instead.

Compare the violin plot with the box plot. Can you find an advantage for each type of graph?

Plot `Glucose` vs `BMI` in a way that allows comparing the two `Outcome` cases. How do the values of `Glucose` for those with diabetes compare to those not having diabetes?

Use `p9.geom_smooth` to compare trend lines for how `Glucose` changes as `BMI` changes by `Outcome`. Is this a valid trendline?

Do diabetics and non-diabetics show different body mass indices? Start by making a box-plot to compare them:

Non-diabetics have an slightly lower median, but the ranges mostly overlap. We know what the subdivision looks like for pregnancy from our previous plot, with those women, with pregnancies having a slightly lower BMI irrespective of diabetic outcome.
When we swap our `fill` and `x`, we observe that the division between diabetics and non-diabetics is clearer within the pregnancy categories.