<html>
<table width="100%" cellspacing="2" cellpadding="2" border="1">
<tbody>
<tr>
<td valign="center" align="center" width="45%"><img src="../media/Univ-Utah.jpeg"><br>
</td>
    <td valign="center" align="center" width="75%">
<h1 align="center"><font size="+1">University of Utah<br>Population Health Sciences<br>Data Science Workshop</font></h1></td>
<td valign="center" align="center" width="45%"><img
src="../media/U_Health_stacked_png_red.png" alt="Utah Health
Logo" width="128" height="134"><br>
</td>
</tr>
</tbody>
</table>
<br>
</html>


In [2]:
from helpers import *

# Introduction to Relational Databases and SQL
In Module 3, we'll move beyond pure Python and start using the **SQL** programming language to get data from clinical databases. This will be an important addition to your data science skillset and will allow you to build your own datasets from raw clinical data.

This notebook will begin with a brief overview of what relational databases are and why they're so useful in healthcare. We'll then be introduced to the database we'll be working with in this class and how to write queries to pull data.

- What is a database?
- Introduction to MIMIC
- SQL basics

## Electronic Health Record Data
Electronic Health Records (EHRs) allow us to collect and store massive amounts of patient data. This data is extremely valuable both for providing care to patients and for performing research. Using EHR data allows us to analyze how patients were treated and what outcomes they had, along with covariates such as demographics and past medical history. This can offer a cheaper, quicker alternative to clinical trials by taking advantage of data that already exists.

Sounds easy, right? Well, it's not quite as simple as it sounds. EHR data offers a host of challenges. The first is **size**. The massive volume of EHR data puts it into the category of "big data", so we need to be able to store it and retrieve it in a way that is efficient and scalable. The second is **purpose**. The primary purpose of EHR data is really to support clinical care. Research is one of several "secondary" uses of clinical data. That means that the data may not be represented in a way that is most useful to us for research, meaning that we need to do some extra work to transform it into the format we need it. The third is **messiness**. Any data is extremely messy, prone to errors, inconsistencies, or missingness, and due to the complexity of clinical care, this is especially true of clinical data. So we also need to "wrangle" our data and clean it up so that it's consistent and usable for our research.

(Note: Even once you get the data, there are still many challenges like confounding and bias, but we will be mainly focusing on the issues of just getting the data.)

#### Discussion
What are some examples of data stored in the EHR? How might those be useful to researchers?

## What is a Database?
Broadly defined, a database is any ["organized collection of data"](https://en.wikipedia.org/wiki/Database). In this class, we'll specifically be focusing on [**relational databases**](https://en.wikipedia.org/wiki/Relational_database), the most common type of database used in healthcare and many other applications.

Here's a simple example of a relation database. Let's say we wanted to use a database to store a list of all of our patients and their diagnoses. If it's a small list of patient, we could do this very simply in an Excel spreadsheet. This would, technically, count as a "database". We'll refer to each sheet in the spreadsheet as a **table** because of their tabular format. 

### A Simple Example

Our example spreadsheet would have  nice format for this is by storing them in two **tables**: a `patient` table and a `diagnosis` table. In the `patient` table, each row would represent a different patient, and there would only be *one* row per patient. In the `diagnosis` table , each row would represent a single diagnosis for a single patient. A patient might appear in the table more than once, but a unique diagnosis for a patient should only appear once (although what "unique" means may not be clear - would if they get diagnoses more than once with a disease?)

![patient_table](./media/example_patient_table.png) 
![patient_diagnosis_table](./media/example_patient_diagnosis_table.png)

### Problems
This is nice and simple, but there are a few problems with the implementation of our mini-database:
1. **Redundancy**: We store the full patient name in both `patient` and `diagnosis`. That's fine when we have three patients with just a few diagnoses, but eventually we're going to have many more patients and this will take up a lot more space.
2. **Uniqueness**: While we're unlikely to see another patient named Thor in our clinic, most people's names are not so unique. Eventually we'll end up with two "John Smith"'s. How will we tell them apart in the `diagnosis` table?
3. **Granularity**: The `Patient Name` column has both the first and last name for the patient. But what if we just want one of those values? Similarly with `Location`, the entire city/state/country/planet are stored in one cell.
4. **Consistency** Looking again at `Location`, we see a few different formats: Tony Stark's location is a city and state. But Natasha Romanoff's is a city followed by a country. Thor doesn't have either of these - just a planet. So the format and meaning of the values are *not consistent* with one another.
5. **Size**: As we see more patients, our `Patient` sheet will eventually have thousands or millions of rows. And we'll get much more than just diagnoses for each patient, so we could have hundreds of sheets, each with millions of rows. Eventually, Excel is just not going to be enough.

### Relational Database
Well-designed relational databases solve many of these issues. Here are some steps we could take to have a better database: 
1. To address the **redundancy** issue, we'll instead just store the value of patient names in the `patient` table and then **link** other tables to that table using an identifier (typically a number) which is less expensive to store. This is why we call this type of database *relational*.
2. To fix the **uniqueness** issue, we'll make that identifier unique. 
3. To allow for more **granular** analysis, we'll break up the `Name` column into `last_name` and `first_name`, and `location` columns up into `city`, `state`, `country`, and `planet`.
4. Having these columns broken up will also make the data more consistent with each other since we'll know who has which element recorded and be able to compare appropriately.
5. For our **size** issue, the steps above might help us reduce our space, but we may ultimately need to move out of Excel and into some other framework.

Below is a diagram showing a possible **schema** for our database with an added third table of `encounter`. The columns in **bold** are **keys** which are used to identify a particular entity - such as a patient, diagnosis, or encounter - in the various tables. The arrows show how the keys link the three tables with each other. 

For example, `subject_id` is the primary identifier for the `patient` table - it represents a single, unique patient. It's used in both the `diagnosis` and `encounter` tables to join those with the `patient` table.

![example_schema](./media/example_schema.png)

## Joining tables in a relational database
To get data, we might need elements from multiple different tables. To do that we need to **join** them using the relationships shown in the diagram above.

For example, let's say we wanted a list of patient first/last names, diagnoses, and type of encounter:

![joined_values](./media/example_joined_values.png)

To get these values, we might do the following, alternating between pulling a value from a table and joining to another table to get the next value:

- **(Get Value)** Get the columns `last_name` and `first_name` from the `patient` table
- **(Join Table)** Use the `subject_id` column to join `patient` and `diagnosis` tables
- **(Get Value)** Take the `diagnosis` column from the `diagnosis` table
- **(Join Table)** Use the `encounter_id` column to join `diagnosis` and `encounter`
- **(Get Value)** Take the `encounter_type` column from `encounter`

#### TODO
Let's say that the leaders of our clinic ask us to pull some data for a report so we need to pull some data from our database. Which columns would we need to get the requested information? Think both about the values we want that we want to present as well as the ones needed to link the appropriate tables together.

In the notation below, `encounter.date` refers to the date column of `encounter`, while `diagnosis.date` refers to the date column of `diagnosis`.

In [40]:
# RUN CELL TO SEE QUIZ
quiz_relational_columns1

VBox(children=(HTML(value="The ID's of all patients who have had Covid-19."), SelectMultiple(index=(1,), optio…



In [41]:
# RUN CELL TO SEE QUIZ
quiz_relational_columns2

VBox(children=(HTML(value='The number of outpatient encounters between January and May of 2022.'), SelectMulti…



In [42]:
# RUN CELL TO SEE QUIZ
quiz_relational_columns3

VBox(children=(HTML(value='The age at time of diagnosis for patients with cancer.'), SelectMultiple(index=(0, …



Instead of working with an example of a superhero clinic, in the rest of this module we'll use real clinical data from a deidentified clinical database called **MIMIC-II**.

## Introduction to MIMIC-II

MIMIC is an openly available clinical database. It is **de-identified**, meaning that any information which would connect a patient to their data has been removed or altered. That means that we have access to it as researchers, students, and developers. 

The research database has been updated to MIMIC-III, which is similar but contains patients for living patients, while MIMIC-II has only deceased patients. MIMIC-III requires a data usage agreement, so we will instead use the older version. The two versions are very similar and contain much of the same data.

Here is a description of MIMIC-III from the [MIMIC website](https://mimic.physionet.org/):

***
<strong>
MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.

The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (both in and out of hospital).
</strong>
***

There is a PDF containing MIMIC-II documentation here:
https://mimic.mit.edu/archive/mimic-ii-guide.pdf

The documentation is quite detailed and technical, but it's useful if you have a specific question about a table or schema. The diagram below shows a high-lebel summary of the types of data contained in the MIMIC database.  If you want to see it enlarged, you can open it in a separate window: [MIMIC architecture](../media/mimic-ii-architecture.png).

This diagram shows the architecture of the database along with column names and relationships between tables. While it's more complex than the fictional superhero clinic, it's fundamentally designed the same way.


![MIMIC architecture](../media/mimic-ii-architecture.png)

The file `mimic_tables.csv` contains a list of tables in the database:

In [1]:
import pandas as pd
pd.read_csv("mimic_tables.csv")

Unnamed: 0,Tables_in_mimic2
0,a_chartdurations
1,a_iodurations
2,a_meddurations
3,additives
4,admissions
5,censusevents
6,chartevents
7,comorbidity_scores
8,d_caregivers
9,d_careunits


#### Discussion
Think of clinical data you might want to use in research. Where might you find these data elements in MIMIC? Which tables and column names would you need?

In the next few notebooks, we'll go through some of these tables in more detail and sharpen our SQL skills along the way.