# Overview

Last week, we discussed clinical information and how it is generated, recorded, and represented. This week we will utilize a real-world dataset of de-identified patient data, called **"MIMIC-II"**, which will allow us to work with real world clinical data. We'll use **SQL** to query the database and Python to visualize and transform the data.


In this notebook we'll have a quick overview of the environments we're working in and the tools we'll be using.

# I. Jupyter Notebooks
Jupyter Notebooks are an environment which can be used for running code, displaying results and visualizations, and sharing human-readable information. Jupyter notebooks consist of *cells* and each cell defines a single piece of code. 

The cell which you're reading now is called a *Markdown* cell: It is meant to be human-readable and allows formatting like:
- Bullted or numbered lists
- **bold text**
- *italics*

Double-click on this cell to see what the raw markdown looks like. Then run the cell by hitting either the `"Run"` button above (looks like a "play" button") or hitting "Shift+Enter".

In [None]:
# This cell is a "Code" cell. It is meant to contain Python code to be executed
# The '#' symbol means that this is a comment, so it won't be run like code
# Run this cell and see what is displayed in the output:
print("Hello!")

Note that in the code cell above, when we execute the cell it runs the single line of code and then displays the output underneath.

# II. Python
Python is a popular general programming language. It's relatively easy to read and understand, so people often learn it as their first language. It's also great for quick development and iterations.

In Python, you can display output by calling the `print` function:

In [None]:
print("Hello again!")

You can assign variables by using the `=` operator. Fill in the quotation marks below with your name and see what is printed out:

In [None]:
name = ''
print("My name is: ", name)

Variables can be of any Python datatype. Note that `my_list` is a list which can contain several other objects - in this case, it contains the variables we just defined.

In [None]:
my_int = 1
my_float = 1.43
my_string = "I'm a string"
my_list = [my_int, my_float, my_string]
print(my_list)

We can perform operations on these Python objects:

In [None]:
# Add two numbers together
1 + 2

In [None]:
# Multiplication / division
2 * 4

In [None]:
2 / 4

In [None]:
# Exponents
2**3

In [None]:
# Add two strings together
"My name is " + name

A Python `function` is a piece of code which is predefined so that it can be run multiple times. A function takes in an **argument** (what's in parentheses after the function name) and produces a **return value**. For example, here is a function which takes two numbers and multiplies them together:

In [None]:
def multiply(a, b):
    return a * b

In [None]:
multiply(2, 3)

In [None]:
multiply(8, 21)

We can `import` other libraries into Python so that we can use code which other people written. For example, we'll import the `matplotlib` library and use it to generate a line plot:

In [None]:
import matplotlib.pyplot as plt
plt.plot([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 15, 25], marker='o')
plt.show()

If you haven't installed a library and try to import it, you'll get an error. To install a library, you can run a `pip install` command in either a notebook or the terminal:

In [None]:
!pip install matplotlib

# III. SQL
SQL stands for "Structured Query Language". It is used to retrieve data from relational databases, perform aggregations on them, and return them in formats which are useful.

Here is a very quick overview of how SQL works:

## SQL Query
SQL code consists of **queries** which are executed to run commands against the database you're working in. At a very basic level, queries consist of a `select` clause, a `from` clause, and a number of optional clauses clause like a `where` clause, `limit` clause, and `order by`.

## `select` statement
This part of the query specifies which columns of data to return. For example, the following query will select the `employee_id`, `employee_first_name`, and `employee_last_name` values from an imaginary table:

```
select employee_id, employee_first_name, employee_last_name
...
```

## `from` statement
This part of the query tells SQL which tables to retreiev the data from. In this example, we want to get the employee ids and names from a table called `employees`:

```
select employee_id, employee_first_name, employee_last_name
from employees
...
```

## `where` statement
This filters the results of our query to only look at certain values. This query will only return data for employees whose first name is "Alex":
```
select employee_id, employee_first_name, employee_last_name
from employees
where employee_first_name = "Alex"
```

## Other statements:
Only the `select` and `from` statements are needed to run a query. But there are many statements which can be very useful, such as:
- `limit 100`: Limit to the first 100 rows (or whatever number)
- `order by last_name`: Sort the results in alphabetical order according to a specific column, such as `last_name`

Here's an example of the query which puts all of these together:

```
select employee_id, employee_first_name, employee_last_name
from employees
where employee_first_name = "Alex"
order by last_name
limit 100;
```

# IV. MIMIC-II
MIMIC is an openly available clinical database. It's **de-identified**, meaning that any information which would connect a patient to their data has been removed or altered. That means that we have access to it as researchers, students, and developers. 

MIMIC-II has been updated to MIMIC-III, which is similar but contains patients for living patients, while MIMIC-II has only deceased patients. MIMIC-III requires a data usage agreement, so we will instead use the older version. The two versions are very similar and contain a lot of the same data.

Here is a description of MIMIC-III from the [MIMIC website](https://mimic.physionet.org/):

***
MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.

The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (both in and out of hospital).

MIMIC supports a diverse range of analytic studies spanning epidemiology, clinical decision-rule improvement, and electronic tool development. It is notable for three factors:

- it is freely available to researchers worldwide
- it encompasses a diverse and very large population of ICU patients
- it contains high temporal resolution data including lab results, electronic documentation, and bedside monitor trends and waveforms.
***

We will use Python and SQL to access an instance of SQL which is set up on Google Cloud. You'll need a password to access it; ask your instructor if you don't know it.

First, we'll import the libraries which will allow us to connect to the database:

In [None]:
# Pandas is a library which allows us to work with tabular data from a number of different formats,
# including SQL
import pandas as pd

# pymysql will run MySQL in Python
import pymysql

# Finally, getpass will allow us to type our password in:
import getpass

The host name, username, and database name have been defined for you. When prompted, enter your password:

In [None]:
conn = pymysql.connect(host="35.233.174.193",port=3306,
                       user="jovyan",passwd=getpass.getpass("Enter password for MIMIC2 database"),
                       db='mimic2')

If you didn't get an error, then that means it worked! Let's run our first query against MIMIC to see what tables are in the database:

In [None]:
# Define a query as a string
query = """
show tables;
"""

# Pass the query and our MySQL connection to pandas. 
# Store the result a variable called df (DataFrame)
df = pd.read_sql(query, conn)
df

# TODO and Discussion items
Throughout these notebooks, you'll see sections marked **TODO** or **Discussion**. Whenever you see a todo section, you'll be asked to edit or complete code. Parts of code which you need to edit will have placeholders of three underscores: `___`. You should replace these underscores with the correct snippet of code.

For example, if you are asked to edit the cell below to print out the text "Hello, world!", you should change this cell (which won't run):

In [None]:
___("Hello, world!")

To this:

In [None]:
print("Hello, world!")