# SQL and RDBMS

 * Data is generally stored in databases, rather than in flat files
     * Reduced redundancy
     * More consistent
     * Better backups!
     * Data entry/storage/retrieval is more efficient

There are many types of databases!

 * Sometimes, data is stored in tabular formats
 * Sometimes as documents (hierarchical)
 * Sometimes as a graph (network-based)

Today, we will look at Relational Database Management Systems (RDBMS)

![](./assets/relational_model.png)

## Characteristics of a Relational Database

 * Data are stored as tables (Rows and Columns)
 * All values are scalar (each row/column entry has exactly 1 value)
 * Each column has exactly 1 type (numeric, text, etc.)
 * Tables have Key columns, which are used to index the table
 * A Primary key is a column (or set of columns) that *uniquely* identifies a row in a table.
     * Must be unique
     * cannot be NULL
 * A Foreign key is a column whose value is required to match the primary key of another table

## Benefits of the Relational Model

 * Data is easy to retrieve and query
 * Flexible (easy to add/delete tables)
 * Reduced redundancy
 

## Disadvantages
 * Sometimes slow and difficult to scale
 * Not ideal for storing hierarchical data
 * Must adhere to a fixed schema (bad for unstructured data)

# Structured Query Language (SQL)

 * A language used to query data (and more!) from relational databases
 * Many different flavors depending on the database:
     * Oracle
     * Microsoft SQL
     * MySQL
     * etc.

## Database Tables

 * Every table has a name 
 * Contains records (rows)

In [None]:
import sqlite3 # library for working with sqlite database
conn = sqlite3.connect("./data/diabetes.db") # Create a connection to the on-disk database

In [None]:
import pandas as pd

#### Example Query

In [None]:
pd.read_sql("SELECT * FROM patient LIMIT 25", conn)

In [None]:
pd.read_sql("SELECT * FROM sqlite_master where type='table'", conn)

### SELECT clause

```
SELECT column_name1, column_name2 
FROM table1```

In [None]:
pd.read_sql("SELECT race, gender FROM patient", conn)

#### SELECT DISTINCT

returns unique values

In [None]:
pd.read_sql("SELECT DISTINCT race, gender FROM patient", conn)

### WHERE Clauses

Where clauses allow you to *filter* data in your SQL query. There are many logical operators that you can use with the WHERE clase. Here is a simple one:

In [None]:
pd.read_sql("SELECT DISTINCT race, gender FROM patient WHERE race = 'Caucasian'", conn)

In [None]:
pd.read_sql("SELECT patient_nbr FROM patient WHERE patient_nbr = 77586282", conn)

Here is a list of some operators you can use in your WHERE clause. 

![](./assets/WHEREclause.png)

#### BETWEEN

In [None]:
pd.read_sql("SELECT * FROM patient WHERE patient_nbr BETWEEN 10000 AND 99999", conn)

#### LIKE

The `LIKE` operator lets you specify matches in text, much like regular expressions. However, it is considerably less powerful. The keys to know are the `%` and `_` operators.

`%` represents 0, one, or many characters (Wildcard)

`_` represents 1 character (Any)

In [None]:
pd.read_sql("SELECT DISTINCT race, gender FROM patient WHERE gender LIKE '%male'", conn)

In [None]:
pd.read_sql("SELECT DISTINCT race, gender FROM patient WHERE gender LIKE '_male'", conn)

#### IN 

In [None]:
pd.read_sql("SELECT DISTINCT race, gender FROM patient WHERE race in ('Caucasian', 'Hispanic')", conn)

### LOGIC (AND/OR/NOT)

As you might imagine, you can create complex WHERE clauses by using the `AND`, `OR`, and `NOT` keywords. In addition, you can wrap the subclauses in parentheses to make sure that they execute together.

In [None]:
pd.read_sql("""SELECT DISTINCT race, gender 
                FROM patient 
                WHERE race in ('Caucasian', 'Hispanic')
                    AND gender = 'Female'
            """, conn)


In [None]:
pd.read_sql("""SELECT DISTINCT race, gender 
                FROM patient 
                WHERE race in ('Caucasian', 'Hispanic')
                    OR gender = 'Female'
            """, conn)

In [None]:
pd.read_sql("""SELECT DISTINCT race, gender 
                FROM patient 
                WHERE race in ('Caucasian', 'Hispanic')
                    OR NOT gender = 'Female'
            """, conn)

In [None]:
pd.read_sql("""SELECT DISTINCT race, gender
                FROM patient
                WHERE gender = 'Female' AND (race = 'Other' OR race = 'Asian')
""", conn)

### ORDER BY

In [None]:
pd.read_sql("""SELECT patient_nbr, gender
                FROM patient
                ORDER BY patient_nbr
""", conn)

In [None]:
pd.read_sql("""SELECT patient_nbr, gender
                FROM patient
                ORDER BY patient_nbr DESC
""", conn)

### NULL values

In [None]:
pd.read_sql("""SELECT patient_nbr, gender
                FROM patient
                WHERE gender IS NULL
""", conn)

In [None]:
pd.read_sql("""SELECT patient_nbr, gender
                FROM patient
                WHERE gender = NULL
""", conn)

In [None]:
pd.read_sql("""SELECT patient_nbr, gender
                FROM patient
                WHERE gender IS NOT NULL
""", conn)

### Min, Max, Count, Avg, Sum

SQL also has some built-in functions for summarizing data. For example, you can call `MIN(column_name)` and it will return the minimum of a column in a select statement

In [None]:
pd.read_sql("""SELECT MIN(num_medications), MAX(num_medications), AVG(num_medications)
                FROM encounter
""", conn)

### Aliases

Sometimes, SQL queries contain long table names, or column names, and it is easier to refer to them by another name, or alias. In addition, derived columns like those returned from MIN(), MAX(), etc. often look better when reformatted.

##### Column name alias

In [None]:
pd.read_sql("""SELECT MIN(num_medications) as Minimum_medications, MAX(num_medications), AVG(num_medications)
                FROM encounter
""", conn)

# Joining Tables

The power of relational databases comes from their relation structure, which enables the user to merge tables together in order to combine information across tables.

In [None]:
pd.read_sql("""SELECT *
                FROM diagnosis
                LIMIT 10
""", conn)

In [None]:
pd.read_sql("""SELECT "ICD-9-CM CODE"
                FROM ccs_crosswalk
                LIMIT 10""", conn)

### 3 Important Types of Joins:

### Inner Join

![](./assets/innerjoin.png)

As the figure above shows, an inner join takes only the values of the join columns that are in BOTH tables and returns the result

In [None]:
pd.read_sql("""SELECT *
                FROM diagnosis
                INNER JOIN ccs_crosswalk ON diagnosis.diag_1 = ccs_crosswalk."ICD-9-CM CODE"
""", conn)

### Left Join

![](./assets/leftjoin.png)

### Full Join (Full outer join)

![](./assets/fulljoin.png)

### Group By:
Just like in Pandas, we have group by functionality in SQL as well.

In [None]:
pd.read_sql("""SELECT ccs_crosswalk."CCS CATEGORY DESCRIPTION", count(ccs_crosswalk."CCS CATEGORY DESCRIPTION")
                FROM diagnosis
                INNER JOIN ccs_crosswalk ON diagnosis.diag_1 = ccs_crosswalk."ICD-9-CM CODE"
                GROUP BY ccs_crosswalk."CCS CATEGORY DESCRIPTION"
""", conn)

### Having

In [None]:
pd.read_sql("""SELECT ccs_crosswalk."CCS CATEGORY DESCRIPTION", count(ccs_crosswalk."CCS CATEGORY DESCRIPTION") 
                FROM diagnosis
                INNER JOIN ccs_crosswalk ON diagnosis.diag_1 = ccs_crosswalk."ICD-9-CM CODE"
                GROUP BY ccs_crosswalk."CCS CATEGORY DESCRIPTION"
                HAVING count(ccs_crosswalk."CCS CATEGORY DESCRIPTION") > 1000
""", conn)

### Case Statements

In [None]:
pd.read_sql("""SELECT num_medications, 
                (CASE 
                    WHEN num_medications > 5 THEN "Greater than 5 medications"
                    WHEN num_medications <= 5 THEN "Less than or equal to 5 medications"
                END) AS GreaterThan5
                FROM encounter
""", conn)

## Execution Order (https://sqlbolt.com/lesson/select_queries_order_of_execution)

Under the hood, 

### 1. FROM and JOINs
The FROM clause, and subsequent JOINs are first executed to determine the total working set of data that is being queried. This includes subqueries in this clause, and can cause temporary tables to be created under the hood containing all the columns and rows of the tables being joined.

### 2. WHERE
Once we have the total working set of data, the first-pass WHERE constraints are applied to the individual rows, and rows that do not satisfy the constraint are discarded. Each of the constraints can only access columns directly from the tables requested in the FROM clause. Aliases in the SELECT part of the query are not accessible in most databases since they may include expressions dependent on parts of the query that have not yet executed.

### 3. GROUP BY
The remaining rows after the WHERE constraints are applied are then grouped based on common values in the column specified in the GROUP BY clause. As a result of the grouping, there will only be as many rows as there are unique values in that column. Implicitly, this means that you should only need to use this when you have aggregate functions in your query.

### 4. HAVING
If the query has a GROUP BY clause, then the constraints in the HAVING clause are then applied to the grouped rows, discard the grouped rows that don't satisfy the constraint. Like the WHERE clause, aliases are also not accessible from this step in most databases.

### 5. SELECT
Any expressions in the SELECT part of the query are finally computed.

### 6. DISTINCT
Of the remaining rows, rows with duplicate values in the column marked as DISTINCT will be discarded.

### 7. ORDER BY
If an order is specified by the ORDER BY clause, the rows are then sorted by the specified data in either ascending or descending order. Since all the expressions in the SELECT part of the query have been computed, you can reference aliases in this clause.

### 8. LIMIT / OFFSET
Finally, the rows that fall outside the range specified by the LIMIT and OFFSET are discarded, leaving the final set of rows to be returned from the query.

### And much, much more...

### Resources:
[Interactive SQL book](https://selectstarsql.com/)

[Quick Reference](https://www.w3schools.com/sql/sql_quickref.asp)

## In-class exercises:

### Answer the following questions from HW2 with SQL statements instead of Pandas (answers my differ slightly due to preprocessing)

**How many unique encounters are there? How many unique patients?**

**What is the most amount of encounters that a single patient has in the dataset?**

**What is the average number of labs administered by age category?**

**Create a new column that has the value of 1 if the medical specialty in that row contains the word Surgery and 0 otherwise** 

For this question, you can use the `LIKE` operator and `CASE` statements [Case Statements](https://www.w3schools.com/sql/sql_case.asp)