### SQL and Pandas Data Frames

- Pandas can read/write SQL databases to/from data frames
- Works with many databases
- SQLite3 support is built-in

Getting Started:

1. Create a new notebook in the directory with the `cd4.db` file
2. import pandas and sqlite3

In [15]:
import pandas as pd
import sqlite3

Let's see what's in our directory

In [None]:
%ls 

### Reading Data Frame from SQL

First, you need to get a database connection. Pandas doesn't read the file directly, it needs a connection object.

In [None]:
# conn = sqlite3.connect()

In [16]:
conn = sqlite3.connect('cd4.db')

Pandas can now issue SQL queries to that connection and create a **DataFrame**

We know we have a cd4 table and can order by name

In [None]:
# pd.read_sql(query, conn)

In [None]:
pd.read_sql('select * from cd4 order by name asc', conn)

See that NULL has become NaN

And these are Data Frames like any other. We can get their info or describe them:

In [21]:
cd4 = pd.read_sql('select * from cd4',conn)

In [None]:
print cd4.info()
cd4.describe()

Or add a column:

In [None]:
# diff = cd4_baseline - cd4_followup

In [25]:
cd4['diff'] = cd4['cd4_baseline'] - cd4['cd4_followup']

In [None]:
cd4

But it's a copy of the database - changing the data frame does not change the underyling database

In [None]:
pd.read_sql('select * from cd4',conn)

This should not be surprising, CSV behaves the same way.

### Exercise: Custom SQL to Data Frame

Create a data frame from the cd4 database using pd.read_sql:

1. with rows ordered by **cd4_baseline** (ascending)
2. with only the **name** and **cd4_baseline** columns
3. with all columns, adding the **diff** column as `cd4_baseline - cd4_followup`

In [None]:
# Starting query
pd.read_sql('select * from cd4', conn)

In [None]:
conn.close()

## Interoperability with CSV

Start with a data frame, e.g. from CSV:

In [7]:
# pd.read_csv(filename)
# Anything weird?

In [None]:
long_data = pd.read_csv('long_data_cleaned.csv', index_col=0)
long_data

In [None]:
long_data.info()

In [None]:
long_data[0:5]

And we can take this CSV data and write it to a database system.
Again create a connection.

In [None]:
# sqlite3.connect()
# c.to_sql(table, conn)


In [45]:
long_data_conn = sqlite3.connect('long_data.db')
long_data.to_sql('long_data',long_data_conn, if_exists='replace')

Let's read that back to see how it compares

In [None]:
# pd.read_sql

In [None]:
pd.read_sql("select * from long_data", long_data_conn, index_col='index')

## Exercise: Filter and export data

Use pandas and the to_sql method to:

1. Write a table containing all columns from the data frame, but only rows for the **p31** analyte
2. Write a table containing rows with **p31** analyte but only the following columns:
    - analyte
    - fi-bkgd
2. Append to the table in step 2 with data for **p24** analyte


In [None]:
p31 = long_data[long_data['analyte'] == 'p31']
p24 = long_data[long_data['analyte'] == 'p24']
subset = p31.append(p24)

In [None]:
long_data_conn = sqlite3.connect('long_data.db')
subset.to_sql('long_data_subset',long_data_conn, if_exists='replace')
long_data_conn.close()
subset