### SQL and Pandas Data Frames

- Pandas can read/write SQL databases to/from data frames
- Works with many databases
- SQLite3 support is built-in

Getting Started:

1. Create a new notebook in the directory with the `cd4.db` file
2. import pandas and sqlite3

In [1]:
import pandas as pd
import sqlite3

Let's see what's in our directory

In [2]:
%ls

01-intro-relational-databases.ipynb
02-relational-databases-and-dataframes.ipynb
README.md
cd4.db
long_data.db
long_data_cleaned.csv
long_data_cleaned.db


### Reading Data Frame from SQL

First, you need to get a database connection. Pandas doesn't read the file directly, it needs a connection object.

In [3]:
conn = sqlite3.connect('cd4.db')

Pandas can now issue SQL queries to that connection and create a **DataFrame**

We know we have a cd4 table and can order by name

In [4]:
pd.read_sql("SELECT * from cd4 order by name", conn, index_col='name') 

Unnamed: 0_level_0,cd4_baseline,cd4_followup
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,364,448.0
Jill,836,
Joe,2117,1959.0
John,815,792.0


See that NULL has become NaN

And these are Data Frames like any other. We can get their info or describe them:

In [5]:
cd4 = pd.read_sql('select * from cd4',conn)

In [6]:
print cd4.info()
cd4.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 3 columns):
name            4 non-null object
cd4_baseline    4 non-null float64
cd4_followup    3 non-null float64
dtypes: float64(2), object(1)
memory usage: 128.0+ bytes
None


Unnamed: 0,cd4_baseline,cd4_followup
count,4.0,3.0
mean,1033.0,1066.333333
std,754.751615,791.974958
min,364.0,448.0
25%,702.25,620.0
50%,825.5,792.0
75%,1156.25,1375.5
max,2117.0,1959.0


Or add a column:

In [7]:
cd4['diff'] = cd4['cd4_baseline'] - cd4['cd4_followup']
cd4

Unnamed: 0,name,cd4_baseline,cd4_followup,diff
0,Jane,364,448.0,-84.0
1,Jill,836,,
2,Joe,2117,1959.0,158.0
3,John,815,792.0,23.0


But it's a copy of the database - changing the data frame does not change the underyling database

In [8]:
pd.read_sql('select * from cd4',conn)

Unnamed: 0,name,cd4_baseline,cd4_followup
0,Jane,364,448.0
1,Jill,836,
2,Joe,2117,1959.0
3,John,815,792.0


This should not be surprising, CSV behaves the same way.

### Exercise: Custom SQL to Data Frame

Create a data frame from the cd4 database using pd.read_sql:

1. with rows ordered by **cd4_baseline** (ascending)
2. with only the **name** and **cd4_baseline** columns
3. with all columns, adding the **diff** column as `cd4_baseline - cd4_followup`

In [9]:
# Starting query
pd.read_sql('select *, cd4_baseline - cd4_followup as diff from cd4 order by cd4_baseline asc', conn)

Unnamed: 0,name,cd4_baseline,cd4_followup,diff
0,Jane,364,448.0,-84.0
1,John,815,792.0,23.0
2,Jill,836,,
3,Joe,2117,1959.0,158.0


In [10]:
conn.close()

## Interoperability with CSV

Start with a data frame, e.g. from CSV:

In [11]:
long_data = pd.read_csv('long_data_cleaned.csv', index_col=0)
long_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1760 entries, 0 to 1760
Data columns (total 11 columns):
dilution          1760 non-null int64
analyte           1760 non-null object
fi-bkgd           1760 non-null float64
fi-bkgd-neg       1408 non-null float64
cv                1760 non-null float64
participant_id    1760 non-null object
visit_code        1760 non-null int64
visit_date        1760 non-null object
sample_type       1760 non-null object
buffer            1760 non-null object
bead_number       1760 non-null int64
dtypes: float64(3), int64(3), object(5)
memory usage: 165.0+ KB


In [12]:
long_data.head()

Unnamed: 0,dilution,analyte,fi-bkgd,fi-bkgd-neg,cv,participant_id,visit_code,visit_date,sample_type,buffer,bead_number
0,50,p24,474.8,454.8,0.0372,URN2,0,10/14/1899,PLA,PBS,19
1,50,gp41,470.8,452.8,0.1387,URN2,0,10/14/1899,PLA,PBS,44
2,50,Con 6 gp120/B,52.5,44.5,0.1183,URN2,0,10/14/1899,PLA,PBS,72
3,50,B.con.env03 140 CF,55.5,46.5,0.1709,URN2,0,10/14/1899,PLA,PBS,65
4,50,Blank,29.0,,0.0527,URN2,0,10/14/1899,PLA,PBS,53


And we can take this CSV data and write it to a database system.
Again create a connection.

In [13]:
conn = sqlite3.connect('long_data_cleaned.db')
long_data.to_sql('long_data_cleaned', conn, if_exists='replace')


Let's read that back to see how it compares

In [14]:
pd.read_sql('select * from long_data_cleaned', conn, index_col='index')

Unnamed: 0_level_0,dilution,analyte,fi-bkgd,fi-bkgd-neg,cv,participant_id,visit_code,visit_date,sample_type,buffer,bead_number
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,50,p24,474.8,454.800000,0.0372,URN2,0,10/14/1899,PLA,PBS,19
1,50,gp41,470.8,452.800000,0.1387,URN2,0,10/14/1899,PLA,PBS,44
2,50,Con 6 gp120/B,52.5,44.500000,0.1183,URN2,0,10/14/1899,PLA,PBS,72
3,50,B.con.env03 140 CF,55.5,46.500000,0.1709,URN2,0,10/14/1899,PLA,PBS,65
4,50,Blank,29.0,,0.0527,URN2,0,10/14/1899,PLA,PBS,53
5,50,Con S gp140 CFI,82.0,62.000000,0.1799,URN2,0,10/14/1899,PLA,PBS,3
6,50,p31,474.4,455.400000,0.0885,URN2,0,10/14/1899,PLA,PBS,50
7,50,p66 (RT),69.4,50.400000,0.0527,URN2,0,10/14/1899,PLA,PBS,42
8,50,MulVgp70_His6,205.4,,0.0861,URN2,0,10/14/1899,PLA,PBS,49
9,50,gp70_B.CaseA_V1_V2,40.5,-64.766667,0.0615,URN2,0,10/14/1899,PLA,PBS,12


## Exercise: Filter and export data

Use pandas and the to_sql method to:

1. Write a table containing all columns from the data frame, but only rows for the **p31** analyte
2. Write a table containing rows with **p31** analyte but only the following columns:
    - analyte
    - fi-bkgd
2. Append to the table in step 2 with data for **p24** analyte
