### SQL and Pandas Data Frames

- Pandas can read/write SQL databases to/from data frames
- Works with many databases
- SQLite3 support is built-in

First, import pandas and sqlite3

In [None]:
import pandas as pd
import sqlite3

Let's see what's in our directory

In [None]:
!ls *.db

### Reading Data Frame from SQL

First, you need to get a database connection. Pandas doesn't read the file directly, it needs a connection object.

In [None]:
play_conn = sqlite3.connect('play.db')

Pandas can now issue SQL queries to that connection and create a **DataFrame**

In [68]:
pd.read_sql('select * from playwrights',play_conn)

Unnamed: 0,first_name,last_name,year_of_birth,year_of_death
0,William,Shakespeare,1564,1616


And these are Data Frames like any other. We can get their info:

In [69]:
p = pd.read_sql('select * from playwrights',play_conn)
p.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
first_name       1 non-null object
last_name        1 non-null object
year_of_birth    1 non-null int64
year_of_death    1 non-null int64
dtypes: int64(2), object(2)
memory usage: 40.0+ bytes


Or add a column:

In [70]:
p['age'] = p['year_of_death'] - p['year_of_birth']

In [71]:
p

Unnamed: 0,first_name,last_name,year_of_birth,year_of_death,age
0,William,Shakespeare,1564,1616,52


But it's a copy of the database - changing the data frame does not change the underyling database

In [72]:
pd.read_sql('select * from playwrights',play_conn)

Unnamed: 0,first_name,last_name,year_of_birth,year_of_death
0,William,Shakespeare,1564,1616


This should not be surprising, CSV behaves the same way. To update the database with this new column, we'll use `to_sql`

In [73]:
p.to_sql('playwrights', play_conn)

ValueError: Table 'playwrights' already exists.

In [None]:
pd.read_sql('select * from playwrights_age', play_conn)

play_conn.close()

## Interoperability with CSV

Start with a data frame, e.g. from CSV:

In [None]:
long_data = pd.read_csv('long_data.csv')

In [None]:
long_data.info()

In [None]:
long_data[0:5]

And we can take this CSV data and write it to a database system.
Again create a connection.

In [None]:
long_data_conn = sqlite3.connect('long_data.db')
long_data.to_sql('long_data',long_data_conn, if_exists='replace')


Let's read that back to see how it compares

In [49]:
pd.read_sql("select * from long_data", long_data_conn)

Unnamed: 0,index,Dilution,Analyte,FI-Bkgd,FI-Bkgd-Neg,CV,Participant ID,Visit Code,Visit Date,Sample Type,Buffer
0,0,50,p24 (19),474.8,454.800000,0.0372,URN2,0,10/14/1899,PLA,PBS
1,1,50,gp41 (44),470.8,452.800000,0.1387,URN2,0,10/14/1899,PLA,PBS
2,2,50,Con 6 gp120/B (72),52.5,44.500000,0.1183,URN2,0,10/14/1899,PLA,PBS
3,3,50,B.con.env03 140 CF (65),55.5,46.500000,0.1709,URN2,0,10/14/1899,PLA,PBS
4,4,50,Blank (53),29.0,,0.0527,URN2,0,10/14/1899,PLA,PBS
5,5,50,Con S gp140 CFI (3),82.0,62.000000,0.1799,URN2,0,10/14/1899,PLA,PBS
6,6,50,p31 (50),474.4,455.400000,0.0885,URN2,0,10/14/1899,PLA,PBS
7,7,50,p66 (RT) (42),69.4,50.400000,0.0527,URN2,0,10/14/1899,PLA,PBS
8,8,50,MulVgp70_His6 (49),205.4,,0.0861,URN2,0,10/14/1899,PLA,PBS
9,9,50,gp70_B.CaseA_V1_V2 (12),40.5,-64.766667,0.0615,URN2,0,10/14/1899,PLA,PBS


## Exercise: Filter and export data

Write a new table containing just the long_data rows with the following analytes:

- **p31 (50)**
- **p24 (19)**

Hint: More than one way to do this, depending on what you choose to `append`, or how to filter.

In [66]:
p31 = long_data[long_data['Analyte'] == 'p31 (50)']
p24 = long_data[long_data['Analyte'] == 'p24 (19)']
subset = p31.append(p24)
subset

long_data_conn = sqlite3.connect('long_data.db')
subset.to_sql('long_data_subset',long_data_conn)
long_data_conn.close()