### SQL and Pandas Data Frames

- Pandas can read/write SQL databases to/from data frames
- Works with many databases
- SQLite3 support is built-in

First, import pandas and sqlite3

In [1]:
import pandas as pd
import sqlite3

Let's see what's in our directory

In [2]:
!ls *.db

cd4.db


### Reading Data Frame from SQL

First, you need to get a database connection. Pandas doesn't read the file directly, it needs a connection object.

In [21]:
conn = sqlite3.connect('cd4.db')

Pandas can now issue SQL queries to that connection and create a **DataFrame**

In [22]:
pd.read_sql('select * from cd4 order by name',conn)

Unnamed: 0,name,cd4_baseline,cd4_followup
0,Jane,364,448.0
1,Jill,836,
2,Joe,2117,1959.0
3,John,815,792.0


See that NULL has become NaN

And these are Data Frames like any other. We can get their info or describe them:

In [23]:
cd4 = pd.read_sql('select * from cd4',conn)
cd4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 3 columns):
name            4 non-null object
cd4_baseline    4 non-null float64
cd4_followup    3 non-null float64
dtypes: float64(2), object(1)
memory usage: 128.0+ bytes


Or add a column:

In [24]:
cd4['diff'] = cd4['cd4_baseline'] - cd4['cd4_followup']

In [25]:
cd4

Unnamed: 0,name,cd4_baseline,cd4_followup,diff
0,Jane,364,448.0,-84.0
1,Jill,836,,
2,Joe,2117,1959.0,158.0
3,John,815,792.0,23.0


But it's a copy of the database - changing the data frame does not change the underyling database

In [26]:
pd.read_sql('select * from cd4',conn)

Unnamed: 0,name,cd4_baseline,cd4_followup
0,Jane,364,448.0
1,Jill,836,
2,Joe,2117,1959.0
3,John,815,792.0


This should not be surprising, CSV behaves the same way. To update the database with this new column, we'll use `to_sql`

In [18]:
cd4.to_sql('cd4', conn)

ValueError: Table 'cd4' already exists.

In [None]:
pd.read_sql('select * from cd4_diff', conn)

conn.close()

## Interoperability with CSV

Start with a data frame, e.g. from CSV:

In [34]:
long_data = pd.read_csv('long_data_cleaned.csv', index_col=0)

In [35]:
long_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1760 entries, 0 to 1760
Data columns (total 11 columns):
dilution          1760 non-null int64
analyte           1760 non-null object
fi-bkgd           1760 non-null float64
fi-bkgd-neg       1408 non-null float64
cv                1760 non-null float64
participant_id    1760 non-null object
visit_code        1760 non-null int64
visit_date        1760 non-null object
sample_type       1760 non-null object
buffer            1760 non-null object
bead_number       1760 non-null int64
dtypes: float64(3), int64(3), object(5)
memory usage: 165.0+ KB


In [36]:
long_data[0:5]

Unnamed: 0,dilution,analyte,fi-bkgd,fi-bkgd-neg,cv,participant_id,visit_code,visit_date,sample_type,buffer,bead_number
0,50,p24,474.8,454.8,0.0372,URN2,0,10/14/1899,PLA,PBS,19
1,50,gp41,470.8,452.8,0.1387,URN2,0,10/14/1899,PLA,PBS,44
2,50,Con 6 gp120/B,52.5,44.5,0.1183,URN2,0,10/14/1899,PLA,PBS,72
3,50,B.con.env03 140 CF,55.5,46.5,0.1709,URN2,0,10/14/1899,PLA,PBS,65
4,50,Blank,29.0,,0.0527,URN2,0,10/14/1899,PLA,PBS,53


And we can take this CSV data and write it to a database system.
Again create a connection.

In [37]:
long_data_conn = sqlite3.connect('long_data.db')
long_data.to_sql('long_data',long_data_conn, if_exists='replace')


Let's read that back to see how it compares

In [41]:
pd.read_sql("select * from long_data", long_data_conn)

Unnamed: 0,index,dilution,analyte,fi-bkgd,fi-bkgd-neg,cv,participant_id,visit_code,visit_date,sample_type,buffer,bead_number
0,0,50,p24,474.8,454.800000,0.0372,URN2,0,10/14/1899,PLA,PBS,19
1,1,50,gp41,470.8,452.800000,0.1387,URN2,0,10/14/1899,PLA,PBS,44
2,2,50,Con 6 gp120/B,52.5,44.500000,0.1183,URN2,0,10/14/1899,PLA,PBS,72
3,3,50,B.con.env03 140 CF,55.5,46.500000,0.1709,URN2,0,10/14/1899,PLA,PBS,65
4,4,50,Blank,29.0,,0.0527,URN2,0,10/14/1899,PLA,PBS,53
5,5,50,Con S gp140 CFI,82.0,62.000000,0.1799,URN2,0,10/14/1899,PLA,PBS,3
6,6,50,p31,474.4,455.400000,0.0885,URN2,0,10/14/1899,PLA,PBS,50
7,7,50,p66 (RT),69.4,50.400000,0.0527,URN2,0,10/14/1899,PLA,PBS,42
8,8,50,MulVgp70_His6,205.4,,0.0861,URN2,0,10/14/1899,PLA,PBS,49
9,9,50,gp70_B.CaseA_V1_V2,40.5,-64.766667,0.0615,URN2,0,10/14/1899,PLA,PBS,12


## Exercise: Filter and export data

Write a new table containing just the long_data rows with the following analytes:

- **p31**
- **p24**

Hint: More than one way to do this, depending on what you choose to `append`, or how to filter.

In [46]:
p31 = long_data[long_data['analyte'] == 'p31']
p24 = long_data[long_data['analyte'] == 'p24']
subset = p31.append(p24)

long_data_conn = sqlite3.connect('long_data.db')
subset.to_sql('long_data_subset',long_data_conn, if_exists='replace')
long_data_conn.close()
subset


Unnamed: 0,dilution,analyte,fi-bkgd,fi-bkgd-neg,cv,participant_id,visit_code,visit_date,sample_type,buffer,bead_number
6,50,p31,474.4,455.400000,0.0885,URN2,0,10/14/1899,PLA,PBS,50
16,50,p31,227.4,222.544444,0.0026,URN2,0,10/14/1899,PLA,CIT,50
26,50,p31,494.4,440.400000,0.0701,URN2,9,01/04/1901,PLA,PBS,50
36,50,p31,242.4,222.800000,0.7263,URN2,9,01/04/1901,PLA,CIT,50
46,50,p31,580.4,504.400000,0.0367,URN2,8,12/30/1900,PLA,PBS,50
56,50,p31,248.8,242.800000,0.0957,URN2,8,12/30/1900,PLA,CIT,50
66,50,p31,424.4,264.044444,0.0180,URN2,7,12/27/1900,PLA,PBS,50
76,50,p31,74.5,65.744444,0.2797,URN2,7,12/27/1900,PLA,CIT,50
86,50,p31,250.0,204.000000,0.0595,URN2,6,12/23/1900,PLA,PBS,50
96,50,p31,40.0,45.500000,0.1107,URN2,6,12/23/1900,PLA,CIT,50
