# Augmenting Pandas With SQLite

## Augmenting Pandas with SQLite

So far, we've explored a few different ways we can work with medium-sized data sets in pandas. First, we learned how to reduce a dataframe's memory footprint by selecting the optimal data types for each column. Then, we discussed how to work with dataframe chunks and modify our processing logic. In this mission, we'll explore how to augment pandas with SQLite.<br>

While pandas stores and works with data in memory, a database tool like SQLite can represent data on disk. This means that while pandas is limited by the amount of available memory (usually a few gigabytes), **SQLite is limited only by the amount of available disk space** (usually hundreds of gigabytes to a few terabytes). This difference is even more pronounced when working with servers in the cloud because the price for extra disk storage is much cheaper than extra memory. With a [maximum supported file size limit](https://www.sqlite.org/limits.html) of 140 terabytes, we can store large data sets in a SQLite database, write SQL queries to extract a subset of the data we want to work with, and use pandas to explore, analyze, and visualize the subset.<br>

We'll continue to work with the data set on MOMA Exhibitions from the first two missions in this course, which you can read more about and download at [data.world](https://data.world/moma/exhibitions). First, we'll create a new SQLite database file and load the entire data set into a single SQLite table. Let's work **under the constraint that we have very limited memory**, and that we need to read the data set into chunks and append them to a table in SQLite.<br>

One way to do this would be to iterate over each line in a CSV file, manually parse each row, and insert it into SQLite using insert statements. Instead of doing that, let's read chunks of the file in as dataframes and use the `DataFrame.to_sql()` method to append the rows in each dataframe to a SQLite database table. At a minimum, we need to specify the table we want to add the rows to and pass in the SQLite cursor object:

```python
>> conn = sqlite3.connect('test.db')
>> df = pd.DataFrame({'A': [0,1,2], 'B': [3,4,5]})
>> df.to_sql('test', conn)
```

By default, the `DataFrame.to_sql()` method won't append to an existing table; nothing will happen if the table already exists. When exporting chunked dataframes to SQLite, we need to set the `if_exists` parameter to `'append'` to tell the method to append multiple chunks to the same table. Finally, we need to set the `index` parameter to `False` so that it doesn't add the index values from the dataframes to the SQLite database table.

* Connect to the `moma.db` SQLite database.
* Create an iterator using `pandas.read_csv()` that will process chunks of `1000 rows` from `moma.csv` at a time. Assign this iterator to `moma_iter`.
* Use `moma_iter` to read in chunks of `1000` rows into a dataframe.
* Append each dataframe chunk to the `exhibitions` table in `moma.db` without including the index values.

In [1]:
import pandas as pd
import sqlite3

In [3]:
conn = sqlite3.connect('moma.db')
moma_iter = pd.read_csv('data/moma.csv', chunksize=1000)

In [4]:
for chunk in moma_iter:
    
    chunk.to_sql('exhibitions', conn, if_exists='append', index=False)
    

## Pandas Types vs. SQLite Types

When we use the `DataFrame.to_sql()` method to add rows to a SQLite database, pandas automatically converts the dataframe's data types to the equivalent SQLite data types. You can find a list of the SQLite data types in the [documentation](https://www.sqlite.org/datatype3.html), which we've recreated below. You'll notice that **SQLite doesn't have special data types for representing datetime, boolean, or categorical values**.

Type|Description
---|---
NULL|The value is a NULL value
INTEGER|The value is a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes, depending on the magnitude of the value
REAL|The value is a floating point value, stored as an 8-byte IEEE floating point number
TEXT|The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE)
BLOB|The value is a blob of data, stored exactly as it was entered

When we export rows from a pandas dataframe to SQLite using the `DataFrame.to_sql()` method, pandas inserts rows by selecting the equivalent SQLite data type.<br>

We can use the `pandas.read_sql()` function to query a SQLite database. The function parses the results into a pandas dataframe and returns them in that structure.

```python
>> conn = sqlite3.connect('test.db')
>> df = pd.DataFrame({'A': [0,1,2], 'B': [3,4,5]})
>> df.to_sql('test', conn)
>> pd.read_sql('select A from test', conn)
   A
0  0
1  1
2  2
```



* Use the `pandas.read_sql()` function to query `moma.db` and return the column types for the `exhibitions` table. Assign the resulting dataframe to `results_df`.
* Display `results_df` using the `print()` function.

In [6]:
query = '''
        PRAGMA table_info(exhibitions);
'''
results_df = pd.read_sql(query, conn)
print(results_df)

    cid                    name     type  notnull dflt_value  pk
0     0            ExhibitionID  INTEGER        0       None   0
1     1        ExhibitionNumber     TEXT        0       None   0
2     2         ExhibitionTitle     TEXT        0       None   0
3     3  ExhibitionCitationDate     TEXT        0       None   0
4     4     ExhibitionBeginDate     TEXT        0       None   0
5     5       ExhibitionEndDate     TEXT        0       None   0
6     6     ExhibitionSortOrder  INTEGER        0       None   0
7     7           ExhibitionURL     TEXT        0       None   0
8     8          ExhibitionRole     TEXT        0       None   0
9     9           ConstituentID     REAL        0       None   0
10   10         ConstituentType     TEXT        0       None   0
11   11             DisplayName     TEXT        0       None   0
12   12               AlphaSort     TEXT        0       None   0
13   13               FirstName     TEXT        0       None   0
14   14              Midd

## Setting Appropriate Types

The SQLite database consumes **11.1 megabytes on disk**, which closely matches the amount of disk space the original CSV file consumes. Let's see if we can reduce the amount of disk space the SQLite database requires by optimizing the data types of the columns in the `exhibitions` table.<br>

If you'll recall from the first mission in this course, we had to clean the data a bit to be able to convert some of the text columns to a numeric type. Unfortunately, data cleaning in SQLite can be cumbersome for many tasks. Because we're reading in values to pandas first anyway, let's focus our type optimization efforts there. **Selecting the correct types in SQLite reduces the disk footprint of the database file, and can make some SQLite operations faster**.

* For each dataframe chunk, convert the `ExhibitionSortOrder` column to the `int16` data type using the `pandas.to_numeric()` function.
* Use the `pandas.read_sql()` function to query `moma.db` and return the column types for the `exhibitions` table. Assign the resulting dataframe to `results_df`.
* Display `results_df` using the `print()` function.

In [8]:
for chunk in moma_iter:
    
    #chunk['ExhibitionSortOrder'] = pd.to_numeric(chunk['ExhibitionSortOrder'], downcast='integer')
    chunk['ExhibitionSortOrder'] = chunk['ExhibitionSortOrder'].astype('int16')
    chunk.to_sql('exhibitions', conn, if_exists='append', index=False)
    
results_df = pd.read_sql('PRAGMA TABLE_INFO(exhibitions);', conn)
print(results_df)

    cid                    name     type  notnull dflt_value  pk
0     0            ExhibitionID  INTEGER        0       None   0
1     1        ExhibitionNumber     TEXT        0       None   0
2     2         ExhibitionTitle     TEXT        0       None   0
3     3  ExhibitionCitationDate     TEXT        0       None   0
4     4     ExhibitionBeginDate     TEXT        0       None   0
5     5       ExhibitionEndDate     TEXT        0       None   0
6     6     ExhibitionSortOrder  INTEGER        0       None   0
7     7           ExhibitionURL     TEXT        0       None   0
8     8          ExhibitionRole     TEXT        0       None   0
9     9           ConstituentID     REAL        0       None   0
10   10         ConstituentType     TEXT        0       None   0
11   11             DisplayName     TEXT        0       None   0
12   12               AlphaSort     TEXT        0       None   0
13   13               FirstName     TEXT        0       None   0
14   14              Midd

## Computing Primarily in SQL

Generating a pandas dataframe from our results set unlocks a few different workflows for us. One involves doing most of the computation in SQL, then parsing the results as a dataframe. Another workflow involves doing the data selection with SQL, but the iterative exploration and analysis in pandas. Each workflow has its own set of trade-offs, which we'll explore in greater depth in this mission. We'll start by performing all of the processing in SQL itself.

* Query the `exhibitions` table in `moma.db` to return both the unique values in the `ExhibitionID` column and the counts **in descending order by the counts** as a dataframe. Here's how the dataframe should be formatted:


 |ExhibitionID|counts
---|---|---
0|NaN|429
1|7.0|321

* Assign the resulting dataframe to `eid_counts` and display the first 10 rows.

In [16]:
query = '''
        SELECT DISTINCT(ExhibitionID) AS ExhibitionID,
                COUNT(*) AS counts
        FROM exhibitions
        GROUP BY 1
        ORDER BY 2 DESC
'''

eid_counts = pd.read_sql(query, conn)
eid_counts.head(10)

Unnamed: 0,ExhibitionID,counts
0,,429
1,7.0,321
2,3838.0,302
3,3030.0,284
4,3988.0,275
5,2600.0,262
6,79.0,259
7,10601.0,256
8,3939.0,254
9,3036.0,244


## Computing Primarily in Pandas

In the last step, we expressed the entire computation in SQL. Because this was a small task, the results came back relatively quickly. **SQLite will take much longer for larger computations because it's not operating in memory**. We can treat SQLite primarily as an archival data store instead, and only use SQL to return subsets of the data. **If we restrict the size of the subsets and ensure the results can fit into memory as a dataframe, we can move all of our heavier computations to pandas**.<br>

Pandas has several advantages over SQLite. 
* First, pandas has a large suite of functions and methods for performing common operations. 
* It also has a diverse type system we can use to save space and improve code running speed. 
* Finally, pandas works in memory and will be much quicker for most tasks.

Let's rewrite the task we performed in the last step using a mix of pandas and SQLite.

* From the `exhibitions` table in `moma.db`, return the `ExhibitionID` column as a dataframe.
* Calculate the unique value counts of the `ExhibitionID` column in pandas, and assign them to `eid_pandas_counts`. Display the first 10 rows.

In [21]:
query = '''
        SELECT ExhibitionID FROM exhibitions;
'''
eid_pandas_counts = pd.read_sql(query, conn).iloc[:, 0].value_counts()
eid_pandas_counts.head(10)

7.0        321
3838.0     302
3030.0     284
3988.0     275
2600.0     262
79.0       259
10601.0    256
3939.0     254
3036.0     244
2749.0     225
Name: ExhibitionID, dtype: int64

## Reading in SQL Results Using Chunks

The techniques we've learned so far in this mission are useful whenever the dataframe representing the results of a SQL query fits in memory. 

### When we're working with a data set that consumes multiple terabytes on disk as a SQLite database file, 
we may find ourselves wanting to explore a subset of the data in pandas, but lacking sufficient memory to read the subset in as a dataframe. In cases like these, we can read the results in as dataframe chunks and then batch process the chunks, just like we did earlier in this course.

```python
q = 'select exhibitionid from exhibitions;'
chunk_iter = pd.read_sql(q, conn, chunksize=100)
for chunk in chunk_iter:
    # Process each chunk.
```

You'll often find yourself working on a data science team that has databases containing millions of rows. While you can perform most of the computation in SQL itself, pandas (and the Python ecosystem more generally) has richer support for mathematical operations and data visualization. Querying the data in SQL and working with batches of the results set will help you make the most of SQL and pandas.

* Use this step to experiment with different chunk sizes, and observe their effects on runtime.

In [27]:
%%timeit
q = 'SELECT ExhibitionID FROM exhibitions;'
chunk_iter = pd.read_sql(q, conn, chunksize=100)
eid_pandas_counts = []

for chunk in chunk_iter:
    eid_pandas_counts.append(chunk.iloc[:,0].value_counts())

eid_pandas_counts = pd.concat(eid_pandas_counts)
eid_pandas_counts_sum = eid_pandas_counts.groupby(eid_pandas_counts.index).sum().sort_values(ascending=False)
eid_pandas_counts_sum.head(10)

572 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [29]:
%%timeit
q = 'SELECT ExhibitionID FROM exhibitions;'
chunk_iter = pd.read_sql(q, conn, chunksize=500)
eid_pandas_counts = []

for chunk in chunk_iter:
    eid_pandas_counts.append(chunk.iloc[:,0].value_counts())

eid_pandas_counts = pd.concat(eid_pandas_counts)
eid_pandas_counts_sum = eid_pandas_counts.groupby(eid_pandas_counts.index).sum().sort_values(ascending=False)
eid_pandas_counts_sum.head(10)

170 ms ± 8.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [28]:
%%timeit
q = 'SELECT ExhibitionID FROM exhibitions;'
chunk_iter = pd.read_sql(q, conn, chunksize=1000)
eid_pandas_counts = []

for chunk in chunk_iter:
    eid_pandas_counts.append(chunk.iloc[:,0].value_counts())

eid_pandas_counts = pd.concat(eid_pandas_counts)
eid_pandas_counts_sum = eid_pandas_counts.groupby(eid_pandas_counts.index).sum().sort_values(ascending=False)
eid_pandas_counts_sum.head(10)

106 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Next Steps

In this mission, we explored how to augment our pandas workflow with SQLite to handle larger data sets. We learned that while SQLite can represent larger data sets on disk, it's slower for most computations. In the next mission, we'll learn the basics of building data pipelines to operationalize our data processing work.