# Data in Databases

Sometimes data is not stored in a single or handful of data files, but is stored as a series of connected tables in a relational database. In this notebook we will demonstrate how you can use python to interact with such data sources.

## What we will accomplish

In this notebook we will:
- Define the concept of a relational database,
- Introduce the structured query language (SQL),
- Show how to access data in a database with SQLAlchemy and
- Show how to read a database with pandas.

## What is a relational database

A relational database is a collection of tables or data sets that are related to one another through a series of key values.

As an example consider all the data involved with operating a business that sells physical goods. For simplicity, let's say that there are two sets of data of which the business owner wants to keep track: purchases and customers. A relational database would store such data in a Purchases Table and a Customer Table. Each row of the purchases table would contain all of the data associated with an individual purchase, including the customer that made the purchase and a unique identifier for the purchase. Similarly, each row of the customer table will contain all the data for a unique customer including a unique identifier code. Each purchase can then be linked to the data associated with each customer through these unique keys. the purchases table will contain a column containing the id of the customer that made the purchase like so:
<img src="database_example.png" width="70%"></img>

Linking data in this way gives you the ability to store and query the data in an efficient manner to answer questions like:
- Are men more likely to buy item A than women?
- Who should we target our advertising dollars to for product X?
- What groups should receive coupon emails for items similar to item Y?
- etc.

You may have to interact with a database in many data project settings. Let's learn how we can use python to extract the data we need from a database.

## Structured query language (SQL)

While I just said we will learn how to use python to interact with data bases, we first have to take a detour into SQL. SQL, or structured query language, is a programming language whose purpose is to submit <i>queries</i> to databases. You can use SQL queries to:
- Create databases and tables,
- Enter data into tables,
- Remove data from tables,
- Delete tables and databases,
- Retrieve data that meets your specifications and
- More.

We will focus on learning the syntax to retrieve data in this notebook, and will touch on some of the other common SQL tasks in the accompanying `Practice Problems` notebook.

### SELECTing data

In SQL the way you can retrieve data from a table within a database is a `SELECT` statement. The syntax of a `SELECT` statement is as follows:
<blockquote>
    SELECT * FROM table_name WHERE conditional_statement
</blockquote>
Here:

- `SELECT` informs the database that you want to retrieve some data,
- The ` * ` portion is a space where you can specify the precise columns you would like returned, if you want all of the columns returned you input a ` * `,
- `FROM table_name` tells the database what table you would like to get data from and
- `WHERE conditional_statement` is an optional argument you can include if you only want entries that specify a certain conditional statement.

This is the template for the most basic SQL `SELECT` statement you can make. In the accompanying `Practice Problems` notebook we touch on more complicated `SELECT` statements that include `JOIN` statements allowing you to cross-reference multiple tables.

## `SQLAlchemy`

The python packages that will allow us to execute a SQL `SELECT` command with python is `SQLAlcemy`, <a href="https://www.sqlalchemy.org/">https://www.sqlalchemy.org/</a>. Let's check that you have it installed first.

In [1]:
## Make sure this runs first
## This way we can make sure that you have it installed
import sqlalchemy

In [2]:
## what version do you have?
## I had 1.4.29 when I wrote this notebook
print(sqlalchemy.__version__)

1.4.29


If you were unable to import `SQLAlchemy` and check your package version, you will need to install the package before moving forward in this notebook. For installation guides see:
- Via conda, <a href="https://anaconda.org/anaconda/sqlalchemy">https://anaconda.org/anaconda/sqlalchemy</a>,
- Via pip, <a href="https://docs.sqlalchemy.org/en/14/intro.html#installation-guide">https://docs.sqlalchemy.org/en/14/intro.html#installation-guide</a>.

### Submitting SQL queries with `SQLAlchemy`

`SQLAlchemy` works by establishing a connection to a database and then allowing you to submit SQL queries to that connected database. We will demonstrate that process now with the `cat_store.db` database in this folder. This database contains two tables, a `customers` table and a `purchases` table.

There is a specific procedure you have to follow in order to use `SQLAlchemy`, which we will go through right now.

#### Creating an engine

In [3]:
## The first step is to create an engine
## The sqlalchemy engine is how we 
## communicate with the database
## docs: https://docs.sqlalchemy.org/en/14/core/engines.html
from sqlalchemy import create_engine

In [4]:
## When we create the engine we have to tell it
## the Dialect, this is the backend language 
## of the database. For us this is SQLite, which is what I used
## to create the database.
## We also have to specify a pool, for our purposes
## we can think of this as where our database is stored
## because cat_store.db is stored in this folder the pool is empty
engine = create_engine("sqlite:///cat_store.db")

#### Connect to the database

With an engine in place we can connect to the database.

In [5]:
## next we have to actually connect the engine
## to the database
conn = engine.connect()

#### Submitting queries

Now that we are connected we can submit queries to the database.

In [6]:
## we'll use this to display the data nicely
import pandas as pd

In [8]:
## Write the SQL statement inside a string
## then place in conn.execute
results = conn.execute("SELECT * FROM purchases")

## To print all the results of the query you can use
## fetchall()
pd.DataFrame(results.fetchall(),
                 columns = results.keys())

## note here, results.keys() returns the columns of the table


Unnamed: 0,purchase_id,customer_id,number_of_items,pretax_price,purchase_type
0,1,3,4,18.9,credit
1,2,2,2,22.2,cash
2,3,7,1,7.89,debit
3,4,1,11,109.89,check
4,5,4,3,33.3,cash
5,6,9,2,10.99,debit
6,7,5,4,39.9,credit
7,8,8,6,71.89,check
8,9,6,20,209.89,cash
9,10,4,3,17.54,cash


In [9]:
results = conn.execute("SELECT * FROM purchases")

## there is also fetchone
## which returns a tuple corresponding the the first
## returned row
results.fetchone()

(1, 3, 4, 18.9, 'credit')

In [10]:
## and fetchmany(n)
## which returns the next n sequential returned rows
results.fetchmany(4)

[(2, 2, 2, 22.2, 'cash'),
 (3, 7, 1, 7.89, 'debit'),
 (4, 1, 11, 109.89, 'check'),
 (5, 4, 3, 33.3, 'cash')]

Note that when using `fetchone` or `fetchmany` the results are returned sequentially. Let's check that you understand that concept now.

##### Practice

What do you expect to be returned with the following code chunk?

In [11]:
results.fetchone()

(6, 9, 2, 10.99, 'debit')

What about this code chunk?

In [12]:
results.fetchmany(4)

[(7, 5, 4, 39.9, 'credit'),
 (8, 8, 6, 71.89, 'check'),
 (9, 6, 20, 209.89, 'cash'),
 (10, 4, 3, 17.54, 'cash')]

#### Using SQL for basic stats

We can also use SQL to compute basic statistics of numeric columns like the mean, max, min and how many observations there are. Let's demonstrate how.

In [15]:
## COUNT
## This gives you how many results were returned by your query
results = conn.execute("SELECT COUNT(*) FROM purchases")

results.fetchall()

[(20,)]

In [16]:
## MAX
## This gives the maximum value of the specified column
results = conn.execute("SELECT MAX(pretax_price) FROM purchases")

results.fetchall()

[(209.89,)]

In [17]:
## MIN
## This gives the minimum value of the specified column
results = conn.execute("SELECT MIN(pretax_price) FROM purchases")

results.fetchall()

[(0.99,)]

In [18]:
## AVG
## This gives the mean value of the specified column
results = conn.execute("SELECT AVG(pretax_price) FROM purchases")

results.fetchall()

[(46.4955,)]

#### Using `pandas` with `SQLAlchemy`

While you can create a DataFrame like we did above with one of the `fetch` commands, you can also use `pandas` package directly.

##### `pandas.read_sql_query`

The command `pandas.read_sql_query` allows you to write a SQL query and have it executed via `pandas`.

In [19]:
## First write your SQL query as a string
## Then give your SQLAlchemy connection object
## docs: https://pandas.pydata.org/docs/reference/api/pandas.read_sql_query.html
pd.read_sql_query("SELECT * FROM customers", conn)

Unnamed: 0,customer_id,name,age,years_customer,email,phone
0,1,Mike Evans,34,2,mik.evans@yahoo.com,323-333-4545
1,2,Francine Frensky,22,1,arthurfan@gmail.com,339-300-4453
2,3,Melanie PBody,16,0,mel@aol.com,222-506-9040
3,4,Mark Ruffalo,40,7,hulk@marvel.com,899-334-2980
4,6,Richard Frank,25,3,letsbefrank@hotmail.com,849-333-1223
5,7,Olivia Olive,19,2,oliveyou@hotmail.com,588-309-8593
6,8,Frances Paris,23,5,iseelondon@gmail.com,543-222-3958
7,9,Paul London,54,11,bakeitwork@gmail.com,853-200-4930
8,10,Jenny Gump,65,20,runforrestrun@aol.com,883-234-5504


##### `pandas.read_sql_table`

You can also just read a table directly into a DataFrame with `pandas` using `pandas.read_sql_table`.

In [20]:
## First write the table name
## Then give your SQLAlchemy connection object
## docs: https://pandas.pydata.org/docs/reference/api/pandas.read_sql_table.html#pandas.read_sql_table
pd.read_sql_table("customers", conn)

Unnamed: 0,customer_id,name,age,years_customer,email,phone
0,1,Mike Evans,34,2,mik.evans@yahoo.com,323-333-4545
1,2,Francine Frensky,22,1,arthurfan@gmail.com,339-300-4453
2,3,Melanie PBody,16,0,mel@aol.com,222-506-9040
3,4,Mark Ruffalo,40,7,hulk@marvel.com,899-334-2980
4,6,Richard Frank,25,3,letsbefrank@hotmail.com,849-333-1223
5,7,Olivia Olive,19,2,oliveyou@hotmail.com,588-309-8593
6,8,Frances Paris,23,5,iseelondon@gmail.com,543-222-3958
7,9,Paul London,54,11,bakeitwork@gmail.com,853-200-4930
8,10,Jenny Gump,65,20,runforrestrun@aol.com,883-234-5504


#### Closing the connection

When you are done submitting queries you need to close the connection to the database.

In [21]:
## When we're done we close the connection
conn.close()

#### Disposing of the engine

When you think you are all the way done with using the database you dispose of the engine.

In [22]:
## then dispose the engine
engine.dispose()

## Summary

In this notebook we introduced the concept of a relational database and discussed how you can access data stored within one. The content presented here was just a start and for those of you interested in more complicated database commands please go through the accompanying notebook stored in the `Practice Problems` folder of this repository.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)