##### CSCI 303
# Introduction to Data Science
<p/>

### 12 - Working with SQL Databases (1)

![Relational database icon](sql.png)

## This Lecture
---
- Relational database basic concepts
- Basic SQL retrieval queries
- Connecting to SQL databases from Python

## Databases
---
At the most general level, a *database* is simply a store of data together with some mechanism for retrieving and modifying the data.

The dominant database model since about 1980 has been the *relational* model.

SQL is a standardized language for manipulating and retrieving data from relational databases.

While new database types are cropping up, especially in a Big Data context, many frameworks for working with Big Data use variants of SQL or concepts borrowed from SQL (e.g., joining, grouping, etc.)

An awful lot of data lives in databases; you need to know at least a little SQL to get at this data!

## The Relational Database
---
We won't get too heavy into the theory.

Basically, data is stored in tables, also called *relations*.

Tables contain rows of data, sometimes called *tuples*.

Each row of data is organized into the same set of named columns, or *attributes*. 

A table named **employees** might look like this:


| name    | age | salary |
|---------|-----|--------|
|Laura    | 43  | 102760 |
|Shashi   | 49  |  83010 |
|Raluca   | 33  |  95500 |

A relational database might contain many named tables (and other objects which behave just like tables).

**Note:** Rows in a table are not ordered in any fashion.

In fact, the order can change from one query to the next (although it usually doesn't unless the data has also changed).

You can impose an order on the data by sorting on one or more columns (a topic for next time).

## SQL
---
SQL is the language used to "talk to" relational databases.

It is more or less standardized; each database vendor has its own dialect, but if you learn one, you can usually work with any other SQL database.

SQL is a declarative, rather than imperative language.

- You don't tell the database what to *do*
- You tell the database what you *want*

To ask the database for something, you issue a *query* in SQL.

Example:

```SELECT name FROM employees WHERE age < 40;```

which means, "I want the name data from the employees table for employees whose age is less than 40"

## Retrieval
---
We will focus on retrieval queries in this course.

To learn how to store and manipulate data in SQL, take CSCI 403!

The mechanism for retrieving data from a relational database is the SELECT query.

We saw an example of this already:

```SELECT name FROM employees WHERE age < 40;```

Recalling our table above:

| name    | age | salary |
|---------|-----|--------|
|Laura    | 43  | 102760 |
|Shashi   | 49  |  83010 |
|Raluca   | 33  |  95500 |


the query `SELECT name FROM employees WHERE age < 40;` results in something much like an anonymous table; a subset of rows and columns from the employees table:

| name   |
|--------|
| Raluca |

You can probably infer the basic form already:

```
SELECT column1, column2, ...
FROM table
WHERE condition;
```

The WHERE clause is optional - omit it to get all rows.

If you want all columns, a shortcut is to use * in place of column names:

```
SELECT * FROM table WHERE condition;```

The WHERE clause condition can be any Boolean expression, and can use column names in the expression, such as

```age < 40```

You can also make compound expressions using AND, OR and NOT:

```
SELECT * FROM employees
WHERE age > 40 AND salary < 100000 OR name = 'Raluca';```

There are many other refinements to explore, such as applying various functions to columns, doing string matching, and dealing with NULL values.

More on all that next time.

One last note: SQL is *not* case sensitive.  These queries mean exactly the same thing:

```SELECT name FROM employees WHERE age < 40;```

```Select name From employees Where age < 40;```

```select NAME from EMPLOYEES where AGE < 40;```

## On To Coding!
---
The obligatory setup code...

In [1]:
import numpy as np
import pandas as pd

from pandas import Series, DataFrame

## Raw Python DB Access
---
You can access pretty much any relational database via modules specific to the database vendor.

Conveniently, these follow a standard, so the same methods work on pretty much all of them.

The API is pretty simple.

Here's an example, connecting to a database I created for this class.

In [2]:
import sqlite3   # We'll be using a simple file-based SQLite3 database

# You will need the csci303.sqlite3 file - put it in the same directory
# as this notebook.

try:
    db = sqlite3.connect('csci303.sqlite3')
    cursor = db.cursor()
except sqlite3.Error as e:
    print(e)

In [3]:
try:
    # issue a query
    cursor.execute('SELECT * FROM employees')
except sqlite3.Error as e:
    # handle oopsies
    print(e)
    db.rollback()

# get results of query
for r in cursor.fetchall():
    print(r)

('Laura', 43, 102760)
('Shashi', 49, 83010)
('Raluca', 33, 95500)


The exception handling code is not essential, but will save you some time if you mess up.

It will a) tell you what you did wrong and b) give you a chance to clear the errors by issuing a rollback.

Example:

In [4]:
try:
    results = cursor.execute('SELECT arglebargle FROM employees')
except sqlite3.Error as e:
    print(e)
    db.rollback()

no such column: arglebargle


Of course, what we really want is to get the data into a pandas DataFrame so we can manipulate it further.

In [5]:
cursor.execute('SELECT * FROM employees')
DataFrame(cursor.fetchall())

Unnamed: 0,0,1,2
0,Laura,43,102760
1,Shashi,49,83010
2,Raluca,33,95500


Note, we unfortunately didn't get the column names.

We can get them from the cursor's `description` property.

In [6]:
cursor.description

(('name', None, None, None, None, None, None),
 ('age', None, None, None, None, None, None),
 ('salary', None, None, None, None, None, None))

In [7]:
cursor.execute('SELECT * FROM employees') 
DataFrame(cursor.fetchall(), columns=[r[0] for r in cursor.description])

Unnamed: 0,name,age,salary
0,Laura,43,102760
1,Shashi,49,83010
2,Raluca,33,95500


Another example... I loaded the Boston Housing dataset into the database as the table 'boston':

In [8]:
cursor.execute('SELECT crim, indus, rm, medv FROM boston') # this gets the crim, indus, rm, and medv data from boston database
DataFrame(cursor.fetchall(), columns=[r[0] for r in cursor.description])[:10] # and this puts those values into a DataFrame

Unnamed: 0,crim,indus,rm,medv
0,0.13158,10.01,6.176,21.2
1,0.15098,10.01,6.021,19.2
2,0.13058,10.01,5.872,20.4
3,0.14476,10.01,5.731,19.3
4,0.06899,25.65,5.87,22.0
5,0.07165,25.65,6.004,20.3
6,0.08447,4.05,5.859,22.6
7,0.06664,4.05,6.546,29.4
8,0.07022,4.05,6.02,23.2
9,0.05425,4.05,6.315,24.6


## Cleaning Up
---
It is probably a good idea to close your database connections when you no longer need them.

Python will close them for you when you stop the kernel for your notebook, but if you leave it running, the database connection will live on...

In [9]:
# use .close() to close database connections
cursor.close()
db.close()

## The pandas Shortcut
---
Since:

- We frequently want data from SQL databases
- We further want to convert data to a DataFrame
- pandas is designed to make our lives easier

it follow that pandas should provide an easy way to get data from a SQL database into a DataFrame.

And thus it does.

We can issue specific SELECT queries via `pandas.read_sql_query`:

In [10]:
dburi = 'sqlite:///csci303.sqlite3'

pd.read_sql_query('SELECT * FROM boston WHERE chas', dburi)[:10]

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,3.32105,0.0,19.58,1,0.871,5.403,100.0,1.3216,5.0,403.0,14.7,396.9,26.82,13.4
1,1.12658,0.0,19.58,1,0.871,5.012,88.0,1.6102,5.0,403.0,14.7,343.28,12.12,15.3
2,1.41385,0.0,19.58,1,0.871,6.129,96.0,1.7494,5.0,403.0,14.7,321.02,15.12,17.0
3,3.53501,0.0,19.58,1,0.871,6.152,82.6,1.7455,5.0,403.0,14.7,88.01,15.02,15.6
4,1.27346,0.0,19.58,1,0.605,6.25,92.6,1.7984,5.0,403.0,14.7,338.92,5.5,27.0
5,1.83377,0.0,19.58,1,0.605,7.802,98.2,2.0407,5.0,403.0,14.7,389.61,1.92,50.0
6,1.51902,0.0,19.58,1,0.605,8.375,93.9,2.162,5.0,403.0,14.7,388.45,3.32,50.0
7,0.13587,0.0,10.59,1,0.489,6.064,59.1,4.2392,4.0,277.0,18.6,381.32,14.66,24.4
8,0.43571,0.0,10.59,1,0.489,5.344,100.0,3.875,4.0,277.0,18.6,396.9,23.09,20.0
9,0.17446,0.0,10.59,1,0.489,5.96,92.1,3.8771,4.0,277.0,18.6,393.25,17.27,21.7


Or, if we simply want the whole table, use `pandas.read_sql_table`:

In [11]:
#pd.read_sql_table('employees', dburi)

## Some Useful Code
---
How can you find out what tables are even in a database?

Every query tool has a mechanism to do this, but right now I know of no super-easy way to do it from Jupyter.

The SQLAlchemy database engine (which is what pandas is using under the covers) can do this for us, with a little code.

You might want to keep this snippet handy.

From http://stackoverflow.com/questions/21310549/list-database-tables-with-sqlalchemy

In [12]:
from sqlalchemy import create_engine, inspect
engine = create_engine("sqlite:///csci303.sqlite3")

inspector = inspect(engine)

#returns list of databases
inspector.get_schema_names()

['main']

In [13]:
#returns list of tables in a database
inspector.get_table_names('main')

['boston',
 'customers',
 'departments',
 'employees',
 'employees2',
 'occupation_assignments',
 'occupation_attainments',
 'occupation_codes',
 'scifi_author',
 'scifi_work',
 'univ_act',
 'univ_carnegie',
 'univ_cbsa',
 'univ_cbsatype',
 'univ_ccbasic',
 'univ_ccenrprf',
 'univ_ccipgrad',
 'univ_ccipug',
 'univ_ccsizset',
 'univ_ccugprof',
 'univ_cngdstcd',
 'univ_control',
 'univ_countycd',
 'univ_csa',
 'univ_cyactive',
 'univ_deggrant',
 'univ_dfrcgid',
 'univ_dfrcuscg',
 'univ_f1syscod',
 'univ_f1systyp',
 'univ_field_descriptions',
 'univ_fips',
 'univ_groffer',
 'univ_hbcu',
 'univ_hdegofr1',
 'univ_hloffer',
 'univ_hospital',
 'univ_iclevel',
 'univ_instcat',
 'univ_instsize',
 'univ_landgrnt',
 'univ_locale',
 'univ_medical',
 'univ_necta',
 'univ_obereg',
 'univ_opeflag',
 'univ_openpubl',
 'univ_postsec',
 'univ_pseflag',
 'univ_pset4flg',
 'univ_rptmth',
 'univ_sector',
 'univ_stabbr',
 'univ_tribal',
 'univ_ugoffer',
 'universities']

In [14]:
#now get the items in a specific table
[(c['name'],c['type']) for c in inspector.get_columns('employees','main')]

[('name', TEXT()), ('age', INTEGER()), ('salary', INTEGER())]

You could even dump all the column info into pandas to get a nice layout...

In [15]:
pd.DataFrame(inspector.get_columns('employees','main'))

Unnamed: 0,name,type,nullable,default,autoincrement,primary_key
0,name,TEXT,True,,auto,0
1,age,INTEGER,True,,auto,0
2,salary,INTEGER,True,,auto,0


## Things To Try
---
- Do some SELECT queries on the 'boston' table
- Explore the 'universities' table
  - Get more info about the columns in 'universities'
  - The related 'univ_xxx' tables give text descriptions for the values in many of the columns
  - See what you can learn about Colorado School of Mines

In [16]:
pd.read_sql_query('SELECT * FROM boston WHERE indus = 18.10', dburi)[:10]


Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,8.98296,0.0,18.1,1,0.77,6.212,97.4,2.1222,24.0,666.0,20.2,377.73,17.6,17.8
1,3.8497,0.0,18.1,1,0.77,6.395,91.0,2.5052,24.0,666.0,20.2,391.34,13.27,21.7
2,5.20177,0.0,18.1,1,0.77,6.127,83.4,2.7227,24.0,666.0,20.2,395.43,11.48,22.7
3,4.22239,0.0,18.1,1,0.77,5.803,89.0,1.9047,24.0,666.0,20.2,353.04,14.64,16.8
4,3.47428,0.0,18.1,1,0.718,8.78,82.9,1.9047,24.0,666.0,20.2,354.55,5.29,21.9
5,5.66998,0.0,18.1,1,0.631,6.683,96.8,1.3567,24.0,666.0,20.2,375.33,3.73,50.0
6,6.53876,0.0,18.1,1,0.631,7.016,97.5,1.2024,24.0,666.0,20.2,392.05,2.96,50.0
7,8.26725,0.0,18.1,1,0.668,5.875,89.6,1.1296,24.0,666.0,20.2,347.88,8.88,50.0
8,4.26131,0.0,18.1,0,0.77,6.112,81.3,2.5091,24.0,666.0,20.2,390.74,12.67,22.6
9,4.54192,0.0,18.1,0,0.77,6.398,88.0,2.5182,24.0,666.0,20.2,374.56,7.79,25.0


## Next Lecture
---
- More SELECT queries
  - More WHERE clause expressions
  - Functions
  - Sorting
  - Grouping and aggregating
  - Joining tables