In [6]:
import pandas as pd
import numpy as np

# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Intro to SQL
Week 5 | Day 2

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Connect to a local or remote database using Python or Pandas
- Connect to a local or remote database using SQLite Manager (for SQLite) or Postico(for POSTGRES)
- Perform queries using SELECT
- Perform simple aggregations COUNT, MAX/MIN/SUM

## SQLite

SQLite is an embedded SQL database engine. Unlike most other SQL databases, SQLite does not have a separate server process. **SQLite reads and writes directly to ordinary disk files.** A complete SQL database with multiple tables, indices, triggers, and views, is contained in a single disk file.

SQLite is not directly comparable to client/server SQL database engines such as MySQL, Oracle, PostgreSQL, or SQL Server since SQLite is trying to solve a different problem.

SQLite emphasizes economy, efficiency, reliability, independence, and simplicity.

**SQLite does not compete with client/server databases. SQLite competes with fopen().**

## Let's connect to our db file with SQLite

In [1]:
import sqlite3
sqlite_db = 'dsi-db.sqlite'
conn = sqlite3.connect(sqlite_db)
c = conn.cursor()

## Now we'll create a table using our cursor

In [2]:
c.execute('CREATE TABLE houses \
          (field1 INTEGER PRIMARY KEY, sqft INTEGER,\
           bdrms INTEGER, age INTEGER, price INTEGER);')

# Save (commit) the changes
conn.commit()

OperationalError: table houses already exists

## Now, we'll add rows to our table

In [3]:
last_sale = (None, 4000, 5, 22, 619000)
c.execute('INSERT INTO houses VALUES (?,?,?,?,?)', last_sale)

# Remember to commit the changes
conn.commit()

## We can bring up DB Browser to see our results...

## Now let's bulk add mulitple rows with executemany()

In [4]:
recent_sales = [
  (None, 2390, 4, 34, 319000),
  (None, 1870, 3, 14, 289000),
  (None, 1505, 3, 90, 269000),
]

c.executemany('INSERT INTO houses VALUES (?, ?, ?, ?, ?)', recent_sales)

conn.commit()

## Now, we'll bulk add using a csv

In [7]:
pd.read_csv('https://www.dropbox.com/s/1k9cgsd7bzce0yk/housing-data.csv?dl=1').head(2)

Unnamed: 0,sqft,bdrms,age,price
0,2104,3,70,399900
1,1600,3,28,329900


In [8]:
# check that our entries match
c.execute('PRAGMA table_info(houses)')
c.fetchall()

[(0, u'field1', u'INTEGER', 0, None, 1),
 (1, u'sqft', u'INTEGER', 0, None, 0),
 (2, u'bdrms', u'INTEGER', 0, None, 0),
 (3, u'age', u'INTEGER', 0, None, 0),
 (4, u'price', u'INTEGER', 0, None, 0)]

In [9]:
from numpy import genfromtxt

# import into nparray of ints, then convert to list of lists
data = (genfromtxt('https://www.dropbox.com/s/1k9cgsd7bzce0yk/housing-data.csv?dl=1', dtype='i8', delimiter=',', skip_header=1)).tolist()

## Here we see what that looks like

In [12]:
data[0:3]


[[None, 2104, 3, 70, 399900],
 [None, 1600, 3, 28, 329900],
 [None, 2400, 3, 44, 369000]]

## To auto-increment the PK, we need 'None', so we'll add that

In [13]:
# append a None value to beginning of each sub-list
for d in data:
    d.insert(0, None)

## Here we see that result

In [12]:
data[0:3]

[[None, 2104L, 3L, 70L, 399900L],
 [None, 1600L, 3L, 28L, 329900L],
 [None, 2400L, 3L, 44L, 369000L]]

## Now we'll loop through and add each

In [16]:
# loop through data, running an INSERT on each record (i.e. sublist)
for d in data:
    c.execute('INSERT INTO houses VALUES (?, ?, ?, ?, ?)', d)

conn.commit()

ProgrammingError: Incorrect number of bindings supplied. The current statement uses 5, and there are 6 supplied.

## Again, we can see the results with DB Broswer

## We can also see our efforts with a query

In [17]:
# Similar syntax as before
results = c.execute("SELECT * FROM houses WHERE bdrms = 4")

# Here results is a cursor object - use fetchall() to extract a list
results.fetchall()

[(2, 2390, 4, 34, 319000),
 (9, 3000, 4, 75, 539900),
 (10, 1985, 4, 61, 299900),
 (15, 1940, 4, 7, 239999),
 (20, 2300, 4, 77, 449900),
 (23, 2609, 4, 5, 499998),
 (24, 3031, 4, 21, 599000),
 (28, 1962, 4, 53, 259900),
 (37, 2040, 4, 75, 314900),
 (39, 1811, 4, 24, 285900),
 (42, 2132, 4, 28, 345000),
 (43, 4215, 4, 66, 549000),
 (44, 2162, 4, 43, 287000),
 (47, 2567, 4, 57, 314000),
 (50, 1852, 4, 64, 299900),
 (53, 2390, 4, 34, 319000)]

## Using pandas

In [18]:
import pandas as pd
from pandas.io import sql

In [19]:
import pandas as pd

data = pd.read_csv('https://www.dropbox.com/s/1k9cgsd7bzce0yk/housing-data.csv?dl=1', low_memory=False)
data.head()

Unnamed: 0,sqft,bdrms,age,price
0,2104,3,70,399900
1,1600,3,28,329900
2,2400,3,44,369000
3,1416,2,49,232000
4,3000,4,75,539900


## Now we'll use the pandas .to_sql() method to write another table

In [20]:
data.to_sql('houses_pandas',
            con=conn,
            if_exists='replace',
            index=False)

## Again, we can see our table using SQLite browser

## And we can query to read it

In [21]:
sql.read_sql('select * from houses_pandas limit 5', con=conn)

Unnamed: 0,sqft,bdrms,age,price
0,2104,3,70,399900
1,1600,3,28,329900
2,2400,3,44,369000
3,1416,2,49,232000
4,3000,4,75,539900


## Exercise

- Create a new database file using SQLite
- Create a table in that db file called students
- Insert the names of all the people sitting at your table
- Create the table with a auto-incrementing primary key
- Add a column for their favorite color and number
cc- Once the table is populated use pandas to select all the data from the table

In [23]:


c.execute('CREATE TABLE houses \
          (field1 INTEGER PRIMARY KEY, sqft INTEGER,\
           bdrms INTEGER, age INTEGER, price INTEGER);')

# Save (commit) the changes
conn.commit()

OperationalError: table houses already exists

## SQL Operators

## SELECT

```SQL
SELECT
<columns>
FROM
<table>
```

## Now within pandas

In [24]:
sql.read_sql('select * from houses_pandas limit 10', con=conn)

Unnamed: 0,sqft,bdrms,age,price
0,2104,3,70,399900
1,1600,3,28,329900
2,2400,3,44,369000
3,1416,2,49,232000
4,3000,4,75,539900
5,1985,4,61,299900
6,1534,3,12,314900
7,1427,3,57,198999
8,1380,3,14,212000
9,1494,3,15,242500


```SQL
SELECT *
```
denotes returns all of the columns.

### We can also select individual columns

```SQL
SELECT
<col1>, <col2>, <coln>
FROM
<table>
```

## And with pandas...

In [23]:
sql.read_sql('select age, price from houses_pandas limit 10', con=conn)

Unnamed: 0,age,price
0,70,399900
1,28,329900
2,44,369000
3,49,232000
4,75,539900
5,61,299900
6,12,314900
7,57,198999
8,14,212000
9,15,242500


## Exercise

- Write a query that returns only bedrooms, sq. footage, and price from our houses_pandas table
- Implement the query in pandas

In [25]:
sql.read_sql('select bdrms, sqft, price from houses_pandas limit 10', con=conn)


Unnamed: 0,bdrms,sqft,price
0,3,2104,399900
1,3,1600,329900
2,3,2400,369000
3,2,1416,232000
4,4,3000,539900
5,4,1985,299900
6,3,1534,314900
7,3,1427,198999
8,3,1380,212000
9,3,1494,242500


## WHERE

### Where is used to filter the data 

```SQL
SELECT
<columns>
FROM
<table>
WHERE
<condition>
```

### Example in SQL

```SQL
SELECT
sqft, bdrms, age, price
FROM houses_pandas
WHERE bdrms = 2 and price < 500000;
```

## Now, we'll execute it in pandas

In [26]:
sql.read_sql('select sqft, bdrms, age, price from houses_pandas\
             where bdrms = 2 and price < 500000', con=conn)

Unnamed: 0,sqft,bdrms,age,price
0,1416,2,49,232000
1,1320,2,62,299900
2,1888,2,79,255000
3,1839,2,40,349900
4,1664,2,40,368500
5,852,2,70,179900


## Exercise

- Write a query that returns the sqft, bdrms, age for  houses older than 60 years.
- Implement the query in pandas

In [30]:
sql.read_sql('SELECT sqft, bdrms, age FROM houses_pandas WHERE age > 60 and bdrms >=3', con=conn)

Unnamed: 0,sqft,bdrms,age
0,2104,3,70
1,3000,4,75
2,1985,4,61
3,2300,4,77
4,1236,3,78
5,2040,4,75
6,3137,3,67
7,4215,4,66
8,1200,3,76
9,1852,4,64


## AGGREGATIONS

- Average (i.e., arithmetic mean)
- Count
- Maximum
- Minimum
- Median
- Mode
- Sum

## Example SQL

In [31]:
```SQL
SELECT COUNT(price)
FROM houses_pandas;
```

SyntaxError: invalid syntax (<ipython-input-31-0fcefe32a2e7>, line 1)

In [32]:
sql.read_sql('SELECT COUNT(price) FROM houses_pandas', con=conn)

Unnamed: 0,COUNT(price)
0,47


## Another example

```SQL
SELECT AVG(sqft), MIN(price), MAX(price)
FROM houses_pandas
WHERE bdrms = 2;
```

In [33]:
sql.read_sql('SELECT AVG(sqft), MIN(price), \
MAX(price) FROM houses_pandas WHERE bdrms = 2', con=conn)

Unnamed: 0,AVG(sqft),MIN(price),MAX(price)
0,1496.5,179900,368500


## Exercise 

- Write a query to find the average price per sq ft for one bedroom houses
- Write another to find the average price per sq ft for those greater than 3 bedrooms
- Implement both in pandas sql

In [45]:
sql.read_sql('SELECT AVG(price/sqft) FROM houses_pandas WHERE bdrms = 1', con=conn)



Unnamed: 0,AVG(price/sqft)
0,169.0


In [42]:
sql.read_sql('SELECT AVG(price/sqft) FROM houses_pandas WHERE bdrms > 3', con=conn)

Unnamed: 0,AVG(price/sqft)
0,156.066667


## Independent Practice

Practice querying the **PostgreSQL database** using Postico. You can find the DB at:

```
url: dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com
port: 5432
database: dsi
user: dsi_student
password: gastudents
```

Questions:

- What's the average price per room for 1 bedroom apartments?
- What's the average price per room for 2 bedrooms apartments?
- What's the most frequent apartment size (in terms of bedrooms)?
- How many are there of that apartment kind?
- What fraction of the total number are of that kind?
- How old is the oldest 3 bedrooms apartment?
- How old is the youngest apartment?
- What's the average age for the whole dataset?
- What's the average age for each bedroom size?

Try to answer all these in SQL.

If you finish, try completing the first sections of <br>[PostgreSQL Exercises](https://pgexercises.com/questions/basic/selectall.html)<br>[SQL zoo](http://www.sqlzoo.net/).

## Conclusion

- We've seen how to use a SQL database and a SQLite file database to make queries
- We've seen how to create a table and populate it
- We've seen how to use pandas sql to make queries
- We've seen how to use select, where, and aggregations