Reading Normalized Data Quickly using a Database
------------------------------------------------------------

In your readonly database, there is an un-normalized table called *home_value_by_zip* with 4,466,776 records of un-normalized data.  There is also a normalized copy of the same data in a few tables in the database.  Part of this assignment is about speed.  You must examine the tables, figure out how to connect them and then construct an efficient SQL query that will retrieve the requested data.

In [1]:
# https://www.pg4e.com/code/ipynb/04-normalization.ipynb

import psycopg2
import pandas as pd
import time

In [2]:
# If you are going to send this file to the autograder, you cannot use hidden.py
# and must uncomment the sql_string assignment statement below and put in your values

sql_string = None
# sql_string = 'dbname=pg4e_data user=pg4e_data_read password=pg4e_pass_94e5d host=35.239.113.162 port=10001'
    
# If we leave sql_string as None, we can use hidden.py - but this only works for demo/test - not autograder
if sql_string is None:
    import hidden
    secrets = hidden.readonly()
    sql_string = hidden.psycopg2(hidden.readonly())
    print('PostgreSQL connection data taken from hidden.py')

PostgreSQL connection data taken from hidden.py


In [3]:
conn = psycopg2.connect(sql_string,connect_timeout=3)

You are to construct a query using the normalized tables that will return the same results as:

    SELECT state, avg(ym_val) AS average FROM home_value_by_zip
    GROUP BY state ORDER BY average DESC LIMIT 10;

Here is the expected output of the first few rows of the query:

     state |       average       
    -------+---------------------
     CA    | 429388.882710557533
     HI    | 384304.615036999379
     DC    | 373415.607524148449
     NJ    | 313458.077439427195


In [18]:
sql = None

### BEGIN SOLUTION
sql = '''SELECT state, avg(ym_val) AS average FROM home_value
JOIN home_state ON state_id = home_state.id
GROUP BY state ORDER BY average DESC LIMIT 10;'''

# sql = '''SELECT state, avg(ym_val) AS average FROM home_value_by_zip
# GROUP BY state ORDER BY average DESC LIMIT 10;'''
### END SOLUTION

if sql is not None:
    df = pd.read_sql_query(sql, conn)
    df.head()

In [19]:
if sql is None:
    raise Exception('You need to define the sql query')
    
assert df['state'][1] == 'HI'
assert df['average'][1] > 384304
assert df['average'][1] < 384305

### BEGIN HIDDEN TESTS
if sql.lower().find('join') < 0 :
    raise Exception('You need to have a JOIN in your query')

start = time.time()
df = pd.read_sql_query(sql, conn)
df.head()
delta = time.time() - start
print('Query execution time', delta)
if delta > 4.0 :
    raise Exception('Your query took too long')

assert df['state'][3] == 'NJ'
assert df['average'][3] > 313458
assert df['average'][3] < 313459
### END HIDDEN TESTS

Query execution time 1.6620681285858154


You are to construct a query using the normalized tables that will return the same results as:

    SELECT city, avg(ym_val) AS average FROM home_value_by_zip
    GROUP BY city ORDER BY average DESC LIMIT 10;

Here is the expected output of the first few rows of the query:

           city       |       average        
     -----------------+----------------------
      Atherton        | 3625292.526690391459
      Portola Valley  | 2218466.548042704626
      Fisher Island   | 2078791.814946619217
      Montecito       | 1939405.693950177936


In [12]:
sql = None

### BEGIN SOLUTION
sql = '''SELECT city, avg(ym_val) AS average FROM home_value
JOIN home_city ON city_id = home_city.id
GROUP BY city ORDER BY average DESC LIMIT 10;'''

# sql = '''SELECT city, avg(ym_val) AS average FROM home_value_by_zip
# GROUP BY city ORDER BY average DESC LIMIT 10;'''
### END SOLUTION

if sql is not None:
    df = pd.read_sql_query(sql, conn)
    df.head()

In [14]:
if sql is None:
    raise Exception('You need to define the sql query')

assert df['city'][1] == 'Portola Valley'
assert df['average'][1] > 2218466
assert df['average'][1] < 2218467

### BEGIN HIDDEN TESTS
if sql.lower().find('join') < 0 :
    raise Exception('You need to have a JOIN in your query')

start = time.time()
my_df = pd.read_sql_query(sql, conn)
my_df.head()
delta = time.time() - start
print('Query execution time', delta)
if delta > 4.0 :
    raise Exception('Your query took too long')
my_df.head()

assert my_df['city'][3] == 'Montecito'
assert my_df['average'][3] > 1939405
assert my_df['average'][3] < 1939406

### END HIDDEN TESTS

Query execution time 1.680393934249878
