### Accessing Reviews Data in Postgres Database
This purpose of this notebook is to show the run time of a query on our postgres database. Once this time is obtained, you can run the Spark Example and compare the run time on the same query between the two methods. 
 
Note: It is also possible to run the two simultaneously. However, it is best to complete the Spark notebook in one sitting. Otherwise, the server may glitch.

#### Database Structure
* reviews

|Field |	Type |	Allow Null |	Default Value| 
|------|-------|-------|------|
|product_id |	varchar(20) |	Yes	| x |
|price |	float4	|Yes | x  |	
|user_id |	varchar(50) |	Yes	| x |
|profile_name |	varchar(100) |	Yes	 | x |
|time |	timestamp(6) WITH TIME ZONE |	Yes |x |	
|score |	float4 |	Yes |x	 |
|title |	text |	Yes | x |	
|summary |	text |	Yes	  |  x|
|text | 	text |	Yes	 |x  |
|helpfulness_score |	float4 |	Yes  |x  |	
|helpfulness_count |	float4 |	Yes	 |x |
|counter_id	 | int4	| No |	nextval('reviews_counter_id_seq'::regclass) |


**In each fo the cells below, make sure you wait for the [\*] to turn into a number before hitting the next cell. This code does take a time to come back. **

In [1]:
# Importing libraries. We need the time library to use the time.time() function.
import psycopg2, time


try:
    # connecting to the postgres 'reviews' database
    connect_str = "dbname='reviews' user='dsa_ro_user' host='dbase.dsa.missouri.edu' " + \
                  "password='readonly'"
    # use our connection values to establish a connection
    conn = psycopg2.connect(connect_str)
    # create a psycopg2 cursor that can execute queries
    cursor = conn.cursor()
    # time our query for counting the number of distinct titles in the reviews table 
    # NOTE: When I ran this, it took about ten minutes, so be sure to have a solid internet connection. :) 
    start = time.time()
    cursor.execute("""SELECT COUNT(DISTINCT title) FROM reviews""")
    rows = cursor.fetchall()
    end = time.time()
    print(rows)
    print(end - start)
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)

[(1854362,)]
583.4322063922882


### Large Databases Without Spark are Slower
You should notice that this query runs in almost twice the length that the Spark query does in the Spark Example notebook. The time.time() function returns the number of seconds since the epoch, which is operating system dependent. Since we are computing a difference between two times, this is an appropriate way to consider runtime.

## Play with the Reviews Database
 * What is the range of helpfulness scores? 
 * What is the average helpfulness score?
 * What is the average rating?

Do your work below

In [7]:
# This is how to get the range of helpfulness scores. Notice that we have to reconnect to the database.
import psycopg2

try:
    connect_str = "dbname='reviews' user='dsa_ro_user' host='dbase.dsa.missouri.edu' " + \
                  "password='readonly'"
    conn = psycopg2.connect(connect_str)
    cursor = conn.cursor()
    # query to get the range: subtract the min value from the max value
    cursor.execute("""SELECT max(helpfulness_score) - min(helpfulness_score) FROM reviews""")
    rows = cursor.fetchall()
    print(rows)
except Exception as e:
    print("Uh oh, can't connect. Invalid dbname, user or password?")
    print(e)

[(47516.0,)]


In [3]:
# This is how to get the average helpfulness score.
cursor.execute("""SELECT avg(helpfulness_score) FROM reviews""")
rows = cursor.fetchall()
print(rows)

[(3.72431562109408,)]


In [8]:
# This is how to get the average "rating". (The rating is the score.) 
# The avg() function makes everything super easy!
cursor.execute("""SELECT avg(score) FROM reviews""")
rows = cursor.fetchall()
print(rows)

[(4.17185119113233,)]


## Note
If you plan on running either or both of the average queries again, you need to run the cell with the connection
string first! We need to be connected to the database before we can query it for information, so be sure the do the 
connection 