### Spark Cluster Performance
The purpose of this notebook is to demonstrate the advantages of a Spark cluster query as opposed to a Postgres database query. It is best to complete the entire notebook in one sitting, as opposed to leaving it and coming
back later. If you choose to leave it and come back to it later, you most likely experience glitches in the server.
It is also important to run each of these top to bottom, and WAIT until one cell is finished before starting the next.

#### Database Structure
* reviews

|Field |	Type |	Allow Null |	Default Value| 
|------|-------|-------|------|
|product_id |	varchar(20) |	Yes	| x |
|price |	float4	|Yes | x  |	
|user_id |	varchar(50) |	Yes	| x |
|profile_name |	varchar(100) |	Yes	 | x |
|time |	timestamp(6) WITH TIME ZONE |	Yes |x |	
|score |	float4 |	Yes |x	 |
|title |	text |	Yes | x |	
|summary |	text |	Yes	  |  x|
|text | 	text |	Yes	 |x  |
|helpfulness_score |	float4 |	Yes  |x  |	
|helpfulness_count |	float4 |	Yes	 |x |
|counter_id	 | int4	| No |	nextval('reviews_counter_id_seq'::regclass) |

#### Accessing Spark
Creating the SQL context from the spark context, which is the entry point into the relational functionality of our spark cluster.  Importing libraries. We need the time library to use the time.time() function.

**In each fo the cells below, make sure you wait for the [\*] to turn into a number before hitting the next cell. This code does take a time to come back. **

In [None]:
import time
from pyspark import SQLContext, SparkContext, SparkConf
sqlContext = SQLContext(sc)


The code below creates the DataFrame from the SQLContext created above.  The DataFrame is based on the contents of our JSON file. 

This file contains the content from the **Amazon Product Reviews Dataset**. DataFrames allow us to manipulate and interact with structured data via a domain-specific language. Our domain-specific language is Python. 

The code below creates the DataFrame from the SQLContext based on the contents of our JSON file. This file contains the content from the reviews database. DataFrames allow us to manipulate and interact with structured data via a domain-specific language. Our domain-specific language is Python. 

In [None]:
## EXCEPT, note that this cell will not come back and often leaves the [*] in the execution notation to the left. 
## The "if" block prevents you from re-reading the data into the variable if you already have! Saves time. Saves lives. 
if df is None:
   df = sqlContext.read.json("In/reviews.json")
else:
    df.printSchema()

In [None]:
# This prints the schema. 

df.printSchema()

In [None]:
# This is a basic query that selects twenty scores (a.k.a. ratings) of different movies and TV shows. The 'twenty'
# is a constraint on the system to prevent the accidental retrieval of about thirty-four million rows. 
# This query does not specify how many to return.

df2 = df.select("score").show()

In [None]:
# You can register a DataFrame as a table for the purpose of running an SQL query on the dataset. 

sqlContext.registerDataFrameAsTable(df, "reviews")
df3 = sqlContext.sql("SELECT productId, price from reviews limit 5")
df3.collect()

In [None]:
# Here we will run a more time-consuming query since our dataset is gigantic. (When I ran this, it took almost 
# five minutes, and my internet connection is fairly good.)

start = time.time()
df4 = sqlContext.sql("SELECT COUNT(DISTINCT title) FROM reviews")
df4.collect()
end = time.time()
print(end - start)

In [None]:
# You should see a large difference between this time and the runtime of the postgres query. (Mine was about half
# the time of the postgres query.)

In [None]:
## Look at scores great than 4 
start = time.time()
df4 = df.filter(df.score>4).show()
#df4.collect()
end = time.time()
print(end - start)
df.printSchema()

In [None]:
#Lets look at helpfulness scores greater than 40
from pyspark.sql.functions import col 

numeric_filtered = df.where(
    (col('helpfuless_score')  > 40))
numeric_filtered.show()

In [None]:
#Lets look at helpfulness scores greater than 40 and a score greater than 4
from pyspark.sql.functions import col 

numeric_filtered = df.where(
    (col('helpfuless_score')  > 40))
numeric_filtered.show()

numeric_filtered2 = df.where(
    (col('score')  > 4))
numeric_filtered2.show()

# I print both out so that you can see the first group has the helpfulness score greater than 40 and the second group 
# meets that condition AND has a score greater than 4. 

## Imagine Your Own Queries of the Amazon Reviews Database
What are some queries you might imagine for Amazon product reviews?
 - Is there a product keyword you would like to know the average rating for?
 - What ratings levels are the most useful? High or low?

In [None]:
### Insert your queries here:
