### Spark Cluster Performance
The purpose of this notebook is to demonstrate the advantages of a Spark cluster query as opposed to a Postgres database query. It is best to complete the entire notebook in one sitting, as opposed to leaving it and coming
back later. If you choose to leave it and come back to it later, you most likely experience glitches in the server.
It is also important to run each of these top to bottom, and WAIT until one cell is finished before starting the next. 

In [None]:
# Creating the SQL context from the spark context, which is the entry point into the relational functionality of
# our spark cluster. 
# Importing libraries. We need the time library to use the time.time() function.

import time
from pyspark import SQLContext, SparkContext, SparkConf
sqlContext = SQLContext(sc)


The code below creates the DataFrame from the SQLContext created above.  The DataFrame is based on the contents of our JSON file. 

This file contains the content from the **Amazon Product Reviews Dataset**. DataFrames allow us to manipulate and interact with structured data via a domain-specific language. Our domain-specific language is Python. 

In [None]:
# This creates the DataFrame from the SQLContext based on the contents of our JSON file. This file contains the
# content from the reviews database. DataFrames allow us to manipulate and interact with structured data via a
# domain-specific language. Our domain-specific language is Python. 

df = sqlContext.read.json("In/reviews.json")

In [None]:
# This prints the schema. 

df.printSchema()

In [None]:
# This is a basic query that selects twenty scores (a.k.a. ratings) of different movies and TV shows. The 'twenty'
# is a constraint on the system to prevent the accidental retrieval of about thirty-four million rows. 
# This query does not specify how many to return.

df2 = df.select("score").show()

In [None]:
# You can register a DataFrame as a table for the purpose of running an SQL query on the dataset. 

sqlContext.registerDataFrameAsTable(df, "reviews")
df3 = sqlContext.sql("SELECT productId, price from reviews limit 5")
df3.collect()

In [None]:
# Here we will run a more time-consuming query since our dataset is gigantic. (When I ran this, it took almost 
# five minutes, and my internet connection is fairly good.)

start = time.time()
df4 = sqlContext.sql("SELECT COUNT(DISTINCT title) FROM reviews")
df4.collect()
end = time.time()
print(end - start)

In [None]:
# You should see a large difference between this time and the runtime of the postgres query. (Mine was about half
# the time of the postgres query.)