# Spark SQL

In this example, we will demonstrate interactive data analysis using text files, Spark SQL, dataframes, and functional programming.  Spark SQL jobs will typically involved a few key steps, these include:

1. Connect to the environment
2. Create a Spark Context
3. Create a Spark SQL context
4. Read in the file to be analyzed
5. Transform it into a data frame
6. Begin to query that dataframe using Spark SQL.

Detailed documentation can be found on the Apache website [here:](http://spark.apache.org/docs/latest/sql-programming-guide.html) - note: this is for Spark 1.5.2

Lastly - when using Hive the best practice would be to avoid the 4th and 5th steps by connecting to Hive tables directly.  

In [1]:
from IPython.display import Image
import getspark
from pyspark import SparkContext 
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
sc = SparkContext( 'local', 'pyspark')
sqlContext = SQLContext(sc)

As with previous examples, we will begin by reading in the text file for the click data.  This version is in CSV format.  After it is read in as an rdd, the next step is to split it by its delimiter, then remove the header and check out the finalized and headerless rdd.

In [2]:
#Define the lineage graph for the rdd from the clickinfo text file
rdd = sc.textFile(r"C:\Spark\clickinfo.csv")

In [3]:
#Split it by its delimiter
rdd = rdd.map(lambda line: line.split(",")) #split it up by comma -transformation

In [4]:
#Strip out the header
header = rdd.first() #extract header
data = rdd.filter(lambda x:x !=header) #review the headeress rdd

In [5]:
#Check out your fancy new rdd
data.take(5)

[[u'1', u'0', u'3', u'1'],
 [u'2', u'0', u'3', u'1'],
 [u'3', u'0', u'3', u'1'],
 [u'4', u'0', u'3', u'1'],
 [u'5', u'0', u'11', u'1']]

### Why go through the trouble?

* Schema = table + columns + types
* Column names to index
* Leverage SQL and relational theory

### Why not schemas and SQL?

* They make your data structure
* Fragility

When reviewing our schema, one of the columns has a boolean for male == 1 and female == 0.  To ease interpretability for reporting, we can find a replace the values as male and female respectively.  Note: currently only strings exist in the rdd, in the next step they will be redefined as specific ints, floats, strings etc.  The row function with Spark dataframes is helpful, becaulse it allows you to programmatically define the schema up front.  

In [6]:
#Instead of reading in the sql context directly, we can define the rows uniquely in the dataframe and define the schema
def genmap(gender):
    if gender == '0':
        return "Female"
    else:
        return "male"

In [7]:
dfmappd = data.map(lambda line: Row(clicks = int(line[0]), 
                              gender = genmap(line[1]), impression=int(line[2]), 
                              signedin=int(line[3]))).toDF()

Now with our dataframe created, we can review the schema and take a topline sample of the dataframe.  Once we have confirmed that the schema matches our desired format the next step is to register a temporary table to be able to we can then execute SQL queries against the data as you would with any normal SQL query. 

In [8]:
dfmappd.printSchema() # This defines the Schema

root
 |-- clicks: long (nullable = true)
 |-- gender: string (nullable = true)
 |-- impression: long (nullable = true)
 |-- signedin: long (nullable = true)



In [9]:
dfmappd.show(5) # Shows a snippet of the data frame

+------+------+----------+--------+
|clicks|gender|impression|signedin|
+------+------+----------+--------+
|     1|Female|         3|       1|
|     2|Female|         3|       1|
|     3|Female|         3|       1|
|     4|Female|         3|       1|
|     5|Female|        11|       1|
+------+------+----------+--------+
only showing top 5 rows



In [10]:
dfmappd.registerTempTable("clickinfo")

In [11]:
sqlContext.sql("""SELECT gender, clicks, SUM(impression) as impressions 
                  FROM clickinfo 
                  GROUP BY gender, clicks 
                  ORDER BY SUM(impression) DESC""").show(5)

+------+------+-----------+
|gender|clicks|impressions|
+------+------+-----------+
|Female| 16063|         18|
|Female| 31110|         17|
|Female| 56240|         16|
|  male| 25224|         16|
|Female| 51154|         16|
+------+------+-----------+
only showing top 5 rows

