In [None]:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// For implicit transformation of RDDs to DataFrames
import sqlContext.implicits._

// For telling Spark to look in the local file system
import java.io._
def localpath(path: String): String = {
    "file://" + new java.io.File(".").getCanonicalPath + "/" + path
}

// For timing expression evaluation
def time[R](block: => R): R = {
    val start: Long = System.nanoTime()
    val result = block
    val end: Long = System.nanoTime()
    val duration: Double = (end - start) / 1000000000.0
    println("Elapsed time: " + duration + "s")
    result
}

println("Using Spark version " + sc.version)

In [None]:
/**
// If you have a Hive install, you can connect it to Spark:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

hiveContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
hiveContext.sql("LOAD DATA LOCAL INPATH 'small_data/employer/hashmap.csv' INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)
*/

import math._

# Spark DataFrames


## Motivation and Spark SQL


Spark SQL is the current effort to provide support for writing SQL queries in Spark. Newer versions support Hive, Parquet, and other data sources. [Docs](http://spark.apache.org/docs/latest/sql-programming-guide.html)

The key feature of Spark SQL is the use of DataFrames instead of RDDs. A DataFrame is a distributed collection of data organized into named columns, and operations on DataFrames are first parsed through an optimized execution engine which streamlines and may even reorder the request to optimize execution. The keyword to search here is Catalyst.

Under the hood, operations on DataFrames are boiled down to operations on RDDs, but the RDDs are created by the execution engine, and not directly by the user. It is also possible to convert RDDs to DataFrames and vice versa.

The Spark ML package, unlike MLlib, uses DataFrames as inputs and outputs.

**Question:** What is an example of a "bad" sequence of operations which should be reordered for optimal performance?

DataFrames are...

* Immutable, like RDDs
* Lineage is remembered, like RDDs (resiliency)
* Lazy execution, like RDDs
* So why do we care?


DataFrames are an abstraction that lets us think of data in a familiar form (Panda, data.frame, SQL table, etc.).

We can use a similar API to RDDs!

Access to SQL-like optimizations and cost analysis due to it being in a columnar format.

What about type safety?

What are these UDF things?

In [None]:
val data = sc.parallelize((1 until 10001)).
    map(x => (random, random))

In [None]:
// This isn't always so easy. You may need to explicity define a schema.
val df = data.toDF()

In [None]:
df.printSchema

In [None]:
// Parquet format will be used by default
df
    .withColumnRenamed("_1", "x")
    .withColumnRenamed("_2", "y")
    .write
    .save("spark_parquet_demo")

Try rerunning the above cell.

Save modes:
* error
* append
* overwrite
* ignore (i.e. CREATE TABLE IF NOT EXISTS)

In [None]:
df.write.mode("ignore").format("parquet").save("spark_parquet_demo")

In [None]:
val dfp = sqlContext.read.load("spark_parquet_demo")

In [None]:
dfp.describe("x").show()

In [None]:
val filteredDF = dfp.filter(dfp("x") < 0.5)

In [None]:
filteredDF.count()

## Exploring the Catalyst Optimizer

In [None]:
filteredDF.explain(true)

In [None]:
val filteredDF = df.filter(df("_1") < 0.5)

In [None]:
filteredDF.explain(true)

In [None]:
val filteredDF = df.filter(df("_1") < 0.5).filter(df("_2") < 0.5)

In [None]:
filteredDF.explain(true)

In [None]:
val filteredDFP = dfp.filter(dfp("x") < 0.5).filter(dfp("y") < 0.5)

In [None]:
filteredDFP.explain(true)

Under the hood, it's just manipulating trees based on rules.
The introductory [blog post](https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html) has good pictures.


### Project Tungsten


* Memory management and GC (better than the JVM)
* Cache-aware computation
* Codegen (compile queries into Java bytecode)

Cache-aware computation example:
* Case 1: pointer -> key, value
* Case 2: key, pointer -> key, value

The CPU has to find keys for sort purposes. This helps it find them faster.

[More](https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html)


### DataFrame performance and tuning


See [here](http://spark.apache.org/docs/latest/sql-programming-guide.html#performance-tuning) for details.

## SQL and DataFrames

In [None]:
// Requires Hive to permanently store tables
dfp.registerTempTable("nums")  // This is NOT the same as a temp table in SQL proper
val sqlDF = sqlContext.sql("select x, y from nums where y > 0.9 limit 3")
sqlDF.show()

In [None]:
sqlDF.explain(true)

*Reminder:* Check the UI (port 4040 by default) for tables in memory.

*Reminder:* A number of interactive tutorials are available on the DataBricks [community cloud](https://community.cloud.databricks.com). I highly recommend making an account and checking out the guide.

This is also a good place to learn about connecting to databases like Cassandra or using JDBC protocol.

## Adding columns and functions


Because DataFrames are immutable, adding new information means appending columns to an existing DataFrame.

In [None]:
// Currying lets us specify some of a function's arguments and delay specifying the rest until later.
// Remember how we can pass an argument into a function as a value or as a function? `f: => Int`

def prediction(threshold: Double)(num: Double): Int = {
    if (num >= threshold) 1 else 0
}

In [None]:
import org.apache.spark.sql.functions.udf

val xLabelizer = udf(prediction(0.5) _)
val yLabelizer = udf(prediction(0.9) _)

In [None]:
val newDF = dfp.withColumn("xLabel", xLabelizer(dfp("x"))).withColumn("yLabel", yLabelizer(dfp("y")))

In [None]:
newDF.show()

## Type safety and DataSets

In [None]:
val rdd = newDF.rdd
val row = rdd.take(1)
row

In [None]:
row.getClass.getName

In [None]:
// Remember that `take` always returns a list of results
val r = row(0)
println(r.schema)

// The fields are the column names by default
r.fieldIndex("yLabel")

In Python, we're not too worried about type safety. But it's important to note that in Scala/Java, these Row objects do not contain the type information of the objects inside them and therefore type safety can be lost converting from RDDs to DataFrames. [DataSets](http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes) (fleshed out in Spark 2.0) are a newer incarnation of DataFrames that add encoding information to preserve that type safety.

By default in Spark 2.0+, DataFrames are just an alias for DataSet[Row] and Rows are a light wrapper around Array[Any] which can cause some problems.

We can redefine the encoding with `.as` and specifying types or a case class.

In [None]:
// The "old way" of doing it...
// We can always drill into Row objects to extract the information we want.
println(r(0).getClass.getName + ": " + r(0))
println(r(2).getClass.getName + ": " + r(2))

// Or by field name
println("Trying to get the field 'x' as Double... " + r.getAs[Double]("x"))
println("Trying to get the field 'xLabel' as Int... " + r.getAs[Int]("xLabel"))

In [None]:
// DataSets are often easier
// Note that going to a lower-precision type won't be allowed
val newDS = newDF.as[(Double, Double, Float, Long)]
newDS.printSchema

In [None]:
val newRDD: org.apache.spark.rdd.RDD[(Double, Double, Float, Long)] = newDS.rdd

In [None]:
val dsrow = newRDD.take(1)(0)
dsrow.getClass

In [None]:
// No longer a Row
println(dsrow._2.getClass.getName)
println(dsrow._2)
println(dsrow._4.getClass.getName)
println(dsrow._4)

### Using case classes

In [None]:
// Matching is done by column name
// Note that again, you are responsible for making this type matching make sense.
case class Observation(
    x: Double,
    y: Double,
    xLabel: Float,
    yLabel: Long)

In [None]:
val ccDS = newDF.as[Observation]
val ccRDD = ccDS.rdd
ccRDD.getClass

In [None]:
ccRDD.map(obs => (obs.x, obs.xLabel)).take(2)

*Copyright &copy; 2018 The Data Incubator.  All rights reserved.*