<style>
    div.container {
      max-width: 800px!important;
    }
</style>

# Spark SQL Basics

Spark SQL provides the means for working with structured data within Apache Spark.  Structured data is represented by the `DataFrame` abstraction (which is a type alias for `Dataset[Row]`), and we can act on them using familiar-looking SQL queries, or else the `DataFrame` API.  In this lesson, we cover `DataFrame` basics, including:

* creating `DataFrame`s in code
* creating `DataFrame`s from external sources (CSV, parquet, hive, PostgreSQL, etc.)
* manipulating and summarising `DataFrame`s using both SQL and the `DataFrame` API

## Preliminaries

This workbook makes use of the [Almond Scala kernel for Jupyter](https://almond.sh/).  To use Spark, we have to first add a few libraries to the classpath, which we can do as follows:

In [1]:
def init: Unit = {
  import ammonite.ops._
  val jars = ls! root/'opt/'spark/'jars |? (_.ext == "jar")
  jars.foreach(interp.load.cp(_))   
}

init

defined [32mfunction[39m [36minit[39m

Spark is also pretty verbose with respect to logging, so it can be useful to change the logging policy to de-clutter our outputs:

In [2]:
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.ERROR)

[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m

And just to get it out of the way up front, we import a number of objects that we need throughout the rest of the document:

In [3]:
import org.apache.spark.sql._

[32mimport [39m[36morg.apache.spark.sql._[39m

Finally, sometimes a code block will produce a large amount of output, some of which is unimportant, and so obfuscatory.  To hide this, we sometimes wrap things in an object like so:

```scala
object foo {
  val x = 1
  val y = 2
}

x + y
```

The object `foo` serves no functional purpose here other than to hide the interpreter output that results from the assignment of `x` and `y`.

## Creating a `SparkSession`

As of Spark 2.x, the usual method of interacting with Spark is by creating `SparkSession` to function as a single entrypoing.  In our case, we do this as follows:

In [4]:
val spark = SparkSession
  .builder
  .config("hive.metastore.uris","thrift://localhost:9083") 
  .config("spark.sql.warehouse.dir", "/data/hive/warehouse")
  .master("local[*]")
  .appName("Spark SQL Basics")
  .enableHiveSupport()
  .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties


[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@45be1e88

Here we explicitly configure our session to use Hive by setting values for `hive.metastore.uris` and `spark.sql.warehouse.dir`.  It is common for this to work automatically by configuration (via the file `hive-site.xml`), but this does not appear to the case in this context.  We also tell Spark to work in pseudo-distributed mode by setting `master` to `local[*]`.  There are various other configurations possible, but that's out of scope here.  See [Configuration - Spark 2.4.3 Documentation](https://spark.apache.org/docs/latest/configuration.html) for details.

When working with Spark SQL, it is very common to use the object `spark.sparkContext`.  So for convenience, we also assign this to a variable, commonly `sc`, as follows:

In [5]:
val sc = spark.sparkContext

[36msc[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32mSparkContext[39m = org.apache.spark.SparkContext@457deaa1

N.b. that we decreased the amount of debugging already in above.  We can also do this via the `SparkSession` object by running `sc.setLogLevel("ERROR")`, but then we'd still be subjected to the logging that occurs as a result of creating the `SparkSession` itself.

## Resilient Distributed Dataset (`RDD`)

As noted, `DataFrame` is the central data abstraction when working with structured data.  However, these build on an earlier abstraction called Resilient Distributed Datasets (`RDD`), and one will still have occasion to use these.  An `RDD` is essentially just a normal Scala collection that's been parallelised for use with Spark.  For example:

In [6]:
val beatles = Seq("John", "Paul", "Ringo", "George")
val distributedBeatles = spark.sparkContext.parallelize(beatles)

[36mbeatles[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m([32m"John"[39m, [32m"Paul"[39m, [32m"Ringo"[39m, [32m"George"[39m)
[36mdistributedBeatles[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32mrdd[39m.[32mRDD[39m[[32mString[39m] = ParallelCollectionRDD[0] at parallelize at cmd5.sc:2

We can treat the resulting `RDD` in much the same way as the original collection, but the `RDD` will be worked on in parallel.  This means that when iterating over an `RDD` the order we will process entries will be unstable.  And because of this, certain operations that require a strictly ordered sequence, like `head` and `tail`, will not be available.  Regardless, the lack of a stable ordering is easily demonstrated:

In [7]:
// stable order for Scala Seq type
println(beatles.fold("")(_ + _))
println(beatles.fold("")(_ + _))

JohnPaulRingoGeorge
JohnPaulRingoGeorge


In [8]:
// but not for a parallelized collection
println(distributedBeatles.fold("")(_ + _))
println(distributedBeatles.fold("")(_ + _))

RingoPaulJohnGeorge
JohnPaulGeorgeRingo


## Creating `DataFrame`s Programmatically

There are a number of ways we can create a `DataFrame`.  Since we just introduced `RDD`s, let us first demonstrate how we can create a `DataFrame` from an `RDD`: 

In [9]:
val data = Seq(
  ("George", "Harrison"),
  ("Ringo", "Starr"),
  ("John", "Lennon"),
  ("Paul", "McArtney")
)

[36mdata[39m: [32mSeq[39m[([32mString[39m, [32mString[39m)] = [33mList[39m(
  ([32m"George"[39m, [32m"Harrison"[39m),
  ([32m"Ringo"[39m, [32m"Starr"[39m),
  ([32m"John"[39m, [32m"Lennon"[39m),
  ([32m"Paul"[39m, [32m"McArtney"[39m)
)

In [10]:
import spark.implicits._

sc
  .parallelize(data)
  .toDF("firstName", "lastName")
  .show

+---------+--------+
|firstName|lastName|
+---------+--------+
|   George|Harrison|
|    Ringo|   Starr|
|     John|  Lennon|
|     Paul|McArtney|
+---------+--------+



[32mimport [39m[36mspark.implicits._

[39m

## Creating `DataFrame`s from External Sources

Blah

## Executing SQL 

In [3]:
import spark.implicits._
import spark.sql

import spark.implicits._
import spark.sql


In [10]:
sql("show databases").show()

+------------+
|databaseName|
+------------+
|     default|
|  nycflights|
+------------+



In [11]:
sql("SELECT * FROM nycflights.airlines").show()

+-------+--------------------+
|carrier|                name|
+-------+--------------------+
|carrier|                name|
|     9E|   Endeavor Air Inc.|
|     AA|American Airlines...|
|     AS|Alaska Airlines Inc.|
|     B6|     JetBlue Airways|
|     DL|Delta Air Lines Inc.|
|     EV|ExpressJet Airlin...|
|     F9|Frontier Airlines...|
|     FL|AirTran Airways C...|
|     HA|Hawaiian Airlines...|
|     MQ|           Envoy Air|
|     OO|SkyWest Airlines ...|
|     UA|United Air Lines ...|
|     US|     US Airways Inc.|
|     VX|      Virgin America|
|     WN|Southwest Airline...|
|     YV|  Mesa Airlines Inc.|
+-------+--------------------+



In [None]:
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
val warehouseLocation = "/data/hive/warehouse"