# Spark Basics
Authors: Daniel Hinojosa

## Spark Architecture

![Spark Architecture](../images/spark_architecture.png)

## Reason for Spark’s Existence
* CPU went from single to multi-core
* Hard Drive storage became cheap over time
* This allows for processing of data without expense

## Spark Architecture
* The reason for existence is that one computer is too slow for processing data
* A cluster can provide faster processing in parallel.
* Spark is separated by:
  * A _driver_ process
  * An _executor_ process

## The Driver
* The driver system for your application
* Maintains information about the application
* Responds to external programs
* Analyzes work across executors
* Distributes work across executors
* Schedules work across executors

## The Executor

* Executes code assigned to it by the driver
* Reports the state of the computation back to the driver

## Spark Extras

* **_MLlib_** - Machine Learning with Spark
* **_GraphX_** - for Graph Processing
* **_SparkR_** - for working with Clusters using R

## Cluster Manager

* Since Spark is a system where it separates work and distributes them it requires multiple machines
* Thus, it requires a cluster manager to manage the remote nodes, known as cluster mode
* Controls the Physical Machines
* Allocates resources to Spark applications
* Cluster Managers can either be:
  * Sparks in-house cluster manager
  * YARN
  * Mesos

## Local Mode

* Instead of remote machines this will run on your internal box
* Easy for testing, in house demonstrations

## Languages Supported

* Scala (Spark’s default language)
* Python (Does nearly everything that Scala does)
* Java (Louder than Scala)
* SQL (Spark SQL is compliant SQL to interact with querying data)
* R/Spark (The classic Big Data language)

For this class/workshop we will be using Scala since it is 
  less verbose and has other features that Java does not have. It is also
  Spark's default language

## Spark Intro
* Big data processing framework
* Variety of packages built upon Spark engine
* Contains two APIs
  * _Unstructured API_
    * Lower Level
    * `RDD`
    * Accumulators
    * Broadcast Variables
  * _Structured API_
    * Higher Level
    * Optimized
    * `DataFrame`
    * `Dataset`
    * `Spark SQL`

## You want to use `DataFrame`, `DataSet`, `SparkSQL` over `RDD`

Source: High Performance Spark
by Holden Karau; Rachel Warren
Published by O'Reilly Media, Inc., 2017

![execution_time.png](../images/execution_time.png)

## `DataFrame`

* Are the most efficient due to catalyst optimizer
* Are available in all languages
* A table with data rows and columns
* Analogous to a spreadsheet or table
* Distributed and spans over multiple machines!
* Easiest to use, particularly for non-functional programmers

In [None]:
val dataFrame = spark.range(1, 100)
                .toDF("mappedRange")

## `show()`

* Shows the data
* Default of 20 elements
* Can be changed

In [3]:
dataFrame.show(40)

+-----------+
|mappedRange|
+-----------+
|          1|
|          2|
|          3|
|          4|
|          5|
|          6|
|          7|
|          8|
|          9|
|         10|
|         11|
|         12|
|         13|
|         14|
|         15|
|         16|
|         17|
|         18|
|         19|
|         20|
|         21|
|         22|
|         23|
|         24|
|         25|
|         26|
|         27|
|         28|
|         29|
|         30|
|         31|
|         32|
|         33|
|         34|
|         35|
|         36|
|         37|
|         38|
|         39|
|         40|
+-----------+
only showing top 40 rows



This is a download from Kaggle.com called the delay flights dataset located at https://www.kaggle.com/giovamata/airlinedelaycauses/downloads/airlinedelaycauses.zip/2

## `spark.read`
* Reads data from a filesystem
* Should specify a file type
* Preferably from a distributed file system like hdfs
* Uses `load` to load the information from the location, for example
  * Use `"hdfs://"` to load from hdfs
  * Use `"s3a://"` to load from s3 on AWS
* Here we will use a local file system
* **Question** what is wrong with the results? Hint: You may want to view the [data file](../data/books.csv)

In [2]:
val booksDF = spark.read
                   .format("csv")
                   .load("../data/books.csv")
booksDF.show()

+------+--------------------+--------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+
|   _c0|                 _c1|                 _c2|           _c3|       _c4|          _c5|          _c6|        _c7|          _c8|               _c9|
+------+--------------------+--------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+
|bookID|               title|             authors|average_rating|      isbn|       isbn13|language_code|# num_pages|ratings_count|text_reviews_count|
|     1|Harry Potter and ...|J.K. Rowling-Mary...|          4.56|0439785960|9780439785969|          eng|        652|      1944099|             26249|
|     2|Harry Potter and ...|J.K. Rowling-Mary...|          4.49|0439358078|9780439358071|          eng|        870|      1996446|             27613|
|     3|Harry Potter and ...|J.K. Rowling-Mary...|          4.47|0439554934|9780439554930|          

booksDF: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 8 more fields]


## There is something else wrong

* Notice the schema from the above, by calling `printSchema()`
* This shows the schema of the `DataFrame`
* **Question** What do you think is wrong with the schema that is determined

In [3]:
booksDF.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)



## Infering the schema and bringing in the header

* Setting the option `inferSchama` we can set the schama based on the data 
* Setting the option `header` we can set the first row to be the header

In [4]:
val booksDF = spark.read.format("csv")
                     .option("inferSchema", "true")
                     .option("header", "true")
                     .load("../data/books.csv")

booksDF: org.apache.spark.sql.DataFrame = [bookID: int, title: string ... 8 more fields]


In [5]:
booksDF.show(5)

+------+--------------------+--------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+
|bookID|               title|             authors|average_rating|      isbn|       isbn13|language_code|# num_pages|ratings_count|text_reviews_count|
+------+--------------------+--------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+
|     1|Harry Potter and ...|J.K. Rowling-Mary...|          4.56|0439785960|9780439785969|          eng|        652|      1944099|             26249|
|     2|Harry Potter and ...|J.K. Rowling-Mary...|          4.49|0439358078|9780439358071|          eng|        870|      1996446|             27613|
|     3|Harry Potter and ...|J.K. Rowling-Mary...|          4.47|0439554934|9780439554930|          eng|        320|      5629932|             70390|
|     4|Harry Potter and ...|        J.K. Rowling|          4.41|0439554896|9780439554893|          

## Show all the columns

In [6]:
booksDF.columns

res4: Array[String] = Array(bookID, title, authors, average_rating, isbn, isbn13, language_code, # num_pages, ratings_count, text_reviews_count)


## Select a subset of the columns

In [7]:
val smallerSelectionDF = booksDF.select("title", "isbn", "# num_pages")
smallerSelectionDF.show()

+--------------------+----------+-----------+
|               title|      isbn|# num_pages|
+--------------------+----------+-----------+
|Harry Potter and ...|0439785960|        652|
|Harry Potter and ...|0439358078|        870|
|Harry Potter and ...|0439554934|        320|
|Harry Potter and ...|0439554896|        352|
|Harry Potter and ...|043965548X|        435|
|Harry Potter Boxe...|0439682584|       2690|
|Unauthorized Harr...|0976540606|        152|
|Harry Potter Coll...|0439827604|       3342|
|The Ultimate Hitc...|0517226952|        815|
|The Ultimate Hitc...|0345453743|        815|
|The Hitchhiker's ...|1400052920|        215|
|The Hitchhiker's ...|0739322206|          6|
|The Ultimate Hitc...|0517149257|        815|
|A Short History o...|076790818X|        544|
|Bill Bryson's Afr...|0767915062|         55|
|Bryson's Dictiona...|0767910435|        256|
|In a Sunburned Co...|0767903862|        335|
|I'm a Stranger He...|076790382X|        304|
|The Lost Continen...|0060920084| 

smallerSelectionDF: org.apache.spark.sql.DataFrame = [title: string, isbn: string ... 1 more field]


## `show` looks cramped for space?

* With `show` there are some other signatures that are worth while of investigating
* The signature from the (API)[https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset] shows some of the following signatures
   * `show(numRows: Int, truncate: Int, vertical: Boolean):Unit`
   * `show(numRows: Int, truncate: Int): Unit`
   * `show(numRows: Int, truncate: Boolean): Unit`
   * `show(truncate: Boolean): Unit`
   * `show(): Unit`
   * `show(numRows: Int)`
* `numRows` are the number of rows you wish to show
* `truncate` as a `Boolean`. If set `True` then it will truncate, `False` will show full text
* `truncate` as an `Int`. If set to more than `0`, truncates strings to truncate characters and all cells will be aligned right.
* `vertical = true` will show the records as a list for better viewing, let's try each in turn using `smallerSelectionDF`

In [8]:
smallerSelectionDF.show(numRows=30, truncate=false)

+------------------------------------------------------------------------------------------------------------+----------+-----------+
|title                                                                                                       |isbn      |# num_pages|
+------------------------------------------------------------------------------------------------------------+----------+-----------+
|Harry Potter and the Half-Blood Prince (Harry Potter  #6)                                                   |0439785960|652        |
|Harry Potter and the Order of the Phoenix (Harry Potter  #5)                                                |0439358078|870        |
|Harry Potter and the Sorcerer's Stone (Harry Potter  #1)                                                    |0439554934|320        |
|Harry Potter and the Chamber of Secrets (Harry Potter  #2)                                                  |0439554896|352        |
|Harry Potter and the Prisoner of Azkaban (Harry Potter  #3)  

In [1]:
smallerSelectionDF.show(numRows=30, vertical=true, truncate=27)

Intitializing Scala interpreter ...

Spark Web UI available at http://be29754929e6:4040
SparkContext available as 'sc' (version = 2.4.3, master = local[*], app id = local-1563542433311)
SparkSession available as 'spark'


<console>: 26: error: not found: value smallerSelectionDF

## Getting Columns

* Columns represent columns and can be used for querying
* There are four different formats in Scala to represent a column.
* These come from the `org.apache.spark.sql` package
* For example, each of these are ways of getting the columns

In [9]:
col("someColumnName")
column("someColumnName")
$"someColumnName" //Part of implicits
'someColumnName //Scala Symbol

res7: Symbol = 'someColumnName


## Calling up columns

## Selecting one column
* To select one column you can either use `select` with a list of columns
* You can also use `col` to select one column

### Querying columns
* You can query columns by also using `where` to form an expression using 
  * The various `col` signatures.
  * A `String` with an expression (see next cell)

In [10]:
val stephenKing = booksDF.where($"authors".contains("Stephen King"))
stephenKing.show()

+------+--------------------+--------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+
|bookID|               title|             authors|average_rating|      isbn|       isbn13|language_code|# num_pages|ratings_count|text_reviews_count|
+------+--------------------+--------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+
|  4978|Wolves of the Cal...|Stephen King-Bern...|          4.19|141651693X|9781416516934|          eng|        931|       120906|              2640|
|  5094|The Drawing of th...|        Stephen King|          4.23|0451210859|9780451210852|          eng|        463|       163647|              4846|
|  5095|The Waste Lands (...|        Stephen King|          4.24|034082977X|9780340829776|          eng|        584|         1073|                82|
|  5096|Wizard and Glass ...|Stephen King-Dave...|          4.25|0340829788|9780340829783|          

stephenKing: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [bookID: int, title: string ... 8 more fields]


### Complex Expressions

* Expressions are dynamic, and you can do varying things
* For example: `expr(col("High") + 5 < col("Low") - 2)`
* This creates directed acyclic graph
* You can also place the entire expression into a String
  * `expr("HIGH + 5 < LOW - 2")`
* This creates the foundation as to why SparkSQL works

In [12]:
val harryPotters = smallerSelectionDF.where(expr("title like '%Harry Potter%'"))
harryPotters.show(truncate=false)

+----------------------------------------------------------------------------------------------------------------------------------+----------+-----------+
|title                                                                                                                             |isbn      |# num_pages|
+----------------------------------------------------------------------------------------------------------------------------------+----------+-----------+
|Harry Potter and the Half-Blood Prince (Harry Potter  #6)                                                                         |0439785960|652        |
|Harry Potter and the Order of the Phoenix (Harry Potter  #5)                                                                      |0439358078|870        |
|Harry Potter and the Sorcerer's Stone (Harry Potter  #1)                                                                          |0439554934|320        |
|Harry Potter and the Chamber of Secrets (Harry Potter  #2)     

harryPotters: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [title: string, isbn: string ... 1 more field]


## Sorting the data

We can sort the data with the `sort` method providing a column. The default is ascending order

In [13]:
harryPotters.sort($"# num_pages").show()

+--------------------+----------+-----------+
|               title|      isbn|# num_pages|
+--------------------+----------+-----------+
|Harry Potter und ...|3895849618|         13|
|Unauthorized Harr...|0976540606|        152|
|Harry Potter Boxe...|0439434866|       1820|
|Mapping the World...|1932100598|        195|
|Mugglenet.Com's W...|1569755833|        216|
|Looking for God i...|1414306342|        234|
|Harry Potter Scho...|043932162X|        240|
|Harry Potter and ...|0812694554|        243|
|Harry Potter and ...|158234681X|        250|
|Harry Potter Y La...|0613359607|        254|
|Harry Potter Boxe...|0439682584|       2690|
|Harry Potter y la...|8498380138|        288|
|Harry Potter and ...|0439554934|        320|
|Harry Potter und ...|3551354014|        334|
|Harry Potter Coll...|0439827604|       3342|
|Harry Potter and ...|0439064864|        341|
|Harry Potter et l...|2070541304|        349|
|Harry Potter und ...|3551552096|        351|
|Harry Potter and ...|0439554896| 

## Bring in some more functions

* For descending order we need some help, a common import is the `import org.apache.spark.sql.functions._` package
* Contains a wide range of functions that compliment what is in `DataFrame` API
* Nearly all of the functions of this Scala `object` is 

In [15]:
import org.apache.spark.sql.functions._

import org.apache.spark.sql.functions._


## Sorting in Descending

* For descending order we need some help, a common import is the `import org.apache.spark.sql.functions._` package
* Contains a wide range of functions that compliment what is in `DataFrame` API

In [16]:
harryPotters.sort(desc($"# num_pages"))

<console>: 31: error: type mismatch;

In [22]:
harryPotters.printSchema()

root
 |-- title: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- # num_pages: string (nullable = true)



In [31]:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
val converted = harryPotters.withColumn("# num_pages", $"# num_pages".cast(IntegerType))
converted.printSchema()

root
 |-- title: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- # num_pages: integer (nullable = true)



import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
converted: org.apache.spark.sql.DataFrame = [title: string, isbn: string ... 1 more field]


## Creating Rows

## Adding Rows

## Dropping Rows

## Taking Rows

## Creating Columns

val total = converted.agg(sum($"# num_pages"))
total.show()

In [40]:
harryPotters.select("title").where($"title".contains("Prisoner")).show(20, false)

+-----------------------------------------------------------+
|title                                                      |
+-----------------------------------------------------------+
|Harry Potter and the Prisoner of Azkaban (Harry Potter  #3)|
|Harry Potter and the Prisoner of Azkaban (Harry Potter  #3)|
+-----------------------------------------------------------+



In [None]:
collect_list($"title")

In [54]:
booksDF.groupBy($"authors").agg(collect_list($"title").alias("titles")).show()

+--------------------+--------------------+
|             authors|              titles|
+--------------------+--------------------+
|Abraham Lincoln-D...|[Speeches and Wri...|
|    Amanda Eyre Ward|    [How to Be Lost]|
|         Ann Beattie|[The Doctor's Hou...|
|         Ann Rinaldi|[A Break with Cha...|
|Charles Dickens-S...|[A Tale of Two Ci...|
|          Dava Sobel|[Galileo's Daught...|
|        Doug Stanton|[In Harm's Way: T...|
|     Eric Klinenberg|[Heat Wave: A Soc...|
|Gayle Lynds-Rober...|[The Altman Code ...|
|Haruki Murakami-U...|    [Naokos Lächeln]|
|          Ian Ogilvy|[Measle and the D...|
|J.E. Austen Leigh...|[A Memoir of Jane...|
|        Jack Meadows|[The Future of th...|
|          James Frey|[A Million Little...|
|Johanna Hurwitz-V...|[Anne Frank: Life...|
|John  Baxter-Mel Bay|[Deluxe Encyclope...|
|Jonathan Swift-YKids|[Gulliver's Travels]|
|     Karen Armstrong|[A History of God...|
|Laura  Jordan-San...|   [Anhelos ocultos]|
|    Laurence Olivier|[Confessio

Lab: What are the harry potter's books average rating?