# Spark Basics
Authors: Daniel Hinojosa

## Spark Intro
* Big data processing framework
* Variety of packages built upon Spark engine
* Contains two APIs
  * _Unstructured API_
    * Lower Level
    * `RDD`
    * Accumulators
    * Broadcast Variables
  * _Structured API_
    * Higher Level
    * Optimized
    * `DataFrame`
    * `Dataset`
    * `Spark SQL`

## Spark Architecture

![Spark Architecture](../images/spark_architecture.png)

In [1]:
val dataFrame = spark.range(1, 100)
                     .toDF("mappedRange")

Intitializing Scala interpreter ...

Spark Web UI available at http://7e47b21ebc4f:4040
SparkContext available as 'sc' (version = 2.4.3, master = local[*], app id = local-1561668367505)
SparkSession available as 'spark'


dataFrame: org.apache.spark.sql.DataFrame = [mappedRange: bigint]


In [2]:
dataFrame.show()

+-----------+
|mappedRange|
+-----------+
|          1|
|          2|
|          3|
|          4|
|          5|
|          6|
|          7|
|          8|
|          9|
|         10|
|         11|
|         12|
|         13|
|         14|
|         15|
|         16|
|         17|
|         18|
|         19|
|         20|
+-----------+
only showing top 20 rows



This is a download from Kaggle.com called the delay flights dataset located at https://www.kaggle.com/giovamata/airlinedelaycauses/downloads/airlinedelaycauses.zip/2

In [34]:
val booksDF = spark.read.format("csv")
                     .option("inferSchema", "true")
                     .option("header", "true")
                     .load("../data/books.csv")

booksDF: org.apache.spark.sql.DataFrame = [bookID: int, title: string ... 8 more fields]


In [4]:
booksDF.show(5)

+------+--------------------+--------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+
|bookID|               title|             authors|average_rating|      isbn|       isbn13|language_code|# num_pages|ratings_count|text_reviews_count|
+------+--------------------+--------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+
|     1|Harry Potter and ...|J.K. Rowling-Mary...|          4.56|0439785960|9780439785969|          eng|        652|      1944099|             26249|
|     2|Harry Potter and ...|J.K. Rowling-Mary...|          4.49|0439358078|9780439358071|          eng|        870|      1996446|             27613|
|     3|Harry Potter and ...|J.K. Rowling-Mary...|          4.47|0439554934|9780439554930|          eng|        320|      5629932|             70390|
|     4|Harry Potter and ...|        J.K. Rowling|          4.41|0439554896|9780439554893|          

Show all the columns:

In [10]:
booksDF.columns

res7: Array[String] = Array(bookID, title, authors, average_rating, isbn, isbn13, language_code, # num_pages, ratings_count, text_reviews_count)


Select a few of the columns

In [16]:
val smallerSelectionDF = booksDF.select("title", "isbn", "# num_pages")
smallerSelectionDF.show()

+--------------------+----------+-----------+
|               title|      isbn|# num_pages|
+--------------------+----------+-----------+
|Harry Potter and ...|0439785960|        652|
|Harry Potter and ...|0439358078|        870|
|Harry Potter and ...|0439554934|        320|
|Harry Potter and ...|0439554896|        352|
|Harry Potter and ...|043965548X|        435|
|Harry Potter Boxe...|0439682584|       2690|
|Unauthorized Harr...|0976540606|        152|
|Harry Potter Coll...|0439827604|       3342|
|The Ultimate Hitc...|0517226952|        815|
|The Ultimate Hitc...|0345453743|        815|
|The Hitchhiker's ...|1400052920|        215|
|The Hitchhiker's ...|0739322206|          6|
|The Ultimate Hitc...|0517149257|        815|
|A Short History o...|076790818X|        544|
|Bill Bryson's Afr...|0767915062|         55|
|Bryson's Dictiona...|0767910435|        256|
|In a Sunburned Co...|0767903862|        335|
|I'm a Stranger He...|076790382X|        304|
|The Lost Continen...|0060920084| 

smallerSelectionDF: org.apache.spark.sql.DataFrame = [title: string, isbn: string ... 1 more field]


In [18]:
val harryPotters = smallerSelectionDF.where($"title".contains("Harry Potter"))
harryPotters.show()

+--------------------+----------+-----------+
|               title|      isbn|# num_pages|
+--------------------+----------+-----------+
|Harry Potter and ...|0439785960|        652|
|Harry Potter and ...|0439358078|        870|
|Harry Potter and ...|0439554934|        320|
|Harry Potter and ...|0439554896|        352|
|Harry Potter and ...|043965548X|        435|
|Harry Potter Boxe...|0439682584|       2690|
|Unauthorized Harr...|0976540606|        152|
|Harry Potter Coll...|0439827604|       3342|
|Harry Potter Scho...|043932162X|        240|
|J.K. Rowling's Ha...|0826452329|         96|
|Harry Potter and ...|0747584664|        768|
|Harry Potter Y La...|0613359607|        254|
|Harry Potter and ...|074757362X|        480|
|Looking for God i...|1414306342|        234|
|Mugglenet.Com's W...|1569755833|        216|
|Harry Potter y el...|8478889930|        602|
|Harry Potter y la...|8498380138|        288|
|Harry Potter y la...|8478888845|        893|
|Ultimate Unoffici...|0972393617| 

harryPotters: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [title: string, isbn: string ... 1 more field]


In [21]:
harryPotters.sort($"# num_pages").show()

+--------------------+----------+-----------+
|               title|      isbn|# num_pages|
+--------------------+----------+-----------+
|Harry Potter und ...|3895849618|         13|
|Unauthorized Harr...|0976540606|        152|
|Harry Potter Boxe...|0439434866|       1820|
|Mapping the World...|1932100598|        195|
|Mugglenet.Com's W...|1569755833|        216|
|Looking for God i...|1414306342|        234|
|Harry Potter Scho...|043932162X|        240|
|Harry Potter and ...|0812694554|        243|
|Harry Potter and ...|158234681X|        250|
|Harry Potter Y La...|0613359607|        254|
|Harry Potter Boxe...|0439682584|       2690|
|Harry Potter y la...|8498380138|        288|
|Harry Potter and ...|0439554934|        320|
|Harry Potter und ...|3551354014|        334|
|Harry Potter Coll...|0439827604|       3342|
|Harry Potter and ...|0439064864|        341|
|Harry Potter et l...|2070541304|        349|
|Harry Potter und ...|3551552096|        351|
|Harry Potter and ...|0439554896| 

In [22]:
harryPotters.printSchema()

root
 |-- title: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- # num_pages: string (nullable = true)



In [31]:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
val converted = harryPotters.withColumn("# num_pages", $"# num_pages".cast(IntegerType))
converted.printSchema()

root
 |-- title: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- # num_pages: integer (nullable = true)



import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
converted: org.apache.spark.sql.DataFrame = [title: string, isbn: string ... 1 more field]


In [33]:
val total = converted.agg(sum($"# num_pages"))
total.show()

+----------------+
|sum(# num_pages)|
+----------------+
|           22112|
+----------------+



total: org.apache.spark.sql.DataFrame = [sum(# num_pages): bigint]


In [40]:
harryPotters.select("title").where($"title".contains("Prisoner")).show(20, false)

+-----------------------------------------------------------+
|title                                                      |
+-----------------------------------------------------------+
|Harry Potter and the Prisoner of Azkaban (Harry Potter  #3)|
|Harry Potter and the Prisoner of Azkaban (Harry Potter  #3)|
+-----------------------------------------------------------+



In [None]:
collect_list($"title")

In [54]:
booksDF.groupBy($"authors").agg(collect_list($"title").alias("titles")).show()

+--------------------+--------------------+
|             authors|              titles|
+--------------------+--------------------+
|Abraham Lincoln-D...|[Speeches and Wri...|
|    Amanda Eyre Ward|    [How to Be Lost]|
|         Ann Beattie|[The Doctor's Hou...|
|         Ann Rinaldi|[A Break with Cha...|
|Charles Dickens-S...|[A Tale of Two Ci...|
|          Dava Sobel|[Galileo's Daught...|
|        Doug Stanton|[In Harm's Way: T...|
|     Eric Klinenberg|[Heat Wave: A Soc...|
|Gayle Lynds-Rober...|[The Altman Code ...|
|Haruki Murakami-U...|    [Naokos Lächeln]|
|          Ian Ogilvy|[Measle and the D...|
|J.E. Austen Leigh...|[A Memoir of Jane...|
|        Jack Meadows|[The Future of th...|
|          James Frey|[A Million Little...|
|Johanna Hurwitz-V...|[Anne Frank: Life...|
|John  Baxter-Mel Bay|[Deluxe Encyclope...|
|Jonathan Swift-YKids|[Gulliver's Travels]|
|     Karen Armstrong|[A History of God...|
|Laura  Jordan-San...|   [Anhelos ocultos]|
|    Laurence Olivier|[Confessio

Lab: What are the harry potter's books average rating?