# Gentle Intro to Apache Spark
## What is Apache Spark
Apache Spark is a *unified computing engine* and a set of *libraries* for parallel data processing on computer clusters.

![](images/eco.png)

## Philosophy
1. Unified
2. Computing engine
3. Libraries

# 2 Structure
## 2.1 Driver vs. Executors

![](images/architech.png)




## Spark Session

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession\
        .builder\
        .master('local[2]')\
        .appName('Cha1')\
        .getOrCreate()

SparkSession and Language API

![](images/sparksession.png)

## DataFrames
- most common Structured API
- *schema*: list that defines the columns and the types within those columns
![](images/df.png)

## Partitions
Paritions are collection of rows of the data stored in different clusters
- A DataFrame’s partitions represent how the data is physically distributed across the cluster of machines during execution. 
- If you have one partition, Spark will have a parallelism of only one, even if you have thousands of executors. If you have many partitions but only one executor, Spark will still have a parallelism of only one because there is only one computation resource..


In [6]:
myRange = spark.range(1000).toDF('number')
myRange

DataFrame[number: bigint]

## Transformation

In [7]:
divisBy2 = myRange.where('number % 2 = 0 ')
divisBy2

DataFrame[number: bigint]


- Narrow Tranformations  
![](images/narrow-trans.png)

- Wide Transformations (Shuffles)  
![](images/wide-trans.png)

#### Lazy Evaluation
In Spark, instead of modifying the data immediately when we express some operation, we build up a plan of transformations that we would like to apply to our source data. This provides immense benefits to the end user because Spark can optimize the entire data flow from end to end.



## Actions
Trigger the tranformation


In [8]:
divisBy2.count()

500


Three kinds of tranformations
- actions to view data in the console;
- actions to collect data to native objects in the respective language;
- and actions to write to output data sources

## Spark UI
Monitor spark jobs

Local Port: http://localhost:4040

## An End-to-End example

In [9]:
dir_path='../Spark-The-Definitive-Guide-master'
data_path="/data/flight-data/csv/2015-summary.csv"

In [10]:
flightData2015 = spark\
    .read\
    .option('inferSchema', 'true')\
    .option('header', 'true')\
    .csv(''.join([dir_path, data_path]))

![](images/read_csv.png)

In [11]:
# similar to df.head()
flightData2015.take(3)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344)]

![](images/sort.png)

In [12]:
# call explain on any DataFrame object 
# to see the DataFrame’s lineage
# how Spark will execute this query
flightData2015.sort("count").explain()

== Physical Plan ==
*(2) Sort [count#23 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#23 ASC NULLS FIRST, 200)
   +- *(1) FileScan csv [DEST_COUNTRY_NAME#21,ORIGIN_COUNTRY_NAME#22,count#23] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/binb/Desktop/spark/Spark-The-Definitive-Guide-master/data/flight-dat..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>


## Executre the plan
Make tranformation to action

In [13]:
## default: shuffle output 200 partitions
## now set it to 5 to simplify
spark.conf.set('spark.sql.shuffle.partitions', '5')

In [14]:
flightData2015.sort('count').take(2)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]

![](images/shuffle.png)

### Experiment with different partitions
We do not manipulate the physical data; instead, we configure physical execution characteristics
through things like the shuffle partitions parameter that we set a few moments ago. This control the physical execution characteristics of Spark jobs.

In experimenting with different values, you should see drastically different runtimes. Remember that you can
monitor the job progress by navigating to the Spark UI on port 4040 to see the physical and logical execution characteristics of your jobs

In [15]:
# partitions = 5
spark.conf.set('spark.sql.shuffle.partitions', '5')
flightData2015.sort('count').take(2)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]

In [16]:
# partitions = 500
spark.conf.set('spark.sql.shuffle.partitions', '500')
flightData2015.sort('count').take(2)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]

## DataFrames and SQL

** DataFrames are immutable **

Transfer DataFrame into a table or view

In [17]:
flightData2015.createOrReplaceTempView('flight_data_2015')

In [21]:
sqlWay = spark.sql('''
SELECT DEST_COUNTRY_NAME, count(1)
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME''')

dataFrameWay = flightData2015\
    .groupBy('DEST_COUNTRY_NAME')\
    .count()

In [26]:
sqlWay.explain()
print('====='*20)
dataFrameWay.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#21], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#21, 500)
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#21], functions=[partial_count(1)])
      +- *(1) FileScan csv [DEST_COUNTRY_NAME#21] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/binb/Desktop/spark/Spark-The-Definitive-Guide-master/data/flight-dat..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>
== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#21], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#21, 500)
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#21], functions=[partial_count(1)])
      +- *(1) FileScan csv [DEST_COUNTRY_NAME#21] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/binb/Desktop/spark/Spark-The-Definitive-Guide-master/data/flight-dat..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY

**Two plans are the same**

ex1: Find the Maximum 'count'

In [33]:
## sql
spark.sql('''
SELECT max(count) 
FROM flight_data_2015''').take(1)

[Row(max(count)=370002)]

In [34]:
## python
from pyspark.sql.functions import max
flightData2015.select(max('count')).take(1)

[Row(max(count)=370002)]

ex2: find top five destination countries 

In [45]:
# sql
maxSql = spark.sql('''
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY destination_total DESC
LIMIT 5''')
maxSql.show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [41]:
# python
from pyspark.sql.functions import desc
flightData2015\
    .groupBy('DEST_COUNTRY_NAME')\
    .sum('count')\
    .withColumnRenamed('sum(count)', 'destination_total')\
    .sort(desc('destination_total'))\
    .limit(5)\
    .show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



#### Execution plan
The true execution plan (the one visible in explain) will differ from that shown in Figure 2-10 because of optimizations in the physical execution.

The execution plan is *directed acyclic graph*(DAG).  
Each resulting DataFrame is immutable.  
Action to generate result (i.e. `df.shows()`)

![](images/ex2.png)

- **1st step**: *read* in the data.   
The DataFrame is not read in until an action is called on that DataFrame or one derived from the original df.
- **2nd step**: *grouping*.   
`groupBy` creates a RelationalGroupedDataset, a DataFrame that has a grouping specified but needs the user to specify an aggregation before it can be queried further. Grouping by a key (or set of keys) and then perform an aggregation over each one of those keys.
- **3rd step**: specify the *aggregation*.   
`sum` aggregation method: takes as input a column expression or a column name. The result of the sum
method call is a *new DataFrame.* with new *schema* which know the type of each column. (*tranformation*: no computation has been performed)
- **4th step**: *renaming*.   
`withColumnRenamed` method: takes two arguments (original column name, new column name). (*transformation*)
- **5th step**: *sorts* the data  
*transformation*  
Note: (`desc` function: does not return a string but a Column. In general, many DataFrame methods will accept strings (as column names) or Column types or expressions. (they are the same)
- **6th step**: specify a *limit*.   
return the first five values in our final DataFrame 
- **7th step**: *action!*  
process of collecting the results of our DataFrame, and Spark will give us back a list or array in the language that we’re executing. 

In [46]:
## explain plan
flightData2015\
    .groupBy("DEST_COUNTRY_NAME")\
    .sum('count')\
    .withColumnRenamed('sum(count)','destination_total')\
    .sort(desc('destination_total'))\
    .limit(5)\
    .explain()

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[destination_total#253L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#21,destination_total#253L])
+- *(2) HashAggregate(keys=[DEST_COUNTRY_NAME#21], functions=[sum(cast(count#23 as bigint))])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#21, 500)
      +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#21], functions=[partial_sum(cast(count#23 as bigint))])
         +- *(1) FileScan csv [DEST_COUNTRY_NAME#21,count#23] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/binb/Desktop/spark/Spark-The-Definitive-Guide-master/data/flight-dat..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>
