# 2. Read CSV Data and try Sorts, Counts and View Shuffles and Execution Plans #
Examples based on Chapter 2, "Spark: Definitive Guide: Big Data processing Made Simple"

In this example, sample flight data for 2010 to 2015 is processed and the execution plan for a wide transformation (shuffle) is demonstrated.  Simple Sort and Count operations are performed against the data.  The sample data can be downloaded to `./datain/flight-data` with the `data-download.ipynb` notebook.


In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.\
        builder.\
        appName("pyspark-notebook-2").\
        master("spark://spark-master:7077").\
        config("spark.executor.memory", "512m").\
        config("spark.eventLog.enabled", "true").\
        config("spark.eventLog.dir", "file:///opt/workspace/events").\
        getOrCreate()      

Read all the flight-data CSV files in the sample `../datain/flight-data` directory:

In [2]:
flightData = spark.read.option("inferSchema", True).option("header", True).csv("/opt/workspace/datain/flight-data/*.csv")

View a small sample of the data-set:

In [3]:
flightData.take(5)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=264),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='India', count=69),
 Row(DEST_COUNTRY_NAME='Egypt', ORIGIN_COUNTRY_NAME='United States', count=24),
 Row(DEST_COUNTRY_NAME='Equatorial Guinea', ORIGIN_COUNTRY_NAME='United States', count=1)]

#### Viewing Execution Plans ####
The Spark `explain()` method can be used to show the execution strategy that will be chosen by Spark to execute a statement.  
  
In the example below, the `sort()` action requires all data from all partitions to be compared - this causes a *shuffle* AKA *partition exchange* which is shown in the execution plan as *Exchange rangepartitioning*.  This happens after the previous *FileScan* operation which reads all the data in to be processed.

In [4]:
flightData.sort("count").explain()

== Physical Plan ==
*(2) Sort [count#12 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#12 ASC NULLS FIRST, 200)
   +- *(1) FileScan csv [DEST_COUNTRY_NAME#10,ORIGIN_COUNTRY_NAME#11,count#12] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/workspace/datain/flight-data/2015-summary.csv, file:/opt/workspace/da..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>


#### Set the Shuffle Partition Configuration ####
By setting the `spark.sql.shuffle.partitions` parameter, we can specify how many partitions to use in the data shuffle operation.  The default is 200 - probably only need 2 for a 2 node cluster.

In [5]:
spark.conf.set("spark.sql.shuffle.partitions", "2")
# Due to Spark "lazy execution", our sort finally gets executed now (not at the Explain stage)
flightData.sort("count",ascending=False).take(5)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='United States', count=370002),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='United States', count=358354),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='United States', count=352742),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='United States', count=348113),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='United States', count=347452)]

#### Example of a Narrow Execution followed by Wide ####
Count can be performed at each partition then the results combined to a single count (Wide / Shuffle)

In [6]:
from pyspark.sql import functions as F
flightData.select(F.sum("count")).collect()

[Row(sum(count)=2580915)]

In [7]:
flightData.select(F.sum("count")).explain()

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(count#12 as bigint))])
+- Exchange SinglePartition
   +- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(count#12 as bigint))])
      +- *(1) FileScan csv [count#12] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/workspace/datain/flight-data/2015-summary.csv, file:/opt/workspace/da..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<count:int>


In [8]:
spark.stop()