In [1]:
import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext("local", "pyspark-shell")

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Improving Performance

## Caching

The performance can be improved by using caching. Caching in Spark refers to storing the results of a DataFrame in memory or on disk of the processing nodes in a cluster. Improves speed on later transformations/actions since data no longer needs to be retrieved from the original data source. 

Very large data sets may not fit in memory. Cacing is incredibly useful but only if you plan to use the DataFrame again. For a single task it is not worth it. If normal caching doesn't seem to work, try creating intermediate Parquet representations. Cache is a Spark transformation (lazy). So nothing is actually cached until an action is called. Use .unpersist() to remove the object from the cache.

### Caching a DataFrame

In [2]:
import time
departures_df = spark.read.csv("AA_DFW_2017_Departures_Short.csv", header= True)
start_time = time.time()
departures_df = departures_df.distinct().cache()

print("Counting %d rows took %f seconds" % (departures_df.count(), time.time() - start_time))
start_time = time.time()
print("Counting %d rows took %f seconds" % (departures_df.count(), time.time() - start_time))

Counting 139358 rows took 9.225997 seconds
Counting 139358 rows took 2.340655 seconds


### Removing a DataFrame from cache

In [3]:
print("Is departures_df cached?: %s" % departures_df.is_cached)
print("Removing departures_df from cache")

departures_df.unpersist()

print("Is departures_df cached?: %s" % departures_df.is_cached)

Is departures_df cached?: True
Removing departures_df from cache
Is departures_df cached?: False


## Improve import performance

Spark clusters are made of two types of processes. Driver process and worker processes. The driver handles task assignments and consolidation of the data results from the workers. The workers handle the actual transformation/action tasks of a Spark job. Once assigned tasks, they operate independently and reports results back to the driver.

The more import objects the better the cluster can divvy up the job. One large file will perform worse than many smaller ones. Dependinf on the configuration of you cluster, you may not be able to process larger files, but could easily handle the same amount of data split between smaller files.

You can define a single import statement, even if there are multiple files by using wildcard symbol.

Spark performs better if objects are of similar size.

Well dedined schema will imporve import performance by avoiding reading the data multiple times.

If a DataFrame will be used often, a simple method is to read in the single filen then write it back out as parquet.


### File import performance

In [4]:
full_df = spark.read.csv("departures_full.csv")
# split_df = spark.read.csv("C:\\Users\\Buğra\\Downloads\\AA_DFW_*_Departures_Short.csv.gz")
# Couldn't figure out to read multiple files at the same time. Somehow it tooks too long to read multiple files.

start_time_a = time.time()
print("Total rows in full DataFrame:\t%d" % full_df.count())
print("Time to run: %f" % (time.time() - start_time_a))

# start_time_b = time.time()
# print("Total rows in split DataFrame:\t%d" % split_df.count())
# print("Time to run: %f" % (time.time() - start_time_b))

Total rows in full DataFrame:	583723
Time to run: 0.614811


### Cluster configurations

Spark contains many configuration settings and these can be modified to match needs. The configurations are available in the configuration files, via the Spark web interface and via the run-time code. Use spark.conf.get(< configuration name >) to read settings. spark.conf.set(< configuration name >) to set. 

Spark deployment options: Single node clusters, Standalone clusters with dedicated machines ad the driver and workers or Managed clusters that components are handled by a third party cluster (YARN, Mesos, Kubernetes).

Driver handles task assignment to the various nodes/processes in the cluster. Result consolidation.

Driver node should have double the memory of the worker. Fast local storage is helpful. 

A spark worker handles running tasks assigned by the driver and communicates those results back to the driver. More worker nodes are better than more larger nodes. You need to test configs to find balance.

### Reading Spark configurations

In [5]:
app_name = spark.conf.get('spark.app.name')
driver_tcp_port = spark.conf.get('spark.driver.port')
num_partitions = spark.conf.get('spark.sql.shuffle.partitions')
print("Name: %s"%app_name)
print("Driver TCP port: %s" % driver_tcp_port)
print("Number of partitions: %s" % num_partitions)

Name: pyspark-shell
Driver TCP port: 51673
Number of partitions: 200


### Writing Spark configurations


In [6]:
before = departures_df.rdd.getNumPartitions()

spark.conf.set("spark.sql.shuffle.partitions", 500)

departures_df = spark.read.csv("AA_DFW_2017_Departures_Short.csv", header= True).distinct()

print("Partition count before change: %d" % before)
print("Partition count after change: %d" % departures_df.rdd.getNumPartitions())

Partition count before change: 200
Partition count after change: 500


### Performance improvements

Improving performance of Spark tasks in general. To understand performance implications of Spark, it should be understood what it's doing under the hood. For that use explain() funtion on a DataFrame.

Shuffling refers to moving data around to various workers to complete a task. It hides complexity from the user. Can be slow to complete and lowers overall throughput. It is necessary but try to minimize it.

Repartitioning is quite costly. Use coalesce function instead. Join can often cause shuffle operations.

Broadcasting provides a copy of an object to each worker. This decrease need for communication between nodes. This limits data shuffles and it's more likely a node will fulfill tasks independently. Broadcasting can speed up .join() operations.

### Normal joins

In [7]:
airports_df = spark.read.csv("airport.csv").select("_c1", "_c4")
airports_df = airports_df.withColumnRenamed("_c1", "AIRPORTNAME")
airports_df = airports_df.withColumnRenamed("_c4", "IATA")

flights_df = spark.read.csv("AA_DFW_2017_Departures_Short.csv", header=True)

normal_df = flights_df.join(airports_df, flights_df["Destination Airport"] == airports_df["IATA"])
normal_df.explain()

== Physical Plan ==
*(2) BroadcastHashJoin [Destination Airport#364], [IATA#343], Inner, BuildRight, false
:- *(2) Filter isnotnull(Destination Airport#364)
:  +- FileScan csv [Date (MM/DD/YYYY)#362,Flight Number#363,Destination Airport#364,Actual elapsed time (Minutes)#365] Batched: false, DataFilters: [isnotnull(Destination Airport#364)], Format: CSV, Location: InMemoryFileIndex[file:/C:/Users/Buğra/Datacamp-jupyter_notebook/PySpark/Cleaning Data with PySpa..., PartitionFilters: [], PushedFilters: [IsNotNull(Destination Airport)], ReadSchema: struct<Date (MM/DD/YYYY):string,Flight Number:string,Destination Airport:string,Actual elapsed ti...
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, true]),false), [id=#237]
   +- *(1) Project [_c1#311 AS AIRPORTNAME#340, _c4#314 AS IATA#343]
      +- *(1) Filter isnotnull(_c4#314)
         +- FileScan csv [_c1#311,_c4#314] Batched: false, DataFilters: [isnotnull(_c4#314)], Format: CSV, Location: InMemoryFileIndex[file:/C:

### Using broadcasting on Spark joins


In [8]:
from pyspark.sql.functions import broadcast

broadcast_df = flights_df.join(broadcast(airports_df), flights_df["Destination Airport"] == airports_df["IATA"])

broadcast_df.explain()

== Physical Plan ==
*(2) BroadcastHashJoin [Destination Airport#364], [IATA#343], Inner, BuildRight, false
:- *(2) Filter isnotnull(Destination Airport#364)
:  +- FileScan csv [Date (MM/DD/YYYY)#362,Flight Number#363,Destination Airport#364,Actual elapsed time (Minutes)#365] Batched: false, DataFilters: [isnotnull(Destination Airport#364)], Format: CSV, Location: InMemoryFileIndex[file:/C:/Users/Buğra/Datacamp-jupyter_notebook/PySpark/Cleaning Data with PySpa..., PartitionFilters: [], PushedFilters: [IsNotNull(Destination Airport)], ReadSchema: struct<Date (MM/DD/YYYY):string,Flight Number:string,Destination Airport:string,Actual elapsed ti...
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, true]),false), [id=#269]
   +- *(1) Project [_c1#311 AS AIRPORTNAME#340, _c4#314 AS IATA#343]
      +- *(1) Filter isnotnull(_c4#314)
         +- FileScan csv [_c1#311,_c4#314] Batched: false, DataFilters: [isnotnull(_c4#314)], Format: CSV, Location: InMemoryFileIndex[file:/C:

### Comparing broadcast vs normal joins

In [10]:
start_time = time.time()

normal_count = normal_df.count()
normal_duration = time.time() - start_time

start_time = time.time()

broadcast_count = broadcast_df.count()
broadcast_duration = time.time() - start_time

print("Normal count:\t\t%d\tduration: %f" % (normal_count, normal_duration))
print("Broadcast count:\t%d\tduration: %f" % (broadcast_count, broadcast_duration))

Normal count:		139358	duration: 0.577108
Broadcast count:	139358	duration: 0.398370
