## Improving performance in Spark


In [6]:
import findspark
findspark.init('/home/rich/spark/spark-2.4.3-bin-hadoop2.7')
import pandas as pd
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import functions as F
import time

In [3]:
spark = SparkSession.builder.master('local[*]').appName('caching').getOrCreate()

import timeit

# Load the CSV file
departures_df = spark.read.format('csv').options(Header=True).load('./data/AA_DFW_2017_Departures_Short.csv.gz')


In [5]:
departures_df.show(5)

+-----------------+-------------+-------------------+-----------------------------+
|Date (MM/DD/YYYY)|Flight Number|Destination Airport|Actual elapsed time (Minutes)|
+-----------------+-------------+-------------------+-----------------------------+
|       01/01/2017|         0005|                HNL|                          537|
|       01/01/2017|         0007|                OGG|                          498|
|       01/01/2017|         0037|                SFO|                          241|
|       01/01/2017|         0043|                DTW|                          134|
|       01/01/2017|         0051|                STL|                           88|
+-----------------+-------------+-------------------+-----------------------------+
only showing top 5 rows



### Caching a DataFrame

Assume caching can improve performance when reusing DataFrames and would like to implement it.

In [7]:
start_time = time.time()

# Add caching to the unique rows in departures_df
departures_df = departures_df.distinct().cache()

# Count the unique rows in departures_df, noting how long the operation takes
print("Counting %d rows took %f seconds" % (departures_df.count(), time.time() - start_time))

# Count the rows again, noting the variance in time of a cached DataFrame
start_time = time.time()
print("Counting %d rows again took %f seconds" % (departures_df.count(), time.time() - start_time))

Counting 139358 rows took 8.546151 seconds
Counting 139358 rows again took 1.414361 seconds


Have applied the caching transformation, it doesn't take effect until an action is run. The action instantiates the caching after the distinct() function completes. The second time, there is no need to recalculate anything so it returns almost immediately.

### Removing a DataFrame from cache

Remove the DataFrame from the cache to prevent any excess memory usage on your cluster.

In [10]:
# Determine if departures_df is in the cache
print("Is departures_df cached?: %s" % departures_df.is_cached)
print("Removing departures_df from cache")

# Remove departures_df from the cache
departures_df.unpersist()

# Check the cache status again
print("Is departures_df cached?: %s" % departures_df.is_cached)

Is departures_df cached?: True
Removing departures_df from cache
Is departures_df cached?: False


### File import performance

Take copy of same file and rename.  using * should improve import speed 

In [13]:
# Import the full and split files into DataFrames
full_df = spark.read.csv('./data/AA_DFW_2014_Departures_Short.csv.gz')
split_df = spark.read.csv('./data/AA_DFW_2014_Departures_Short_0*.csv.gz')

# Print the count and run time for each DataFrame
start_time_a = time.time()
print("Total rows in full DataFrame:\t%d" % full_df.count())
print("Time to run: %f" % (time.time() - start_time_a))

start_time_b = time.time()
print("Total rows in split DataFrame:\t%d" % split_df.count())
print("Time to run: %f" % (time.time() - start_time_b))

Total rows in full DataFrame:	157199
Time to run: 0.636141
Total rows in split DataFrame:	157199
Time to run: 0.435705


using split files runs more quickly than using one large file for import

### Reading Spark configurations

In [14]:
# Name of the Spark application instance
app_name = spark.conf.get('spark.app.name')

# Driver TCP port
driver_tcp_port = spark.conf.get('spark.driver.port')

# Number of join partitions
num_partitions = spark.conf.get('spark.sql.shuffle.partitions')

# Show the results
print("Name: %s" % app_name)
print("Driver TCP port: %s" % driver_tcp_port)
print("Number of partitions: %s" % num_partitions)

Name: caching
Driver TCP port: 37219
Number of partitions: 200


### Writing Spark configurations

Modify some of the settings to tune Spark to your needs. Import some data to review that your changes have affected the cluster.

The spark configuration is initially set to the default value of 200 partitions.

In [21]:
# Store the number of partitions in variable
before = departures_df.rdd.getNumPartitions()

# Configure Spark to use 500 partitions
spark.conf.set('spark.sql.shuffle.partitions', 500)

# Recreate the DataFrame using the departures data file
departures_df = spark.read.csv('./data/AA_DFW_2017_Departures_Short.csv.gz').distinct()

# Print the number of partitions for each instance
print("Partition count before change: %d" % before)
print("Partition count after change: %d" % departures_df.rdd.getNumPartitions())

Partition count before change: 200
Partition count after change: 500


+----------+----+---+---+
|       _c0| _c1|_c2|_c3|
+----------+----+---+---+
|01/01/2017|2332|ORD|119|
|01/01/2017|2583|DTW|143|
|01/02/2017|1010|STL|109|
|01/02/2017|2333|ORF|182|
|01/04/2017|1459|MCI| 87|
|01/04/2017|2589|AUS| 73|
|01/07/2017|1209|SMF|248|
|01/08/2017|2655|LBB| 56|
|01/09/2017|1414|AUS| 54|
|01/09/2017|2555|MCO|152|
|01/12/2017|0320|SAT| 65|
|01/13/2017|2302|MSY| 75|
|01/17/2017|0542|PHX|158|
|01/17/2017|2195|JAC|159|
|01/19/2017|2322|PBI|152|
|01/22/2017|1568|MCI| 85|
|01/22/2017|1606|RDU|158|
|01/22/2017|2361|BWI|182|
|01/22/2017|2396|OMA|120|
|01/25/2017|0702|PHL|163|
+----------+----+---+---+
only showing top 20 rows

