# Chapter 7: Optimizing and Tuning Spark Applications
Christoph Windheuser    
May, 2022   
Python examples of chapter 7 (page 173 ff) in the book *Learning Spark*

In [1]:
# Import required python spark libraries
import pyspark
from pyspark.sql.functions import col, expr, when, concat, lit, avg


In [2]:
#create a SparkSession

spark = (SparkSession \
         .builder \
         .enableHiveSupport() \
         .appName("Chapter_7") \
         .getOrCreate())


In [3]:
# Show the content of the environment variable $SPARK_HOME:
!echo $SPARK_HOME

/opt/spark


In [None]:
# Show all config files
!ls -l $SPARK_HOME/conf

In [None]:
#Get single Spark configuiration values:
print(spark.conf.get("spark.sql.warehouse.dir"))

In [None]:
# Get the whole confiuguration context of a Spark Context:
scConf = sc.getConf().getAll()

for l in scConf:
    print (l[0] + ":")
    print (l[1])
    print ()


In [None]:
# Change single Spark config variables
spark.conf.set("spark.sql.shuffle.partitions", spark.sparkContext.defaultParallelism)

In [None]:
# Show the Spark SQL-specifdic Spark configs:
spark.sql("SET -v").select("key", "value").show(truncate=False)

## Spark's Web Interface
To see Spark's Web Interface, go the web address: http://127.0.0.1:4040    
The tab *Environment* shows all environment variables. In the web interface, the variables are *read-only*, they cannot be modified.

## Set configuration variables in a Spark program

In [None]:
# First check, if a configuration variable is modifiable:

# Example:
spark.conf.isModifiable("spark.sql.shuffle.partitions")

In [None]:
# Get the actual value of the variable:
spark.conf.get("spark.sql.shuffle.partitions")

In [None]:
# Set the variable to a new variable and check:
spark.conf.set("spark.sql.shuffle.partitions", 5)
spark.conf.get("spark.sql.shuffle.partitions")

In [None]:
# Set it back to the old value:
spark.conf.set("spark.sql.shuffle.partitions", 16)
spark.conf.get("spark.sql.shuffle.partitions")

## Partitions
Page 181 ff.

In [None]:
# Create a big DataFrame:
numDF = spark.range(1000 * 1000)

In [None]:
# Get the default number of partitions of this DataFrame
numDF.rdd.getNumPartitions()

In [None]:
# Now change the number of partitions to another value
numDF = spark.range(1000 * 1000).repartition(32)

In [None]:
# Check the number of partitions>:
numDF.rdd.getNumPartitions()

## Caching of Data
Page 183 ff.

Create a DataFrame with 10M records.  

The time difference (approx. 10x faster) between *Count and load into cache*' and *Count in cache*
can only be demonstrated when this code is run the first time in the notebook. In consecutive executions the DataFrame is already cached and there is basically no time difference.

In [4]:


import time

start = time.time()
df = spark.range(1 * 10000000).toDF("id")
end = time.time()
print("Step 1 - Create:                    %f seconds" %(end - start))

start = time.time()
df = df.withColumn("square", df.id * df.id)
end = time.time()
print("Step 2 - Add Column:                %f seconds" %(end - start))

start = time.time()
df.cache()
end = time.time()
print("Step 3 - Cache df:                  %f seconds" %(end - start))

start = time.time()
df.count()
end = time.time()
print("Step 4 - Count and load into cache: %f seconds" %(end - start))

start = time.time()
df.count()
end = time.time()
print("Step 5 - Count in cache:            %f seconds" %(end - start))


Step 1 - Create:                    1.442420 seconds
Step 2 - Add Column:                0.086140 seconds
Step 3 - Cache df:                  0.292966 seconds
Step 4 - Count and load into cache: 3.301321 seconds
Step 5 - Count in cache:            0.181523 seconds


### Caching Tables and Views in SQL
It is also possible to cache tables of views:

In [8]:
df.createOrReplaceTempView("dfTable")
spark.sql("CACHE TABLE dfTable")
spark.sql("SELECT count(*) FROM dfTable").show()

+--------+
|count(1)|
+--------+
|10000000|
+--------+



## Persistance of Data
Page 184 ff

Persistance of data is synonymous to caching data, but let you apecify how the data is persisted with the parameter `pyspark.StorageLevel.LEVEL`. 

As we have specified the persistance on disk only, the time difference is much lower compared to the example above (this time approx. 5x faster compared to 12x faster above). Under the link http://127.0.0.1:4040/ you can see that the data is persisted on disk and not on memory for all partitions.

In [7]:
start = time.time()
df2 = spark.range(1 * 10000000).toDF("id")
end = time.time()
print("Step 1 - Create:                    %f seconds" %(end - start))

start = time.time()
df2 = df.withColumn("square", df2.id * df2.id)
end = time.time()
print("Step 2 - Add Column:                %f seconds" %(end - start))

start = time.time()
df2.persist(storageLevel=pyspark.StorageLevel.DISK_ONLY)
end = time.time()
print("Step 3 - Persist df DISK_ONLY:      %f seconds" %(end - start))

start = time.time()
df2.count()
end = time.time()
print("Step 4 - Count and load into cache: %f seconds" %(end - start))

start = time.time()
df2.count()
end = time.time()
print("Step 5 - Count in cache:            %f seconds" %(end - start))


Step 1 - Create:                    0.019116 seconds
Step 2 - Add Column:                0.006613 seconds
Step 3 - Persist df DISK_ONLY:      0.031600 seconds
Step 4 - Count and load into cache: 1.571908 seconds
Step 5 - Count in cache:            0.259982 seconds


## Shuffle Sort Merge Join (SMJ)
Page 189 ff.

In [3]:
from random import randint

# Disable broadcast join
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

In [4]:
# Generate synthetic data for two data frames
states = ['AZ', 'CO', 'CA', 'TX', 'NY', 'MI']
items  = ['SKU-0', 'SKU-1', 'SKU-2', 'SKU-3', 'SKU-4']

In [5]:
userDF = spark.range(1000000).toDF("uid")

In [6]:
userDF.show(n=3)

+---+
|uid|
+---+
|  0|
|  1|
|  2|
+---+
only showing top 3 rows



In [7]:
userDF = (userDF
            .withColumn("login", concat(lit("user_"), expr("uid")))
            .withColumn("email", concat(lit("user_"), expr("uid"), lit("@databricks.com")))
            .withColumn("user_state", lit(states[randint(0, 5)]))          
         )

In [8]:
userDF.show(n=5)

+---+------+--------------------+----------+
|uid| login|               email|user_state|
+---+------+--------------------+----------+
|  0|user_0|user_0@databricks...|        TX|
|  1|user_1|user_1@databricks...|        TX|
|  2|user_2|user_2@databricks...|        TX|
|  3|user_3|user_3@databricks...|        TX|
|  4|user_4|user_4@databricks...|        TX|
+---+------+--------------------+----------+
only showing top 5 rows



In [9]:
orderDF = spark.range(1000000).toDF("transaction_id")

In [10]:
orderDF.show(n=3)

+--------------+
|transaction_id|
+--------------+
|             0|
|             1|
|             2|
+--------------+
only showing top 3 rows



In [11]:
orderDF = (orderDF
            .withColumn("quantity", orderDF.transaction_id)
            .withColumn("users_id", lit(randint(0, 10000)))
            .withColumn("amount",   orderDF.transaction_id * 2.0)
            .withColumn("state",    lit(states[randint(0,5)]))
            .withColumn("items",    lit(items[randint(0,4)]))
          )

In [12]:
orderDF.show(n=5)

+--------------+--------+--------+------+-----+-----+
|transaction_id|quantity|users_id|amount|state|items|
+--------------+--------+--------+------+-----+-----+
|             0|       0|     820|   0.0|   MI|SKU-4|
|             1|       1|     820|   2.0|   MI|SKU-4|
|             2|       2|     820|   4.0|   MI|SKU-4|
|             3|       3|     820|   6.0|   MI|SKU-4|
|             4|       4|     820|   8.0|   MI|SKU-4|
+--------------+--------+--------+------+-----+-----+
only showing top 5 rows



In [13]:
userOrderDF = orderDF.join(userDF, orderDF.users_id == userDF.uid)

In [14]:
userOrderDF.show()

+--------------+--------+--------+------+-----+-----+---+--------+--------------------+----------+
|transaction_id|quantity|users_id|amount|state|items|uid|   login|               email|user_state|
+--------------+--------+--------+------+-----+-----+---+--------+--------------------+----------+
|             0|       0|     820|   0.0|   MI|SKU-4|820|user_820|user_820@databric...|        TX|
|             1|       1|     820|   2.0|   MI|SKU-4|820|user_820|user_820@databric...|        TX|
|             2|       2|     820|   4.0|   MI|SKU-4|820|user_820|user_820@databric...|        TX|
|             3|       3|     820|   6.0|   MI|SKU-4|820|user_820|user_820@databric...|        TX|
|             4|       4|     820|   8.0|   MI|SKU-4|820|user_820|user_820@databric...|        TX|
|             5|       5|     820|  10.0|   MI|SKU-4|820|user_820|user_820@databric...|        TX|
|             6|       6|     820|  12.0|   MI|SKU-4|820|user_820|user_820@databric...|        TX|
|         

In [15]:
userOrderDF.explain()

== Physical Plan ==
CartesianProduct
:- *(1) Project [id#38L AS transaction_id#40L, id#38L AS quantity#47L, 820 AS users_id#50, (cast(id#38L as double) * 2.0) AS amount#54, MI AS state#59, SKU-4 AS items#65]
:  +- *(1) Range (0, 1000000, step=1, splits=16)
+- *(2) Project [id#0L AS uid#2L, concat(user_, cast(id#0L as string)) AS login#9, concat(user_, cast(id#0L as string), @databricks.com) AS email#12, TX AS user_state#16]
   +- *(2) Filter (820 = id#0L)
      +- *(2) Range (0, 1000000, step=1, splits=16)




In [21]:
(userDF
     .orderBy(col("uid").asc())
     .write.format("parquet")
     .bucketBy(8, "uid")
     .mode("overWrite")
     .saveAsTable("UserTbl")
)

In [22]:
(orderDF
     .orderBy(col("users_id").asc())
     .write.format("parquet")
     .bucketBy(8, "users_id")
     .mode("overWrite")
     .saveAsTable("OrderTbl")
)

In [23]:
spark.sql("CACHE TABLE UserTbl")
spark.sql("CACHE TABLE OrderTbl")

DataFrame[]

In [24]:
userBucketDF  = spark.table("UserTbl")
orderBucketDF = spark.table("OrderTbl")


In [27]:
joinUserOrderBucketDF = orderBucketDF.join(userBucketDF, orderBucketDF.users_id == userBucketDF.uid)

In [28]:
joinUserOrderBucketDF.show()

+--------------+--------+--------+---------+-----+-----+---+--------+--------------------+----------+
|transaction_id|quantity|users_id|   amount|state|items|uid|   login|               email|user_state|
+--------------+--------+--------+---------+-----+-----+---+--------+--------------------+----------+
|        750000|  750000|     820|1500000.0|   MI|SKU-4|820|user_820|user_820@databric...|        TX|
|        750001|  750001|     820|1500002.0|   MI|SKU-4|820|user_820|user_820@databric...|        TX|
|        750002|  750002|     820|1500004.0|   MI|SKU-4|820|user_820|user_820@databric...|        TX|
|        750003|  750003|     820|1500006.0|   MI|SKU-4|820|user_820|user_820@databric...|        TX|
|        750004|  750004|     820|1500008.0|   MI|SKU-4|820|user_820|user_820@databric...|        TX|
|        750005|  750005|     820|1500010.0|   MI|SKU-4|820|user_820|user_820@databric...|        TX|
|        750006|  750006|     820|1500012.0|   MI|SKU-4|820|user_820|user_820@data

In [29]:
joinUserOrderBucketDF.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [cast(users_id#331 as bigint)], [uid#192L], Inner
   :- Sort [cast(users_id#331 as bigint) ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(cast(users_id#331 as bigint), 200), ENSURE_REQUIREMENTS, [id=#394]
   :     +- Filter isnotnull(users_id#331)
   :        +- Scan In-memory table OrderTbl [transaction_id#329L, quantity#330L, users_id#331, amount#332, state#333, items#334], [isnotnull(users_id#331)]
   :              +- InMemoryRelation [transaction_id#329L, quantity#330L, users_id#331, amount#332, state#333, items#334], StorageLevel(disk, memory, deserialized, 1 replicas)
   :                    +- *(1) ColumnarToRow
   :                       +- FileScan parquet default.ordertbl[transaction_id#329L,quantity#330L,users_id#331,amount#332,state#333,items#334] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/home/christoph/Dev/LearningSpark/spark-warehouse/