# Configuration and tuning

For help on optimizing your programs, the configuration and tuning guides provide information on best practices. They are especially important for making sure that your data is stored in memory in an efficient format

# Tuning

STONGLY ADVISED to read https://spark.apache.org/docs/latest/tuning.html

[SDG] chapter 19 "Performance Tuning"

A nice resource calculator at http://spark-configuration.luminousmen.com/

## Data Serialization


<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*_3-owXxQnlBBvjBwaKD10g.png" >

[https://teepika-r-m.medium.com/serialization-in-apache-spark-cdbb49099a8e]

Serialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Often, this will be the first thing you should tune to optimize a Spark application.

Spark provides two serialization libraries: Java serialization and Kryo.

The Kryo can be faster by 10x but is not the default.

You can switch to using Kryo by initializing your job with a SparkConf and calling `conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")`. This setting configures the serializer used for not only shuffling data between worker nodes but also when serializing RDDs to disk.


In [None]:
#TODO implement a test case that shows the performance diff between default and kryo serializer.

## Memory Tuning
https://spark.apache.org/docs/latest/tuning.html#memory-tuning

### Determining Memory Consumption

In [None]:
from pyspark.mllib.random import RandomRDDs
from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark = SparkSession.builder.appName('tuning')\
    .config("spark.kryoserializer.buffer.max", "512m")\
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
    .getOrCreate()
sc = spark.sparkContext

In [None]:
# ONLY when running in jupyter:
#spark.conf.set("spark.sql.repl.eagerEval.enabled", True)

u = RandomRDDs.normalRDD(spark, 10000000, 8)
u = u.map(lambda x: (x,)) # convert to tuple so we can transorm into DF
u.cache()
schema = StructType([  StructField('c1', DoubleType(), True)])
# we can move from RDD to Dataframe and back. 
df = spark.createDataFrame(u, schema)

#we must do something with u before we can see the storage used.

In [None]:
df.agg({'c1': 'sum'}).show()
#df.show()

How many bytes are needed to store 10M double precision? . In C it will take 10M * 8 Bytes = 80MB.

Open the UI http://localhost:4040 and and look at the “Storage” page in the web UI. 
In my setup it show 92MB.

In [None]:
# let's create another RDD to consume more memory and see what's happen.
u2 = RandomRDDs.normalRDD(spark, 40000000,4)
u2 = u2.map(lambda x: (x,)) # convert to tuple so we can transorm into DF
u2.cache()
schema = StructType([  StructField('c1', DoubleType(), True)])
df2 = spark.createDataFrame(u2, schema)
df2.agg({'c1': 'sum'}).show()

While computing the above cell, I got
```MemoryStore: Not enough space to cache rdd_16_2 in memory! (computed 73.9 MiB so far)

23/01/04 11:29:43 WARN BlockManager: Putting block rdd_16_0 failed
23/01/04 11:29:43 ERROR Utils: Uncaught exception in thread stdout writer for python3
java.lang.OutOfMemoryError: Java heap space
```

This is because the current limitation is 400MB on the Java heap. When filling this amount, some cached content will be flushed or discared.


In [None]:
df2.count()

# Configuration

The default configuration might be good for your job, but it also might cause severe bottlenecks.
You should be able to identify the major configuration settings and update them.

In [None]:
# Get all the current configurations of Spark.
# This does NOT include the Cluster manager (Yarn for example)!
sc.getConf().getAll()

In [None]:
spark