# Tuning and Optimizing Spark

In [2]:
from pyspark.sql import SparkSession

In [5]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [6]:
spark = SparkSession.builder.getOrCreate()

### Managing Spark Configurations

* we can check the spark environment variables configuration through `Spark UI`
* check if the spark configuration is modifiable using `spark.conf.isModifiable()` method
* We can change spark configurations using command line arguments to `spark-submit` or change it in the SparkSession  
* Order of precedence for the configurations - spark-defaults.conf, spark-submit, SparkSession

In [7]:
spark.conf.isModifiable("spark.sql.shuffle.partitions")

True

#### Setting and Getting Spark Configurations

In [8]:
spark.conf.get("spark.sql.shuffle.partitions")

'200'

In [9]:
spark.conf.set("spark.sql.shuffle.partitions", 6)

In [10]:
spark.conf.get("spark.sql.shuffle.partitions")

'6'

### Scaling Spark for Large Workloads

The spark configurations affect three spark components  
* Spark Driver
* The executor
* Shuffle service running on the executor

#### Static Versus Dynamic Resource Allocation

* Providing spark configurations as command-line arguments to `spark-submit` will cap the limit of the resources (It is static)
* If we use dynamic resource allocation, spark driver can request more or fewer compute resources 

In [11]:
# Some configurations can be set using spark REPL
spark.conf.get("spark.dynamicAllocation.enabled")

Py4JJavaError: An error occurred while calling o26.get.
: java.util.NoSuchElementException: spark.dynamicAllocation.enabled
	at org.apache.spark.sql.errors.QueryExecutionErrors$.noSuchElementExceptionError(QueryExecutionErrors.scala:1678)
	at org.apache.spark.sql.internal.SQLConf.$anonfun$getConfString$3(SQLConf.scala:4577)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:4577)
	at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:72)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)


In [12]:
# Setting the config programatically

In [13]:
from pyspark import SparkConf

In [14]:
conf = (SparkConf()
       .setAppName("MyApp")
       .setMaster("local")
)

In [19]:
conf.set("spark.dynamicAllocation.enabled", True)

<pyspark.conf.SparkConf at 0x7fa77ca29e50>

In [20]:
conf.get("spark.dynamicAllocation.enabled")

'True'

In [21]:
spark = SparkSession.builder.config(conf=conf).getOrCreate()

23/05/06 12:57:42 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [22]:
spark.conf.get("spark.dynamicAllocation.enabled")

'True'

Set the following for dynamic allocation  
* spark.dynamicAllocation.enabled true  
* spark.dynamicAllocation.minExecutors 2  
* spark.dynamicAllocation.maxExecutors 20  
* spark.dynamicAllocation.schedulerBacklogTimeout 1m  
* spark.dynamicAllocation.executorIdleTimeout 2min  

New executors will be requested each time the backlog timeout `(spark.dynamicAllocation.schedulerBacklogTimeout)` is exceeded. In this case, whenever there are pending tasks that have not been scheduled for over 1 minute, the driver will request that a new executor be launched to schedule backlogged tasks, up to a maximum of 20. By contrast, if an executor finishes a task and is idle for 2 minutes `(spark.dynamicAllocation.executorIdleTimeout)`, the Spark driver will terminate it.

### Configuring Spark Executors Memory

* The amount of memory available to each executor is controlled by `spark.executor.memory`
* Executor memory is divided into three sections:  
  * Execution Memory (60%)
  * Storage Memory (40%)
  * Reserved Memory (300 MB)
* we can adjust the configuration
* If storage memory is not being used, spark can use it for execution memory and vice-versa  
  
![Spark Memory layout](/home/thulasiram/personal/data_engineering/images/spark_memory_layout.png)

In [26]:
conf.get("spark.memory.fraction")

In [27]:
conf.set("spark.memory.fraction", 0.5)

<pyspark.conf.SparkConf at 0x7fa77ca29e50>

In [28]:
conf.get("spark.memory.fraction")

'0.5'

* Execution Memory is used for shuffles, joins, sorts & aggregations
* Storage memory is used for caching user data structures and partitions derived from DataFrame

### Maximizing Spark Parallelism

* A spark job will have many stages and within each stage there will be many tasks
* Spark will schedule a task per core 
* Each task will process a partition
* Ideal is to have as many partitions as there are cores on the executor  
  
![cores and partitions](/home/thulasiram/personal/data_engineering/images/core_vs_partitions.png)

* The size of partition in spark is given by `spark.sql.files.maxPartitionBytes` (default is 128MB)
* Decreasing the size will result in "small file problem" and increase disk I/O and performance degradation
* We can control the number of partitions

In [30]:
spark.conf.get("spark.sql.files.maxPartitionBytes")

'134217728b'