# Tuning and Optimizing Spark

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.getOrCreate()

23/05/06 12:00:30 WARN Utils: Your hostname, thulasiram resolves to a loopback address: 127.0.1.1; using 192.168.0.105 instead (on interface wlp0s20f3)
23/05/06 12:00:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/05/06 12:00:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/05/06 12:00:31 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### Managing Spark Configurations

* we can check the spark environment variables configuration through `Spark UI`
* check if the spark configuration is modifiable using `spark.conf.isModifiable()` method
* We can change spark configurations using command line arguments to `spark-submit` or change it in the SparkSession  
* Order of precedence for the configurations - spark-defaults.conf, spark-submit, SparkSession

In [5]:
spark.conf.isModifiable("spark.sql.shuffle.partitions")

True

#### Setting and Getting Spark Configurations

In [7]:
spark.conf.get("spark.sql.shuffle.partitions")

'200'

In [8]:
spark.conf.set("spark.sql.shuffle.partitions", 6)

In [9]:
spark.conf.get("spark.sql.shuffle.partitions")

'6'

### Scaling Spark for Large Workloads

The spark configurations affect three spark components  
* Spark Driver
* The executor
* Shuffle service running on the executor

#### Static Versus Dynamic Resource Allocation

* Providing spark configurations as command-line arguments to `spark-submit` will cap the limit of the resources (It is static)
* If we use dynamic resource allocation, spark driver can request more or fewer compute resources 
* Some configurations can be set using spark REPL, we need to set it programmatically 

In [14]:
from pyspark import SparkConf, SparkContext

In [15]:
conf = (SparkConf()
       .setAppName("MyApp")
       .setMaster("local")
)

In [16]:
conf.set("spark.dynamicAllocation.enabled", "true")

<pyspark.conf.SparkConf at 0x7f5eb91c6070>

In [17]:
conf.get("spark.dynamicAllocation.enabled")

'true'

In [20]:
spark = SparkSession.builder.config(conf=conf).getOrCreate()

23/05/06 12:50:38 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [21]:
spark.conf.get("spark.dynamicAllocation.enabled")

'true'