<img src="uva_seal.png">  

## Spark SQL and DataFrames

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: August 20, 2023

---  

### SparkSession

This notebook provides details about setting the SparkSession.

The `SparkSession` is a unified conduit to all Spark operations and data.  It's an example of a `context manager`.  

Here is an example of building a SparkSession with several configs:

---  
```
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[*]") \                         # use all cores on local machine
    .appName("Python Spark SQL basic example") \  # will see appName on cluster manager
    .config("spark.executor.memory", '20g') \     # RAM per executor (worker)
    .config('spark.executor.cores', '5') \        # cores available to EACH executor
    .config('spark.executor.instances', '17') \   # total number of executors
    .config("spark.driver.memory",'1g') \         # RAM for driver, generally lower need than a worker
    .getOrCreate()
    
# for details see:
# https://spark.apache.org/docs/latest/configuration.html
```
---  

### Setting up Cores, Executors, RAM 
 
- setting these configs is best codified in a function
- Spark sets configs by default, but unfortunately they're not always optimal

---  

<span style="color:red">**Example: Hardware consists of 6 nodes, each with 16 cores, 64GB RAM**</span>

RESOURCE OVERHEAD:  
$O1$. On each executor, 1 core and 1 GB RAM is consumed by OS and Hadoop Daemons  
This leaves 15 available cores on each node  
$O2$. The resource manager (e.g., YARN) will require an overhead ~1GB RAM per node  
$O3$. One executor is required for the driver

**Number of cores**  
More cores means more concurrent processing, but an application running > 5 concurrent tasks generally doesn't perform well.  
cap this at **spark.executor.cores = 5**.  

**Executor instances**  
We can set 15 cores_per_node / 5 cores_per_executor = 3 executors_per_node. 15 is due to $O1$.

Given 6 nodes and 3 executors per node, we can set 18 executors  
One of these executors is required for the driver $(O3)$    
Thus, we set **spark.executor.instances = 17**  

**Executor memory**  
Available RAM is 63GB per node $(O1)$. For 3 executors per node, this gives 63GB/3 = 21GB per executor  
The resource manager will require an overhead ~1GB per node $(O2)$. set **spark.executor.memory = 20g**

**NOTE:** `spark.executor.cores` will use all cores by default (this is a simpler way to go, but not always optimal)  

---  