# Spark Performance Tuning & Best Practices

## Use DataFrame/Dataset over RDD

- For Spark jobs, prefer using Dataset/DataFrame over RDD as Dataset and DataFrame’s includes several optimization modules to improve the performance of the Spark workloads
- In PySpark use, DataFrame over RDD as Dataset’s are not supported in PySpark applications.

- Spark RDD is a building block of Spark programming, even when we use DataFrame/Dataset, Spark internally uses RDD to execute operations/queries

## Why RDD is slow?
- Using RDD directly leads to performance issues as Spark doesn’t know how to apply the optimization techniques

## Use coalesce() over repartition()
- When you want to reduce the number of partitions prefer using coalesce()

## Use Parquet data format’s

## Avoid UDF’s (User Defined Functions)

- Try to avoid Spark/PySpark UDF’s at any cost and use when existing Spark built-in functions are not available for use

## Persisting data
- Using persist() method, Spark provides an optimization mechanism to store the intermediate computation of a Spark DataFrame so they can be reused in subsequent actions.
- Well, suppose you have written a few transformations to be performed on an RDD. 
- Now each time you call an action on the RDD, Spark recomputes the RDD and all its dependencies. 
- This can turn out to be quite expensive.

In [0]:
l1 = [1, 2, 3, 4]

rdd1 = sc.parallelize(l1)
rdd2 = rdd1.map(lambda x: x*x)
rdd3 = rdd2.map(lambda x: x+2)

# When I call count(), all the transformations are performed and it takes 0.1 s to complete the task.
print(rdd3.count())

# When I call collect(), again all the transformations are called and it still takes me 0.1 s to complete the task.
print(rdd3.collect())

- So how do we get out of this vicious cycle? Persist!

In [0]:
from pyspark import StorageLevel

# By default cached to memory and disk
rdd3.persist(StorageLevel.MEMORY_AND_DISK)

# before rdd is persisted (It will be persisted on first action as below)
print(rdd3.count())


# after rdd is persisted (After the previous action is executed)
print(rdd3.collect())

- In our previous code, all we have to do is persist in the final RDD.
- This way when we first call an action on the RDD, the final data generated will be stored in the cluster.
- Now, any subsequent use of action on the same RDD would be much faster as we had already stored the previous result.

## Reduce expensive Shuffle operations

- Spark shuffling triggers when we perform certain transformation operations like gropByKey(), reducebyKey(), join()
- Spark Shuffle is an expensive operation since it involves the following
  - Disk I/O
  - Involves data serialization and deserialization
  - Network I/O

- DataFrame increases the partition number to 200 automatically when Spark operation performs data shuffling (join(), aggregation functions).
- You can change this default shuffle partition value

In [0]:
spark.conf.set("spark.sql.shuffle.partitions",100)

- In Spark 3.x, we have a newly added feature of adaptive query Execution.
- When spark.sql.adaptive.enabled settled as true and spark.sql.adaptive.coalescePartitions.enabled settled as true, then the number of shuffle partitions can be updated by spark dynamically.

## Getting the right size of the shuffle partition
- Based on your dataset size, number of cores, and memory, Spark shuffling can benefit or harm your jobs
- When you dealing with less amount of data, you should typically reduce the shuffle partitions otherwise you will end up with many partitioned files with a fewer number of records in each partition. which results in running many tasks with lesser data to process.
- On another hand, when you have too much data and have less number of partitions results in fewer longer running tasks, and sometimes you may also get out of memory error.
- Getting the right size of the shuffle partition is always tricky and takes many runs with different values to achieve the optimized number.
- This is one of the key properties to look for when you have performance issues on Spark jobs.

## Don’t Collect Data
- When we call the collect action, the result is returned to the driver node
- If you are working with huge amounts of data, then the driver node might easily run out of memory.
- One great way to escape is by using the take() action. It scans the first partition it finds and returns the result.

## Aggregate with Accumulators

- Suppose you want to aggregate some value. This can be done with simple programming using a variable for a counter.

In [0]:
file = sc.textFile("/FileStore/tables/log.txt")

# variable counter
warningCount = 0

def extractWarning(line):
    global warningCount
    if ("WARNING" in line):
        warningCount +=1

lines = file.flatMap(lambda x: x.split(","))
lines.foreach(extractWarning)

# output variable
warningCount

### But there is a caveat here.

- When we try to view the result on the driver node, then we get a 0 value
- This is because when the code is implemented on the worker nodes, the variable becomes local to the node
- This means that the updated value is not sent back to the driver node. To overcome this problem, we use accumulators.

- Accumulators have shared variables provided by Spark
- They are used for associative and commutative tasks
- For example, if you want to count the number of blank lines in a text file or determine the amount of corrupted data then accumulators can turn out to be very helpful.

In [0]:
file = sc.textFile("/FileStore/tables/log.txt")

# accumulator
warningCount = sc.accumulator(0)

def extractWarning(line):
    global warningCount
    if ("WARNING" in line):
        warningCount +=1

lines = file.flatMap(lambda x: x.split(","))
lines.foreach(extractWarning)

# accumulator value
warningCount.value  # output 4

- One thing to be remembered when working with accumulators is that worker nodes can only write to accumulators
- But only the driver node can read the value.

## Broadcast Large Variables

- Just like accumulators, Spark has another shared variable called the Broadcast variable
- They are only used for reading purposes that get cached in all the worker nodes in the cluster
- This comes in handy when you have to send a large look-up table to all nodes.

- Assume a file containing data containing the shorthand code for countries (like IND for India) with other kinds of information
- You have to transform these codes to the country name
- This is where Broadcast variables come in handy using which we can cache the lookup tables in the worker nodes.

In [0]:
# lookup
country = {"IND":"India","USA":"United States of America","SA":"South Africa"}

# broadcast
broadcaster = sc.broadcast(country)

# data
userData = [("Johnny","USA"),("Faf","SA"),("Sachin","IND")]

# create rdd
rdd_data = sc.parallelize(userData)

# use broadcast variable
def convert(code):
    return broadcaster.value[code]

# transformation
output = rdd_data.map(lambda x: (x[0], convert(x[1])))

# action
output.collect()

## Be shrewd with Partitioning

- One of the cornerstones of Spark is its ability to process data in a parallel fashion. 
- Spark splits data into several partitions, each containing some subset of the complete data. 
- For example, if a dataframe contains 10,000 rows and there are 10 partitions, then each partition will have 1000 rows.
- The number of partitions in the cluster depends on the number of cores in the cluster and is controlled by the driver node. 
- When Spark runs a task, it is run on a single partition in the cluster

- However, this number is adjustable and should be adjusted for better optimization.
- Choose too few partitions, you have a number of resources sitting idle.
- Choose too many partitions, you have a large number of small partitions shuffling data frequently, which can become highly inefficient.
- So what’s the right number?
- According to Spark, 128 MB is the maximum number of bytes you should pack into a single partition.
- So, if we have 128000 MB of data, we should have 1000 partitions. But this number is not rigid as we will see in the next tip.

## Monitoring of Job Stages
- Most of the developers write and execute the code, but monitoring of Job tasks is essential. This monitoring is best achieved by managing DAG and reducing the stages

## Use Predicate Pushdown
- Predicate push down is another feature of Spark and Parquet that can improve query performance by reducing the amount of data read from Parquet files.
- Push down means the filters are pushed to the source as opposed to being brought into Spark
-