# Spark Performance Tuning & Best Practices

## Use DataFrame/Dataset over RDD

- For Spark jobs, prefer using Dataset/DataFrame over RDD as Dataset and DataFrame’s includes several optimization modules to improve the performance of the Spark workloads
- In PySpark use, DataFrame over RDD as Dataset’s are not supported in PySpark applications.

- Spark RDD is a building block of Spark programming, even when we use DataFrame/Dataset, Spark internally uses RDD to execute operations/queries

## Why RDD is slow?
- Using RDD directly leads to performance issues as Spark doesn’t know how to apply the optimization techniques

## Use coalesce() over repartition()
- When you want to reduce the number of partitions prefer using coalesce()

## Use Parquet data format’s

## Avoid UDF’s (User Defined Functions)

- Try to avoid Spark/PySpark UDF’s at any cost and use when existing Spark built-in functions are not available for use

## Persisting data
- Using persist() method, Spark provides an optimization mechanism to store the intermediate computation of a Spark DataFrame so they can be reused in subsequent actions.
- Well, suppose you have written a few transformations to be performed on an RDD. 
- Now each time you call an action on the RDD, Spark recomputes the RDD and all its dependencies. 
- This can turn out to be quite expensive.

In [None]:
l1 = [1, 2, 3, 4]

rdd1 = sc.parallelize(l1)
rdd2 = rdd1.map(lambda x: x*x)
rdd3 = rdd2.map(lambda x: x+2)

# When I call count(), all the transformations are performed and it takes 0.1 s to complete the task.
print(rdd3.count())

# When I call collect(), again all the transformations are called and it still takes me 0.1 s to complete the task.
print(rdd3.collect())

- So how do we get out of this vicious cycle? Persist!

In [None]:
from pyspark import StorageLevel

# By default cached to memory and disk
rdd3.persist(StorageLevel.MEMORY_AND_DISK)

# before rdd is persisted (It will be persisted on first action as below)
print(rdd3.count())


# after rdd is persisted (After the previous action is executed)
print(rdd3.collect())

- In our previous code, all we have to do is persist in the final RDD.
- This way when we first call an action on the RDD, the final data generated will be stored in the cluster.
- Now, any subsequent use of action on the same RDD would be much faster as we had already stored the previous result.

## Reduce expensive Shuffle operations

- Spark shuffling triggers when we perform certain transformation operations like gropByKey(), reducebyKey(), join()
- Spark Shuffle is an expensive operation since it involves the following
  - Disk I/O
  - Involves data serialization and deserialization
  - Network I/O

- DataFrame increases the partition number to 200 automatically when Spark operation performs data shuffling (join(), aggregation functions).
- You can change this default shuffle partition value

In [None]:
spark.conf.set("spark.sql.shuffle.partitions",100)

## Don’t Collect Data
- When we call the collect action, the result is returned to the driver node
- If you are working with huge amounts of data, then the driver node might easily run out of memory.
- One great way to escape is by using the take() action. It scans the first partition it finds and returns the result.

## Broadcast Large Variables

- Just like accumulators, Spark has another shared variable called the Broadcast variable
- They are only used for reading purposes that get cached in all the worker nodes in the cluster
- This comes in handy when you have to send a large look-up table to all nodes.

- Assume a file containing data containing the shorthand code for countries (like IND for India) with other kinds of information
- You have to transform these codes to the country name
- This is where Broadcast variables come in handy using which we can cache the lookup tables in the worker nodes.

In [None]:
# lookup
country = {"IND":"India","USA":"United States of America","SA":"South Africa"}

# broadcast
broadcaster = sc.broadcast(country)

# data
userData = [("Johnny","USA"),("Faf","SA"),("Sachin","IND")]

# create rdd
rdd_data = sc.parallelize(userData)

# use broadcast variable
def convert(code):
    return broadcaster.value[code]

# transformation
output = rdd_data.map(lambda x: (x[0], convert(x[1])))

# action
output.collect()

## Be shrewd with Partitioning

- One of the cornerstones of Spark is its ability to process data in a parallel fashion. 
- Spark splits data into several partitions, each containing some subset of the complete data. 
- For example, if a dataframe contains 10,000 rows and there are 10 partitions, then each partition will have 1000 rows.
- When Spark runs a task, it is run on a single partition in the cluster

- Choose too few partitions, you have a number of resources sitting idle.
- Choose too many partitions, you have a large number of small partitions shuffling data frequently, which can become highly inefficient.
- So what’s the right number?
- According to Spark, 128 MB is the maximum number of size you should pack into a single partition.
- So, if we have 128000 MB of data, we should have 1000 partitions. But this number is not rigid as we will see in the next tip.

## Monitoring of Job Stages
- Most of the developers write and execute the code, but monitoring of Job tasks is essential. This monitoring is best achieved by managing DAG and reducing the stages

## Use Predicate Pushdown
- Predicate push down is another feature of Spark and Parquet that can improve query performance by reducing the amount of data read from Parquet files.
- Push down means the filters are pushed to the source as opposed to being brought into Spark