# Understanding partitions

[**Watch the video**](https://panoptotech.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=1860120d-7b89-47ac-958f-afa80149771a)

To allow every executor to perform work in parallel, Spark breaks up the data into chunks called
partitions. A partition is a collection of rows that sit on one physical machine in your cluster. A
DataFrame’s partitions represent how the data is physically distributed across the cluster of
machines during execution. If you have one partition, Spark will have a parallelism of only one,
even if you have thousands of executors. If you have many partitions but only one executor,
Spark will still have a parallelism of only one because there is only one computation resource. [SDG]


Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. The general recommendation for Spark is to have 4x of partitions to the number of cores in cluster available for application, and for upper bound — the task should take 100ms+ time to execute. If it is taking less time than your partitioned data is too small and your application might be spending more time in distributing the tasks. [2]

<hr>

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [None]:
df_1m=spark.range(1*1000*1000)


In [None]:
# In standalone mode, numPartitions == num cores in this machine.
df_1m.rdd.getNumPartitions()

In [None]:
df_4p=df_1m.repartition(4)
print(f"this df has {df_4p.rdd.getNumPartitions()} partitions")

In [None]:
# Does this statemnt has an effect of DF partition? 
# NO. The command changes from the default SHUFFLE value of 200 to 5,
# but that value is used only during shuffling. 
spark.conf.set("spark.sql.shuffle.partitions", 5)

spark.range(1*1000*1000).rdd.getNumPartitions()

## Partitioning types

### Hash Partitioning
Uses Java’s Object.hashCode method to determine the partition as partition = key.hashCode() % numPartitions.

### Range Partitioning
Uses a range to distribute to the respective partitions the keys that fall within a range. This method is suitable where there’s a natural ordering in the keys and the keys are non negative.

## Repartition and Coalesce

Another important optimization opportunity is to partition the data according to some frequently
filtered columns, which control the physical layout of data across the cluster including the
partitioning scheme and the number of partitions.

Repartition will **incur a full shuffle of the data**, regardless of whether one is necessary. This
means that you should typically only repartition when the future number of partitions is greater
than your current number of partitions or when you are looking to partition by a set of columns

Another opportunity to repartition is when writing data to disk: Each partition will be written to a separate folder, making reading the data faster. See https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrameWriter.partitionBy.html?highlight=partitionby#pyspark.sql.DataFrameWriter.partitionBy 

### Coalesce

The coalesce reduces the number of partitions in a DataFrame. Coalesce avoids complete shuffle; instead of creating new partitions, it shuffles the data using Hash Partitioner (Default) and adjusts into existing partitions. Which means it can only decrease the number of partitions.


In [None]:
# We can repartition a DF:
df_10p = df_4p.repartition(10)
df_10p.rdd.getNumPartitions()

### Unbalanced partition sizes

If a partition is too large and other partitions are small, the processing will not only be slower  but we can also get Out Of Memory error from the executor that processes the large partition.

Read [Add SALT to Compensate](https://towardsdatascience.com/skewed-data-in-spark-add-salt-to-compensate-16d44404088b)
 for a discussion on one way to reduce this problem.
 
 

In [None]:
%%time
# This example *should* demonstrate SQEWED partitions: when the size of partitions is unbalanced.

# source: https://luminousmen.com/post/spark-tips-partition-tuning
# however, when I ran this code, the 'transactions' has 1 partition after repatition().
# the 'df' has 1 partition.
# Before calling repartition() it has 8.
# Tested on Spark 3.2.0 standalone.

import pandas as pd
import numpy as np
from pyspark.sql import functions as F

# set smaller number of partitions so they can fit the screen
spark.conf.set('spark.sql.shuffle.partitions', 3)
# disable broadcast join to see the shuffle
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
length = 1000000
names = np.random.choice(['Bob', 'James', 'Marek', 'Johannes', None], length)
amounts = np.random.randint(0, 1000000, length)

# generate skewed data
country = np.random.choice(
    ['United Kingdom', 'Poland', 'USA', 'Germany', 'Russia'],
    length,
    p = [0.05, 0.05, 0.8, 0.05, 0.05]
)
data = pd.DataFrame({'name': names, 'amount': amounts, 'country': country})

In [None]:
%%time
transactions = spark.createDataFrame(data).repartition('country') # <<<<<< here we repartition!

print(f"transactions has {transactions.rdd.getNumPartitions()} partitions")

countries = spark.createDataFrame(pd.DataFrame({
    'id': [11, 12, 13, 14, 15],
    'country': ['United Kingdom', 'Poland', 'USA', 'Germany', 'Russia']
}))

In [None]:
%%time
df = transactions.join(countries, 'country')

In [None]:
print(f"df has {df.rdd.getNumPartitions()} partitions")
# check the partitions data
# I use glom() ONLY for this demonstration. You don't really want get 'per partition' data into python!
for i, part in enumerate(df.rdd.glom().collect()):
    print({i: part[0:50]})

# Check yourself
* what happens if there are less partitions than executors?
* what happens if there are more partitions than executors?
* Who is responsible to partition the data?
* what is the criterion to decide if a row is in a certain partition?
* can an RDD be partitioned by columns only (vertical partitioning)?
* Is partitioning automatic? Can I influence it?

Answer [here](https://forms.gle/9eu69HhgWRCE8CCMA) and see your results