# Understanding partitions

To allow every executor to perform work in parallel, Spark breaks up the data into chunks called
partitions. A partition is a collection of rows that sit on one physical machine in your cluster. A
DataFrame’s partitions represent how the data is physically distributed across the cluster of
machines during execution. If you have one partition, Spark will have a parallelism of only one,
even if you have thousands of executors. If you have many partitions but only one executor,
Spark will still have a parallelism of only one because there is only one computation resource. [SDG]


Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. The general recommendation for Spark is to have 4x of partitions to the number of cores in cluster available for application, and for upper bound — the task should take 100ms+ time to execute. If it is taking less time than your partitioned data is too small and your application might be spending more time in distributing the tasks. [2]

<hr>

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/02/13 11:09:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/02/13 11:09:11 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
df_1m=spark.range(1*1000*1000)

In [3]:
# In standalone mode, numPartitions == num cores in this machine.
df_1m.rdd.getNumPartitions()

8

In [4]:
df_4p=df_1m.repartition(4)
print(f"this df has {df_4p.rdd.getNumPartitions()} partitions")

[Stage 0:>                                                          (0 + 8) / 8]

this df has 4 partitions


In [6]:
# Does this statemnt has an effect of DF partition? (no)
spark.conf.set("spark.sql.shuffle.partitions", 5)

spark.range(1*1000*1000).rdd.getNumPartitions()

8

## Repartition and Coalesce

Another important optimization opportunity is to partition the data according to some frequently
filtered columns, which control the physical layout of data across the cluster including the
partitioning scheme and the number of partitions.
Repartition will **incur a full shuffle of the data**, regardless of whether one is necessary. This
means that you should typically only repartition when the future number of partitions is greater
than your current number of partitions or when you are looking to partition by a set of columns

Another opportunity to repartition is when writing data to disk: Each partition will be written to a separate folder, making reading the data faster. See https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrameWriter.partitionBy.html?highlight=partitionby#pyspark.sql.DataFrameWriter.partitionBy 

In [7]:
# We can repartition a DF:
df_10p = df_4p.repartition(10)
df_10p.rdd.getNumPartitions()

10

### Unbalanced partition sizes

In [12]:
%%time
# This example *should* demonstrate SQEWED partitions: when the size of partitions is unbalanced.

# source: https://luminousmen.com/post/spark-tips-partition-tuning
# however, when I ran this code, the 'transactions' has 1 partition after repatition().
# the 'df' has 1 partition.
# Before calling repartition() it has 8.
# Tested on Spark 3.2.0 standalone.

import pandas as pd
import numpy as np
from pyspark.sql import functions as F

# set smaller number of partitions so they can fit the screen
spark.conf.set('spark.sql.shuffle.partitions', 3)
# disable broadcast join to see the shuffle
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
length = 1000000
names = np.random.choice(['Bob', 'James', 'Marek', 'Johannes', None], length)
amounts = np.random.randint(0, 1000000, length)

# generate skewed data
country = np.random.choice(
    ['United Kingdom', 'Poland', 'USA', 'Germany', 'Russia'],
    length,
    p = [0.05, 0.05, 0.8, 0.05, 0.05]
)
data = pd.DataFrame({'name': names, 'amount': amounts, 'country': country})

CPU times: user 157 ms, sys: 56.3 ms, total: 213 ms
Wall time: 212 ms


In [None]:
%%time
transactions = spark.createDataFrame(data).repartition('country') # <<<<<< here we repartition!

print(f"transactions has {transactions.rdd.getNumPartitions()} partitions")

countries = spark.createDataFrame(pd.DataFrame({
    'id': [11, 12, 13, 14, 15],
    'country': ['United Kingdom', 'Poland', 'USA', 'Germany', 'Russia']
}))

In [None]:
%%time
df = transactions.join(countries, 'country')

In [None]:
print(f"df has {df.rdd.getNumPartitions()} partitions")
# check the partitions data
# I use glom() ONLY for this demonstration. You don't really want get 'per partition' data into python!
for i, part in enumerate(df.rdd.glom().collect()):
    print({i: part[0:50]})

# Check yourself
* what happens if there are less partitions than executors?
* what happens if there are more partitions than executors?
* Who is repsonsible to partition the data?
* what is the criterion to decide if a row is in a certain partition?
* can an RDD be partitioned by columns only (vertical partitioning)?
* Is partioning automatic? Can I influence it?

Answer here [link here] and see your results