# Today's topic: Repartition (and partition recap)

# 0. Set-Ups

General hints for this notebook:
- Spark UI usually accesible by http://localhost:4040/ or http://localhost:4041/
- Deep dive Spark UI happens in later episodes
- sc.setJobDescription("Description") replaces the Job Description of an action in the Spark UI with your own
- sdf.rdd.getNumPartitions() returns the number partitions of the current Spark DataFrame
- sdf.write.format("noop").mode("overwrite").save() is a good way to analyze and initiate actions for transformations without side effects during an actual write

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
import pyspark

In [2]:
spark = SparkSession \
    .builder \
    .appName("Data with Nikk the Greek Spark Session") \
    .master("local[4]") \
    .enableHiveSupport() \
    .getOrCreate()

sc = spark.sparkContext

In [53]:
#Turning off AQE as it generates more jobs which might be confusing for this scenario here. 
spark.conf.set("spark.sql.adaptive.enabled", "false")
#to not cache datafrimes... this may not create repeatable results
spark.conf.set("spark.databricks.io.cache.enabled", "false")

In [3]:
def sdf_generator(num_rows: int, num_partitions: int = None) -> "DataFrame":
    return (
        spark.range(num_rows, numPartitions=num_partitions)
        .withColumn("date", f.current_date())
        .withColumn("timestamp",f.current_timestamp())
        .withColumn("idstring", f.col("id").cast("string"))
        .withColumn("idfirst", f.col("idstring").substr(0,1))
        .withColumn("idlast", f.col("idstring").substr(-1,1))
        )

In [4]:
sdf_gen = sdf_generator(20)
sdf_gen.count()

20

In [5]:
sdf_gen.show()

+---+----------+--------------------+--------+-------+------+
| id|      date|           timestamp|idstring|idfirst|idlast|
+---+----------+--------------------+--------+-------+------+
|  0|2024-01-14|2024-01-14 09:40:...|       0|      0|     0|
|  1|2024-01-14|2024-01-14 09:40:...|       1|      1|     1|
|  2|2024-01-14|2024-01-14 09:40:...|       2|      2|     2|
|  3|2024-01-14|2024-01-14 09:40:...|       3|      3|     3|
|  4|2024-01-14|2024-01-14 09:40:...|       4|      4|     4|
|  5|2024-01-14|2024-01-14 09:40:...|       5|      5|     5|
|  6|2024-01-14|2024-01-14 09:40:...|       6|      6|     6|
|  7|2024-01-14|2024-01-14 09:40:...|       7|      7|     7|
|  8|2024-01-14|2024-01-14 09:40:...|       8|      8|     8|
|  9|2024-01-14|2024-01-14 09:40:...|       9|      9|     9|
| 10|2024-01-14|2024-01-14 09:40:...|      10|      1|     0|
| 11|2024-01-14|2024-01-14 09:40:...|      11|      1|     1|
| 12|2024-01-14|2024-01-14 09:40:...|      12|      1|     2|
| 13|202

In [24]:
def rows_per_partition(sdf: "DataFrame", num_rows: int) -> None:
    sdf_part = sdf.withColumn("partition_id", f.spark_partition_id())
    sdf_part_count = sdf_part.groupBy("partition_id").count()
    sdf_part_count = sdf_part_count.withColumn("count_perc", 100*f.col("count")/num_rows)
    sdf_part_count.orderBy("partition_id").show()

In [26]:
def rows_per_partition_col(sdf: "DataFrame", num_rows: int, col: str) -> None:
    sdf_part = sdf.withColumn("partition_id", f.spark_partition_id())
    sdf_part_count = sdf_part.groupBy("partition_id", col).count()
    sdf_part_count = sdf_part_count.withColumn("count_perc", 100*f.col("count")/num_rows)
    sdf_part_count.orderBy("partition_id", col).show()


# 1. Recap partitioning

## The most important thing you want a good parallisation. 
- This means your number of partitions should always depend on the number of cores you have available. In spark language: spark.sparkContext.defaultParallelism. Recommendations are a factor of 2-4. But really depends on memory and data size. Small data sizes run perfectly with a factor 1x.
- To have a good parallisation you should also have a well (best uniform, worst case normal) distributed dataset. Data skew can even in narrow transformations already make your whole execution dependend on one partition or task as we saw before
## Partition size
- If your partition size is really big > 1GB you might have OOM (out of memory), Garbage collection (GC) and other errors
- Recommendations in the internet say anything between 100-1000 MB. Spark sets his max partition bytes parameter for example to 128 MB. It really depends on your machine and available memory of course. Definitly don't scratch the limits of available memory.
## Distribution overhead
- As we saw in previous experiments a to high number of partitions leads to a lot of scheduling and distribution overhead.
- A good sign is if your actual aexecution time makes not at least 90 % of the total task time. Also if your tasks are below 100 ms it's usually to short

See also here: https://stackoverflow.com/questions/64600212/how-to-determine-the-partition-size-in-an-apache-spark-dataframe

# 2. How repartitioning works
- Documentation: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.repartition.html#pyspark.sql.DataFrame.repartition
- Repartition allows to increase and decrease the number of partitions
- Repartition requires shuffling data which can be more unefficient than Coalesce
- On the other hand it creates uniform distributions unlike coalesce which only unions partions together
- Instead of partition based on the number of partitions you can partition based on a number of rows
- If no number of partitions is defined the default value depends on spark.sql.shuffle.partitions which defaults to 200 (important later when evaluating wide transformations in later episodes)

In [7]:
num_rows = 20000

In [8]:
sdf1 = sdf_generator(num_rows, 4)
sdf1.rdd.getNumPartitions()

4

In [9]:
row_count = sdf1.count()
print(row_count)

20000


In [25]:
rows_per_partition(sdf1, num_rows)

+------------+-----+----------+
|partition_id|count|count_perc|
+------------+-----+----------+
|           0| 5000|      25.0|
|           1| 5000|      25.0|
|           2| 5000|      25.0|
|           3| 5000|      25.0|
+------------+-----+----------+



In [27]:
rows_per_partition_col(sdf1, num_rows, "idfirst")

+------------+-------+-----+----------+
|partition_id|idfirst|count|count_perc|
+------------+-------+-----+----------+
|           0|      0|    1|     0.005|
|           0|      1| 1111|     5.555|
|           0|      2| 1111|     5.555|
|           0|      3| 1111|     5.555|
|           0|      4| 1111|     5.555|
|           0|      5|  111|     0.555|
|           0|      6|  111|     0.555|
|           0|      7|  111|     0.555|
|           0|      8|  111|     0.555|
|           0|      9|  111|     0.555|
|           1|      5| 1000|       5.0|
|           1|      6| 1000|       5.0|
|           1|      7| 1000|       5.0|
|           1|      8| 1000|       5.0|
|           1|      9| 1000|       5.0|
|           2|      1| 5000|      25.0|
|           3|      1| 5000|      25.0|
+------------+-------+-----+----------+



In [15]:
sc.setJobDescription("Baseline 4 partitions")
sdf1.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

In [56]:
sdf1.rdd.getNumPartitions()

4

In [31]:
sdf_3 = sdf1.repartition(3)
sdf_3.rdd.getNumPartitions()

3

In [32]:
rows_per_partition(sdf_3, num_rows)

+------------+-----+----------+
|partition_id|count|count_perc|
+------------+-----+----------+
|           0| 6667|    33.335|
|           1| 6667|    33.335|
|           2| 6666|     33.33|
+------------+-----+----------+



In [34]:
sdf_12 = sdf1.repartition(12)
sdf_12.rdd.getNumPartitions()

12

In [40]:
rows_per_partition(sdf_12, num_rows)

+------------+-----+----------+
|partition_id|count|count_perc|
+------------+-----+----------+
|           0| 1667|     8.335|
|           1| 1666|      8.33|
|           2| 1666|      8.33|
|           3| 1666|      8.33|
|           4| 1667|     8.335|
|           5| 1667|     8.335|
|           6| 1667|     8.335|
|           7| 1667|     8.335|
|           8| 1666|      8.33|
|           9| 1667|     8.335|
|          10| 1667|     8.335|
|          11| 1667|     8.335|
+------------+-----+----------+



In [37]:
spark.conf.set("spark.sql.shuffle.partitions", 200)
sdf_col_200 = sdf1.repartition("idfirst")
sdf_col_200.rdd.getNumPartitions()

200

In [38]:
rows_per_partition(sdf_col_200, num_rows)

+------------+-----+----------+
|partition_id|count|count_perc|
+------------+-----+----------+
|           3| 1111|     5.555|
|          18| 1111|     5.555|
|          26| 1111|     5.555|
|          35|    1|     0.005|
|          49| 1111|     5.555|
|          75| 1111|     5.555|
|         139| 1111|     5.555|
|         144|11111|    55.555|
|         166| 1111|     5.555|
|         189| 1111|     5.555|
+------------+-----+----------+



In [41]:
rows_per_partition_col(sdf_col_200, num_rows, "idfirst")

+------------+-------+-----+----------+
|partition_id|idfirst|count|count_perc|
+------------+-------+-----+----------+
|           3|      7| 1111|     5.555|
|          18|      3| 1111|     5.555|
|          26|      8| 1111|     5.555|
|          35|      0|    1|     0.005|
|          49|      5| 1111|     5.555|
|          75|      6| 1111|     5.555|
|         139|      9| 1111|     5.555|
|         144|      1|11111|    55.555|
|         166|      4| 1111|     5.555|
|         189|      2| 1111|     5.555|
+------------+-------+-----+----------+



In [42]:
spark.conf.set("spark.sql.shuffle.partitions", 20)
sdf_col_20 = sdf1.repartition("idfirst")
sdf_col_20.rdd.getNumPartitions()

20

In [43]:
rows_per_partition(sdf_col_20, num_rows)

+------------+-----+----------+
|partition_id|count|count_perc|
+------------+-----+----------+
|           3| 1111|     5.555|
|           4|11111|    55.555|
|           6| 2222|     11.11|
|           9| 2222|     11.11|
|          15| 1112|      5.56|
|          18| 1111|     5.555|
|          19| 1111|     5.555|
+------------+-----+----------+



In [44]:
rows_per_partition_col(sdf_col_20, num_rows, "idfirst")

+------------+-------+-----+----------+
|partition_id|idfirst|count|count_perc|
+------------+-------+-----+----------+
|           3|      7| 1111|     5.555|
|           4|      1|11111|    55.555|
|           6|      4| 1111|     5.555|
|           6|      8| 1111|     5.555|
|           9|      2| 1111|     5.555|
|           9|      5| 1111|     5.555|
|          15|      0|    1|     0.005|
|          15|      6| 1111|     5.555|
|          18|      3| 1111|     5.555|
|          19|      9| 1111|     5.555|
+------------+-------+-----+----------+



In [45]:
sdf_col_10 = sdf1.repartition(10, "idfirst")
sdf_col_10.rdd.getNumPartitions()

10

In [46]:
rows_per_partition(sdf_col_10, num_rows)

+------------+-----+----------+
|partition_id|count|count_perc|
+------------+-----+----------+
|           3| 1111|     5.555|
|           4|11111|    55.555|
|           5| 1112|      5.56|
|           6| 2222|     11.11|
|           8| 1111|     5.555|
|           9| 3333|    16.665|
+------------+-----+----------+



In [47]:
rows_per_partition_col(sdf_col_10, num_rows, "idfirst")

+------------+-------+-----+----------+
|partition_id|idfirst|count|count_perc|
+------------+-------+-----+----------+
|           3|      7| 1111|     5.555|
|           4|      1|11111|    55.555|
|           5|      0|    1|     0.005|
|           5|      6| 1111|     5.555|
|           6|      4| 1111|     5.555|
|           6|      8| 1111|     5.555|
|           8|      3| 1111|     5.555|
|           9|      2| 1111|     5.555|
|           9|      5| 1111|     5.555|
|           9|      9| 1111|     5.555|
+------------+-------+-----+----------+



In [49]:
sdf_col_5 = sdf1.repartition(5, "idfirst")
sdf_col_5.rdd.getNumPartitions()

5

In [50]:
rows_per_partition(sdf_col_5, num_rows)

+------------+-----+----------+
|partition_id|count|count_perc|
+------------+-----+----------+
|           0| 1112|      5.56|
|           1| 2222|     11.11|
|           3| 2222|     11.11|
|           4|14444|     72.22|
+------------+-----+----------+



In [57]:
rows_per_partition_col(sdf_col_5, num_rows, "idfirst")

+------------+-------+-----+----------+
|partition_id|idfirst|count|count_perc|
+------------+-------+-----+----------+
|           0|      0|    1|     0.005|
|           0|      6| 1111|     5.555|
|           1|      4| 1111|     5.555|
|           1|      8| 1111|     5.555|
|           3|      3| 1111|     5.555|
|           3|      7| 1111|     5.555|
|           4|      1|11111|    55.555|
|           4|      2| 1111|     5.555|
|           4|      5| 1111|     5.555|
|           4|      9| 1111|     5.555|
+------------+-------+-----+----------+



In [59]:
sc.setJobDescription("Repartition from 4 to 3")
sdf_3.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

In [60]:
sc.setJobDescription("Repartition from 4 to 12")
sdf_12.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

In [61]:
sc.setJobDescription("Repartition from 4 to 5 with col")
sdf_col_5.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

# 3. When to use repartioning (and partly coalesce)
- You want rebalance your data across another number of paritions as explained before to depend on the number of cores, e.g. initially you have 7 partitions but 4 cores you would increase to 8 (today)
- You have data skew which leads to one partition/task running longer than the others. It also can affect joins and other wide transformations. Repartitioning will handle this. (today)
- Join operations can benefit of repartioning the data beforehand. Repartioning reduces the shuffling of data during the join. (will learn this in a later episode)
- Any operations doing a shuffle based on a column key, e.g. a join as explained but also an orderby can benefit of it. (today)
- bigger filter operations can lead to a lot of empty partitions. E.g. you have 10 Mio of rows and 1000 partitions. After a filter 10 rows are left. This would be suddenly an overhead for any following operation. (today)
- Optimize or influence your writes. You will learn later writes depend on the number of partitions. High number of partitions can make your writes optimal. The file creation e.g. with parquet depends also on the number of partitions just before the write. (will learn this in a later episode)

See also here: https://medium.com/@zaiderikat/apache-spark-repartitioning-101-f2b37e7d8301

# 4. Performance comparison with Coalesce

# 5. Filter and Sorting
orderby
count

In [17]:
sdf3 = sdf_generator(num_rows, 3)
print(sdf3.rdd.getNumPartitions())
sc.setJobDescription("3 Partitions")
sdf3.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

3


In [18]:
sdf4 = sdf_generator(num_rows, 8)
print(sdf4.rdd.getNumPartitions())
sc.setJobDescription("8 Partitions")
sdf4.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

8


In [19]:
sdf5 = sdf4.coalesce(4)
print(sdf5.rdd.getNumPartitions())
sc.setJobDescription("Coalesce from 8 to 4")
sdf5.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

4


In [34]:
sdf6 = sdf_generator(num_rows, 200001)
print(sdf6.rdd.getNumPartitions())
sc.setJobDescription("200001 partitions")
sdf6.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

200001


In [21]:
sdf7 = sdf6.coalesce(4)
print(sdf7.rdd.getNumPartitions())
sc.setJobDescription("Coalesce from 200001 to 4")
sdf7.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

4


In [22]:
sdf8 = sdf_generator(num_rows, 40)
print(sdf8.rdd.getNumPartitions())
sc.setJobDescription("40 partitions")
sdf8.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

40


# 4. References
- Interesting discussion on Stackoverflow of Coalesce vs Repartition speed: https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce
- When to use repartition: https://medium.com/@zaiderikat/apache-spark-repartitioning-101-f2b37e7d8301
- Factors to consider for no. of partitions: https://stackoverflow.com/questions/64600212/how-to-determine-the-partition-size-in-an-apache-spark-dataframe