# Today's topic: Spark Partitioning in action with Repartion and Coalesce

# 0. Set-Ups

General hints for this notebook:
- Spark UI usually accesible by http://localhost:4040/ or http://localhost:4041/
- Deep dive Spark UI happens in later episodes
- sc.setJobDescription("Description") replaces the Job Description of an action in the Spark UI with your own
- sdf.rdd.getNumPartitions() returns the number partitions of the current Spark DataFrame
- sdf.write.format("noop").mode("overwrite").save() is a good way to analyze and initiate actions for transformations without side effects during an actual write

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
import pyspark

In [2]:
spark = SparkSession \
    .builder \
    .appName("Data with Nikk the Greek Spark Session") \
    .master("local[4]") \
    .enableHiveSupport() \
    .getOrCreate()

sc = spark.sparkContext

In [3]:
#Turning off AQE as it generates more jobs which might be confusing for this scenario here. 
spark.conf.set("spark.sql.adaptive.enabled", "false")
#to not cache datafrimes... this may not create repeatable results
spark.conf.set("spark.databricks.io.cache.enabled", "false")

In [4]:
def sdf_generator(num_rows: int, num_partitions: int = None) -> "DataFrame":
    return (
        spark.range(num_rows, numPartitions=num_partitions)
        .withColumn("date", f.current_date())
        .withColumn("timestamp",f.current_timestamp())
        .withColumn("idstring", f.col("id").cast("string"))
        .withColumn("idfirst", f.col("idstring").substr(0,1))
        .withColumn("idlast", f.col("idstring").substr(-1,1))
        )

In [5]:
sdf_gen = sdf_generator(20)
sdf_gen.count()

20

In [6]:
sdf_gen.show()

+---+----------+--------------------+--------+-------+------+
| id|      date|           timestamp|idstring|idfirst|idlast|
+---+----------+--------------------+--------+-------+------+
|  0|2024-01-16|2024-01-16 21:08:...|       0|      0|     0|
|  1|2024-01-16|2024-01-16 21:08:...|       1|      1|     1|
|  2|2024-01-16|2024-01-16 21:08:...|       2|      2|     2|
|  3|2024-01-16|2024-01-16 21:08:...|       3|      3|     3|
|  4|2024-01-16|2024-01-16 21:08:...|       4|      4|     4|
|  5|2024-01-16|2024-01-16 21:08:...|       5|      5|     5|
|  6|2024-01-16|2024-01-16 21:08:...|       6|      6|     6|
|  7|2024-01-16|2024-01-16 21:08:...|       7|      7|     7|
|  8|2024-01-16|2024-01-16 21:08:...|       8|      8|     8|
|  9|2024-01-16|2024-01-16 21:08:...|       9|      9|     9|
| 10|2024-01-16|2024-01-16 21:08:...|      10|      1|     0|
| 11|2024-01-16|2024-01-16 21:08:...|      11|      1|     1|
| 12|2024-01-16|2024-01-16 21:08:...|      12|      1|     2|
| 13|202

In [7]:
def rows_per_partition(sdf: "DataFrame") -> None:
    num_rows = sdf.count()
    sdf_part = sdf.withColumn("partition_id", f.spark_partition_id())
    sdf_part_count = sdf_part.groupBy("partition_id").count()
    sdf_part_count = sdf_part_count.withColumn("count_perc", 100*f.col("count")/num_rows)
    sdf_part_count.orderBy("partition_id").show()

In [8]:
def rows_per_partition_col(sdf: "DataFrame", num_rows: int, col: str) -> None:
    sdf_part = sdf.withColumn("partition_id", f.spark_partition_id())
    sdf_part_count = sdf_part.groupBy("partition_id", col).count()
    sdf_part_count = sdf_part_count.withColumn("count_perc", 100*f.col("count")/num_rows)
    sdf_part_count.orderBy("partition_id", col).show()


# 1. Recap partitioning

## The most important thing you want a good parallisation. 
- This means your number of partitions should always depend on the number of cores you have available. In spark language: spark.sparkContext.defaultParallelism. Recommendations are a factor of 2-4. But really depends on memory and data size. Small data sizes run perfectly with a factor 1x.
- To have a good parallisation you should also have a well (best uniform, worst case normal) distributed dataset. Data skew can even in narrow transformations already make your whole execution dependend on one partition or task as we saw before
## Partition size
- If your partition size is really big > 1GB you might have OOM (out of memory), Garbage collection (GC) and other errors
- Recommendations in the internet say anything between 100-1000 MB. Spark sets his max partition bytes parameter for example to 128 MB. It really depends on your machine and available memory of course. Definitly don't scratch the limits of available memory.
## Distribution overhead
- As we saw in previous experiments a to high number of partitions leads to a lot of scheduling and distribution overhead.
- A good sign is if your actual aexecution time makes not at least 90 % of the total task time. Also if your tasks are below 100 ms it's usually to short

See also here: https://stackoverflow.com/questions/64600212/how-to-determine-the-partition-size-in-an-apache-spark-dataframe

# 2. How repartitioning works
- Documentation: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.repartition.html#pyspark.sql.DataFrame.repartition
- Repartition allows to increase and decrease the number of partitions
- Repartition requires shuffling data which can be more unefficient than Coalesce
- On the other hand it creates uniform distributions unlike coalesce which only unions partions together
- Instead of partition based on the number of partitions you can partition based on a number of rows
- If no number of partitions is defined the default value depends on spark.sql.shuffle.partitions which defaults to 200 (important later when evaluating wide transformations in later episodes)

# 3. How coalesce works
- Documentation: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.coalesce.html
- Narrow transformation
- Can only reduce not inrcease no. partitions. It does not give you and error but it just ignores a value higher than the initially available partitions
- Coalesce can skew the data within each partition which leads to lower performance and some tasks running way longer. Reason it just unions the partitions together.
- Coalesce can help with efficiently reducing high number of small partitions and improve performance. Remember a too high number of partitions leads to a lot of scheduling overhead.

# 4. When to use repartioning and coalesce
- You want rebalance your data across another number of paritions as explained before to depend on the number of cores, e.g. initially you have 7 partitions but 4 cores you would increase to 8 (today)
- You have data skew which leads to one partition/task running longer than the others. It also can affect joins and other wide transformations. Repartitioning will handle this. (today)
- Join operations can benefit of repartioning the data beforehand. Repartioning reduces the shuffling of data during the join. But also other shuffle operations based on a column key like order by (will learn this in a later episode)
- Filter operations can become more efficient. (today)
- bigger filter operations can lead to a lot of empty partitions. E.g. you have 10 Mio of rows and 1000 partitions. After a filter 10 rows are left. This would be suddenly an overhead for any following operation e.g. a count. (today)
- Exploding of structured fields in a dataframe can increase the partition size. (later)
- Optimize or influence your writes. You will learn later writes depend on the number of partitions. High number of partitions can make your writes unoptimal. The file creation e.g. with parquet depends also on the number of partitions just before the write. (will learn this in a later episode)

See also here: https://medium.com/@zaiderikat/apache-spark-repartitioning-101-f2b37e7d8301

# 5. Reducing the number of partitions

In [9]:
num_rows = 200000000

## 5.1. Scenario 1

12 is our target partition size but we have 13 partitions as input

In [10]:
sdf = sdf_generator(num_rows, 12)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 12")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


In [11]:
sdf = sdf_generator(num_rows, 13)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 13")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

13


In [12]:
sdf = sdf_generator(num_rows, 13).coalesce(12)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Coalesce 13 to 12")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


In [13]:
sdf = sdf_generator(num_rows, 13).repartition(12)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Repartition 13 to 12")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


## 5.2. Scenario 2

12 is our target partition size but we have 20001 partitions as input

In [14]:
sdf = sdf_generator(num_rows, 20001)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 20001")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

20001


In [15]:
sdf = sdf_generator(num_rows, 20001).coalesce(12)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Coalesce 20001 to 12")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


In [16]:
sdf = sdf_generator(num_rows, 20001).repartition(12)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Repartition 20001 to 12")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


## 3. Scenario 3

40 is our target partition size but we have 90 partitions as input

In [17]:
sdf = sdf_generator(num_rows, 40)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 40")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

40


In [18]:
sdf = sdf_generator(num_rows, 90)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 90")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

90


In [19]:
sdf = sdf_generator(num_rows, 90).coalesce(40)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Coalesce 90 to 40")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

40


In [20]:
sdf = sdf_generator(num_rows, 90).repartition(40)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Repartition 90 to 40")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

40


# 6. Increasing the number of partitions

In [21]:
sdf = sdf_generator(num_rows, 1)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 1")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

1


In [22]:
sdf = sdf_generator(num_rows, 10)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 10")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

10


In [23]:
sdf = sdf_generator(num_rows, 12)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 12")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


In [24]:
sdf = sdf_generator(num_rows, 1).repartition(12)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Repartition 1 to 12")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


In [25]:
sdf = sdf_generator(num_rows, 10).repartition(12)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Repartition 10 to 12")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


# 7. Data Skew

filter or coalesce to generate it

In [26]:
sdf = sdf_generator(num_rows, 12)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 12")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


In [27]:
sdf = sdf_generator(num_rows, 15)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 15")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

15


In [28]:
sdf = sdf_generator(num_rows, 15)
sdf = sdf.coalesce(12)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line skewed 12")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


In [29]:
sdf = sdf_generator(num_rows, 15)
sdf = sdf.coalesce(12)
sdf = sdf.coalesce(8)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Coalesce for Skew 8")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

8


In [30]:
sdf = sdf_generator(num_rows, 15)
sdf = sdf.coalesce(12)
sdf = sdf.repartition(12)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Repartition for Skew 12")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


In [31]:
sdf = sdf_generator(num_rows, 15)
sdf = sdf.coalesce(12)
sdf = sdf.repartition(8)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Repartition for Skew 8")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

8


# 8. Filter operations become more efficient

In [32]:
sdf = sdf_generator(num_rows, 12)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 12")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


In [33]:
sdf = sdf_generator(num_rows, 12)
sdf = sdf.filter(f.col("id") < 1000)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 12 with filter id")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


In [34]:
sdf = sdf_generator(num_rows, 12)
sdf = sdf.repartition(12, "id")
sdf = sdf.filter(f.col("id") < 1000)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Repartition filter 12 id")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


In [35]:
sdf = sdf_generator(num_rows, 12)
sdf = sdf.filter(f.col("idfirst") == "1")
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 12 with filter idfirst")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


In [36]:
sdf = sdf_generator(num_rows, 12)
sdf = sdf.repartition(12, "idfirst")
sdf = sdf.filter(f.col("idfirst") == "1")
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Reaprtition filter 12 idfirst")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


In [52]:
sdf = sdf_generator(num_rows, 12)
sdf = sdf.filter(f.col("idlast") == "1")
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 12 with filter idlast")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


In [53]:
sdf = sdf_generator(num_rows, 12)
sdf = sdf.repartition(12, "idlast")
sdf = sdf.filter(f.col("idlast") == "1")
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Reaprtition filter 12 idlast")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

12


# 9. Bigger filter operations and empty partitions

In [37]:
sdf = sdf_generator(num_rows, 20)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 20")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

20


In [38]:
rows_per_partition(sdf)

+------------+--------+----------+
|partition_id|   count|count_perc|
+------------+--------+----------+
|           0|10000000|       5.0|
|           1|10000000|       5.0|
|           2|10000000|       5.0|
|           3|10000000|       5.0|
|           4|10000000|       5.0|
|           5|10000000|       5.0|
|           6|10000000|       5.0|
|           7|10000000|       5.0|
|           8|10000000|       5.0|
|           9|10000000|       5.0|
|          10|10000000|       5.0|
|          11|10000000|       5.0|
|          12|10000000|       5.0|
|          13|10000000|       5.0|
|          14|10000000|       5.0|
|          15|10000000|       5.0|
|          16|10000000|       5.0|
|          17|10000000|       5.0|
|          18|10000000|       5.0|
|          19|10000000|       5.0|
+------------+--------+----------+



In [39]:
sdf = sdf_generator(num_rows, 4)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 4")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

4


In [40]:
rows_per_partition(sdf)

+------------+--------+----------+
|partition_id|   count|count_perc|
+------------+--------+----------+
|           0|50000000|      25.0|
|           1|50000000|      25.0|
|           2|50000000|      25.0|
|           3|50000000|      25.0|
+------------+--------+----------+



In [41]:
sdf = sdf_generator(num_rows, 20)
sdf = sdf.filter(f.col("id") < 200)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line 200 filter")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

20


In [42]:
rows_per_partition(sdf)

+------------+-----+----------+
|partition_id|count|count_perc|
+------------+-----+----------+
|           0|  200|     100.0|
+------------+-----+----------+



In [43]:
sdf = sdf_generator(num_rows, 20)
sdf = sdf.filter(f.col("id") < 200)
sdf = sdf.coalesce(4)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("4 coalesce")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

4


In [44]:
rows_per_partition(sdf)

+------------+-----+----------+
|partition_id|count|count_perc|
+------------+-----+----------+
|           0|  200|     100.0|
+------------+-----+----------+



In [45]:
sdf = sdf_generator(num_rows, 20)
sdf = sdf.filter(f.col("id") < 200)
sdf = sdf.repartition(4)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("4 repartition")
sdf.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

4


In [46]:
rows_per_partition(sdf)

+------------+-----+----------+
|partition_id|count|count_perc|
+------------+-----+----------+
|           0|   50|      25.0|
|           1|   50|      25.0|
|           2|   50|      25.0|
|           3|   50|      25.0|
+------------+-----+----------+



# 10. Bigger filtering and Count

In [47]:
sdf = sdf_generator(num_rows, 20)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line Count 20")
print(sdf.count())
sc.setJobDescription("None")

20
200000000


In [48]:
sdf = sdf_generator(num_rows, 4)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line Count 4")
print(sdf.count())
sc.setJobDescription("None")

4
200000000


In [49]:
sdf = sdf_generator(num_rows, 20)
sdf = sdf.filter(f.col("id") < 200)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("Base line count 20 filter")
print(sdf.count())
sc.setJobDescription("None")

20
200


In [50]:
sdf = sdf_generator(num_rows, 20)
sdf = sdf.filter(f.col("id") < 200)
sdf = sdf.coalesce(4)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("4 coalesce count")
print(sdf.count())
sc.setJobDescription("None")

4
200


In [51]:
sdf = sdf_generator(num_rows, 20)
sdf = sdf.filter(f.col("id") < 200)
sdf = sdf.repartition(4)
print(sdf.rdd.getNumPartitions())
sc.setJobDescription("4 repartition count")
print(sdf.count())
sc.setJobDescription("None")

4
200


# 11. Final comments
- We have to balance performance of the current process vs the data distribution. That's where following processes e.g. by saving the data for data queries or operations like joins, sorts etc. will benefit from.
- Even if we deep dive here. Some operations make sense when you realize performance significant bottle necks. E.g. if you gain 10 sec execution time is it worth? But if a small amount of data runs 3h daily and you reduce it to 15 min then it's worth to improve. 
- Something you should ways have a look on with some quick look in the Spark UI:
    - Is a lot of driver memory consumed, meaning driver execution like collect()? (later)
    - Are all cores used?
    - Do you have spill to disk? (later)