# Today's topic: Coalesce

# 0. Set-Ups

General hints for this notebook:
- Spark UI usually accesible by http://localhost:4040/ or http://localhost:4041/
- Deep dive Spark UI happens in later episodes
- sc.setJobDescription("Description") replaces the Job Description of an action in the Spark UI with your own
- sdf.rdd.getNumPartitions() returns the number partitions of the current Spark DataFrame
- sdf.write.format("noop").mode("overwrite").save() is a good way to analyze and initiate actions for transformations without side effects during an actual write

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
import pyspark

In [2]:
spark = SparkSession \
    .builder \
    .appName("Data with Nikk the Greek Spark Session") \
    .master("local[4]") \
    .enableHiveSupport() \
    .getOrCreate()

sc = spark.sparkContext

In [3]:
def sdf_generator(num_rows: int, num_partitions: int = None) -> "DataFrame":
    return (
        spark.range(num_rows, numPartitions=num_partitions)
        .withColumn("date", f.current_date())
        .withColumn("timestamp",f.current_timestamp())
        .withColumn("idstring", f.col("id").cast("string"))
        .withColumn("idfirst", f.col("idstring").substr(0,1))
        .withColumn("idlast", f.col("idstring").substr(-1,1))
        )

In [4]:
sdf_gen = sdf_generator(20)
sdf_gen.count()

20

In [5]:
sdf_gen.show()

+---+----------+--------------------+--------+-------+------+
| id|      date|           timestamp|idstring|idfirst|idlast|
+---+----------+--------------------+--------+-------+------+
|  0|2024-01-13|2024-01-13 13:13:...|       0|      0|     0|
|  1|2024-01-13|2024-01-13 13:13:...|       1|      1|     1|
|  2|2024-01-13|2024-01-13 13:13:...|       2|      2|     2|
|  3|2024-01-13|2024-01-13 13:13:...|       3|      3|     3|
|  4|2024-01-13|2024-01-13 13:13:...|       4|      4|     4|
|  5|2024-01-13|2024-01-13 13:13:...|       5|      5|     5|
|  6|2024-01-13|2024-01-13 13:13:...|       6|      6|     6|
|  7|2024-01-13|2024-01-13 13:13:...|       7|      7|     7|
|  8|2024-01-13|2024-01-13 13:13:...|       8|      8|     8|
|  9|2024-01-13|2024-01-13 13:13:...|       9|      9|     9|
| 10|2024-01-13|2024-01-13 13:13:...|      10|      1|     0|
| 11|2024-01-13|2024-01-13 13:13:...|      11|      1|     1|
| 12|2024-01-13|2024-01-13 13:13:...|      12|      1|     2|
| 13|202

# 1. How coalesce works 
- Documentation: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.coalesce.html
- Narrow transformation
- Can only increase not reduce partitions. It does not give you and error but it just ignores a value higher than the initially available partitions
- Coalesce can skew the data within each partition which leads to lower performance and some tasks running way longer
- Coalesce can help with efficiently reducing high number of small partitions and improve performance
- A to small number of partitions (bigger partitions) can result to OOM or other issues. A factor of 2-4 of the number of cors is recommended. But really depends on the memory available. 

In [23]:
num_rows = 2000000000

In [7]:
sdf1 = sdf_generator(num_rows, 4)
sdf1.rdd.getNumPartitions()

4

In [8]:
row_count1 = sdf1.count()
print(row_count1)

2000000000


In [9]:
sdf_part1 = sdf1.withColumn("partition_id", f.spark_partition_id())
sdf_part_count1 = sdf_part1.groupBy("partition_id").count()
sdf_part_count1 = sdf_part_count1.withColumn("count_perc", 100*f.col("count")/row_count1)
sdf_part_count1.show()

+------------+---------+----------+
|partition_id|    count|count_perc|
+------------+---------+----------+
|           0|500000000|      25.0|
|           1|500000000|      25.0|
|           2|500000000|      25.0|
|           3|500000000|      25.0|
+------------+---------+----------+



In [10]:
sc.setJobDescription("Baseline 4 partitions")
sdf1.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

In [28]:
sdf1.coalesce(2).rdd.getNumPartitions()

2

In [30]:
sdf1.coalesce(12).rdd.getNumPartitions()

4

In [31]:
sdf2 = sdf1.coalesce(3)

In [32]:
row_count2 = sdf2.count()
print(row_count2)

2000000000


In [33]:
sdf_part2 = sdf2.withColumn("partition_id", f.spark_partition_id())
sdf_part_count2 = sdf_part2.groupBy("partition_id").count()
sdf_part_count2 = sdf_part_count2.withColumn("count_perc", 100*f.col("count")/row_count2)
sdf_part_count2.show()

+------------+----------+----------+
|partition_id|     count|count_perc|
+------------+----------+----------+
|           0| 500000000|      25.0|
|           1| 500000000|      25.0|
|           2|1000000000|      50.0|
+------------+----------+----------+



In [16]:
sc.setJobDescription("Coalesce from 4 to 3")
sdf2.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

In [17]:
sdf3 = sdf_generator(num_rows, 3)
print(sdf3.rdd.getNumPartitions())
sc.setJobDescription("3 Partitions")
sdf3.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

3


In [18]:
sdf4 = sdf_generator(num_rows, 8)
print(sdf4.rdd.getNumPartitions())
sc.setJobDescription("8 Partitions")
sdf4.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

8


In [19]:
sdf5 = sdf4.coalesce(4)
print(sdf5.rdd.getNumPartitions())
sc.setJobDescription("Coalesce from 8 to 4")
sdf5.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

4


In [34]:
sdf6 = sdf_generator(num_rows, 200001)
print(sdf6.rdd.getNumPartitions())
sc.setJobDescription("200001 partitions")
sdf6.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

200001


In [21]:
sdf7 = sdf6.coalesce(4)
print(sdf7.rdd.getNumPartitions())
sc.setJobDescription("Coalesce from 200001 to 4")
sdf7.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

4


In [22]:
sdf8 = sdf_generator(num_rows, 40)
print(sdf8.rdd.getNumPartitions())
sc.setJobDescription("40 partitions")
sdf8.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

40
