# Today's topic: Spark partitions

# 0. Set-Ups

General hints for this notebook:
- Spark UI usually accesible by http://localhost:4040/ or http://localhost:4041/
- Deep dive Spark UI happens in later episodes
- sc.setJobDescription("Description") replaces the Job Description of an action in the Spark UI with your own
- sdf.rdd.getNumPartitions() returns the number partitions of the current Spark DataFrame
- sdf.write.format("noop").mode("overwrite").save() is a good way to analyze and initiate actions for transformations without side effects during an actual write

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
import pyspark

In [2]:
spark = SparkSession \
    .builder \
    .appName("Data with Nikk the Greek Spark Session") \
    .master("local[4]") \
    .enableHiveSupport() \
    .getOrCreate()

sc = spark.sparkContext

In [3]:
def sdf_generator1(num_iter: int = 1) -> "DataFrame":
    d = [
        {"a":"a", "b": 1},
        {"a":"b", "b": 2},
        {"a":"c", "b": 3},
        {"a":"d", "b": 4},
        {"a":"e", "b": 5},
        {"a":"e", "b": 6},
        {"a":"f", "b": 7},
        {"a":"g", "b": 8},
        {"a":"h", "b": 9},
        {"a":"i", "b": 10},
    ]

    data = []
    for i in range(0, num_iter):
        data.extend(d)
    ddl_schema = "a string, b int"
    return spark.createDataFrame(data, schema=ddl_schema)


In [4]:
sdf_gen1 = sdf_generator1(2)
sdf_gen1.count()


20

In [5]:
sdf_gen1.show()

+---+---+
|  a|  b|
+---+---+
|  a|  1|
|  b|  2|
|  c|  3|
|  d|  4|
|  e|  5|
|  e|  6|
|  f|  7|
|  g|  8|
|  h|  9|
|  i| 10|
|  a|  1|
|  b|  2|
|  c|  3|
|  d|  4|
|  e|  5|
|  e|  6|
|  f|  7|
|  g|  8|
|  h|  9|
|  i| 10|
+---+---+



In [3]:
def sdf_generator2(num_rows: int, num_partitions: int = None) -> "DataFrame":
    return (
        spark.range(num_rows, numPartitions=num_partitions)
        .withColumn("today", f.current_date())
        .withColumn("timestamp",f.current_timestamp())
        .withColumn("idstring", f.col("id").cast("string"))
        .withColumn("idfirst", f.col("idstring").substr(0,1))
        .withColumn("idlast", f.col("idstring").substr(-1,1))
        )

In [7]:
sdf_gen2 = sdf_generator2(20)
sdf_gen2.count()

20

In [8]:
sdf_gen2.show()

+---+----------+--------------------+--------+-------+------+
| id|     today|           timestamp|idstring|idfirst|idlast|
+---+----------+--------------------+--------+-------+------+
|  0|2024-01-13|2024-01-13 10:43:...|       0|      0|     0|
|  1|2024-01-13|2024-01-13 10:43:...|       1|      1|     1|
|  2|2024-01-13|2024-01-13 10:43:...|       2|      2|     2|
|  3|2024-01-13|2024-01-13 10:43:...|       3|      3|     3|
|  4|2024-01-13|2024-01-13 10:43:...|       4|      4|     4|
|  5|2024-01-13|2024-01-13 10:43:...|       5|      5|     5|
|  6|2024-01-13|2024-01-13 10:43:...|       6|      6|     6|
|  7|2024-01-13|2024-01-13 10:43:...|       7|      7|     7|
|  8|2024-01-13|2024-01-13 10:43:...|       8|      8|     8|
|  9|2024-01-13|2024-01-13 10:43:...|       9|      9|     9|
| 10|2024-01-13|2024-01-13 10:43:...|      10|      1|     0|
| 11|2024-01-13|2024-01-13 10:43:...|      11|      1|     1|
| 12|2024-01-13|2024-01-13 10:43:...|      12|      1|     2|
| 13|202

# 1. Partition Size based on Cores and Data Amount with spark.CreateDataFrame
- In the Spark UI under Executors you can see the number of available Cores of your cluster. In our case it's 4 as configured also in the Spark Session above
- Spark is splitting datasets created in memory with spark.createDataFrame based on the number of Cores available equaly distributed. That's why we have 4 partitions. You can check the variable spark.sparkContext.defaultParallelism
- If we reduce the number of Cores the partion number reduced equivalently
- Looking at the row distribution below but also in the SparkUI we can confirm a uniform distribution
- The most efficient way of narrow transformations is to have uniform partitions and the number of partitions is devidable by the number of course to avoid Cores being unused
- The size of the data does not have any influence. It rather just leads to OOM or Out of disk errors if too high and/or a long processing time. This is important to

In [5]:
spark.sparkContext.defaultParallelism

3

In [10]:
sdf_gen1_1 = sdf_generator1(2)
sdf_gen1_1.rdd.getNumPartitions()

4

In [11]:
sdf_part1_1 = sdf_gen1_1.withColumn("partition_id", f.spark_partition_id())
sdf_part1_1.show()

+---+---+------------+
|  a|  b|partition_id|
+---+---+------------+
|  a|  1|           0|
|  b|  2|           0|
|  c|  3|           0|
|  d|  4|           0|
|  e|  5|           0|
|  e|  6|           1|
|  f|  7|           1|
|  g|  8|           1|
|  h|  9|           1|
|  i| 10|           1|
|  a|  1|           2|
|  b|  2|           2|
|  c|  3|           2|
|  d|  4|           2|
|  e|  5|           2|
|  e|  6|           3|
|  f|  7|           3|
|  g|  8|           3|
|  h|  9|           3|
|  i| 10|           3|
+---+---+------------+



In [12]:
row_count = sdf_gen1_1.count()
print(row_count)

20


In [13]:
sdf_part_count1_1 = sdf_part1_1.groupBy("partition_id").count()
sdf_part_count1_1 = sdf_part_count1_1.withColumn("count_perc", 100*f.col("count")/row_count)
sdf_part_count1_1.show()

+------------+-----+----------+
|partition_id|count|count_perc|
+------------+-----+----------+
|           0|    5|      25.0|
|           1|    5|      25.0|
|           2|    5|      25.0|
|           3|    5|      25.0|
+------------+-----+----------+



In [14]:
sc.setJobDescription("Gen1_Exp1")
sdf_gen1_1.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

In [15]:
sdf_gen1_2 = sdf_generator1(2000)
sdf_gen1_2.rdd.getNumPartitions()

4

In [16]:
sdf_part1_2 = sdf_gen1_2.withColumn("partition_id", f.spark_partition_id())
sdf_part1_2.show()

+---+---+------------+
|  a|  b|partition_id|
+---+---+------------+
|  a|  1|           0|
|  b|  2|           0|
|  c|  3|           0|
|  d|  4|           0|
|  e|  5|           0|
|  e|  6|           0|
|  f|  7|           0|
|  g|  8|           0|
|  h|  9|           0|
|  i| 10|           0|
|  a|  1|           0|
|  b|  2|           0|
|  c|  3|           0|
|  d|  4|           0|
|  e|  5|           0|
|  e|  6|           0|
|  f|  7|           0|
|  g|  8|           0|
|  h|  9|           0|
|  i| 10|           0|
+---+---+------------+
only showing top 20 rows



In [17]:
row_count = sdf_gen1_2.count()
print(row_count)

20000


In [18]:
sdf_part_count1_2 = sdf_part1_2.groupBy("partition_id").count()
sdf_part_count1_2 = sdf_part_count1_2.withColumn("count_perc", 100*f.col("count")/row_count)
sdf_part_count1_2.show()

+------------+-----+----------+
|partition_id|count|count_perc|
+------------+-----+----------+
|           0| 5120|      25.6|
|           1| 5120|      25.6|
|           2| 5120|      25.6|
|           3| 4640|      23.2|
+------------+-----+----------+



In [19]:
sc.setJobDescription("Gen1_Exp2")
sdf_gen1_2.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

# 2. Partition Size based on Cores and Data Amount with spark.range
- The same results as for spark.createDataFrame count also here even though it's a spark function returning data

In [20]:
sdf_gen2_1 = sdf_generator2(2000000)
sdf_gen2_1.rdd.getNumPartitions()

4

In [21]:
sdf_part2_1 = sdf_gen2_1.withColumn("partition_id", f.spark_partition_id())
sdf_part2_1.show()

+---+----------+--------------------+--------+-------+------+------------+
| id|     today|           timestamp|idstring|idfirst|idlast|partition_id|
+---+----------+--------------------+--------+-------+------+------------+
|  0|2024-01-13|2024-01-13 10:51:...|       0|      0|     0|           0|
|  1|2024-01-13|2024-01-13 10:51:...|       1|      1|     1|           0|
|  2|2024-01-13|2024-01-13 10:51:...|       2|      2|     2|           0|
|  3|2024-01-13|2024-01-13 10:51:...|       3|      3|     3|           0|
|  4|2024-01-13|2024-01-13 10:51:...|       4|      4|     4|           0|
|  5|2024-01-13|2024-01-13 10:51:...|       5|      5|     5|           0|
|  6|2024-01-13|2024-01-13 10:51:...|       6|      6|     6|           0|
|  7|2024-01-13|2024-01-13 10:51:...|       7|      7|     7|           0|
|  8|2024-01-13|2024-01-13 10:51:...|       8|      8|     8|           0|
|  9|2024-01-13|2024-01-13 10:51:...|       9|      9|     9|           0|
| 10|2024-01-13|2024-01-1

In [22]:
row_count = sdf_gen2_1.count()
print(row_count)

2000000


In [23]:
sdf_part_count2_1 = sdf_part2_1.groupBy("partition_id").count()
sdf_part_count2_1 = sdf_part_count2_1.withColumn("count_perc", 100*f.col("count")/row_count)
sdf_part_count2_1.show()

+------------+------+----------+
|partition_id| count|count_perc|
+------------+------+----------+
|           0|500000|      25.0|
|           1|500000|      25.0|
|           2|500000|      25.0|
|           3|500000|      25.0|
+------------+------+----------+



In [24]:
sc.setJobDescription("Gen2_Exp1")
sdf_gen2_1.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

In [4]:
sdf_gen2_2 = sdf_generator2(2000000000000000000)
sdf_gen2_2.rdd.getNumPartitions()

3

# 3. Influence on Spark partitions to the performance
- We can see within the Stage details of each Job Stage that if the number of partitions is devidable by the number of Cores we use our available capacaties best
- We can see when we increase the number of partitions with relativly small datasets the GC (Garbage collecting) time to clean up unused files and scheduler time significantly increases and makes the process much slower

In [4]:
sdf1 = sdf_generator2(20000000, 4)
print(sdf1.rdd.getNumPartitions())
sc.setJobDescription("Part Exp1")
sdf1.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

4


In [5]:
sdf2 = sdf_generator2(20000000, 8)
print(sdf2.rdd.getNumPartitions())
sc.setJobDescription("Part Exp2")
sdf2.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

8


In [6]:
sdf3 = sdf_generator2(20000000, 3)
print(sdf3.rdd.getNumPartitions())
sc.setJobDescription("Part Exp3")
sdf3.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

3


In [7]:
sdf4 = sdf_generator2(20000000, 6)
print(sdf4.rdd.getNumPartitions())
sc.setJobDescription("Part Exp4")
sdf4.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

6


In [8]:
sdf5 = sdf_generator2(20000000, 200)
print(sdf5.rdd.getNumPartitions())
sc.setJobDescription("Part Exp5")
sdf5.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

200


In [9]:
sdf6 = sdf_generator2(20000000, 20000)
print(sdf6.rdd.getNumPartitions())
sc.setJobDescription("Part Exp6")
sdf6.write.format("noop").mode("overwrite").save()
sc.setJobDescription("None")

20000
