# Spark Optimization

### Partitioning

- Partitioning refers to dividing data into logical chunks (partitions) across nodes.
- Effective partitioning improves parallelism, reduces shuffle, and enhances query performance.

#### Partitioning in memory

- Repartition
    - allows to specify the desired number of partitions and the columns to partition by
    - shuffles the data to create the specified number of partitions

- Coalesce
    - reduces the number of partitions by merging them
    - useful when you want to decrease the number of partitions for efficiency

#### Partitioning on disk


- `partitionBy()` method is used to partition the data into a file system, resulting in multiple sub-directories.
- this enhances the read performance for downstream systems.
- This function can be applied to one or multiple column values while writing a DataFrame to the disk.



[spark performace tuning](https://spark.apache.org/docs/latest/sql-performance-tuning.html)



In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CustomerOrdersDemo").getOrCreate()



Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/07 03:49:47 WARN Utils: Your hostname, codespaces-c6070e, resolves to a loopback address: 127.0.0.1; using 10.0.0.91 instead (on interface eth0)
25/08/07 03:49:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/07 03:49:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# load the data
from pyspark.sql.functions import *

orders_df = spark.read.csv("file:///workspaces/trng-2286/datasets/orders.csv", inferSchema=True, header=True)

orders_df.show()

+--------------------+--------------------+----------+----------+------+
|            order_id|         customer_id|order_date|product_id|amount|
+--------------------+--------------------+----------+----------+------+
|02a777e0-5571-42c...|0e99a07c-c7a5-43d...|2023-04-21|     P1031|375.94|
|1c5a3e4d-f8de-47b...|3a69ac3e-6726-431...|2021-09-25|     P1086|373.51|
|a5b65d4d-3ac0-45d...|3a69ac3e-6726-431...|2024-01-04|     P1054| 61.73|
|b752df2c-aa68-41e...|c63cab5f-dc06-484...|2024-01-16|     P1029| 64.97|
|23e8adb9-330d-4ce...|50b165d0-6486-4d5...|2021-10-27|     P1091| 289.4|
|f76db88b-9100-4b0...|50b165d0-6486-4d5...|2024-09-23|     P1057|221.37|
|3bb7142d-b348-486...|50b165d0-6486-4d5...|2023-03-26|     P1000|408.53|
|b4002908-2ad1-4d7...|50b165d0-6486-4d5...|2022-07-26|     P1043|355.11|
|600f5736-35ec-476...|50b165d0-6486-4d5...|2023-03-26|     P1097| 94.63|
|f35f20fb-bc93-417...|4657a2b1-abae-49a...|2023-06-02|     P1012| 64.52|
|98ce805f-e468-458...|4657a2b1-abae-49a...|2024-11-

In [4]:
customers_df = spark.read.parquet("file:///workspaces/trng-2286/datasets/final_customer_data.parquet")

customers_df.show()

+--------------------+--------------------+----+-------+--------------------+-----------+-------------------+---------+-----------+---------+---------------+------------------+-------------+----------+---------+
|         customer_id|               email| age| gender|             country|signup_date|         last_login|is_active|total_spent|age_group|pref_newsletter|pref_notifications|pref_language|first_name|last_name|
+--------------------+--------------------+----+-------+--------------------+-----------+-------------------+---------+-----------+---------+---------------+------------------+-------------+----------+---------+
|0e99a07c-c7a5-43d...|robinjackson@wrig...|50.0| Female|              France| 2023-03-01|2025-05-29 22:36:25|     true|     1438.4|    Adult|           true|              push|           en|    Thomas|     Lamb|
|3a69ac3e-6726-431...|susan51@johnson-g...|20.0|   Male|       Guinea-Bissau| 2020-12-14|2025-03-21 23:52:55|     true|    2364.98|    Young|           

                                                                                

In [5]:
# partitionng in memeory - repartition

customers_df.repartition(4, "country").write.mode("overwrite").parquet("./datasets/customers_partitioned")

                                                                                

In [None]:
partitioned_customer_df = customers_df.repartitionByRange(4, "country").sortWithinPartitions("total_spent")

| Method                       | Function                 | Shuffles?               | Partitions  | Use Case                        |
| ---------------------------- | ------------------------ | ----------------------- | ----------- | ------------------------------- |
| `repartition(n, col)`        | Hash repartition         | Yes (full shuffle)      | Exact count | General repartitioning          |
| `repartitionByRange(n, col)` | Range-based partitioning | Yes (efficient shuffle) | Exact count | Sorted / range-based processing |
| `sortWithinPartitions(col)`  | Local sort               | No shuffle              | As is       | Ordered rows per partition      |
| `spark_partition_id()`       | Partition tracking       | No                      | -           | Debugging/analysis              |


In [8]:
partitioned_customer_df.withColumn("partition_id", spark_partition_id()) \
.select("partition_id", "country", "total_spent") \
.orderBy("partition_id", "total_spent") \
.show(50, truncate=False)

+------------+---------------------------------+-----------+
|partition_id|country                          |total_spent|
+------------+---------------------------------+-----------+
|0           |Bouvet Island (Bouvetoya)        |505.95     |
|0           |Cocos (Keeling) Islands          |509.83     |
|0           |Belarus                          |901.85     |
|0           |Azerbaijan                       |1800.07    |
|0           |Brunei Darussalam                |2351.05    |
|0           |Argentina                        |2475.12    |
|0           |Cuba                             |2707.21    |
|0           |Bhutan                           |2888.23    |
|0           |Cambodia                         |2997.65    |
|0           |Bangladesh                       |3189.7     |
|0           |Central African Republic         |3580.51    |
|0           |Congo                            |3986.54    |
|0           |Colombia                         |4283.76    |
|0           |Anguilla  

In [12]:
# partiion in memory - coalesce 

customers_df.coalesce(1).write.mode("overwrite").parquet("./datasets/customers_coalesce")

In [13]:
# partition by disk

customers_df.write.mode("overwrite").partitionBy("country").parquet("./datasets/customers_by_country")

                                                                                

In [14]:
df_partitioned = spark.read.parquet("file:///workspaces/trng-2286/datasets/customers_by_country")

df_partitioned.filter(col("country") == "Finland").explain(True)

                                                                                

== Parsed Logical Plan ==
'Filter '`=`('country, Finland)
+- Relation [customer_id#131,email#132,age#133,gender#134,signup_date#135,last_login#136,is_active#137,total_spent#138,age_group#139,pref_newsletter#140,pref_notifications#141,pref_language#142,first_name#143,last_name#144,country#145] parquet

== Analyzed Logical Plan ==
customer_id: string, email: string, age: double, gender: string, signup_date: date, last_login: timestamp, is_active: boolean, total_spent: double, age_group: string, pref_newsletter: boolean, pref_notifications: string, pref_language: string, first_name: string, last_name: string, country: string
Filter (country#145 = Finland)
+- Relation [customer_id#131,email#132,age#133,gender#134,signup_date#135,last_login#136,is_active#137,total_spent#138,age_group#139,pref_newsletter#140,pref_notifications#141,pref_language#142,first_name#143,last_name#144,country#145] parquet

== Optimized Logical Plan ==
Filter (isnotnull(country#145) AND (country#145 = Finland))
+- Re


### Bucketing

- Bucketing organizes data into fixed number of buckets using the hash of a column.

**Benefits:**
- Reduces shuffle during joins and aggregations.
- Supports efficient bucketed joins and sort-merge joins.

In [None]:
# bucketing
spark.sql("DROP TABLE IF EXISTS bucketed_customers")

customers_df.write.bucketBy(8, "customer_id") \
        .sortBy("age") \
        .mode("overwrite") \
        .saveAsTable("bucketed_customers")


spark.sql("DROP TABLE IF EXISTS bucketed_orders")

orders_df.write.bucketBy(8, "customer_id") \
        .sortBy("order_date") \
        .mode("overwrite") \
        .saveAsTable("bucketed_orders")


In [17]:
bucketed_customers = spark.table("bucketed_customers")
bucketed_orders = spark.table("bucketed_orders")

bucketed_orders.join(bucketed_customers, "customer_id").explain(True)

== Parsed Logical Plan ==
'Join UsingJoin(Inner, [customer_id])
:- SubqueryAlias spark_catalog.default.bucketed_orders
:  +- Relation spark_catalog.default.bucketed_orders[order_id#167,customer_id#168,order_date#169,product_id#170,amount#171] parquet
+- SubqueryAlias spark_catalog.default.bucketed_customers
   +- Relation spark_catalog.default.bucketed_customers[customer_id#152,email#153,age#154,gender#155,country#156,signup_date#157,last_login#158,is_active#159,total_spent#160,age_group#161,pref_newsletter#162,pref_notifications#163,pref_language#164,first_name#165,last_name#166] parquet

== Analyzed Logical Plan ==
customer_id: string, order_id: string, order_date: date, product_id: string, amount: double, email: string, age: double, gender: string, country: string, signup_date: date, last_login: timestamp, is_active: boolean, total_spent: double, age_group: string, pref_newsletter: boolean, pref_notifications: string, pref_language: string, first_name: string, last_name: string
Proj

| Feature               | **Partitioning**                                                 | **Bucketing**                                                            |
| --------------------- | ---------------------------------------------------------------- | ------------------------------------------------------------------------ |
| **Definition**        | Divides data into **directory-based partitions** by column value | Divides data into a **fixed number of buckets** using a hash of a column |
| **Granularity**       | Coarse-grained (1 partition per unique value)                    | Fine-grained (fixed number of buckets, regardless of unique values)      |
| **Data Layout**       | Creates folders for each partition column value                  | Creates files (buckets) within a single folder                           |
| **Shuffle Required?** | Yes (during writing and often during reading)                    | Yes (during writing, but optimized join/scan during reading)             |
| **Use Case**          | Filter pushdown and pruning (`WHERE country = 'IN'`)             | Efficient **joins** and **sampling** on large datasets                   |
| **Syntax (Write)**    | `.write.partitionBy("col")`                                      | `.write.bucketBy(4, "col").sortBy("col").saveAsTable(...)`               |
| **Sort Support**      | Not sorted by default                                            | Can be sorted within each bucket (`.sortBy(...)`)                        |
| **Join Optimization** | Not directly useful                                              | Enables **bucketed joins** (avoids shuffle if both sides are bucketed)   |
| **Flexibility**       | Dynamically adjusts to data                                      | Requires fixed bucket count defined in advance                           |
| **Storage Format**    | Works with any format (Parquet, Delta, etc.)                     | Only supported when using `.saveAsTable()` (Hive-compatible)             |


### Joins

In PySpark, joins combine rows from two DataFrames based on a common key (just like SQL joins).

| Join Type | Description                                        |
| --------- | -------------------------------------------------- |
| `inner`   | Keep only matching rows in both DataFrames         |
| `left`    | All rows from left + matching from right           |
| `right`   | All rows from right + matching from left           |
| `outer`   | All rows from both (NULL where no match)           |
| `semi`    | Keep rows from left **if match exists** in right   |
| `anti`    | Keep rows from left **if no match** in right       |
| `cross`   | Cartesian product (every row with every other row) |



### Broadcast Joins

A broadcast join is an optimization technique in Spark that sends a small DataFrame to all worker nodes, so the larger DataFrame doesn’t need to shuffle its data.

useful when:

- Joins are expensive because they involve data shuffling across nodes
- Broadcast joins eliminate shuffle if one DataFrame is small enough to fit in memory

In [22]:
# large df
employees_df = spark.createDataFrame([
    (1, "Alice", "HR"),
    (2, "Bob", "IT"),
    (3, "Charlie", "IT"),
    (4, "David", "Finance"),
    (5, "Eve", "HR")
], ["emp_id", "name", "dept"])

# small df 

depratments_df = spark.createDataFrame([
    ("Finance", "Finance & Accounts"),
    ("IT", "Information Technology"),
    ("HR", "Human Resources")
], ["dept", "dept_desc"])

In [23]:
# default join on department

joined_df = employees_df.join(depratments_df, on="dept", how="inner")

joined_df.show()

[Stage 25:>                                                         (0 + 2) / 2]

+-------+------+-------+--------------------+
|   dept|emp_id|   name|           dept_desc|
+-------+------+-------+--------------------+
|Finance|     4|  David|  Finance & Accounts|
|     HR|     1|  Alice|     Human Resources|
|     HR|     5|    Eve|     Human Resources|
|     IT|     2|    Bob|Information Techn...|
|     IT|     3|Charlie|Information Techn...|
+-------+------+-------+--------------------+



                                                                                

In [24]:
# broadcast join on small df

broadcast_joined_df = employees_df.join(broadcast(depratments_df), on="dept", how="inner")

broadcast_joined_df.explain(True)


== Parsed Logical Plan ==
'Join UsingJoin(Inner, [dept])
:- LogicalRDD [emp_id#184L, name#185, dept#186], false
+- ResolvedHint (strategy=broadcast)
   +- LogicalRDD [dept#187, dept_desc#188], false

== Analyzed Logical Plan ==
dept: string, emp_id: bigint, name: string, dept_desc: string
Project [dept#186, emp_id#184L, name#185, dept_desc#188]
+- Join Inner, (dept#186 = dept#187)
   :- LogicalRDD [emp_id#184L, name#185, dept#186], false
   +- ResolvedHint (strategy=broadcast)
      +- LogicalRDD [dept#187, dept_desc#188], false

== Optimized Logical Plan ==
Project [dept#186, emp_id#184L, name#185, dept_desc#188]
+- Join Inner, (dept#186 = dept#187), rightHint=(strategy=broadcast)
   :- Filter isnotnull(dept#186)
   :  +- LogicalRDD [emp_id#184L, name#185, dept#186], false
   +- Filter isnotnull(dept#187)
      +- LogicalRDD [dept#187, dept_desc#188], false

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [dept#186, emp_id#184L, name#185, dept_desc#188]
   +- Broadc