<a href="https://colab.research.google.com/github/anjli01/PySpark-Notes/blob/main/16_Repartition_vs_Coalesce.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Repartition vs. Coalesce

These functions control the number of partitions in a DataFrame, impacting parallelism and performance. The key difference is whether they trigger a full data shuffle.

#### `repartition(n)` — Increases/Decreases Partitions with Shuffle

*   **Purpose**: Redistributes data across `n` partitions, ensuring even distribution across cluster nodes.
*   **Behavior**: *Always* involves a **full shuffle** of data. Each row can move to any partition.
*   **Use Cases**:
    *   **Increasing Parallelism**: When current partitions are too few for CPU-bound tasks.
    *   **Even Distribution**: Decreasing partitions while ensuring balanced data distribution.
    *   **Balancing Data Skew**: Rebalancing data when some partitions are much larger than others.
    *   **Changing Partitioning Key**: To optimize future joins or aggregations by repartitioning by specific columns.
*   **Overhead**: High network I/O and disk I/O due to the full shuffle.

#### `coalesce(n)` — Decreases Partitions Without Full Shuffle

*   **Purpose**: Decreases the number of partitions.
*   **Behavior**: Attempts to combine existing partitions on the same nodes. **Avoids a full shuffle if possible** by merging partitions. It cannot increase the number of partitions beyond the current count. If `n` is greater than the current number of partitions, it simply returns the current number.
*   **Use Cases**:
    *   **Reducing Partitions for Writing**: When writing to a single file or a small number of files to avoid creating many tiny files (which are inefficient in distributed file systems).
    *   **Minimizing Shuffle**: When you need to reduce partitions, but want to avoid the high cost of a full shuffle.
*   **Overhead**: Lower overhead than `repartition()` as it avoids a full shuffle.

#### Example Code

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RepartitionCoalesce").getOrCreate()

data = [(i,) for i in range(100)] # 100 rows
df = spark.createDataFrame(spark.sparkContext.parallelize(data, 10), ["value"]) # Start with 10 partitions

print(f"Initial partitions: {df.rdd.getNumPartitions()}") # Should be 10 (or Spark's default if not specified)

# Repartition to 10 partitions (no change if starting with 10, but demonstrates repartition)
df_repartitioned = df.repartition(10)
print(f"Partitions after repartition(10): {df_repartitioned.rdd.getNumPartitions()}")
df_repartitioned.explain() # Look for 'Exchange' for shuffle

# Repartition to 2 partitions
df_repartitioned_small = df.repartition(2)
print(f"Partitions after repartition(2): {df_repartitioned_small.rdd.getNumPartitions()}")
df_repartitioned_small.explain() # Look for 'Exchange'

# Coalesce to 5 partitions (decreases without full shuffle)
df_coalesced = df.coalesce(5)
print(f"Partitions after coalesce(5): {df_coalesced.rdd.getNumPartitions()}")
df_coalesced.explain() # Should not show 'Exchange' for valid decreases

# Coalesce to 1 partition (useful for writing a single file)
df_coalesced_single = df.coalesce(1)
print(f"Partitions after coalesce(1): {df_coalesced_single.rdd.getNumPartitions()}")

# Try to coalesce to a number higher than initial partitions - it won't increase
df_coalesced_larger = df.coalesce(15)
print(f"Partitions after coalesce(15) (will not increase): {df_coalesced_larger.rdd.getNumPartitions()}")

spark.stop()

Initial partitions: 10
Partitions after repartition(10): 10
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   ShuffleQueryStage 0
   +- Exchange RoundRobinPartitioning(10), REPARTITION_BY_NUM, [plan_id=15]
      +- *(1) Scan ExistingRDD[value#0L]
+- == Initial Plan ==
   Exchange RoundRobinPartitioning(10), REPARTITION_BY_NUM, [plan_id=10]
   +- Scan ExistingRDD[value#0L]


Partitions after repartition(2): 2
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   ShuffleQueryStage 0
   +- Exchange RoundRobinPartitioning(2), REPARTITION_BY_NUM, [plan_id=31]
      +- *(1) Scan ExistingRDD[value#0L]
+- == Initial Plan ==
   Exchange RoundRobinPartitioning(2), REPARTITION_BY_NUM, [plan_id=26]
   +- Scan ExistingRDD[value#0L]


Partitions after coalesce(5): 5
== Physical Plan ==
Coalesce 5
+- *(1) Scan ExistingRDD[value#0L]


Partitions after coalesce(1): 1
Partitions after coalesce(15) (will not increase): 10


#### When to Use Which: Comparison Table

| Feature         | `repartition(n)`                                | `coalesce(n)`                                                  |
| :-------------- | :---------------------------------------------- | :------------------------------------------------------------- |
| **Shuffle**     | **Always** triggers a full data shuffle.        | Avoids a full shuffle *if possible* (merges existing partitions). |
| **Partitions**  | Can increase or decrease partitions.            | Can only decrease or maintain partitions (or return original if `n` is higher than current). |
| **Distribution**| Guarantees even distribution of data.           | May result in unevenly sized partitions.                       |
| **Performance** | Higher overhead due to network I/O and disk I/O. | Lower overhead, faster for reducing partitions.               |
| **Use Cases**   | - Increasing parallelism<br>- Balancing skewed data<br>- Changing partitioning key for joins | - Reducing number of small files when writing<br>- Minimizing network I/O when decreasing partitions |

#### General Guidelines and Best Practices

*   **Write Optimization**: When writing a large DataFrame to a few files (e.g., one file), use `coalesce(1)` or `coalesce(N)` (where N is small) to prevent many tiny output files.
*   **Shuffle Control**: Be mindful of `repartition()`; it's an expensive operation. Use it only when a full data redistribution is absolutely necessary.
*   **Increasing Parallelism**: If tasks are CPU-bound and your current number of partitions is too low, use `repartition()` to boost parallelism.
*   **Data Skew**: If the Spark UI shows data skew (some tasks taking much longer), `repartition()` by the skewed column(s) can help redistribute data evenly.
*   **Best Practice**: Understand your data and workload. Monitor Spark UI (tasks, durations, shuffle read/write bytes) to determine if your partitioning strategy is optimal.
