Here’s a practical, “what to use when” guide to **`repartition`** and **`coalesce`** in PySpark—focused on **parallelism**, **skew**, and **small files**.

---

# What they do (in one line)

* **`repartition()`**: **wide** transformation → causes a **shuffle**. Can **increase or decrease** partitions. Can also **hash-partition by columns** for better balance/locality.
* **`coalesce()`**: **narrow** transformation → **no shuffle**. Can only **decrease** partitions by merging existing ones.

---

## 1) Managing Parallelism (number of tasks)

### When you need **more** parallelism (e.g., many cores idle, slow stages)

Use **`repartition(n)`** to increase partitions and spread work across executors.

```python
# Increase from, say, 50 to 400 partitions to use the cluster better
df = df.repartition(400)  # round-robin shuffle
```

If operations are keyed (joins, windows), prefer **column-based** repartition to keep related rows together:

```python
df = df.repartition("user_id")              # hash by user_id, #parts = spark.sql.shuffle.partitions
df = df.repartition(800, "user_id")         # explicit partition count + hash by column
```

### When you need **less** parallelism (too many tiny tasks)

* If data is already fairly balanced, use **`coalesce(target)`** (cheaper, no shuffle):

```python
df = df.coalesce(100)  # merge adjacent partitions
```

* If data is **not** balanced (some partitions huge, others tiny), use **`repartition(target)`** to **rebalance** via a shuffle.

> Tip: A common steady-state target is \~**128–512 MB per partition** (depends on executor memory/IO).

---

## 2) Handling Skew (one or few partitions much larger than others)

**Symptoms**: One task runs far longer; joins “hang” on the last reducer.

**Tools & patterns**

1. **Repartition by the skew key** to ensure even hashing:

   ```python
   big = big.repartition(2000, "join_key")  # more buckets to dilute a hot key
   ```

2. **Salting** the skewed key (split a hot key across multiple buckets), then desalt after join:

   ```python
   from pyspark.sql import functions as F

   salt_buckets = 20
   big_salted = big.withColumn("salt", (F.rand()*salt_buckets).cast("int"))
   # replicate small side across salt buckets (explode) or broadcast it
   small = F.broadcast(small)  # if it fits broadcast threshold
   joined = big_salted.join(small, ["join_key"], "inner")
   ```

3. **AQE (Adaptive Query Execution)** in Spark 3+:

   ```python
   spark.conf.set("spark.sql.adaptive.enabled", "true")
   spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
   spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
   ```

   AQE can automatically split skewed partitions and coalesce small ones during shuffles.

4. **Broadcast** the small side of a skewed join when possible:

   ```python
   joined = big.join(F.broadcast(small), "join_key")
   ```

**Rule of thumb**: If skew is the bottleneck, prefer **`repartition(by columns)`** (possibly with more partitions) or **AQE**; **`coalesce()`** won’t fix skew (it only merges, doesn’t rebalance).

---

## 3) Reducing Small Files (object stores like S3/ADLS/GCS, HDFS)

Each Spark **task writes one file** per output directory. Too many partitions ⇒ too many **small files**.

### General, non-partitioned writes

* If your final dataset is already balanced:

  * **`coalesce(target_files)`** (fast, no shuffle) just before `.write`:

    ```python
    target_files = 200
    df.coalesce(target_files).write.mode("overwrite").parquet(path)
    ```
* If it’s **not** balanced (or you want more consistent file sizes):

  * **`repartition(target_files)`** before write:

    ```python
    df.repartition(200).write.parquet(path)
    ```

### Partitioned writes (e.g., `partitionBy("date")`)

Want **one file per partition value** (or a controlled small number)?

* Use **`repartition` on the same partition columns before writing**. This ensures all rows for a given partition value live in the **same Spark partition**, so **only one task** writes that directory → typically **one file per value**.

  ```python
  out = df.repartition("date")  # hash by date; each date goes to exactly one partition
  out.write.partitionBy("date").parquet(path)  # usually one file per date value
  ```
* For multi-column partitions:

  ```python
  out = df.repartition("country", "date")
  out.write.partitionBy("country", "date").parquet(path)
  ```
* If you need a **specific number of files per partition value** (e.g., 2 files per `date`), there isn’t a native “coalesce per key” primitive. Two approaches:

  * Set **more total partitions** than distinct keys, so some keys share a partition (you’ll still get one file per key with `repartition(partCols)`), then **use `maxRecordsPerFile`** to limit file size:

    ```python
    (df.repartition("date")
       .write
       .option("maxRecordsPerFile", 5_000_000)  # coarse file size control
       .partitionBy("date")
       .parquet(path))
    ```
  * Or write each partition value in a loop (only for moderate cardinalities).

> Avoid `coalesce(1)` on big data—it forces a single task to write a single file (slow, memory pressure, single point of failure). It’s fine only for tiny outputs.

### Read-side small files (many tiny input files)

Small input files create too many splits and task overhead. Options:

* Increase split size:

  ```python
  spark.conf.set("spark.sql.files.maxPartitionBytes", 134217728)  # 128 MB
  ```
* Compact upstream (Delta/iceberg optimize, or periodic compaction job).
* After reading, use `repartition()` if subsequent stages need fewer, larger tasks.

---

## Choosing between `repartition` and `coalesce`: Decision guide

* **Increase parallelism** → `repartition(n)`
* **Data skew hurting performance** → `repartition(n, *cols)` (and/or AQE, salting, broadcast)
* **Just fewer output files; data already even** → `coalesce(n)`
* **Fewer output files; data uneven** → `repartition(n)`
* **Partitioned write and want \~1 file per partition value** → `repartition(partitionCols)` **then** `write.partitionBy(partitionCols)`
* **Absolute minimal shuffle cost** → prefer `coalesce()` (only when safe)

---

## Common code snippets

### Balance & write (non-partitioned), target \~256MB files

```python
from pyspark.sql import functions as F

# Estimate target partitions (rough heuristic if you know size)
total_size_bytes = 1_000_000_000_000  # 1 TB
target_file_size = 256 * 1024 * 1024  # 256 MB
target_parts = int((total_size_bytes + target_file_size - 1) // target_file_size)

balanced = df.repartition(target_parts)  # or coalesce if already balanced
(balanced
  .write
  .option("maxRecordsPerFile", 10_000_000)  # optional
  .mode("overwrite")
  .parquet(path))
```

### Partitioned write with one file per partition value

```python
df.repartition("event_date").write.partitionBy("event_date").parquet(path)
```

### Skewed join quick fixes

```python
from pyspark.sql import functions as F
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

big  = big.repartition(2000, "user_id")
small = F.broadcast(small)  # if small enough
joined = big.join(small, "user_id", "inner")
```

---

## Pitfalls & tips

* `coalesce()` can create **few very large partitions** → risk of OOM/GC pressure on those tasks.
* `repartition()` always shuffles (costly) but **evens out** data—use it when size/latency justify it.
* `repartition(col)` uses **hash partitioning**: **all rows for the same key go to the same partition** (good for per-key file consolidation).
* Don’t forget **`spark.sql.shuffle.partitions`** defaults (often 200). Set it deliberately for your workload or rely on **AQE coalescing**.
* For very large datasets, prefer **many medium files** (e.g., 128–512 MB) over a few huge ones or thousands of tiny ones.

---

If you share your workload shape (cluster cores, dataset size, partition columns, and where skew appears), I can suggest exact partition counts and a drop-in snippet.
