
# 🧠 PySpark: Partitioning vs Bucketing (Local)

In this notebook, we explore **partitioning** and **bucketing** using PySpark, locally.

We'll cover:
- ✅ What is partitioning?
- ✅ What is bucketing?
- ✅ Use cases and differences
- ✅ Performance implications
- ✅ Sample code and test data

---


In [None]:

from pyspark.sql import SparkSession
import os

spark = SparkSession.builder \
    .appName("Partitioning_vs_Bucketing") \
    .master("local[*]") \
    .config("spark.sql.shuffle.partitions", "8") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")


## 📁 Step 1: Create Sample Dataset

In [None]:

from pyspark.sql.functions import col, rand
import random

# Create synthetic taxi-like data
df = spark.range(0, 1000000).withColumn("pickup_borough", (col("id") % 5).cast("string"))
df = df.withColumn("vendor_id", (col("id") % 20).cast("string"))
df = df.withColumn("fare_amount", (col("id") % 100) * rand())

df.write.mode("overwrite").parquet("data/trips_raw")
print("✅ Sample data written to 'data/trips_raw'")


## 📦 Step 2: Partition the Dataset by `pickup_borough`

In [None]:

df.write.partitionBy("pickup_borough").mode("overwrite").parquet("data/trips_partitioned")
print("✅ Data written with partitioning.")


## 🎯 Step 3: Bucket the Dataset by `vendor_id` into 8 buckets

In [None]:

spark.sql("DROP TABLE IF EXISTS trips_bucketed")
df.write.bucketBy(8, "vendor_id").sortBy("vendor_id").mode("overwrite").saveAsTable("trips_bucketed")
print("✅ Data written with bucketing (as Hive table).")


## 🧪 Step 4: Compare Query Performance (Partitioned vs Bucketed)

In [None]:

from time import time

# Read partitioned
df_partitioned = spark.read.parquet("data/trips_partitioned")
start = time()
df_partitioned.filter("pickup_borough = '2'").groupBy("pickup_borough").sum("fare_amount").show()
print(f"⏱️ Partitioned query took {time() - start:.2f} seconds")

# Read bucketed
start = time()
df_bucketed = spark.table("trips_bucketed")
df_bucketed.filter("vendor_id = '5'").groupBy("vendor_id").sum("fare_amount").show()
print(f"⏱️ Bucketed query took {time() - start:.2f} seconds")



## ✅ Summary: Partitioning vs Bucketing

| Feature           | Partitioning                           | Bucketing                                 |
|------------------|-----------------------------------------|--------------------------------------------|
| Works on         | File system level (folders)             | Table level (Hive/Delta only)             |
| Use case         | Filtering (pushdown)                    | Efficient joins / aggregations            |
| Example column   | `pickup_borough`                        | `vendor_id`                               |
| Performance gain | Partition pruning                       | Hash-based bucketing before shuffle       |
| Format support   | ✅ Parquet, Delta, ORC                  | ❌ Only Hive/Delta-compatible              |

---
Next: Broadcast vs Distributed Join 🔄
