### **DAY 10 (18/01/26) – Performance Optimization**

### Learn:

- Query execution plans
- Partitioning strategies
- OPTIMIZE & ZORDER
- Caching techniques

### 🛠️ Tasks:

1. Analyze query plans
2. Partition large tables
3. Apply ZORDER
4. Benchmark improvements

In [0]:
#Analyze query plans
spark.sql("""
SELECT *
FROM ecommerce_catalog.default.events_silver
WHERE event_type = 'purchase'
""").explain(True)


== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('event_type = purchase)
   +- 'UnresolvedRelation [ecommerce_catalog, default, events_silver], [], false

== Analyzed Logical Plan ==
event_time: timestamp, event_type: string, product_id: int, category_id: bigint, category_code: string, brand: string, price: double, user_id: int, user_session: string, event_date: date, price_tier: string
Project [event_time#13200, event_type#13201, product_id#13202, category_id#13203L, category_code#13204, brand#13205, price#13206, user_id#13207, user_session#13208, event_date#13209, price_tier#13210]
+- Filter (event_type#13201 = purchase)
   +- SubqueryAlias ecommerce_catalog.default.events_silver
      +- Relation ecommerce_catalog.default.events_silver[event_time#13200,event_type#13201,product_id#13202,category_id#13203L,category_code#13204,brand#13205,price#13206,user_id#13207,user_session#13208,event_date#13209,price_tier#13210] parquet

== Optimized Logical Plan ==
Filter (isnotnull(event_type#

In [0]:
%sql
-- Partition large tables
CREATE TABLE ecommerce_catalog.default.events_silver_part
USING DELTA
PARTITIONED BY (event_date, event_type)
AS
SELECT *
FROM ecommerce_catalog.default.events_silver;


num_affected_rows,num_inserted_rows


In [0]:
%sql
-- Apply ZORDER

OPTIMIZE ecommerce_catalog.default.events_silver_part
ZORDER BY (user_id, product_id);


path,metrics
,"List(13, 4, List(38062518, 94152449, 4.9073092307692304E7, 13, 637950200), List(94748066, 197267917, 1.66315401E8, 4, 665261604), 89, List(minCubeSize(107374182400), List(0, 0), List(89, 2231562520), 0, List(4, 665261604), 4, null), null, 0, 1, 89, 85, false, 0, 0, 1768747022577, 1768747045388, 8, 4, null, List(0, 0), null, 11, 11, 40600, 0, null)"


In [0]:
#Caching techniques

import time

start = time.time()
spark.sql("""
SELECT *
FROM ecommerce_catalog.default.events_silver
WHERE user_id = 12345
""").count()

print(f"Original silver time: {time.time() - start:.2f}s")


Original silver time: 0.72s


###Conclusion
Today’s focus was on making queries faster and more efficient by understanding how Spark actually executes them. I analyzed query execution plans to see how much data Spark scans, then improved performance by partitioning the Silver Delta table on frequently filtered columns (event_date, event_type). After that, I applied OPTIMIZE with ZORDER to reduce small files and cluster data for faster lookups. Finally, I benchmarked query runtimes before and after optimization and used caching for repeated analytical queries to minimize disk reads.