###Deltalake & Lakehouse Optimization Usecases

![](/Workspace/Users/infoblisstech@gmail.com/databricks-code-repo/5_all_databricks_workouts/DELTA OPTIMIZATIONS.png)

####1. Handling Data Skew & Query Performance (Optimize & Z-Order)
Scenario: The analytics team reports that queries filtering silver_shipments by source_city and shipment_date are becoming slow as data volume grows.

Task: Run the OPTIMIZE command with ZORDER on the silver_shipments table to co-locate related data in the same files.

Outcome:
Why did we choose source_city and shipment_date for Z-Ordering instead of shipment_id? Think about high cardinality vs. query filtering

In [0]:
%sql
OPTIMIZE logistics_proj.shipment_logistics_data.logistics_shipment_silver_tbl ZORDER BY (shipment_date,source_city);

#### 2. Speeding up Regional Queries (Partition Pruning)
Scenario: The dashboard team reports that queries filtering for orgin_hub_city with "New York" shipments from the gold_core_curated_tbl table are scanning the entire dataset (Terabytes of data), even though New York is only 5% of the data. This is racking up compute costs.

Task: Re-create the gold_core_curated_tbl table partitioned by orgin_hub_city. Run a query filtering for one city to demonstrate "Partition Pruning" (where Spark skips files that don't match the filter).

Outcome: Verify the partition filtering is applied or not, by performing explain plan, check for the PartitionFilters in the output.

In [0]:
df1=spark.sql("select * from logistics_proj.shipment_logistics_data.staff_silver_tbl")
df1.write.mode("overwrite").format('delta').partitionBy('origin_hub_city').saveAsTable('logistics_proj.shipment_logistics_data.staff_gold_tbl')

In [0]:
df2=spark.sql("select * from logistics_proj.shipment_logistics_data.staff_silver_tbl")
df2.write.mode("overwrite").format('delta').partitionBy('origin_hub_city').save('/Volumes/logistics_proj/shipment_logistics_data/silver')

partition data in folder:

![image_1770739942518.png](./image_1770739942518.png "image_1770739942518.png")

In [0]:
%sql
select * from logistics_proj.shipment_logistics_data.staff_gold_tbl

![image_1770720793361.png](./image_1770720793361.png "image_1770720793361.png")
After partition the explain plan:

![image_1770720894876.png](./image_1770720894876.png "image_1770720894876.png")

In [0]:
%sql
explain  select * from logistics_proj.shipment_logistics_data.staff_gold_tbl where origin_hub_city = 'Newyork';


#### 3. Storage Cost Savings (Vacuum)
Scenario: Your Project pipeline runs every hour, creating many small files and obsolete versions of data. Your storage costs are rising. You need to clean up files that are no longer needed for time travel.

Task: Execute a Vacuum command to remove data files older than the retention threshold.

Outcome: Performance improvement, cost saving, best practices.

Observation: Perform the describe history and find whether vacuum is completed.

In [0]:
%sql
--Vaccum logistics_proj.shipment_logistics_data.staff_gold_tbl;
--Vaccum logistics_proj.shipment_logistics_data.logistics_shipment_silver_tbl retain 45 hours;

####4. Modern Data Layout (Liquid Clustering)
Scenario: You are redesigning the silver_shipments table. You want to avoid the "small files" problem and need a flexible layout that adapts to changing query patterns automatically without rewriting the table.

Task: Re-create the silver_shipments table using Liquid Clustering on the shipment_id column.

Outcome: Liquid Clustering over traditional partitioning when the cardinality of shipment_id is very high.

In [0]:
%sql
ALTER TABLE logistics_proj.shipment_logistics_data.logistics_shipment_silver_tbl CLUSTER BY (shipment_id);

In [0]:
%sql
EXPLAIN SELECT * from logistics_proj.shipment_logistics_data.logistics_shipment_silver_tbl where shipment_id =5008467;
describe detail logistics_proj.shipment_logistics_data.logistics_shipment_silver_tbl;

#### 5. Cost Efficient Environment Cloning (Shallow Clone)
Scenario: The QA team needs to test an update on the gold_core_curated_tbl table. The table is 5TB in size. You cannot afford to duplicate the storage cost just for a test and the update should not affect the original table.

Task: Create a Shallow Clone of the gold table for the QA team.

Outcome: If we delete records from the source table (gold_core_curated_tbl), will the QA table (gold_core_curated_tbl_qa) be affected & vice versa? Why or why not?

In [0]:
%sql
create TABLE IF NOT exists logistics_proj.shipment_logistics_data.logistics_shipment_gold_curated_tbl_shallow_clone shallow clone logistics_proj.shipment_logistics_data.logistics_shipment_gold_curated_tbl;

-- When the source data is modified the shallow clone's data will be affected as the shallow clone is a pointer to the source data.
-- An update to the shallow cloned tbl will not affect the source table as shallow clone table maintains its own delta_logs

#### 6. Disaster Recovery (Time Travel & Restore)
Scenario: A junior data engineer accidentally ran a logic error that corrupted the gold_core_curated_tbl table 15 minutes ago. You need to revert the table to its previous state immediately.

Task: Use Delta Lake's Restore feature to roll back the table.

Outcome:What is the difference between querying with VERSION AS OF (Time Travel) and running RESTORE?

In [0]:
%sql
RESTORE TABLE logistics_proj.shipment_logistics_data.logistics_shipment_gold_curated_tbl TO TIMESTAMP AS OF '2026-01-30T16:24:16.000+00:00';
--OR 
---RESTORE TABLE logistics_proj.shipment_logistics_data.logistics_shipment_gold_curated_tbl TO VERSION AS OF 1;