# OPTIMIZE and VACUUM Lab üß™

Welcome to the OPTIMIZE and VACUUM lab! In this hands-on lab, you'll learn how to maintain and optimize Delta Lake tables for better performance and storage efficiency.

---

## üéØ Learning Objectives

By the end of this lab, you will be able to:

1. ‚úÖ Understand why small files are a performance problem
2. ‚úÖ Use **OPTIMIZE** to compact small files
3. ‚úÖ Use **ZORDER** to co-locate related data
4. ‚úÖ Use **VACUUM** to remove old files and save storage
5. ‚úÖ Monitor table health with DESCRIBE DETAIL
6. ‚úÖ Check table history with DESCRIBE HISTORY
7. ‚úÖ Understand retention periods and data lifecycle

---

## üìä What are OPTIMIZE and VACUUM?

### **OPTIMIZE**
* **Problem:** Many small files slow down queries
* **Solution:** Combine small files into larger, optimized files
* **Benefit:** Faster queries, better compression
* **Optional:** ZORDER for data clustering

### **VACUUM**
* **Problem:** Old files accumulate and waste storage
* **Solution:** Delete files no longer needed
* **Benefit:** Reduced storage costs
* **Caution:** Affects time travel capability

---

## üõ†Ô∏è Lab Structure

This lab has **9 tasks** to complete:

1. Create a table with small files (simulate the problem)
2. Check table details (see the small files issue)
3. Run OPTIMIZE to compact files
4. Verify optimization results
5. Add more data and use ZORDER
6. Run VACUUM to clean up old files
7. Understand retention periods
8. Monitor table health
9. Best practices review

**Each task includes:**
* üìù Clear instructions
* üí° Hints to guide you
* ‚úÖ Solutions at the end (try first!)

---

## üìÅ Dataset

We'll create our own table with booking data to simulate real-world scenarios.

---

**Let's get started!** üöÄ

## Task 1: Create a Table with Small Files üìÅ

**The Problem:**

When data is written in many small batches, Delta tables end up with many small files. This hurts query performance!

**Your Challenge:**

Create a Delta table with booking data by writing multiple small batches.

**Requirements:**

1. Create a table called `main.default.bookings_lab`
2. Write data in **5 separate batches** (simulating incremental writes)
3. Each batch should have 1000 rows
4. Use `.coalesce(1)` to ensure each batch creates a small file
5. Use mode `append` for batches 2-5

**Data structure:**
* `booking_id` - INT (sequential)
* `customer_id` - INT (random 1-500)
* `booking_date` - DATE (random dates in 2024)
* `amount` - DOUBLE (random 50-1000)
* `region` - STRING (random: 'North', 'South', 'East', 'West')

---

**Write your code in the cell below:**

In [0]:
# TODO: Generate and write 5 batches of booking data
# Each batch: 1000 rows, coalesce(1) to create small files
# Batch 1: mode("overwrite"), Batches 2-5: mode("append")



### üí° Hints for Task 1

<details>
<summary><b>Hint 1:</b> Generating sample data (click to expand)</summary>

Use Python to generate data:
```python
import random
from datetime import datetime, timedelta

data = [
    (i, random.randint(1, 500), 
     (datetime(2024, 1, 1) + timedelta(days=random.randint(0, 365))).strftime("%Y-%m-%d"),
     round(random.uniform(50, 1000), 2),
     random.choice(['North', 'South', 'East', 'West']))
    for i in range(start_id, end_id)
]
```
</details>

<details>
<summary><b>Hint 2:</b> Creating DataFrame (click to expand)</summary>

```python
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType

schema = StructType([
    StructField("booking_id", IntegerType()),
    StructField("customer_id", IntegerType()),
    StructField("booking_date", StringType()),
    StructField("amount", DoubleType()),
    StructField("region", StringType())
])

df = spark.createDataFrame(data, schema)
```
</details>

<details>
<summary><b>Hint 3:</b> Writing batches (click to expand)</summary>

```python
# Batch 1 (overwrite)
df1.coalesce(1).write.mode("overwrite").saveAsTable("main.default.bookings_lab")

# Batch 2-5 (append)
df2.coalesce(1).write.mode("append").saveAsTable("main.default.bookings_lab")
```
</details>

## Task 2: Check Table Details üîç

**Your Challenge:**

Inspect the table to see the small files problem.

**Requirements:**

1. Use `DESCRIBE DETAIL` to view table metadata
2. Look for:
   * `numFiles` - Should be around 5 (one per batch)
   * `sizeInBytes` - Total table size
   * `location` - Where the table is stored
3. Calculate the average file size

**Questions to answer:**
* How many files does the table have?
* What's the average file size?
* Is this optimal? (Hint: Ideal file size is 128MB-1GB)

---

**Write your code in the cell below:**

In [0]:
%sql
-- TODO: Use DESCRIBE DETAIL to inspect the table
-- Look at numFiles and sizeInBytes



### üí° Hints for Task 2

<details>
<summary><b>Hint 1:</b> DESCRIBE DETAIL syntax (click to expand)</summary>

```sql
DESCRIBE DETAIL catalog.schema.table_name
```
</details>

<details>
<summary><b>Hint 2:</b> Key columns to look at (click to expand)</summary>

Important columns:
* `numFiles` - Number of data files
* `sizeInBytes` - Total size in bytes
* `location` - Table location
* `format` - Should be 'delta'
</details>

<details>
<summary><b>Hint 3:</b> Calculate average file size (click to expand)</summary>

```sql
SELECT 
  numFiles,
  sizeInBytes,
  ROUND(sizeInBytes / numFiles / 1024 / 1024, 2) AS avg_file_size_mb
FROM (
  DESCRIBE DETAIL main.default.bookings_lab
)
```
</details>

## Task 3: Run OPTIMIZE ‚ö°

**The Problem:**

Your table has 5 small files. Reading many small files is slow because:
* More file open/close operations
* Less efficient compression
* More metadata to track
* Slower query performance

**Your Challenge:**

Run OPTIMIZE to compact the small files into larger, optimized files.

**Requirements:**

1. Use the `OPTIMIZE` command on your table
2. Run it using SQL (`%sql`)
3. Observe the output metrics:
   * `numFilesAdded` - New optimized files created
   * `numFilesRemoved` - Old small files marked for removal

**Syntax:**
```sql
OPTIMIZE catalog.schema.table_name
```

---

**Write your code in the cell below:**

In [0]:
%sql
-- TODO: Run OPTIMIZE on main.default.bookings_lab



### üí° Hints for Task 3

<details>
<summary><b>Hint 1:</b> OPTIMIZE syntax (click to expand)</summary>

```sql
OPTIMIZE main.default.bookings_lab
```

That's it! Just one line.
</details>

<details>
<summary><b>Hint 2:</b> Understanding the output (click to expand)</summary>

OPTIMIZE returns metrics:
* `numFilesAdded` - New optimized files (usually 1)
* `numFilesRemoved` - Old files marked for deletion (should be 5)
* `totalFilesSkipped` - Files already optimal
* `totalTimeMs` - Time taken
</details>

<details>
<summary><b>Hint 3:</b> What happens (click to expand)</summary>

OPTIMIZE:
1. Reads all small files
2. Combines them into larger files
3. Writes new optimized files
4. Marks old files for deletion (but doesn't delete yet!)
5. Updates transaction log
</details>

## Task 4: Verify Optimization Results ‚úÖ

**Your Challenge:**

Check if OPTIMIZE worked by inspecting the table again.

**Requirements:**

1. Run `DESCRIBE DETAIL` again on your table
2. Compare with Task 2 results:
   * `numFiles` - Should be fewer (ideally 1)
   * `sizeInBytes` - Should be similar (same data)
3. Calculate the new average file size

**Expected results:**
* Before OPTIMIZE: ~5 small files
* After OPTIMIZE: 1 larger file
* File size: Much larger per file

---

**Write your code in the cell below:**

In [0]:
%sql
-- TODO: Run DESCRIBE DETAIL again to see the changes
-- Compare numFiles before and after



### üí° Hints for Task 4

<details>
<summary><b>Hint 1:</b> Same command as Task 2 (click to expand)</summary>

```sql
DESCRIBE DETAIL main.default.bookings_lab
```
</details>

<details>
<summary><b>Hint 2:</b> What to look for (click to expand)</summary>

Compare:
* **Before:** numFiles = 5, small avg file size
* **After:** numFiles = 1, larger avg file size
* **Data:** sizeInBytes should be similar (same data, better compressed)
</details>

<details>
<summary><b>Hint 3:</b> Why files are better (click to expand)</summary>

Larger files mean:
* Fewer file operations
* Better compression
* Faster queries
* More efficient storage
</details>

## Task 5: Add More Data and Use ZORDER üéØ

**What is ZORDER?**

ZORDER BY co-locates related data in the same files, making queries on those columns much faster.

**Your Challenge:**

Add more data and optimize with ZORDER.

**Requirements:**

**Part A: Add more data**
1. Generate 2000 more rows (booking_id 5001-7000)
2. Append to the table
3. Use `.coalesce(2)` to create 2 more small files

**Part B: OPTIMIZE with ZORDER**
1. Run OPTIMIZE with ZORDER BY on the `region` column
2. This will cluster data by region for faster region-based queries

**Syntax:**
```sql
OPTIMIZE table_name ZORDER BY (column_name)
```

---

**Write your code in the cells below:**

In [0]:
# TODO: Generate 2000 more rows (booking_id 5001-7000)
# Append to the table with coalesce(2)



In [0]:
%sql
-- TODO: Run OPTIMIZE with ZORDER BY (region)



### üí° Hints for Task 5

<details>
<summary><b>Hint 1:</b> Generating more data (click to expand)</summary>

```python
# Similar to Task 1, but different ID range
data_batch6 = [
    (i, random.randint(1, 500), 
     (datetime(2024, 1, 1) + timedelta(days=random.randint(0, 365))).strftime("%Y-%m-%d"),
     round(random.uniform(50, 1000), 2),
     random.choice(['North', 'South', 'East', 'West']))
    for i in range(5001, 7001)
]
```
</details>

<details>
<summary><b>Hint 2:</b> ZORDER syntax (click to expand)</summary>

```sql
OPTIMIZE main.default.bookings_lab
ZORDER BY (region)
```

You can ZORDER by multiple columns:
```sql
ZORDER BY (region, booking_date)
```
</details>

<details>
<summary><b>Hint 3:</b> When to use ZORDER (click to expand)</summary>

Use ZORDER on columns that are:
* Frequently used in WHERE clauses
* Used for joins
* High cardinality (many distinct values)
* Not partition columns (use partitioning instead)

In our case: `region` is frequently filtered, so ZORDER helps!
</details>

## Task 6: Understand VACUUM üß™

**The Problem:**

After OPTIMIZE, the old small files are still on disk! They're marked for deletion in the transaction log, but physically still exist.

**Why?**
* Time travel needs old files
* Concurrent readers might be using them
* Safety buffer before permanent deletion

**Your Challenge:**

Run VACUUM DRY RUN to see what would be deleted.

**Requirements:**

1. Use `VACUUM` with `DRY RUN` option
2. Set retention to 0 hours (for demo purposes only!)
3. Observe what files would be deleted

**Syntax:**
```sql
VACUUM table_name RETAIN 0 HOURS DRY RUN
```

**‚ö†Ô∏è Important:** 
* DRY RUN shows what WOULD be deleted (doesn't actually delete)
* RETAIN 0 HOURS is only for demos - never use in production!
* Default retention is 7 days (168 hours)

---

**Write your code in the cell below:**

In [0]:
%sql
-- TODO: Run VACUUM with DRY RUN to preview deletions
-- Use RETAIN 0 HOURS for this demo



### üí° Hints for Task 6

<details>
<summary><b>Hint 1:</b> VACUUM DRY RUN syntax (click to expand)</summary>

```sql
VACUUM main.default.bookings_lab RETAIN 0 HOURS DRY RUN
```
</details>

<details>
<summary><b>Hint 2:</b> Disable retention check (click to expand)</summary>

For RETAIN 0 HOURS, you need to disable the safety check first:
```sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM main.default.bookings_lab RETAIN 0 HOURS DRY RUN
```
</details>

<details>
<summary><b>Hint 3:</b> Understanding output (click to expand)</summary>

DRY RUN shows:
* List of files that would be deleted
* These are the old small files from before OPTIMIZE
* No files are actually deleted (safe to run)
</details>

## Task 7: Run VACUUM (Actually Delete Files) üóëÔ∏è

**Your Challenge:**

Now run VACUUM for real to delete the old files.

**Requirements:**

1. Disable the retention check (required for 0 hours)
2. Run VACUUM with RETAIN 0 HOURS (without DRY RUN)
3. Observe the output showing deleted files

**Commands needed:**
```sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM table_name RETAIN 0 HOURS
```

**‚ö†Ô∏è Warning:** 
* This permanently deletes files!
* Time travel to old versions will fail after VACUUM
* In production, use 7+ days retention
* We use 0 hours only for demo purposes

---

**Write your code in the cell below:**

In [0]:
%sql
-- TODO: Disable retention check and run VACUUM
-- This will actually delete the old files



### üí° Hints for Task 7

<details>
<summary><b>Hint 1:</b> Two commands needed (click to expand)</summary>

```sql
-- Command 1: Disable safety check
SET spark.databricks.delta.retentionDurationCheck.enabled = false;

-- Command 2: Run VACUUM
VACUUM main.default.bookings_lab RETAIN 0 HOURS
```
</details>

<details>
<summary><b>Hint 2:</b> What happens (click to expand)</summary>

VACUUM:
1. Finds files older than retention period
2. Checks they're not in current table version
3. Permanently deletes them from storage
4. Frees up disk space
5. Returns list of deleted files
</details>

<details>
<summary><b>Hint 3:</b> Production retention (click to expand)</summary>

In production, use:
```sql
VACUUM table_name RETAIN 168 HOURS  -- 7 days (default)
VACUUM table_name RETAIN 720 HOURS  -- 30 days
```

Balance:
* Longer retention = More time travel, more storage cost
* Shorter retention = Less time travel, less storage cost
</details>

## Task 8: Check Table History üìú

**Your Challenge:**

View the complete history of operations on your table.

**Requirements:**

1. Use `DESCRIBE HISTORY` to view all transactions
2. Look for these operations:
   * `CREATE OR REPLACE TABLE` (initial creation)
   * `WRITE` (your append operations)
   * `OPTIMIZE` (file compaction)
   * `VACUUM` (file deletion)
3. Examine the `operationMetrics` to see:
   * How many files were added/removed
   * How much data was processed

**Questions to answer:**
* How many versions does your table have?
* Which operations created the most files?
* What did OPTIMIZE do (check numFilesAdded/numFilesRemoved)?

---

**Write your code in the cell below:**

In [0]:
%sql
-- TODO: Use DESCRIBE HISTORY to view all operations



### üí° Hints for Task 8

<details>
<summary><b>Hint 1:</b> DESCRIBE HISTORY syntax (click to expand)</summary>

```sql
DESCRIBE HISTORY main.default.bookings_lab
```

Limit to recent operations:
```sql
DESCRIBE HISTORY main.default.bookings_lab LIMIT 10
```
</details>

<details>
<summary><b>Hint 2:</b> Key columns to examine (click to expand)</summary>

Important columns:
* `version` - Transaction number
* `timestamp` - When it happened
* `operation` - Type of operation
* `operationMetrics` - Detailed metrics (files, rows, bytes)
* `userName` - Who did it
</details>

<details>
<summary><b>Hint 3:</b> Analyzing OPTIMIZE metrics (click to expand)</summary>

For OPTIMIZE operations, look at:
```
operationMetrics:
  numFilesAdded: 1 (new optimized file)
  numFilesRemoved: 5 (old small files)
  minFileSize: ...
  maxFileSize: ...
```
</details>

## Task 9: Final Table Health Check üéØ

**Your Challenge:**

Perform a final health check on your optimized table.

**Requirements:**

1. Run `DESCRIBE DETAIL` one more time
2. Verify the table is in good shape:
   * `numFiles` - Should be 1 or very few
   * `sizeInBytes` - Total data size
   * Calculate average file size (should be larger now)
3. Query the table to ensure data is intact
4. Count total rows (should be 7000)

**Success criteria:**
* ‚úÖ Fewer files than before
* ‚úÖ Larger average file size
* ‚úÖ All data still accessible
* ‚úÖ No data loss

---

**Write your code in the cells below:**

In [0]:
%sql
-- TODO: Run DESCRIBE DETAIL to check final state



In [0]:
%sql
-- TODO: Query the table and count rows
-- Should have 7000 rows total



### üí° Hints for Task 9

<details>
<summary><b>Hint 1:</b> Check table details (click to expand)</summary>

```sql
DESCRIBE DETAIL main.default.bookings_lab
```
</details>

<details>
<summary><b>Hint 2:</b> Verify row count (click to expand)</summary>

```sql
SELECT COUNT(*) AS total_rows
FROM main.default.bookings_lab
```

Should be 7000 (5 batches of 1000 + 1 batch of 2000)
</details>

<details>
<summary><b>Hint 3:</b> Sample the data (click to expand)</summary>

```sql
SELECT * 
FROM main.default.bookings_lab
ORDER BY booking_id
LIMIT 20
```

Verify:
* booking_id ranges from 1 to 7000
* All columns present
* Data looks correct
</details>

---
---
---

# üìù Complete Solutions

**‚ö†Ô∏è Only look at these if you're stuck or want to verify your work!**

Try to solve the challenges yourself first. Learning happens through struggle and problem-solving!

---

## ‚úÖ Solution: Task 1 (Create Table with Small Files)

<details>
<summary><b>Click to reveal solution</b></summary>

```python
import random
from datetime import datetime, timedelta
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType

# Define schema
schema = StructType([
    StructField("booking_id", IntegerType()),
    StructField("customer_id", IntegerType()),
    StructField("booking_date", StringType()),
    StructField("amount", DoubleType()),
    StructField("region", StringType())
])

# Generate and write 5 batches
for batch_num in range(1, 6):
    start_id = (batch_num - 1) * 1000 + 1
    end_id = batch_num * 1000 + 1
    
    # Generate data
    data = [
        (i, 
         random.randint(1, 500),
         (datetime(2024, 1, 1) + timedelta(days=random.randint(0, 365))).strftime("%Y-%m-%d"),
         round(random.uniform(50, 1000), 2),
         random.choice(['North', 'South', 'East', 'West']))
        for i in range(start_id, end_id)
    ]
    
    df = spark.createDataFrame(data, schema)
    
    # Write batch
    if batch_num == 1:
        df.coalesce(1).write.mode("overwrite").saveAsTable("main.default.bookings_lab")
    else:
        df.coalesce(1).write.mode("append").saveAsTable("main.default.bookings_lab")
    
    print(f"‚úÖ Batch {batch_num} written: {len(data)} rows")

print(f"\n‚úÖ Table created with 5000 rows in 5 small files")
```

**Key concepts:**
* Loop to create multiple batches
* `coalesce(1)` forces one file per batch
* First batch uses `overwrite`, rest use `append`
* This simulates real-world incremental writes

</details>

## ‚úÖ Solution: Task 2 (Check Table Details)

<details>
<summary><b>Click to reveal solution</b></summary>

```sql
DESCRIBE DETAIL main.default.bookings_lab
```

**Calculate average file size:**
```sql
SELECT 
  numFiles,
  sizeInBytes,
  ROUND(sizeInBytes / numFiles / 1024 / 1024, 2) AS avg_file_size_mb
FROM (
  DESCRIBE DETAIL main.default.bookings_lab
)
```

**What you should see:**
* `numFiles`: 5
* `sizeInBytes`: ~500KB-1MB total
* `avg_file_size_mb`: Very small (< 1 MB)

**Why this is bad:**
* Small files = many file operations
* Inefficient for Spark to process
* Slower queries
* More metadata overhead

</details>

## ‚úÖ Solution: Task 3 (Run OPTIMIZE)

<details>
<summary><b>Click to reveal solution</b></summary>

```sql
OPTIMIZE main.default.bookings_lab
```

**Expected output:**
```
metrics:
  numFilesAdded: 1
  numFilesRemoved: 5
  totalFilesSkipped: 0
  totalTimeMs: ~1000-5000
```

**What happened:**
1. Delta Lake read all 5 small files
2. Combined them into 1 larger file
3. Wrote the new optimized file
4. Marked old files for deletion (in transaction log)
5. Old files still exist on disk (until VACUUM)

**Benefits:**
* Queries now read 1 file instead of 5
* Better compression
* Faster performance

</details>

## ‚úÖ Solution: Task 4 (Verify Optimization)

<details>
<summary><b>Click to reveal solution</b></summary>

```sql
DESCRIBE DETAIL main.default.bookings_lab
```

**What changed:**

**Before OPTIMIZE:**
* numFiles: 5
* avg file size: ~100-200 KB

**After OPTIMIZE:**
* numFiles: 1
* avg file size: ~500KB-1MB

**Verification:**
```sql
SELECT COUNT(*) AS total_rows
FROM main.default.bookings_lab
```
Should still be 5000 rows (no data lost!)

**Key insight:** OPTIMIZE doesn't change data, just reorganizes files for better performance.

</details>

## ‚úÖ Solution: Task 5 (Add Data and Use ZORDER)

<details>
<summary><b>Click to reveal solution</b></summary>

**Part A: Add more data**
```python
# Generate 2000 more rows
data_batch6 = [
    (i, 
     random.randint(1, 500),
     (datetime(2024, 1, 1) + timedelta(days=random.randint(0, 365))).strftime("%Y-%m-%d"),
     round(random.uniform(50, 1000), 2),
     random.choice(['North', 'South', 'East', 'West']))
    for i in range(5001, 7001)
]

df_batch6 = spark.createDataFrame(data_batch6, schema)
df_batch6.coalesce(2).write.mode("append").saveAsTable("main.default.bookings_lab")

print("‚úÖ Added 2000 more rows in 2 files")
```

**Part B: OPTIMIZE with ZORDER**
```sql
OPTIMIZE main.default.bookings_lab
ZORDER BY (region)
```

**What ZORDER does:**
* Co-locates data with same region values
* Makes queries filtering by region much faster
* Example: `WHERE region = 'North'` only reads relevant files

**When to use ZORDER:**
* Columns frequently used in WHERE clauses
* High cardinality columns
* Columns used for joins

</details>

## ‚úÖ Solution: Task 6 (VACUUM DRY RUN)

<details>
<summary><b>Click to reveal solution</b></summary>

```sql
-- Disable retention check (required for 0 hours)
SET spark.databricks.delta.retentionDurationCheck.enabled = false;

-- Run VACUUM DRY RUN
VACUUM main.default.bookings_lab RETAIN 0 HOURS DRY RUN
```

**Expected output:**
* List of file paths that would be deleted
* These are the old files from before OPTIMIZE
* No files are actually deleted (DRY RUN is safe)

**Why DRY RUN is useful:**
* Preview what will be deleted
* Verify retention period is correct
* Check if important files would be removed
* Safe to run in production

**üí° Best practice:** Always run DRY RUN first before actual VACUUM!

</details>

## ‚úÖ Solution: Task 7 (Run VACUUM)

<details>
<summary><b>Click to reveal solution</b></summary>

```sql
-- Disable retention check
SET spark.databricks.delta.retentionDurationCheck.enabled = false;

-- Run VACUUM (actually delete files)
VACUUM main.default.bookings_lab RETAIN 0 HOURS
```

**Expected output:**
* List of deleted file paths
* These files are permanently removed from storage
* Storage space is freed

**What happened:**
1. VACUUM found files older than 0 hours
2. Verified they're not in current table version
3. Permanently deleted them from cloud storage
4. Freed up disk space

**‚ö†Ô∏è Important:**
* Time travel to old versions will now fail
* This is permanent - files cannot be recovered
* In production, use 7+ days retention

**Production example:**
```sql
-- Safe production VACUUM (7 days retention)
VACUUM main.default.bookings_lab RETAIN 168 HOURS
```

</details>

## ‚úÖ Solution: Task 8 (Check Table History)

<details>
<summary><b>Click to reveal solution</b></summary>

```sql
DESCRIBE HISTORY main.default.bookings_lab
```

**What you should see:**

Multiple versions showing:
1. Version 0: `CREATE OR REPLACE TABLE`
2. Versions 1-4: `WRITE` (append operations)
3. Version 5: `OPTIMIZE`
4. Version 6: `WRITE` (batch 6)
5. Version 7: `OPTIMIZE` (with ZORDER)
6. Version 8: `VACUUM`

**Analyzing OPTIMIZE metrics:**
```sql
SELECT 
  version,
  operation,
  operationMetrics.numFilesAdded AS files_added,
  operationMetrics.numFilesRemoved AS files_removed
FROM (
  DESCRIBE HISTORY main.default.bookings_lab
)
WHERE operation = 'OPTIMIZE'
```

**Key insights:**
* Each operation creates a new version
* OPTIMIZE shows files added/removed
* Complete audit trail of all changes

</details>

## ‚úÖ Solution: Task 9 (Final Table Health Check)

<details>
<summary><b>Click to reveal solution</b></summary>

**Check table details:**
```sql
DESCRIBE DETAIL main.default.bookings_lab
```

**Expected results:**
* `numFiles`: 1 (or very few)
* `sizeInBytes`: ~1-2 MB
* Much better than 5+ small files!

**Verify data integrity:**
```sql
SELECT COUNT(*) AS total_rows
FROM main.default.bookings_lab
```
Should be 7000 rows.

**Sample the data:**
```sql
SELECT * 
FROM main.default.bookings_lab
ORDER BY booking_id
LIMIT 20
```

**Success criteria:**
* ‚úÖ Fewer files (1 vs 5+)
* ‚úÖ All 7000 rows present
* ‚úÖ Data intact and queryable
* ‚úÖ Better performance for queries

**What we accomplished:**
1. Created table with small files (the problem)
2. Used OPTIMIZE to compact files (the solution)
3. Used ZORDER to cluster data (performance boost)
4. Used VACUUM to clean up old files (save storage)
5. Verified everything works (data integrity)

</details>

## üìö Best Practices Summary

### **OPTIMIZE Best Practices**

‚úÖ **Run regularly** - After many small writes (daily/weekly)  
‚úÖ **Use ZORDER** - On frequently filtered columns  
‚úÖ **Monitor file count** - Keep numFiles reasonable  
‚úÖ **Schedule during off-peak** - Can be resource-intensive  
‚úÖ **Check metrics** - Use DESCRIBE HISTORY to verify  

**When to OPTIMIZE:**
* After many incremental writes
* When queries are slow
* When numFiles is high (> 100)
* Before important queries/reports

**ZORDER columns:**
* High cardinality (many distinct values)
* Frequently in WHERE clauses
* Used for joins
* Not partition columns

---

### **VACUUM Best Practices**

‚úÖ **Use appropriate retention** - 7 days minimum (168 hours)  
‚úÖ **Always DRY RUN first** - Preview before deleting  
‚úÖ **Consider time travel needs** - Longer retention for auditing  
‚úÖ **Schedule regularly** - Weekly or monthly  
‚úÖ **Monitor storage** - Track space savings  

**Retention guidelines:**
* **7 days (168 hours)** - Minimum, default
* **30 days (720 hours)** - Good for most use cases
* **90 days (2160 hours)** - Compliance/audit requirements
* **Never use 0 hours** - Only for demos!

**When to VACUUM:**
* After multiple OPTIMIZE operations
* When storage costs are high
* Regularly scheduled maintenance
* After large DELETE/UPDATE operations

---

### **Monitoring**

‚úÖ **DESCRIBE DETAIL** - Check file count and sizes  
‚úÖ **DESCRIBE HISTORY** - Track operations  
‚úÖ **Set up alerts** - Monitor table health  
‚úÖ **Track metrics** - Files, size, query performance  

---

### **Common Workflow**

```sql
-- 1. Check table health
DESCRIBE DETAIL table_name

-- 2. Optimize if needed (many small files)
OPTIMIZE table_name ZORDER BY (frequently_filtered_column)

-- 3. Preview vacuum
VACUUM table_name RETAIN 168 HOURS DRY RUN

-- 4. Run vacuum
VACUUM table_name RETAIN 168 HOURS

-- 5. Verify results
DESCRIBE DETAIL table_name
```

## üí° Key Concepts Review

### **Small Files Problem**

**Causes:**
* Incremental writes (streaming, frequent inserts)
* Many small transactions
* Unoptimized data ingestion

**Impact:**
* Slow query performance
* More metadata overhead
* Inefficient resource usage

**Solution:** OPTIMIZE

---

### **OPTIMIZE**

**What it does:**
* Combines small files into larger files
* Improves compression
* Updates transaction log
* Marks old files for deletion

**Syntax:**
```sql
OPTIMIZE table_name
OPTIMIZE table_name ZORDER BY (column)
```

**When to run:**
* After many small writes
* When numFiles is high
* Before important queries
* Scheduled maintenance

---

### **ZORDER**

**What it does:**
* Co-locates related data
* Clusters data by column values
* Enables data skipping
* Faster filtered queries

**Best columns for ZORDER:**
* High cardinality
* Frequently filtered
* Used in joins
* Not partition columns

---

### **VACUUM**

**What it does:**
* Permanently deletes old files
* Frees storage space
* Removes files older than retention
* Cannot be undone

**Syntax:**
```sql
VACUUM table_name RETAIN n HOURS
VACUUM table_name RETAIN n HOURS DRY RUN
```

**Retention trade-offs:**
* **Longer retention:**
  - ‚úÖ More time travel capability
  - ‚ùå Higher storage costs
* **Shorter retention:**
  - ‚úÖ Lower storage costs
  - ‚ùå Less time travel capability

---

### **Transaction Log**

**Role in OPTIMIZE/VACUUM:**
* Tracks which files are current
* Marks old files for deletion
* Enables time travel
* VACUUM uses it to find deletable files

## üéâ Lab Complete!

Congratulations! You've successfully completed the OPTIMIZE and VACUUM lab.

### **What You Accomplished:**

‚úÖ Created a table with small files (simulated the problem)  
‚úÖ Used DESCRIBE DETAIL to diagnose issues  
‚úÖ Ran OPTIMIZE to compact files  
‚úÖ Used ZORDER to cluster data  
‚úÖ Ran VACUUM DRY RUN to preview deletions  
‚úÖ Ran VACUUM to free storage  
‚úÖ Monitored table health with DESCRIBE HISTORY  
‚úÖ Verified data integrity  

---

### **Key Takeaways:**

1. **Small files hurt performance** - Many small files slow down queries
2. **OPTIMIZE is your friend** - Compact files regularly
3. **ZORDER boosts queries** - Cluster by frequently filtered columns
4. **VACUUM saves money** - Delete old files to reduce storage costs
5. **Balance retention** - Time travel vs storage costs
6. **Monitor regularly** - Use DESCRIBE DETAIL and DESCRIBE HISTORY

---

### **Production Checklist:**

‚òê Schedule OPTIMIZE (daily/weekly)  
‚òê Use ZORDER on key columns  
‚òê Schedule VACUUM (weekly/monthly)  
‚òê Set appropriate retention (7-30 days)  
‚òê Monitor table health  
‚òê Track storage costs  
‚òê Document maintenance procedures  

---

### **Next Steps:**

* Apply OPTIMIZE/VACUUM to your production tables
* Set up scheduled jobs for maintenance
* Monitor query performance improvements
* Explore Delta Lake table properties
* Learn about liquid clustering (alternative to ZORDER)

---

### **Resources:**

* [OPTIMIZE Documentation](https://docs.databricks.com/sql/language-manual/delta-optimize.html)
* [VACUUM Documentation](https://docs.databricks.com/sql/language-manual/delta-vacuum.html)
* [Delta Lake Best Practices](https://docs.databricks.com/delta/best-practices.html)
* [File Management Guide](https://docs.databricks.com/delta/file-mgmt.html)

---

**You're now ready to maintain production Delta tables!** üöÄ

*Happy optimizing!*