
This notebook is NOT for pipeline execution
It is for thinking like Databricks exam expects

It answers questions like:

- Where should this logic go? (Bronze / Silver / Gold)
- Which API should I use? (DF vs SQL)
- Which operation is correct? (MERGE vs INSERT)
- What improves performance?


**Exam Decision Rules  Spark, Delta & Lakehouse**

This notebook summarizes decision-making rules
used in Databricks certification exams and
real-world Spark pipelines.

Focus:
- Bronze vs Silver vs Gold
- DataFrame vs SQL
- Delta operations
- Performance & correctness

Why this cell exists

> This is your revision notebook
Before exam → read this once → confidence ↑


## Layer Responsibilities (Exam Critical)

### Bronze
- Raw ingestion
- Append-only
- No filters, joins, or aggregations
- Schema-on-read

### Silver
- Cleansing & standardization
- Deduplication
- Joins
- Business rules
- MERGE operations

### Gold
- Aggregations
- KPIs & metrics
- BI-ready tables
- Window functions for analytics

**Exam Rule #**
- ❓ Where do you remove duplicates?
- ❌ Bronze
- ✅ Silver

## Delta Write Mode Rules

| Scenario | Correct Choice |
|-------|----------------|
| Raw ingestion | append |
| Incremental updates | MERGE |
| Full recompute | overwrite |
| Slowly changing dimension | MERGE |

**Exam Trap**
- ❓ Incremental customer updates arriving daily
- ❌ overwrite
- ✅ MERGE

## Delta Operations Decision

MERGE:
- When records may exist or not
- Upserts
- SCD Type 1 / Type 2

INSERT:
- Append-only data
- Logs, events

UPDATE:
- Rarely used alone
- Usually inside MERGE

**Exam Shortcut**

If question says “existing records may change” → MERGE

## API Selection Rules

### Use DataFrame API when:
- Writing pipelines
- Applying complex logic
- Using window functions programmatically

### Use Spark SQL when:
- Ad-hoc analysis
- BI queries
- Simple aggregations

**Exam Rule**
- ❓ Which is better for production pipelines?
- ✅ DataFrame API

## When to Use Window Functions

Use window functions when:
- Ranking (ROW_NUMBER, RANK)
- Running totals
- Deduplication based on latest record
- Time-based comparisons

Example (Dedup in Silver)
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

w = Window.partitionBy("subscriber_id").orderBy("event_time")

df.withColumn("rn", row_number().over(w)) \
  .filter("rn = 1")

**Exam Pattern**

“Keep latest record per key” → Window + row_number()


## Performance Decision Rules

- Use partitioning on low-cardinality columns
- Use Z-ORDER on high-cardinality filter columns
- Cache only reused DataFrames
- Avoid SELECT *

**Exam Rule**

- ❓ Fast point lookups on subscriber_id
- ✅ Z-ORDER BY subscriber_id

## Schema Evolution

- Bronze → allow schema drift
- Silver → enforce schema
- Use mergeSchema cautiously

**Exam Rule**
- ❓ New column arrives in raw JSON
- ✅ Allow in Bronze
- ❌ Break Silver

## Unity Catalog Rules

- Tables live in schemas
- Files live in volumes
- Use saveAsTable for managed tables
- Avoid hardcoding paths in Silver/Gold

**Exam Rule**
- ❓ Where should raw files live?
- ✅ Unity Catalog Volume

## 10-Second Exam Checklist

- Raw data? → Bronze
- Clean + dedup? → Silver
- KPI / aggregation? → Gold
- Existing rows may change? → MERGE
- Latest per key? → Window function
- Performance issue? → Partition or Z-ORDER


- You don’t execute it daily
- You read it before exam
- Converts confusion → confidence
- Exactly how Databricks questions are framed
