# MERGE INTO - Upserts & Slowly Changing Dimensions Demo

Welcome! This demo will teach you how to update and insert data without creating duplicates.

---

## üéØ The Duplicate Problem

**Common scenario:**
* You have a target table with customer data
* New data arrives with updates and new customers
* Need to: UPDATE existing customers, INSERT new ones

**Naive approach (DON'T DO THIS!):**
```sql
-- This creates DUPLICATES! ‚ùå
INSERT INTO customers SELECT * FROM new_customers
```

**Problems:**
* ‚ùå Creates duplicate rows for existing customers
* ‚ùå Doesn't update changed data
* ‚ùå No way to track history
* ‚ùå Data quality issues

---

## ‚úÖ The Solution: MERGE INTO

**MERGE INTO** (also called UPSERT) combines INSERT and UPDATE in one atomic operation:

* **WHEN MATCHED** ‚Üí UPDATE existing rows
* **WHEN NOT MATCHED** ‚Üí INSERT new rows
* **WHEN NOT MATCHED BY SOURCE** ‚Üí DELETE rows (optional)

**Benefits:**
* ‚úÖ No duplicates
* ‚úÖ Atomic operation
* ‚úÖ Efficient (single pass)
* ‚úÖ Supports complex logic
* ‚úÖ Enables SCD patterns

---

## üéØ What You'll Learn

1. **Basic MERGE** - Syntax and operations
2. **SCD Type 1** - Overwrite changes (no history)
3. **SCD Type 2** - Track historical changes
4. **Hands-On Challenges** - Practice exercises
5. **Alternative Methods** - Other upsert patterns
6. **Best Practices** - Performance and optimization

**Let's get started!** üöÄ

## 1. Basic MERGE Syntax üîÑ

Let's start with the fundamentals of MERGE INTO.

**Basic structure:**
```sql
MERGE INTO target_table
USING source_table
ON merge_condition
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
WHEN NOT MATCHED BY SOURCE THEN DELETE
```

**Three clauses:**
1. **WHEN MATCHED** - Row exists in both target and source ‚Üí UPDATE
2. **WHEN NOT MATCHED** - Row only in source ‚Üí INSERT
3. **WHEN NOT MATCHED BY SOURCE** - Row only in target ‚Üí DELETE (optional)

In [0]:
%sql
-- Create our target table with initial customer data

CREATE OR REPLACE TABLE main.default.customers_target (
  customer_id INT,
  name STRING,
  email STRING,
  city STRING,
  total_purchases DOUBLE,
  last_updated TIMESTAMP
)
USING DELTA

In [0]:
%sql
-- Insert initial customer data

INSERT INTO main.default.customers_target VALUES
  (1, 'Alice Johnson', 'alice@example.com', 'New York', 1500.00, CURRENT_TIMESTAMP()),
  (2, 'Bob Smith', 'bob@example.com', 'Los Angeles', 2300.00, CURRENT_TIMESTAMP()),
  (3, 'Carol White', 'carol@example.com', 'Chicago', 1800.00, CURRENT_TIMESTAMP()),
  (4, 'David Brown', 'david@example.com', 'Houston', 950.00, CURRENT_TIMESTAMP()),
  (5, 'Eve Davis', 'eve@example.com', 'Phoenix', 1200.00, CURRENT_TIMESTAMP());

-- View initial data
SELECT * FROM main.default.customers_target ORDER BY customer_id

In [0]:
%sql
-- Create a source table with updates and new customers

CREATE OR REPLACE TEMP VIEW customers_updates AS
SELECT * FROM VALUES
  (2, 'Bob Smith', 'bob.smith@example.com', 'San Francisco', 2800.00),  -- Updated: email, city, purchases
  (3, 'Carol White', 'carol@example.com', 'Chicago', 2100.00),           -- Updated: purchases only
  (6, 'Frank Miller', 'frank@example.com', 'Seattle', 500.00),           -- New customer
  (7, 'Grace Lee', 'grace@example.com', 'Boston', 750.00)                -- New customer
AS updates(customer_id, name, email, city, total_purchases);

-- View the updates
SELECT * FROM customers_updates ORDER BY customer_id

In [0]:
%sql
-- Perform MERGE operation
-- This will UPDATE existing customers and INSERT new ones

MERGE INTO main.default.customers_target AS target
USING customers_updates AS source
ON target.customer_id = source.customer_id

-- When customer exists: UPDATE
WHEN MATCHED THEN
  UPDATE SET
    target.name = source.name,
    target.email = source.email,
    target.city = source.city,
    target.total_purchases = source.total_purchases,
    target.last_updated = CURRENT_TIMESTAMP()

-- When customer is new: INSERT
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, city, total_purchases, last_updated)
  VALUES (source.customer_id, source.name, source.email, source.city, source.total_purchases, CURRENT_TIMESTAMP())

In [0]:
%sql
-- Check the results
-- Should have 7 customers total:
-- - Customer 2 and 3 updated
-- - Customer 6 and 7 inserted
-- - Customer 1, 4, 5 unchanged

SELECT * FROM main.default.customers_target ORDER BY customer_id

### üîç MERGE with Conditional Logic

You can add conditions to each clause:

```sql
MERGE INTO target
USING source
ON target.id = source.id

-- Only update if source value is higher
WHEN MATCHED AND source.value > target.value THEN
  UPDATE SET target.value = source.value

-- Only insert if value meets threshold
WHEN NOT MATCHED AND source.value > 100 THEN
  INSERT (id, value) VALUES (source.id, source.value)
```

**Use cases:**
* Update only if data is newer
* Insert only if meets criteria
* Conditional deletes
* Business rule enforcement

In [0]:
%sql
-- MERGE can also DELETE rows
-- Let's create a scenario where we remove inactive customers

CREATE OR REPLACE TEMP VIEW active_customers AS
SELECT * FROM VALUES
  (1, 'Alice Johnson', 'alice@example.com', 'New York', 1500.00),
  (2, 'Bob Smith', 'bob.smith@example.com', 'San Francisco', 2800.00),
  (6, 'Frank Miller', 'frank@example.com', 'Seattle', 500.00)
AS active(customer_id, name, email, city, total_purchases);

-- MERGE with DELETE clause
MERGE INTO main.default.customers_target AS target
USING active_customers AS source
ON target.customer_id = source.customer_id

WHEN MATCHED THEN
  UPDATE SET target.last_updated = CURRENT_TIMESTAMP()

WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, city, total_purchases, last_updated)
  VALUES (source.customer_id, source.name, source.email, source.city, source.total_purchases, CURRENT_TIMESTAMP())

WHEN NOT MATCHED BY SOURCE THEN
  DELETE;  -- Remove customers not in active list

-- View results - customers 3, 4, 5, 7 should be deleted
SELECT * FROM main.default.customers_target ORDER BY customer_id

## 2. SCD Type 1 - Overwrite Changes üîÑ

**What is SCD Type 1?**

**Slowly Changing Dimension Type 1** overwrites old values with new values - **no history tracking**.

**Characteristics:**
* ‚ùå No historical data preserved
* ‚úÖ Simple to implement
* ‚úÖ Saves storage space
* ‚úÖ Always shows current state

**Use cases:**
* Correcting data errors
* Updating non-critical attributes
* When history doesn't matter
* Master data management

**Example:**
```
Before: customer_id=1, city='New York'
Update: customer_id=1, city='Boston'
After:  customer_id=1, city='Boston'  (old value lost)
```

In [0]:
%sql
-- Create a fresh table for SCD Type 1 demo

CREATE OR REPLACE TABLE main.default.customers_scd1 (
  customer_id INT,
  name STRING,
  email STRING,
  city STRING,
  status STRING,
  total_purchases DOUBLE,
  last_updated TIMESTAMP
)
USING DELTA;

-- Insert initial data
INSERT INTO main.default.customers_scd1 VALUES
  (101, 'John Doe', 'john@example.com', 'New York', 'active', 5000.00, CURRENT_TIMESTAMP()),
  (102, 'Jane Smith', 'jane@example.com', 'Chicago', 'active', 3000.00, CURRENT_TIMESTAMP()),
  (103, 'Mike Johnson', 'mike@example.com', 'Boston', 'active', 2000.00, CURRENT_TIMESTAMP());

SELECT * FROM main.default.customers_scd1 ORDER BY customer_id

In [0]:
%sql
-- New data arrives with changes

CREATE OR REPLACE TEMP VIEW customers_scd1_updates AS
SELECT * FROM VALUES
  (101, 'John Doe', 'john.doe@example.com', 'Los Angeles', 'active', 5500.00),  -- Moved cities, new email
  (102, 'Jane Smith', 'jane@example.com', 'Chicago', 'inactive', 3000.00),      -- Status changed
  (104, 'Sarah Williams', 'sarah@example.com', 'Miami', 'active', 1500.00)      -- New customer
AS updates(customer_id, name, email, city, status, total_purchases);

SELECT * FROM customers_scd1_updates ORDER BY customer_id

In [0]:
%sql
-- SCD Type 1: Simply overwrite with new values

MERGE INTO main.default.customers_scd1 AS target
USING customers_scd1_updates AS source
ON target.customer_id = source.customer_id

WHEN MATCHED THEN
  UPDATE SET
    target.name = source.name,
    target.email = source.email,
    target.city = source.city,
    target.status = source.status,
    target.total_purchases = source.total_purchases,
    target.last_updated = CURRENT_TIMESTAMP()

WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, city, status, total_purchases, last_updated)
  VALUES (source.customer_id, source.name, source.email, source.city, source.status, source.total_purchases, CURRENT_TIMESTAMP())

In [0]:
%sql
-- View the results
-- Notice: Old values are GONE (John's old city, Jane's old status)

SELECT 
  customer_id,
  name,
  city,
  status,
  total_purchases,
  last_updated
FROM main.default.customers_scd1
ORDER BY customer_id

### üìä SCD Type 1 Characteristics

**What happened:**
* Customer 101: City changed from 'New York' ‚Üí 'Los Angeles' (old value lost)
* Customer 102: Status changed from 'active' ‚Üí 'inactive' (old value lost)
* Customer 104: New customer inserted
* Customer 103: Unchanged (not in source)

**Pros:**
* ‚úÖ Simple to implement
* ‚úÖ Easy to understand
* ‚úÖ Saves storage (no history)
* ‚úÖ Always current data
* ‚úÖ Fast queries (no filtering needed)

**Cons:**
* ‚ùå No historical data
* ‚ùå Can't audit changes
* ‚ùå Can't answer "what was the value on date X?"
* ‚ùå Can't undo mistakes

**When to use:**
* Correcting errors (typos, wrong data)
* Non-critical attributes (phone numbers, addresses)
* When storage is limited
* When history has no business value

## 3. SCD Type 2 - Track Historical Changes üìä

**What is SCD Type 2?**

**Slowly Changing Dimension Type 2** preserves historical data by creating **new rows** for changes.

**Characteristics:**
* ‚úÖ Full history preserved
* ‚úÖ Can query historical state
* ‚úÖ Audit trail maintained
* ‚ùå More complex to implement
* ‚ùå Uses more storage

**Key columns:**
* **Surrogate key** - Unique identifier for each version
* **Business key** - Natural identifier (customer_id)
* **Effective dates** - When this version was valid
* **Current flag** - Is this the current version?

**Example:**
```
Before:
id | customer_id | city      | effective_from | effective_to | is_current
1  | 101         | New York  | 2024-01-01     | 9999-12-31   | true

After update:
id | customer_id | city      | effective_from | effective_to | is_current
1  | 101         | New York  | 2024-01-01     | 2024-06-15   | false  ‚Üê Closed
2  | 101         | Boston    | 2024-06-15     | 9999-12-31   | true   ‚Üê New version
```

In [0]:
%sql
-- Create table with SCD Type 2 columns

CREATE OR REPLACE TABLE main.default.customers_scd2 (
  surrogate_key BIGINT GENERATED ALWAYS AS IDENTITY,  -- Auto-incrementing surrogate key
  customer_id INT,                                     -- Business key
  name STRING,
  email STRING,
  city STRING,
  status STRING,
  total_purchases DOUBLE,
  effective_from DATE,                                 -- When this version became effective
  effective_to DATE,                                   -- When this version ended
  is_current BOOLEAN                                   -- Is this the current version?
)
USING DELTA

In [0]:
%sql
-- Insert initial customer data
-- All rows are current (is_current = true, effective_to = 9999-12-31)

INSERT INTO main.default.customers_scd2 
  (customer_id, name, email, city, status, total_purchases, effective_from, effective_to, is_current)
VALUES
  (101, 'John Doe', 'john@example.com', 'New York', 'active', 5000.00, CURRENT_DATE(), '9999-12-31', true),
  (102, 'Jane Smith', 'jane@example.com', 'Chicago', 'active', 3000.00, CURRENT_DATE(), '9999-12-31', true),
  (103, 'Mike Johnson', 'mike@example.com', 'Boston', 'active', 2000.00, CURRENT_DATE(), '9999-12-31', true);

SELECT * FROM main.default.customers_scd2 ORDER BY customer_id, effective_from

In [0]:
%sql
-- New data with changes

CREATE OR REPLACE TEMP VIEW customers_scd2_updates AS
SELECT * FROM VALUES
  (101, 'John Doe', 'john.doe@example.com', 'Los Angeles', 'active', 5800.00),  -- City changed
  (102, 'Jane Smith', 'jane@example.com', 'Chicago', 'inactive', 3000.00),      -- Status changed
  (104, 'Emily Davis', 'emily@example.com', 'Seattle', 'active', 1000.00)       -- New customer
AS updates(customer_id, name, email, city, status, total_purchases);

SELECT * FROM customers_scd2_updates ORDER BY customer_id

In [0]:
%sql
-- SCD Type 2 requires TWO steps:
-- Step 1: Close out (expire) the old current records that changed

MERGE INTO main.default.customers_scd2 AS target
USING customers_scd2_updates AS source
ON target.customer_id = source.customer_id 
   AND target.is_current = true

-- When matched AND data changed: Close the old record
WHEN MATCHED AND (
  target.email != source.email OR
  target.city != source.city OR
  target.status != source.status OR
  target.total_purchases != source.total_purchases
) THEN
  UPDATE SET
    target.effective_to = CURRENT_DATE(),
    target.is_current = false

In [0]:
%sql
-- Check the table - old records should be closed (is_current = false)
SELECT * FROM main.default.customers_scd2 ORDER BY customer_id, effective_from

In [0]:
%sql
-- Step 2: Insert new versions for changed records AND new customers

MERGE INTO main.default.customers_scd2 AS target
USING (
  SELECT 
    source.*,
    COALESCE(target.surrogate_key, -1) AS existing_key
  FROM customers_scd2_updates source
  LEFT JOIN main.default.customers_scd2 target
    ON source.customer_id = target.customer_id
    AND target.is_current = true
) AS source
ON target.customer_id = source.customer_id AND target.is_current = true

-- When NOT matched: Insert new version (either new customer or changed customer)
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, city, status, total_purchases, effective_from, effective_to, is_current)
  VALUES (source.customer_id, source.name, source.email, source.city, source.status, source.total_purchases, 
          CURRENT_DATE(), '9999-12-31', true)

In [0]:
%sql
-- View the complete history
-- Notice: We have multiple versions for customers 101 and 102!

SELECT 
  surrogate_key,
  customer_id,
  name,
  city,
  status,
  effective_from,
  effective_to,
  is_current
FROM main.default.customers_scd2
ORDER BY customer_id, effective_from

In [0]:
%sql
-- Get only current records (typical query pattern)

SELECT 
  customer_id,
  name,
  city,
  status,
  total_purchases
FROM main.default.customers_scd2
WHERE is_current = true
ORDER BY customer_id

In [0]:
%sql
-- Query: What was customer 101's city on a specific date?
-- This is the power of SCD Type 2!

SELECT 
  customer_id,
  name,
  city,
  effective_from,
  effective_to
FROM main.default.customers_scd2
WHERE customer_id = 101
  AND effective_from <= CURRENT_DATE()
  AND effective_to >= CURRENT_DATE()
ORDER BY effective_from

### üìä SCD Type 2 Characteristics

**What happened:**
* Customer 101: Two rows (New York ‚Üí Los Angeles)
* Customer 102: Two rows (active ‚Üí inactive)
* Customer 103: One row (unchanged)
* Customer 104: One row (new customer)

**Pros:**
* ‚úÖ Complete history preserved
* ‚úÖ Can query historical state
* ‚úÖ Full audit trail
* ‚úÖ Can analyze trends over time
* ‚úÖ Supports compliance requirements

**Cons:**
* ‚ùå More complex to implement
* ‚ùå Uses more storage
* ‚ùå Queries need is_current filter
* ‚ùå Requires surrogate keys
* ‚ùå Two-step MERGE process

**When to use:**
* Regulatory compliance (audit requirements)
* Historical analysis (trend analysis)
* Critical attributes (pricing, status)
* When you need to answer "what was the value on date X?"
* Data warehousing scenarios

## 4. Hands-On Challenges üéØ

Time to practice! Complete these challenges to master MERGE operations.

**Instructions:**
* Read each challenge carefully
* Write your code in the empty cell provided
* Run your code to test it
* Use the hints if you get stuck
* **Solutions are at the end of the notebook** - scroll to the "Challenge Solutions" section

### üí™ Challenge 1: Basic MERGE (Easy)

**Scenario:**
You have a products table and need to update prices and add new products.

**Setup:**
```sql
-- Target table (already created below)
product_id | product_name | price | stock
1          | Laptop       | 999   | 50
2          | Mouse        | 29    | 200
3          | Keyboard     | 79    | 150

-- New data (already created below)
product_id | product_name | price | stock
2          | Mouse        | 25    | 250   -- Price reduced, stock increased
3          | Keyboard     | 79    | 100   -- Stock decreased
4          | Monitor      | 299   | 75    -- New product
```

**Your task:**
Write a MERGE statement that:
1. UPDATES existing products (id 2 and 3)
2. INSERTS new products (id 4)
3. Updates the `last_updated` timestamp

**Hints:**
* Use `ON target.product_id = source.product_id` for the join condition
* You need both WHEN MATCHED and WHEN NOT MATCHED clauses
* Use CURRENT_TIMESTAMP() for the last_updated field

**Write your code in the cell below!**

*Solution available at the end of the notebook*

In [0]:
%sql
-- Setup for Challenge 1 (run this first)

CREATE OR REPLACE TABLE main.default.products_challenge (
  product_id INT,
  product_name STRING,
  price DOUBLE,
  stock INT,
  last_updated TIMESTAMP
);

INSERT INTO main.default.products_challenge VALUES
  (1, 'Laptop', 999.00, 50, CURRENT_TIMESTAMP()),
  (2, 'Mouse', 29.00, 200, CURRENT_TIMESTAMP()),
  (3, 'Keyboard', 79.00, 150, CURRENT_TIMESTAMP());

CREATE OR REPLACE TEMP VIEW products_updates AS
SELECT * FROM VALUES
  (2, 'Mouse', 25.00, 250),
  (3, 'Keyboard', 79.00, 100),
  (4, 'Monitor', 299.00, 75)
AS updates(product_id, product_name, price, stock);

SELECT 'Target table:' AS table_type, * FROM main.default.products_challenge
UNION ALL
SELECT 'Source updates:' AS table_type, *, CAST(NULL AS TIMESTAMP) FROM products_updates
ORDER BY table_type DESC, product_id

In [0]:
%sql
-- YOUR CODE HERE
-- Write your MERGE statement below



### üí™ Challenge 2: SCD Type 1 with Conditions (Medium)

**Scenario:**
You're managing employee data. You need to:
* UPDATE employees if their salary INCREASED
* INSERT new employees
* Do NOT update if salary decreased (data quality check)

**Setup:**
```sql
-- Target table
emp_id | name          | department | salary
201    | Alice Brown   | Sales      | 75000
202    | Bob Wilson    | IT         | 85000
203    | Carol Davis   | Marketing  | 70000

-- New data
emp_id | name          | department | salary
201    | Alice Brown   | Sales      | 80000   -- Salary increased ‚úÖ
202    | Bob Wilson    | IT         | 80000   -- Salary decreased ‚ùå (don't update)
204    | David Lee     | IT         | 90000   -- New employee
```

**Your task:**
Write a MERGE with a condition that only updates when `source.salary > target.salary`

**Hints:**
* Add a condition to the WHEN MATCHED clause: `WHEN MATCHED AND condition THEN`
* The condition should compare source.salary with target.salary
* Bob's salary should NOT be updated (85000 stays, not 80000)

**Write your code in the cell below!**

*Solution available at the end of the notebook*

In [0]:
%sql
-- Setup for Challenge 2

CREATE OR REPLACE TABLE main.default.employees_challenge (
  emp_id INT,
  name STRING,
  department STRING,
  salary DOUBLE,
  last_updated TIMESTAMP
);

INSERT INTO main.default.employees_challenge VALUES
  (201, 'Alice Brown', 'Sales', 75000.00, CURRENT_TIMESTAMP()),
  (202, 'Bob Wilson', 'IT', 85000.00, CURRENT_TIMESTAMP()),
  (203, 'Carol Davis', 'Marketing', 70000.00, CURRENT_TIMESTAMP());

CREATE OR REPLACE TEMP VIEW employees_updates AS
SELECT * FROM VALUES
  (201, 'Alice Brown', 'Sales', 80000.00),
  (202, 'Bob Wilson', 'IT', 80000.00),
  (204, 'David Lee', 'IT', 90000.00)
AS updates(emp_id, name, department, salary);

SELECT 'Target:' AS type, * FROM main.default.employees_challenge
UNION ALL
SELECT 'Updates:' AS type, *, CAST(NULL AS TIMESTAMP) FROM employees_updates
ORDER BY type DESC, emp_id

In [0]:
%sql
-- YOUR CODE HERE
-- Write your MERGE statement with conditional UPDATE



### üí™ Challenge 3: Implement SCD Type 2 (Hard)

**Scenario:**
You're tracking product prices over time for a pricing history table.

**Setup:**
```sql
-- Target table (SCD Type 2 structure)
product_id | product_name | price | effective_from | effective_to | is_current
301        | Widget A     | 100   | 2024-01-01     | 9999-12-31   | true
302        | Widget B     | 150   | 2024-01-01     | 9999-12-31   | true

-- New prices
product_id | product_name | price
301        | Widget A     | 120   -- Price increased
303        | Widget C     | 200   -- New product
```

**Your task:**
Implement SCD Type 2 in TWO steps:

**Step 1:** Close old records (set effective_to = CURRENT_DATE, is_current = false)  
**Step 2:** Insert new versions (set effective_from = CURRENT_DATE, is_current = true)

**Hints:**
* **Step 1:** Join on `product_id` AND `is_current = true`, add condition to check if price changed
* **Step 2:** Use WHEN NOT MATCHED to insert new versions (works for both changed and new products)
* Remember: After Step 1, changed products are no longer "current", so Step 2 will insert them

**Write your code in the cells below!**

*Solution available at the end of the notebook*

In [0]:
%sql
-- Setup for Challenge 3

CREATE OR REPLACE TABLE main.default.products_scd2_challenge (
  surrogate_key BIGINT GENERATED ALWAYS AS IDENTITY,
  product_id INT,
  product_name STRING,
  price DOUBLE,
  effective_from DATE,
  effective_to DATE,
  is_current BOOLEAN
);

INSERT INTO main.default.products_scd2_challenge 
  (product_id, product_name, price, effective_from, effective_to, is_current)
VALUES
  (301, 'Widget A', 100.00, '2024-01-01', '9999-12-31', true),
  (302, 'Widget B', 150.00, '2024-01-01', '9999-12-31', true);

CREATE OR REPLACE TEMP VIEW products_new_prices AS
SELECT * FROM VALUES
  (301, 'Widget A', 120.00),
  (303, 'Widget C', 200.00)
AS updates(product_id, product_name, price);

SELECT 'Target:' AS type, product_id, product_name, price, effective_from, effective_to, is_current 
FROM main.default.products_scd2_challenge
UNION ALL
SELECT 'Updates:' AS type, product_id, product_name, price, CAST(NULL AS DATE), CAST(NULL AS DATE), CAST(NULL AS BOOLEAN)
FROM products_new_prices
ORDER BY type DESC, product_id

In [0]:
%sql
-- YOUR CODE HERE - Step 1
-- Close old records (set effective_to and is_current)



In [0]:
%sql
-- YOUR CODE HERE - Step 2
-- Insert new versions



## 5. Alternative Methods to Avoid Duplicates üõ†Ô∏è

Besides MERGE, there are other ways to update data without duplicates.

### 1Ô∏è‚É£ INSERT OVERWRITE

**What it does:**
Replaces **all data** in the table (or partition) with new data.

**Syntax:**
```sql
INSERT OVERWRITE TABLE target_table
SELECT * FROM source_table
```

**Characteristics:**
* ‚úÖ No duplicates (replaces everything)
* ‚úÖ Simple syntax
* ‚ùå Loses data not in source
* ‚ùå Not incremental
* ‚ùå Dangerous if source is incomplete

**When to use:**
* Full refresh scenarios
* Rebuilding dimension tables
* When you have complete source data
* Partitioned tables (overwrite specific partitions)

**‚ö†Ô∏è Warning:** Use carefully - it deletes existing data!

In [0]:
%sql
-- Create a simple table
CREATE OR REPLACE TABLE main.default.daily_summary (
  report_date DATE,
  total_sales DOUBLE,
  order_count INT
);

INSERT INTO main.default.daily_summary VALUES
  ('2024-01-01', 10000.00, 50),
  ('2024-01-02', 12000.00, 60);

SELECT 'Before:' AS status, * FROM main.default.daily_summary
UNION ALL
SELECT 'After:' AS status, * FROM (
  VALUES 
    ('2024-01-02', 12500.00, 65),  -- Updated
    ('2024-01-03', 11000.00, 55)   -- New
  AS new_data(report_date, total_sales, order_count)
)
ORDER BY status, report_date

In [0]:
%sql
-- INSERT OVERWRITE replaces ALL data

INSERT OVERWRITE TABLE main.default.daily_summary
VALUES
  ('2024-01-02', 12500.00, 65),
  ('2024-01-03', 11000.00, 55);

-- Notice: 2024-01-01 is GONE!
SELECT * FROM main.default.daily_summary ORDER BY report_date

### 2Ô∏è‚É£ replaceWhere (Dynamic Partition Overwrite)

**What it does:**
Replaces data matching a condition (typically a partition).

**Syntax (Python):**
```python
df.write.mode("overwrite") \
  .option("replaceWhere", "date = '2024-01-15'") \
  .saveAsTable("table")
```

**Characteristics:**
* ‚úÖ Overwrites only matching data
* ‚úÖ Preserves other data
* ‚úÖ Efficient for partitioned tables
* ‚ùå Python/Scala only (not SQL)
* ‚ùå Requires partition column

**When to use:**
* Reprocessing specific partitions
* Daily/monthly data refreshes
* Correcting data for specific time periods
* Partitioned tables

In [0]:
# Create sample data
from pyspark.sql.functions import lit

# Create table with partitioned data
spark.sql("""
  CREATE OR REPLACE TABLE main.default.sales_by_date (
    sale_id INT,
    amount DOUBLE,
    sale_date DATE
  )
  USING DELTA
""")

# Insert initial data
spark.sql("""
  INSERT INTO main.default.sales_by_date VALUES
    (1, 100.00, '2024-01-01'),
    (2, 150.00, '2024-01-01'),
    (3, 200.00, '2024-01-02'),
    (4, 250.00, '2024-01-02')
""")

print("Before replaceWhere:")
display(spark.table("main.default.sales_by_date").orderBy("sale_date", "sale_id"))

In [0]:
# Replace only data for 2024-01-02 (leave 2024-01-01 untouched)
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, DateType

new_data = spark.createDataFrame([
    (5, 300.00, '2024-01-02'),
    (6, 350.00, '2024-01-02')
], ["sale_id", "amount", "sale_date"])

new_data.write.mode("overwrite") \
  .option("replaceWhere", "sale_date = '2024-01-02'") \
  .saveAsTable("main.default.sales_by_date")

print("After replaceWhere:")
print("Notice: 2024-01-01 data preserved, 2024-01-02 replaced")
display(spark.table("main.default.sales_by_date").orderBy("sale_date", "sale_id"))

### 3Ô∏è‚É£ Deduplication Pattern

**What it does:**
Remove duplicates using window functions before inserting.

**Pattern:**
```sql
-- Deduplicate source data first
WITH deduped AS (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) AS rn
  FROM source_data
)
SELECT * FROM deduped WHERE rn = 1
```

Then use MERGE or INSERT OVERWRITE.

**When to use:**
* Source data has duplicates
* Need to pick "best" record (latest, highest priority)
* Data quality issues in source

**Characteristics:**
* ‚úÖ Handles duplicate source data
* ‚úÖ Flexible (choose which duplicate to keep)
* ‚ùå Extra processing step
* ‚úÖ Works with any insert method

In [0]:
%sql
-- Source data with duplicates
CREATE OR REPLACE TEMP VIEW orders_with_dupes AS
SELECT * FROM VALUES
  (1001, 'Order A', 500.00, '2024-01-15 10:00:00'),
  (1001, 'Order A', 525.00, '2024-01-15 11:00:00'),  -- Duplicate! (later timestamp)
  (1002, 'Order B', 300.00, '2024-01-15 10:30:00'),
  (1002, 'Order B', 300.00, '2024-01-15 10:30:00'),  -- Exact duplicate
  (1003, 'Order C', 750.00, '2024-01-15 12:00:00')
AS orders(order_id, order_name, amount, order_timestamp);

SELECT * FROM orders_with_dupes ORDER BY order_id, order_timestamp

In [0]:
%sql
-- Deduplicate using ROW_NUMBER window function
-- Keep the latest record for each order_id

WITH deduped_orders AS (
  SELECT 
    order_id,
    order_name,
    amount,
    order_timestamp,
    ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_timestamp DESC) AS rn
  FROM orders_with_dupes
)
SELECT 
  order_id,
  order_name,
  amount,
  order_timestamp
FROM deduped_orders
WHERE rn = 1
ORDER BY order_id

### 4Ô∏è‚É£ DELETE + INSERT Pattern

**What it does:**
Manually delete existing records, then insert new ones.

**Pattern:**
```sql
-- Step 1: Delete existing records
DELETE FROM target_table
WHERE id IN (SELECT id FROM source_table);

-- Step 2: Insert all records from source
INSERT INTO target_table
SELECT * FROM source_table;
```

**Characteristics:**
* ‚úÖ Simple to understand
* ‚úÖ Works in any SQL database
* ‚ùå NOT atomic (two separate transactions)
* ‚ùå Risk of data loss if INSERT fails
* ‚ùå Less efficient than MERGE

**When to use:**
* Legacy systems without MERGE support
* Very simple scenarios
* When MERGE is not available

**‚ö†Ô∏è Recommendation:** Use MERGE instead - it's atomic and more efficient!

### üìä Comparison of All Methods

| Method | Atomic | Incremental | Preserves Other Data | Complexity | Best For |
|--------|--------|-------------|---------------------|------------|----------|
| **MERGE INTO** | ‚úÖ Yes | ‚úÖ Yes | ‚úÖ Yes | Medium | Most use cases |
| **INSERT OVERWRITE** | ‚úÖ Yes | ‚ùå No | ‚ùå No | Low | Full refresh |
| **replaceWhere** | ‚úÖ Yes | ‚úÖ Yes | ‚úÖ Yes | Low | Partition refresh |
| **DELETE + INSERT** | ‚ùå No | ‚úÖ Yes | ‚úÖ Yes | Low | Legacy systems |
| **Deduplication** | N/A | N/A | N/A | Medium | Data quality |

**Recommendation priority:**
1. **MERGE INTO** - Default choice for upserts
2. **replaceWhere** - For partition-level updates
3. **INSERT OVERWRITE** - For full table refresh
4. **DELETE + INSERT** - Only if MERGE not available

## ‚úÖ Solution 3: Implement SCD Type 2

**Challenge:** Track price history with effective dates.

**Key concepts:**
* SCD Type 2 requires TWO MERGE statements
* Step 1: Close old records (expire them)
* Step 2: Insert new versions
* Must join on business key AND is_current = true

---
---
---

# üìù Challenge Solutions

**‚ö†Ô∏è Spoiler Alert!**

This section contains solutions to all challenges. Try to solve them yourself first!

---

## üîÑ SCD Type 1 vs Type 2 Comparison

| Aspect | SCD Type 1 | SCD Type 2 |
|--------|------------|------------|
| **History** | ‚ùå No history | ‚úÖ Full history |
| **Storage** | ‚úÖ Minimal | ‚ùå More storage |
| **Complexity** | ‚úÖ Simple | ‚ùå Complex |
| **Query complexity** | ‚úÖ Simple | ‚ùå Need filters |
| **Audit trail** | ‚ùå No | ‚úÖ Yes |
| **Time travel** | ‚ùå No | ‚úÖ Yes |
| **Use case** | Corrections, non-critical | Compliance, analysis |
| **Rows per entity** | 1 | Multiple |
| **MERGE steps** | 1 | 2 |

**Decision guide:**
* **Need history?** ‚Üí Type 2
* **Compliance required?** ‚Üí Type 2
* **Storage limited?** ‚Üí Type 1
* **Simple corrections?** ‚Üí Type 1
* **Trend analysis?** ‚Üí Type 2
* **Current state only?** ‚Üí Type 1

### üí° Solution 2 Explanation

**What the code does:**

1. **WHEN MATCHED AND source.salary > target.salary** - Key part!
   * Only updates if new salary is HIGHER than current
   * This is the data quality check

**Expected results:**
* Employee 201 (Alice): Updated (75000 ‚Üí 80000) ‚úÖ
* Employee 202 (Bob): NOT updated (stays 85000, rejects 80000) ‚úÖ
* Employee 203 (Carol): Unchanged (not in source)
* Employee 204 (David): Inserted (new employee)

**Why Bob wasn't updated:**
* Current salary: 85000
* New salary: 80000
* Condition: 80000 > 85000 = FALSE
* Result: WHEN MATCHED clause skipped

**Total rows:** 4

### üöÄ MERGE Performance Tips

**1. Optimize join columns:**
```sql
-- Use indexed/partitioned columns in ON clause
ON target.id = source.id AND target.date = source.date
```

**2. Filter source data:**
```sql
-- Reduce source data before MERGE
USING (SELECT * FROM source WHERE date >= CURRENT_DATE() - 7) AS source
```

**3. Use OPTIMIZE:**
```sql
-- Compact files after many MERGEs
OPTIMIZE target_table
OPTIMIZE target_table ZORDER BY (id)
```

**4. Partition your tables:**
```sql
CREATE TABLE table (...) PARTITIONED BY (date)
-- MERGE on partitioned tables is faster
```

**5. Monitor with DESCRIBE HISTORY:**
```sql
DESCRIBE HISTORY table
-- Check operationMetrics for performance insights
```

**6. Use appropriate data types:**
* Use INT instead of STRING for IDs
* Use DATE instead of STRING for dates
* Proper types improve join performance

In [0]:
%sql
-- SOLUTION for Challenge 3 - Step 1: Close old records

MERGE INTO main.default.products_scd2_challenge AS target
USING products_new_prices AS source
ON target.product_id = source.product_id 
   AND target.is_current = true

-- When matched AND price changed: Close the old record
WHEN MATCHED AND target.price != source.price THEN
  UPDATE SET
    target.effective_to = CURRENT_DATE(),
    target.is_current = false;

-- Check results
SELECT * FROM main.default.products_scd2_challenge ORDER BY product_id, effective_from

## üéâ Congratulations!

You've completed the MERGE INTO & SCD demo!

### **What You Learned:**

‚úÖ **MERGE INTO** - Upsert operations without duplicates  
‚úÖ **Three clauses** - WHEN MATCHED, NOT MATCHED, NOT MATCHED BY SOURCE  
‚úÖ **SCD Type 1** - Overwrite changes (no history)  
‚úÖ **SCD Type 2** - Track historical changes  
‚úÖ **Hands-On Practice** - Completed 3 challenges  
‚úÖ **Alternative Methods** - INSERT OVERWRITE, replaceWhere, deduplication  
‚úÖ **Best Practices** - Performance and production patterns  

---

### **Key Takeaways:**

1. **Never use plain INSERT for updates** - Creates duplicates
2. **MERGE is atomic** - All or nothing operation
3. **Choose SCD type wisely** - Based on business needs
4. **SCD Type 2 requires two steps** - Close old, insert new
5. **Test with edge cases** - Duplicates, NULLs, no matches
6. **Monitor performance** - Use OPTIMIZE and ZORDER

---

### **Decision Matrix:**

| Need | Solution |
|------|----------|
| Update + Insert | MERGE INTO |
| No history needed | SCD Type 1 |
| Track history | SCD Type 2 |
| Full table refresh | INSERT OVERWRITE |
| Partition refresh | replaceWhere |
| Duplicate source data | Deduplicate first |

---

### **Next Steps:**

* Implement MERGE in your ETL pipelines
* Choose appropriate SCD type for your dimensions
* Explore MERGE with complex conditions
* Learn about Change Data Capture (CDC)
* Study Delta Lake optimization techniques

---

### **Resources:**

* [MERGE INTO Documentation](https://docs.databricks.com/sql/language-manual/delta-merge-into.html)
* [SCD Patterns Guide](https://docs.databricks.com/delta/merge.html)
* [Delta Lake Best Practices](https://docs.databricks.com/delta/best-practices.html)
* [Slowly Changing Dimensions](https://en.wikipedia.org/wiki/Slowly_changing_dimension)

---

**You're now ready to handle complex data updates in production!** üöÄ

*Happy merging!*

### üí° Solution 3 Explanation

**Step 1: Close old records**

1. **ON target.product_id = source.product_id AND target.is_current = true**
   * Only matches CURRENT versions
   * Product 301 matches (current version exists)
   * Product 303 doesn't match (new product)

2. **WHEN MATCHED AND target.price != source.price**
   * Only closes if price actually changed
   * Product 301: 100 != 120, so close it

3. **UPDATE SET effective_to = CURRENT_DATE(), is_current = false**
   * Marks the old version as expired

**After Step 1:**
* Product 301: is_current = false (closed)
* Product 302: is_current = true (unchanged)

---

**Step 2: Insert new versions**

1. **ON target.product_id = source.product_id AND target.is_current = true**
   * Product 301: No match (we closed it in Step 1!)
   * Product 303: No match (new product)

2. **WHEN NOT MATCHED**
   * Product 301: Insert new version with price 120
   * Product 303: Insert new product

**Final results:**
* Product 301: 2 rows (old version + new version)
* Product 302: 1 row (unchanged)
* Product 303: 1 row (new product)

**Total rows:** 4

**üîë Key insight:** After closing a record in Step 1, it's no longer "current", so Step 2 treats it as NOT MATCHED and inserts a new version!

In [0]:
%sql
-- SOLUTION for Challenge 1

MERGE INTO main.default.products_challenge AS target
USING products_updates AS source
ON target.product_id = source.product_id

WHEN MATCHED THEN
  UPDATE SET
    target.product_name = source.product_name,
    target.price = source.price,
    target.stock = source.stock,
    target.last_updated = CURRENT_TIMESTAMP()

WHEN NOT MATCHED THEN
  INSERT (product_id, product_name, price, stock, last_updated)
  VALUES (source.product_id, source.product_name, source.price, source.stock, CURRENT_TIMESTAMP());

-- Verify results
SELECT * FROM main.default.products_challenge ORDER BY product_id

## ‚úÖ Solution 2: SCD Type 1 with Conditions

**Challenge:** Only update if salary increased (data quality check).

**Key concepts:**
* Add condition to WHEN MATCHED clause
* Use AND to add business logic
* Prevents bad data from being applied

In [0]:
%sql
-- SOLUTION for Challenge 3 - Step 2: Insert new versions

MERGE INTO main.default.products_scd2_challenge AS target
USING products_new_prices AS source
ON target.product_id = source.product_id AND target.is_current = true

-- When NOT matched: Insert new version (changed product or new product)
WHEN NOT MATCHED THEN
  INSERT (product_id, product_name, price, effective_from, effective_to, is_current)
  VALUES (source.product_id, source.product_name, source.price, CURRENT_DATE(), '9999-12-31', true);

-- Final results: Should have 2 rows for product 301 (history), 1 row each for 302 and 303
SELECT * FROM main.default.products_scd2_challenge ORDER BY product_id, effective_from

## ‚ö†Ô∏è Common Pitfalls

### **MERGE Pitfalls**

‚ùå **Wrong join condition** - Creates duplicates or misses updates
```sql
-- BAD: Missing is_current check for SCD Type 2
ON target.id = source.id  -- Matches ALL versions!

-- GOOD: Include is_current
ON target.id = source.id AND target.is_current = true
```

‚ùå **Forgetting WHEN NOT MATCHED** - New records not inserted

‚ùå **Not handling NULLs** - NULL != NULL in SQL
```sql
-- Handle NULLs in comparisons
WHEN MATCHED AND (target.col != source.col OR target.col IS NULL) THEN ...
```

‚ùå **Using MERGE for full refresh** - Use INSERT OVERWRITE instead

### **SCD Type 2 Pitfalls**

‚ùå **Forgetting to close old records** - Creates duplicates with is_current=true

‚ùå **Wrong effective dates** - Use CURRENT_DATE() consistently

‚ùå **Not filtering on is_current** - Queries return all versions
```sql
-- Always filter for current records
WHERE is_current = true
```

‚ùå **Comparing all columns** - Inefficient, use hash or key columns

### **General Pitfalls**

‚ùå **Not testing with duplicates** - Source data may have dupes

‚ùå **Ignoring performance** - MERGE can be slow on large tables

‚ùå **Not using Delta Lake** - MERGE requires Delta format

In [0]:
%sql
-- SOLUTION for Challenge 2

MERGE INTO main.default.employees_challenge AS target
USING employees_updates AS source
ON target.emp_id = source.emp_id

-- Only update if salary INCREASED
WHEN MATCHED AND source.salary > target.salary THEN
  UPDATE SET
    target.name = source.name,
    target.department = source.department,
    target.salary = source.salary,
    target.last_updated = CURRENT_TIMESTAMP()

WHEN NOT MATCHED THEN
  INSERT (emp_id, name, department, salary, last_updated)
  VALUES (source.emp_id, source.name, source.department, source.salary, CURRENT_TIMESTAMP());

-- Verify: Alice updated (75k->80k), Bob NOT updated (85k stays), David inserted
SELECT * FROM main.default.employees_challenge ORDER BY emp_id

### üìä SCD Best Practices

**SCD Type 1:**

‚úÖ **Document what you're overwriting** - Know what history you're losing  
‚úÖ **Use for non-critical attributes** - Phone numbers, addresses  
‚úÖ **Keep it simple** - Straightforward MERGE  
‚úÖ **Add last_updated timestamp** - Track when changes occurred  

**SCD Type 2:**

‚úÖ **Use surrogate keys** - IDENTITY columns work great  
‚úÖ **Use effective dates** - effective_from, effective_to  
‚úÖ **Use is_current flag** - Makes queries easier  
‚úÖ **Use 9999-12-31 for open records** - Standard convention  
‚úÖ **Index on business key + is_current** - Optimize queries  
‚úÖ **Document the pattern** - Team needs to understand it  

**General:**

‚úÖ **Choose the right SCD type** - Based on business requirements  
‚úÖ **Test thoroughly** - Especially edge cases  
‚úÖ **Monitor storage** - SCD Type 2 grows over time  
‚úÖ **Consider SCD Type 3** - For limited history (not covered here)  

## 6. Best Practices ‚úÖ

Production-ready patterns for MERGE operations.

### üöÄ MERGE Performance Tips

**1. Optimize join columns:**
```sql
-- Use indexed/partitioned columns in ON clause
ON target.id = source.id AND target.date = source.date
```

**2. Filter source data:**
```sql
-- Reduce source data before MERGE
USING (SELECT * FROM source WHERE date >= CURRENT_DATE() - 7) AS source
```

**3. Use OPTIMIZE:**
```sql
-- Compact files after many MERGEs
OPTIMIZE target_table
OPTIMIZE target_table ZORDER BY (id)
```

**4. Partition your tables:**
```sql
CREATE TABLE table (...) PARTITIONED BY (date)
-- MERGE on partitioned tables is faster
```

**5. Monitor with DESCRIBE HISTORY:**
```sql
DESCRIBE HISTORY table
-- Check operationMetrics for performance insights
```

**6. Use appropriate data types:**
* Use INT instead of STRING for IDs
* Use DATE instead of STRING for dates
* Proper types improve join performance

## ‚úÖ Solution 1: Basic MERGE

**Challenge:** Update existing products and insert new ones.

**Key concepts:**
* Use WHEN MATCHED for updates
* Use WHEN NOT MATCHED for inserts
* Join on product_id
* Update all columns including last_updated

### üí° Solution 1 Explanation

**What the code does:**

1. **MERGE INTO** - Targets the products_challenge table
2. **USING** - Sources data from products_updates view
3. **ON** - Joins on product_id (the business key)
4. **WHEN MATCHED** - Products 2 and 3 exist, so UPDATE their values
5. **WHEN NOT MATCHED** - Product 4 is new, so INSERT it

**Expected results:**
* Product 1: Unchanged (not in source)
* Product 2: Updated (price 29 ‚Üí 25, stock 200 ‚Üí 250)
* Product 3: Updated (stock 150 ‚Üí 100)
* Product 4: Inserted (new product)

**Total rows:** 4

## üìö Quick Reference Guide

### **Basic MERGE Syntax**
```sql
MERGE INTO target_table AS target
USING source_table AS source
ON target.id = source.id

WHEN MATCHED THEN
  UPDATE SET target.col = source.col

WHEN NOT MATCHED THEN
  INSERT (col1, col2) VALUES (source.col1, source.col2)

WHEN NOT MATCHED BY SOURCE THEN
  DELETE
```

### **SCD Type 1 Template**
```sql
MERGE INTO target AS t
USING source AS s
ON t.business_key = s.business_key

WHEN MATCHED THEN
  UPDATE SET
    t.attribute = s.attribute,
    t.last_updated = CURRENT_TIMESTAMP()

WHEN NOT MATCHED THEN
  INSERT (*) VALUES (*)
```

### **SCD Type 2 Template**
```sql
-- Step 1: Close old records
MERGE INTO target AS t
USING source AS s
ON t.business_key = s.business_key AND t.is_current = true
WHEN MATCHED AND (t.attribute != s.attribute) THEN
  UPDATE SET
    t.effective_to = CURRENT_DATE(),
    t.is_current = false

-- Step 2: Insert new versions
MERGE INTO target AS t
USING source AS s
ON t.business_key = s.business_key AND t.is_current = true
WHEN NOT MATCHED THEN
  INSERT (business_key, attribute, effective_from, effective_to, is_current)
  VALUES (s.business_key, s.attribute, CURRENT_DATE(), '9999-12-31', true)
```

### **Query Patterns**
```sql
-- Current records only (SCD Type 2)
SELECT * FROM table WHERE is_current = true

-- Historical state at specific date
SELECT * FROM table 
WHERE effective_from <= '2024-01-15' 
  AND effective_to >= '2024-01-15'

-- All history for a key
SELECT * FROM table 
WHERE business_key = 123 
ORDER BY effective_from
```