# Implementing an SCD Type 2 dimension from a CDC source using Snowflakes's Stored procedure and Data Quality Checks 

<img src = "img/stored_procedure_scd_type_2.jpg">

## SCD Type 2 Table

### Source Data (PRODUCT_STATUS_CDC table)

| PRODUCT_KEY | STATUS | CHANGE_TYPE | CHANGE_TIME           | CDC_LOG_POSITION |
|-------------|--------|-------------|------------------------|-------------------|
| 1           | 10     | INSERT      | 2019-01-01 10:00:00   | 1                 |
| 1           | 10     | INSERT      | 2019-01-01 10:00:00   | 1                 |
| 1           | 10     | UPDATE      | 2019-01-01 10:30:00   | 2                 |
| 1           | 20     | UPDATE      | 2019-01-01 11:00:00   | 3                 |
| 1           | 20     | DELETE      | 2019-01-01 12:00:00   | 4                 |
| 1           | 10     | INSERT      | 2019-01-01 14:00:00   | 5                 |

### CTE 1: deduplicated_cdc
* Deduplicates rows where all fields (excluding CDC_LOG_POSITION) are identical. Here, the first two rows are duplicates, so one is removed.

| PRODUCT_KEY | STATUS | CHANGE_TYPE | CHANGE_TIME           | CDC_LOG_POSITION |
|-------------|--------|-------------|------------------------|-------------------|
| 1           | 10     | INSERT      | 2019-01-01 10:00:00   | 1                 |
| 1           | 10     | UPDATE      | 2019-01-01 10:30:00   | 2                 |
| 1           | 20     | UPDATE      | 2019-01-01 11:00:00   | 3                 |
| 1           | 20     | DELETE      | 2019-01-01 12:00:00   | 4                 |
| 1           | 10     | INSERT      | 2019-01-01 14:00:00   | 5                 |

### CTE 2: hash_table_1
* Computes _hash_1 for each row based on the STATUS and CHANGE_TYPE.
* Sets _valued_changed_1 to TRUE for rows where _hash_1 differs from the previous row’s _hash_1.

| PRODUCT_KEY | STATUS | CHANGE_TYPE | CHANGE_TIME           | CDC_LOG_POSITION | _hash_1      | _previous_hash_1 | _valued_changed_1 |
|-------------|--------|-------------|------------------------|-------------------|--------------|------------------|--------------------|
| 1           | 10     | INSERT      | 2019-01-01 10:00:00   | 1                 | hash(10)     | NULL             | TRUE               |
| 1           | 10     | UPDATE      | 2019-01-01 10:30:00   | 2                 | hash(10)     | hash(10)         | FALSE              |
| 1           | 20     | UPDATE      | 2019-01-01 11:00:00   | 3                 | hash(20)     | hash(10)         | TRUE               |
| 1           | 20     | DELETE      | 2019-01-01 12:00:00   | 4                 | hash(20, del)| hash(20)         | TRUE               |
| 1           | 10     | INSERT      | 2019-01-01 14:00:00   | 5                 | hash(10)     | hash(20, del)    | TRUE               |


### CTE 3: hash_table_2
* Filters rows to retain only those where _valued_changed_1 is TRUE

| PRODUCT_KEY | STATUS | CHANGE_TYPE | CHANGE_TIME           | CDC_LOG_POSITION |
|-------------|--------|-------------|------------------------|-------------------|
| 1           | 10     | INSERT      | 2019-01-01 10:00:00   | 1                 |
| 1           | 20     | UPDATE      | 2019-01-01 11:00:00   | 3                 |
| 1           | 20     | DELETE      | 2019-01-01 12:00:00   | 4                 |
| 1           | 10     | INSERT      | 2019-01-01 14:00:00   | 5                 |

### CTE 4: final_table
* Adds the is_active column, where DELETE rows are marked is_active = FALSE.

| PRODUCT_KEY | STATUS | CHANGE_TYPE | CHANGE_TIME           | CDC_LOG_POSITION | is_active |
|-------------|--------|-------------|------------------------|-------------------|-----------|
| 1           | 10     | INSERT      | 2019-01-01 10:00:00   | 1                 | TRUE      |
| 1           | 20     | UPDATE      | 2019-01-01 11:00:00   | 3                 | TRUE      |
| 1           | 20     | DELETE      | 2019-01-01 12:00:00   | 4                 | FALSE     |
| 1           | 10     | INSERT      | 2019-01-01 14:00:00   | 5                 | TRUE      |

### CTE 5: scd_table
* Calculates scd_start_time and scd_end_time. The scd_end_time uses LEAD to set the end time to the next CHANGE_TIME (or default future date if none).
* The DELETE row closes the previous status

| product_status_product_key | product_status_status | scd_start_time         | scd_end_time           | is_active |
|----------------------------|-----------------------|-------------------------|-------------------------|-----------|
| 1                          | 10                    | 2019-01-01 10:00:00    | 2019-01-01 11:00:00     | TRUE      |
| 1                          | 20                    | 2019-01-01 11:00:00    | 2019-01-01 12:00:00     | TRUE      |
| 1                          | 20                    | 2019-01-01 12:00:00    | 2019-01-01 12:00:00     | FALSE     |
| 1                          | 10                    | 2019-01-01 14:00:00    | 2999-01-01 00:00:00     | TRUE      |

### Final Select
* Filters for is_active = TRUE and assigns ROW_NUMBER() to product_status_scd_key, ordered by product_status_product_key and scd_start_time.

| product_status_scd_key | product_status_product_key | product_status_status | scd_start_time         | scd_end_time           |
|-------------------------|----------------------------|-----------------------|-------------------------|-------------------------|
| 1                       | 1                          | 10                    | 2019-01-01 10:00:00    | 2019-01-01 11:00:00     |
| 2                       | 1                          | 20                    | 2019-01-01 11:00:00    | 2019-01-01 12:00:00     |
| 3                       | 1                          | 10                    | 2019-01-01 14:00:00    | 2999-01-01 00:00:00     |


## Queries

```sql
-- Distinc Products for a given state and date
SELECT DISTINCT product_status_product_key AS product_key
FROM CAIOCVELASCO.DATA_ENGINEER.PRODUCT_STATUS_HST
WHERE product_status_status = '10'  -- Replace 'your_state_value' with the actual state you want to test
  AND scd_start_time <= CAST('2019-01-01' AS TIMESTAMP_NTZ) + INTERVAL '1 DAY' - INTERVAL '1 SECOND'  -- Replace '2024-10-29' with your desired test date
  AND scd_end_time >= CAST('2019-01-01' AS TIMESTAMP_NTZ);  -- Replace '2024-10-29' with the same date as above
```

* This query is designed to select distinct products that were in a given state (product_status_status = '10') during a specific date range, with particular focus on historical records of product statuses from the PRODUCT_STATUS_HST table.

Result set:
| product_key |
|-------------|
| 1           |

## Data Quality

### Query 1 - Check for Existence of Product Keys

```sql
-- 1. Check for Existence of Product Keys
-- Ensure that all unique product keys in the CDC table exist in the SCD table.
-- Outcome: If any product keys are returned, it indicates that those keys are missing in the SCD table.
SELECT DISTINCT cdc.product_key
FROM CAIOCVELASCO.DATA_ENGINEER.PRODUCT_STATUS_CDC AS cdc
LEFT JOIN CAIOCVELASCO.DATA_ENGINEER.PRODUCT_STATUS_HST AS scd
ON cdc.product_key = scd.product_status_product_key
WHERE scd.product_status_product_key IS NULL;
```

Result set:
| product_key |
|-------------|
| EMPTY       |

### Query 2 - Check for Duplicates in SCD

```sql
-- 2. Check for Duplicates in SCD
-- Ensure that there are no duplicate records for the same product key in the SCD table.
-- Outcome: If any rows are returned, it indicates that there are duplicates for those product keys in the SCD table.
SELECT product_status_scd_key, product_status_product_key, product_status_status, scd_start_time, scd_end_time, COUNT(*) AS duplicate_count
FROM CAIOCVELASCO.DATA_ENGINEER.PRODUCT_STATUS_HST
GROUP BY product_status_scd_key, product_status_product_key, product_status_status, scd_start_time, scd_end_time
HAVING COUNT(*) > 1;
```

Result set:
| product_status_product_key | product_status_status | scd_start_time | scd_end_time | duplicate_count |
|----------------------------|-----------------------|----------------|--------------|-----------------|
| EMPTY                      | EMPTY                 | EMPTY          | EMPTY        | EMPTY           |

### Query 3 - Cross Check for Active Records with CDC Table


  ```sql
-- 3. Check for Active Records
-- Verify that all active records in the SCD table (where scd_end_time is a future date, e.g., 2999-01-01) 
-- correspond to entries in the CDC table. This ensures that the current status of a product is accurately represented.
-- Outcome: If there are any active records in the SCD table that do not have a matching product key in the CDC table, they will be returned, indicating a potential inconsistency in the data.
SELECT DISTINCT s.product_status_product_key
FROM CAIOCVELASCO.DATA_ENGINEER.PRODUCT_STATUS_HST s
LEFT JOIN CAIOCVELASCO.DATA_ENGINEER.PRODUCT_STATUS_CDC c
ON s.product_status_product_key = c.product_key
WHERE s.scd_end_time = '2999-01-01 00:00:00' -- Active records
  AND c.product_key IS NULL; -- Check for missing entries in CDC (This condition checks for records in the SCD table that do not have a corresponding entry in the CDC table. If the product_key from the CDC table is NULL, it means that there was no match for that product_status_product_key, indicating that there is an active record in the SCD table that isn’t represented in the CDC table.)
```

Result set:
| product_status_product_key |
|----------------------------|
| EMPTY                      |

### Query 4 - Last Status Consistency Check
* The WHERE clause ensures that we only select the rows where there is a discrepancy between the latest status in the CDC table and the corresponding active status in the SCD table.
* Specifically, the query looks for:
    * No matching status in the SCD table (s.rn <> 1 or s.product_status_status IS NULL).
    * A mismatch between the statuses in the SCD and CDC tables (s.product_status_status <> c.latest_status).
```sql
WITH latest_cdc AS (
    -- Retrieve the latest status for each product key from the CDC table
    SELECT 
        product_key,
        status AS latest_status,
        -- ROW_NUMBER() OVER (PARTITION BY product_key ORDER BY change_time DESC) AS rn
        ROW_NUMBER() OVER (PARTITION BY product_key ORDER BY change_time DESC, cdc_log_position DESC) AS rn -- rn = 1 to the latest status
    FROM CAIOCVELASCO.DATA_ENGINEER.PRODUCT_STATUS_CDC
),

latest_scd AS (
    -- Get the latest active status for each product key from the SCD table
    SELECT 
        product_status_product_key,
        product_status_status,
        ROW_NUMBER() OVER (PARTITION BY product_status_product_key ORDER BY scd_start_time DESC) AS rn
        -- ROW_NUMBER() OVER (PARTITION BY product_key ORDER BY change_time DESC, cdc_log_position DESC) AS rn
    FROM CAIOCVELASCO.DATA_ENGINEER.PRODUCT_STATUS_HST
    WHERE scd_end_time = '2999-01-01 00:00:00' -- Only consider active records
)

SELECT 
    c.product_key,
    c.latest_status AS cdc_latest_status,
    s.product_status_status AS scd_latest_status
FROM latest_cdc c
LEFT JOIN latest_scd s ON c.product_key = s.product_status_product_key
WHERE c.rn = 1 -- Get the latest status from CDC and combine with the one from SCD
    AND (s.rn <> 1 OR s.product_status_status IS NULL -- Ensure no matching status in SCD or no active status
         OR s.product_status_status <> c.latest_status); -- Check if the status is different
```

Result set:
| product_key |cdc_latest_status |scd_latest_status |
|-------------|------------------|------------------|
| EMPTY       | EMPTY            | EMPTY            |