# 🥇 Incremental Load of the `FactSales` Fact Table

## Objective

This notebook implements an **incremental process** to populate the `FactSales` fact table in the `gold` schema, using data from the `silver` schema.

---


### Process Parameter Configuration
This cell defines the configuration variables required to run the process.  
Each parameter controls how and where data will be extracted and loaded.

- **`catalog`**  
  - Name of the Spark catalog or database where the tables reside.  
  - Usually corresponds to the current workspace or execution context in Databricks or another compatible platform.  

- **`cdc_col`**  
  - Column used as **Change Data Capture** (CDC).  
  - Indicates the date and time of the last modification for each record, and is used to filter only new or updated data.  

- **`backdate_refresh`**  
  - Optional field to force a reload starting from a specific date.  
  - If empty (`""`), the last load date detected in the target will be used.  
  - If a date is provided (`"YYYY-MM-DD HH:MM:SS"`), it will be used as the starting point to reload data.

- **`source_object`**  
  - Name of the **source** fact table in the *silver* zone.  

- **`source_schema`**  
  - Name of the schema where the source fact table is located (*silver layer*).  

- **`target_schema`**  
  - Name of the schema where the target fact table will be stored (*gold layer*).  

- **`target_object`**  
  - Name of the **target** fact table in the *gold* zone.  

- **`fact_key`**  
  - Name of the fact table key.  
  - Used as a unique identifier to perform *upserts* in Delta Lake (MERGE between source and target).  
  - Leave empty (`""`) if the fact table does not have one.


In [0]:
catalog = "workspace"
cdc_col  = "modified_date"
backdate_refresh = ""
source_object = "silver_sales"
source_schema = "silver"
target_schema = "gold"
target_object = "FactSales"
fact_key = "sales_sk"


- **`dimensions`**  
  - List of dictionaries with the configuration of each dimension to be joined to the fact table.  
  - Each element includes:  
    - **`table`** → Full name of the dimension table (including catalog and schema).  
    - **`alias`** → Alias to be used in SQL when referring to the dimension.  
    - **`join_keys`** → List of tuples indicating the join columns between the fact table and the dimension.  
      - Each tuple has the format `(fact_col, dim_col)` where:  
        - `fact_col` is the column in the fact table.  
        - `dim_col` is the column in the dimension table.  
    - **`surrogate_key`** → Surrogate key column of the dimension that should be selected in the `SELECT` statement.  

- **`fact_table`**  
  - Full path of the source fact table.

- **`fact_columns`**  
  - List of column names to be extracted directly from the fact table during the incremental query, excluding foreign keys to dimensions.


In [0]:
dimensions = [
    {
        "table": f"{catalog}.{target_schema}.DimProducts",
        "alias": "DimProducts",
        "join_keys": [("product_sk", "product_sk")], # (fact_col, dim_col)
        "surrogate_key": "product_sk"
    },
    {
        "table": f"{catalog}.{target_schema}.DimStores",
        "alias": "DimStores",
        "join_keys": [("store_sk", "store_sk")], 
        "surrogate_key": "store_sk"
    },
    {
        "table": f"{catalog}.{target_schema}.DimSalesPersons",
        "alias": "DimSalesPersons",
        "join_keys": [("salesperson_sk", "salesperson_sk")],
        "surrogate_key": "salesperson_sk"
    },
    {
        "table": f"{catalog}.{target_schema}.DimTimes",
        "alias": "DimTimes",
        "join_keys": [("time_sk", "time_sk")],
        "surrogate_key": "time_sk"
    },
    {
        "table": f"{catalog}.{target_schema}.DimDates",
        "alias": "DimDates",
        "join_keys": [("date_sk", "date_sk")],
        "surrogate_key": "date_sk"
    },
    {
        "table": f"{catalog}.{target_schema}.DimCustomers",
        "alias": "DimCustomers",
        "join_keys": [("customer_sk", "customer_sk")],
        "surrogate_key": "customer_sk"
    },
    {
        "table": f"{catalog}.{target_schema}.DimCampaigns",
        "alias": "DimCampaigns",
        "join_keys": [("campaign_sk", "campaign_sk")],
        "surrogate_key": "campaign_sk"
    }
]

fact_table = f"{catalog}.{source_schema}.{source_object}"

fact_columns = ["sales_sk", "sales_id", "total_amount", "modified_date"]

Calculates the cutoff date for the incremental load.  
- **If `backdate_refresh` is empty** (`""`):  
  - Checks if the target table (`target_schema.target_object`) exists in the Spark catalog.  
    - **If it exists** → Executes a SQL query to get the maximum value of the `cdc_col` (`modified_date`) in the target table and assigns it to `last_load`.  
    - **If it does not exist** → Assigns the date `"1900-01-01 00:00:00"` to force a full load.  
- **If `backdate_refresh` has a value**:  
  - Uses that value directly as `last_load`, without calculating it from the target table.



In [0]:
if len(backdate_refresh) == 0:
    if spark.catalog.tableExists(f"{target_schema}.{target_object}"):
        last_load = spark.sql(f"select max({cdc_col}) from workspace.{target_schema}.{target_object}").collect()[0][0]
    else:
        last_load = "1900-01-01 00:00:00"
else:
    last_load = backdate_refresh

**`generate_fact_query_incremental`**  
Builds the SQL query that retrieves only the new or updated records from the fact table and enriches them with dimension data.  

**How it works**:  
- Assigns alias `"f"` to the fact table to simplify the SQL.  
- Constructs the list of columns to select:  
  - All columns defined in `fact_columns` (from the fact table).  
  - The surrogate key of each dimension (from `join_keys`).  
- For each dimension:  
  - Generates a `LEFT JOIN` using the columns defined in `join_keys`.  
- Creates the `WHERE` clause filtering records where `cdc_column >= processing_date`.  
- Returns the SQL query ready to execute in Spark.


In [0]:
def generate_fact_query_incremental(fact_table, dimensions, fact_columns, cdc_column, processing_date):
    fact_alias = "f"

    select_cols = [f"{fact_alias}.{col}" for col in fact_columns]
    join_clauses = []

    for dim in dimensions:
        table_full = dim["table"]
        alias = dim["alias"]
        table_name = table_full.split('.')[-1]

        fact_col, dim_col = dim["join_keys"][0]

        surrogate_key = f"{alias}.{dim['surrogate_key']}"

        select_cols.append(surrogate_key)

        on_conditions = [f"{fact_alias}.{fk} = {alias}.{dk}" for fk, dk in dim["join_keys"]]
        join_clause = f"LEFT JOIN {table_full} {alias} ON " + " AND ".join(on_conditions)
        join_clauses.append(join_clause)

    select_clause = ",\n       ".join(select_cols)
    joins = "\n".join(join_clauses)
    where_clause = f"{fact_alias}.{cdc_column} >= DATE('{processing_date}')"

    query = f"""
    SELECT {select_clause}
    FROM {fact_table} {fact_alias}
    {joins}
    WHERE {where_clause}
    """.strip()

    return query


In [0]:
query = generate_fact_query_incremental(fact_table, dimensions, fact_columns, cdc_col, last_load)
print(query)

SELECT f.sales_sk,
       f.sales_id,
       f.total_amount,
       f.modified_date,
       DimProducts.product_sk,
       DimStores.store_sk,
       DimSalesPersons.salesperson_sk,
       DimTimes.time_sk,
       DimDates.date_sk,
       DimCustomers.customer_sk,
       DimCampaigns.campaign_sk
    FROM workspace.silver.silver_sales f
    LEFT JOIN workspace.gold.DimProducts DimProducts ON f.product_sk = DimProducts.product_sk
LEFT JOIN workspace.gold.DimStores DimStores ON f.store_sk = DimStores.store_sk
LEFT JOIN workspace.gold.DimSalesPersons DimSalesPersons ON f.salesperson_sk = DimSalesPersons.salesperson_sk
LEFT JOIN workspace.gold.DimTimes DimTimes ON f.time_sk = DimTimes.time_sk
LEFT JOIN workspace.gold.DimDates DimDates ON f.date_sk = DimDates.date_sk
LEFT JOIN workspace.gold.DimCustomers DimCustomers ON f.customer_sk = DimCustomers.customer_sk
LEFT JOIN workspace.gold.DimCampaigns DimCampaigns ON f.campaign_sk = DimCampaigns.campaign_sk
    WHERE f.modified_date >= DATE('202

The result is loaded into a DataFrame (`df_fact`), which contains:  
    - The incremental records of the fact table.  
    - The surrogate keys of the joined dimensions.  


In [0]:
df_fact = spark.sql(query)

In [0]:
display(df_fact.limit(5))

sales_sk,sales_id,total_amount,modified_date,product_sk,store_sk,salesperson_sk,time_sk,date_sk,customer_sk,campaign_sk
1,SALES_0000001,2421.54,2025-08-19T18:16:31.999Z,167,191,1442,929,50,56504,3
2,SALES_0000002,2487.22,2025-08-19T18:16:31.999Z,100,422,1996,548,74,59945,8
3,SALES_0000003,2915.91,2025-08-19T18:16:31.999Z,71,411,1388,831,276,54709,35
4,SALES_0000004,4086.9,2025-08-19T18:16:31.999Z,137,242,1211,1218,127,77739,6
5,SALES_0000005,2425.33,2025-08-19T18:16:31.999Z,207,380,570,1330,236,97840,19


### **Loading Data into the Target Table (`FactSales`)**  

- **If the target table exists**  
  1. It is obtained as a Delta object (`dlt_object`).  
  2. A **MERGE** is executed between:
     - **`src`** → incremental DataFrame (`df_fact`).
     - **`trg`** → target table.
  3. MERGE conditions:
     - **Match by fact key**
     - **When matched** → update all columns if `modified_date` in `src` is greater than or equal to that in `trg`.
     - **When not matched** → insert the full record.

- **If the target table does not exist**  
  → It is created by writing `df_fact` in Delta format, in `append` mode.

Usually, we do not perform an Upsert on the fact table, but sometimes it is important, so we keep it. Ideally, only new records should arrive.


In [0]:
from delta.tables import DeltaTable

In [0]:
if spark.catalog.tableExists(f"{catalog}.{target_schema}.{target_object}"):
    dlt_object = DeltaTable.forName(spark, f"{catalog}.{target_schema}.{target_object}")
    dlt_object.alias("trg").merge(df_fact.alias("src"), f"src.{fact_key} = trg.{fact_key}")\
        .whenMatchedUpdateAll(condition = f"src.{cdc_col} >= trg.{cdc_col}")\
        .whenNotMatchedInsertAll()\
        .execute()
else:
    df_fact.write.format("delta")\
        .mode("append")\
        .saveAsTable(f"{catalog}.{target_schema}.{target_object}")

If there is no key in the fact table, we can use the combination of all surrogate keys:


In [0]:
# If the fact table doesn't have an ID (example: sales_id, sales_sk):
# fact_key_cols = ["customer_sk","product_sk","store_sk", 
#                  "sales_person_sk","campaign_sk","date_sk","time_sk"]

#fact_key_cols_str = " AND ".join([f"src.{col} = trg.{col}" for col in fact_key_cols])
#print(fact_key_cols_str)

# if spark.catalog.tableExists(f"{catalog}.{target_schema}.{target_object}"):
#     dlt_object = DeltaTable.forName(spark, f"{catalog}.{target_schema}.{target_object}")
#     dlt_object.alias("trg").merge(df_fact.alias("src"), fact_key_cols_str)\
#         .whenMatchedUpdateAll(condition = f"src.{cdc_col} >= trg.{cdc_col}")\
#         .whenNotMatchedInsertAll()\
#         .execute()
# else:
#     df_fact.write.format("delta")\
#         .mode("append")\
#         .saveAsTable(f"{catalog}.{target_schema}.{target_object}")