# Silver layer - FactFinance

This notebook transforms **bronze.FactFinance** into **silver.FactFinance** by removing fully-null rows, filling missing values using metadata-driven defaults, logging validation results, and writing the curated table.

Parameters used by the pipeline to control execution:

* **in_parameter_run_id**: unique identifier for the pipeline run
* **in_parameter_process_date**: execution date for lineage
* **in_parameter_year**: year partition
* **out_parameter_count_processed**: output parameter

In [None]:
in_parameter_run_id = 0
in_parameter_process_date = ""
in_parameter_year = 2010
out_parameter_count_processed = 0

Variables

In [None]:
v_table_name = "FactFinance"

## 1. Load validation rules

In [None]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

In [None]:
df_validation_rules = spark.read.table("control.validation_rules").filter(col("table_name") == v_table_name).toPandas()

## 2. Extract data

In [None]:
df = spark.read.format("delta").table(f"bronze.{v_table_name}")
df = df.filter(col("year") == in_parameter_year)

## 3. Validation

### 3.1. Validate foreign keys

#### Load dimension tables, by selecting only the key columns

DimOrganization

In [None]:
df_dim_organization = spark.read.format("delta").table(f"silver.DimOrganization")
df_dim_organization = df_dim_organization.select("OrganizationKey").dropDuplicates()

DimDepartmentGroup

In [None]:
df_dim_department_group = spark.read.format("delta").table(f"silver.DimDepartmentGroup")
df_dim_department_group = df_dim_department_group.select("DepartmentGroupKey").dropDuplicates()

DimScenario

In [None]:
df_dim_scenario = spark.read.format("delta").table(f"silver.DimScenario")
df_dim_scenario = df_dim_scenario.select("ScenarioKey").dropDuplicates()

DimAccount

In [None]:
df_dim_account = spark.read.format("delta").table(f"silver.DimAccount")
df_dim_account = df_dim_account.select("AccountKey").dropDuplicates()

#### Identify missing foreign keys

For each dimension, creates a DataFrame of fact rows whose FK does not exist in the corresponding dimension (left anti join pattern).

DimOrganization

In [None]:
missing_dim_organization_fk = (
    df.alias("f")
    .join(df_dim_organization.select(col("OrganizationKey")).dropDuplicates().alias("d"),
          on = col("f.OrganizationKey") == col("d.OrganizationKey"),
          how = "left_anti")
    .select("f.FinanceKey", col("f.OrganizationKey").alias("DimColumn"))
    .withColumn("DimTable", lit("DimOrganization"))
)

DimDepartmentGroup

In [None]:
missing_dim_department_group_fk = (
    df.alias("f")
    .join(df_dim_department_group.select(col("DepartmentGroupKey")).dropDuplicates().alias("d"),
          on = col("f.DepartmentGroupKey") == col("d.DepartmentGroupKey"),
          how = "left_anti")
    .select("f.FinanceKey", col("f.DepartmentGroupKey").alias("DimColumn"))
    .withColumn("DimTable", lit("DimDepartmentGroup"))
)

DimScenario

In [None]:
missing_df_dim_scenario_fk = (
    df.alias("f")
    .join(df_dim_scenario.select(col("ScenarioKey")).dropDuplicates().alias("d"),
          on = col("f.ScenarioKey") == col("d.ScenarioKey"),
          how = "left_anti")
    .select("f.FinanceKey", col("f.ScenarioKey").alias("DimColumn"))
    .withColumn("DimTable", lit("DimScenario"))
)

DimAccount

In [None]:
missing_df_dim_account_fk = (
    df.alias("f")
    .join(df_dim_account.select(col("AccountKey")).dropDuplicates().alias("d"),
          on = col("f.AccountKey") == col("d.AccountKey"),
          how = "left_anti")
    .select("f.FinanceKey", col("f.AccountKey").alias("DimColumn"))
    .withColumn("DimTable", lit("DimAccount"))
)

#### Filter invalid fact rows

Removes invalid fact rows by joining back the “missing FK” sets (anti-join using FinanceKey), leaving only valid fact records for Silver.

In [None]:
df = df.alias("f").join(missing_dim_organization_fk.alias("m"), on = col("f.FinanceKey") == col("m.FinanceKey"), how = "left_anti")
df = df.alias("f").join(missing_dim_department_group_fk.alias("m"), on = col("f.FinanceKey") == col("m.FinanceKey"), how = "left_anti")
df = df.alias("f").join(missing_df_dim_scenario_fk.alias("m"), on = col("f.FinanceKey") == col("m.FinanceKey"), how = "left_anti")
df = df.alias("f").join(missing_df_dim_account_fk.alias("m"), on = col("f.FinanceKey") == col("m.FinanceKey"), how = "left_anti")

## 4. Load silver data

Add run metadata to track when the data was loaded and which run loaded it.

In [None]:
df = df.withColumn("year", lit(in_parameter_year))
df = df.withColumn("process_date", lit(in_parameter_process_date))
df = df.withColumn("process_date", to_date("process_date", "yyyy-MM-dd"))
df = df.withColumn("run_id", lit(in_parameter_run_id))

Save the data by overwriting only the partition for the selected year (replaceWhere), making the load rerunnable without duplicates.

In [None]:
df.write.format("delta").mode("overwrite").option("replaceWhere", f"year = {in_parameter_year}").saveAsTable(f"silver.{v_table_name}")

Calculate the total number of rows processed and return it to the pipeline.

In [None]:
out_parameter_count_processed = df.count()

In [None]:
mssparkutils.notebook.exit(out_parameter_count_processed)