# Gold Layer: Dimension Customers
This notebook creates the final `dim_customers` table by joining CRM and ERP data sources.
- **Goal**: Create a Single Version of Truth (SVOT) for customer data.
- **Logic**: 
    - Priority for `gender` is given to CRM; ERP is used as a fallback (Coalesce).
    - Implements a **Surrogate Key** using `ROW_NUMBER()`.
    - Joins CRM Customers, ERP Customers, and ERP Locations.
- **Output**: `workspace.gold.dim_customers`

In [0]:
%python
import pyspark.sql.functions as F
from pyspark.sql.window import Window

# Define source and target table paths
SILVER_CRM_CUST = "workspace.silver.crm_customers"
SILVER_ERP_CUST = "workspace.silver.erp_customers"
SILVER_ERP_LOC  = "workspace.silver.erp_locations"
GOLD_TARGET     = "workspace.gold.dim_customers"

In [0]:
%python
# 1. Extraction & Integration: Joining disparate Silver tables
query = f"""
SELECT
    ci.customer_id,
    ci.customer_number,
    ci.first_name,
    ci.last_name,
    la.country_name AS country,
    ci.marital_status,
    CASE 
        WHEN ci.gender <> 'n/a' THEN ci.gender
        ELSE COALESCE(ca.gender, 'n/a')
    END AS gender,
    ca.birth_date AS birthdate,
    ci.created_date AS create_date
FROM {SILVER_CRM_CUST} ci
LEFT JOIN {SILVER_ERP_CUST} ca
    ON ci.customer_number = ca.customer_number
LEFT JOIN {SILVER_ERP_LOC} la
    ON ci.customer_number = la.customer_number
"""

df_raw = spark.sql(query)

# 2. Data Cleansing: Deduplicating by picking the most recent record per customer_id
df_clean = df_raw.orderBy(F.col("create_date").desc()) \
                 .dropDuplicates(["customer_id"])

# 3. Surrogate Key Assignment: Creating a unique numeric key for the Data Warehouse
w = Window.orderBy("customer_id")
df_gold = df_clean.withColumn("customer_key", F.row_number().over(w)) \
                  .withColumn("gold_ingestion_ts", F.current_timestamp())

# 4. Final Load: Writing to the Gold Layer in Delta format
df_gold.write.mode("overwrite").format("delta").saveAsTable(GOLD_TARGET)

print(f"âœ… Customer Dimension created. Removed {df_raw.count() - df_clean.count()} duplicate records.")
display(df_gold.limit(10))

In [0]:
%python
# Writing in Delta format
df_gold.write \
    .mode("overwrite") \
    .format("delta") \
    .saveAsTable(GOLD_TARGET)

print(f"Successfully created: {GOLD_TARGET}")

## Data Quality Check

In [0]:
%sql
SELECT 
    country, 
    count(*) as total_customers,
    count(birthdate) as has_birthdate,
    count(gender) as has_gender
FROM workspace.gold.dim_customers
GROUP BY ALL
ORDER BY total_customers DESC   