## Bronze Layer - Incremental Load (Azure SQL Database)
- **Purpose**: Extract data from SpaceParts training database
- **Layer**: Bronze (Raw Data)
- **Load Type**: Incremental

---
### Parámetros
---


In [None]:
import os
from datetime import datetime, timedelta
from pyspark.sql.functions import *
from pyspark.sql.types import *

execution_date = os.environ.get("execution_date", datetime.now().isoformat())
lookback_days = int(os.environ.get("lookback_days", "1"))  # Días hacia atrás para la carga incremental

StatementMeta(, a5d3d459-5910-450c-9f39-45308fdbc35b, 7, Finished, Available, Finished)

---
### Dependencias
---

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark import StorageLevel
from datetime import datetime, timedelta
import logging
import pandas as pd
import re, unicodedata
from collections import defaultdict
from pyspark.sql.functions import col, max as fmax

StatementMeta(, a5d3d459-5910-450c-9f39-45308fdbc35b, 30, Finished, Available, Finished)

---
### Configuraciones de optimización
---

In [None]:
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

StatementMeta(, a5d3d459-5910-450c-9f39-45308fdbc35b, 9, Finished, Available, Finished)

---
### Configuraciones de los Logs
---

In [None]:
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

StatementMeta(, a5d3d459-5910-450c-9f39-45308fdbc35b, 10, Finished, Available, Finished)

---
### Credenciales de conexión
---

In [None]:
class BronzeIncrementalLoader:
    def __init__(self, spark_session):
        self.spark = spark_session
        self.server = "te3-training-eu.database.windows.net"
        self.database = "SpacePartsCoDW"
        self.username = "dwreader@te3-training-eu"
        self.password = "TE3#reader!"
    
    def get_jdbc_connection_properties(self):
        jdbc_url = f"jdbc:sqlserver://{self.server}:1433;database={self.database};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
    
        return {
            "url": jdbc_url,
            "user": self.username,
            "password": self.password,
            "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
        }
    
    def get_last_execution_timestamp(self, table_name: str):
        """Obtiene la última fecha de ejecución exitosa para una tabla"""
        try:
            if not self.spark._jsparkSession.catalog().tableExists("bronze_incremental_control"):
                bronze_table_name = clean_bronze_table_name(*table_name.split('.'))
                if self.spark._jsparkSession.catalog().tableExists(bronze_table_name):
                    max_date = self.spark.table(bronze_table_name).agg(max("dwcreateddate")).collect()[0][0]
                    logger.info(f"Usando fecha máxima de tabla bronze para {table_name}: {max_date}")
                    return max_date
                return None
                
            control_df = self.spark.table("bronze_incremental_control")
            last_run = control_df.filter(
                (col("table_name") == table_name) & 
                (col("status") == "success")
            ).orderBy(col("execution_timestamp").desc()).limit(1).collect()
            
            if last_run:
                return last_run[0]["last_extracted_timestamp"]
            else:
                bronze_table_name = clean_bronze_table_name(*table_name.split('.'))
                if self.spark._jsparkSession.catalog().tableExists(bronze_table_name):
                    max_date = self.spark.table(bronze_table_name).agg(max("dwcreateddate")).collect()[0][0]
                    logger.info(f"Primera ejecución incremental para {table_name}. Usando fecha máxima: {max_date}")
                    return max_date
                return None
        except Exception as e:
            logger.warning(f"No se pudo obtener timestamp para {table_name}: {e}")
            return None
    
    def update_execution_control(self, table_name: str, last_extracted_timestamp, status: str, record_count: int = 0):
        """Actualiza la tabla de control con información de la ejecución"""
        control_data = [(
            table_name,
            execution_date,
            datetime.now(),
            last_extracted_timestamp,
            status,
            record_count
        )]
        
        control_schema = StructType([
            StructField("table_name", StringType(), True),
            StructField("execution_id", StringType(), True),
            StructField("execution_timestamp", TimestampType(), True),
            StructField("last_extracted_timestamp", TimestampType(), True),
            StructField("status", StringType(), True),
            StructField("record_count", IntegerType(), True)
        ])
        
        control_df = self.spark.createDataFrame(control_data, control_schema)
        control_df.write.format("delta").mode("append").option("mergeSchema", "true").saveAsTable("bronze_incremental_control")

StatementMeta(, a5d3d459-5910-450c-9f39-45308fdbc35b, 11, Finished, Available, Finished)

In [None]:
bronze_loader = BronzeIncrementalLoader(spark)

StatementMeta(, a5d3d459-5910-450c-9f39-45308fdbc35b, 12, Finished, Available, Finished)

---
### Obtener lista de tablas
---

In [None]:
try:
    tables_query = """(
        SELECT TABLE_SCHEMA, TABLE_NAME 
        FROM INFORMATION_SCHEMA.TABLES 
        WHERE TABLE_TYPE='BASE TABLE' 
        AND TABLE_SCHEMA IN ('dim', 'fact')
    ) as tables_query"""
    
    connection_props = bronze_loader.get_jdbc_connection_properties()
    
    tables_df = spark.read \
        .format("jdbc") \
        .option("url", connection_props["url"]) \
        .option("dbtable", tables_query) \
        .option("user", connection_props["user"]) \
        .option("password", connection_props["password"]) \
        .option("driver", connection_props["driver"]) \
        .load()
    
    tables_df = tables_df.orderBy("TABLE_SCHEMA", "TABLE_NAME")    
    tables_list = tables_df.collect()
    
    print(f"Se encontraron {len(tables_list)} tablas para carga incremental:")
    for row in tables_list:
        print(f"  - {row.TABLE_SCHEMA}.{row.TABLE_NAME}")

except Exception as e:
    print(f"Error obteniendo lista de tablas: {str(e)}")
    tables_list = [
        type('obj', (object,), {'TABLE_SCHEMA': 'dim', 'TABLE_NAME': 'Brands'}),
        type('obj', (object,), {'TABLE_SCHEMA': 'dim', 'TABLE_NAME': 'Budget-Rate'}),
        type('obj', (object,), {'TABLE_SCHEMA': 'dim', 'TABLE_NAME': 'Customers'}),
        type('obj', (object,), {'TABLE_SCHEMA': 'dim', 'TABLE_NAME': 'Employees'}),
        type('obj', (object,), {'TABLE_SCHEMA': 'dim', 'TABLE_NAME': 'Exchange-Rate'}),
        type('obj', (object,), {'TABLE_SCHEMA': 'dim', 'TABLE_NAME': 'Invoice-DocType'}),
        type('obj', (object,), {'TABLE_SCHEMA': 'dim', 'TABLE_NAME': 'Order-DocType'}),
        type('obj', (object,), {'TABLE_SCHEMA': 'dim', 'TABLE_NAME': 'Order-Status'}),
        type('obj', (object,), {'TABLE_SCHEMA': 'dim', 'TABLE_NAME': 'Products'}),
        type('obj', (object,), {'TABLE_SCHEMA': 'dim', 'TABLE_NAME': 'Regions'}),
        type('obj', (object,), {'TABLE_SCHEMA': 'fact', 'TABLE_NAME': 'Budget'}),
        type('obj', (object,), {'TABLE_SCHEMA': 'fact', 'TABLE_NAME': 'Forecast'}),
        type('obj', (object,), {'TABLE_SCHEMA': 'fact', 'TABLE_NAME': 'Invoices'}),
        type('obj', (object,), {'TABLE_SCHEMA': 'fact', 'TABLE_NAME': 'Orders'})
    ]

StatementMeta(, a5d3d459-5910-450c-9f39-45308fdbc35b, 13, Finished, Available, Finished)

Se encontraron 14 tablas para carga incremental:
  - dim.Brands
  - dim.Budget-Rate
  - dim.Customers
  - dim.Employees
  - dim.Exchange-Rate
  - dim.Invoice-DocType
  - dim.Order-DocType
  - dim.Order-Status
  - dim.Products
  - dim.Regions
  - fact.Budget
  - fact.Forecast
  - fact.Invoices
  - fact.Orders


---
### Funciones de normalización (mantenidas del original)
---

In [None]:
FORBIDDEN_CHARS = r"[ ,;{}\(\)\n\t=]+"
RESERVED = {
    "select","from","where","group","order","by","having","limit","offset",
    "and","or","not","as","on","join","inner","left","right","full","cross",
    "desc","asc","table","column","index","view","database","schema","create",
    "drop","alter","insert","update","delete","merge","into","values","set",
    "case","when","then","else","end","union","all","distinct","true","false",
    "null"
}

def strip_accents(text: str) -> str:
    t = unicodedata.normalize("NFKD", str(text))
    return "".join([c for c in t if not unicodedata.combining(c)])

def clean_identifier(name: str) -> str:
    if name is None: return "col"
    s = strip_accents(str(name).strip())
    s = re.sub(FORBIDDEN_CHARS, "_", s)
    s = s.replace(".", "_").replace("-", "_").replace("/", "_").replace("\\", "_")
    s = re.sub(r"[^0-9a-zA-Z_]", "", s)
    s = re.sub(r"_+", "_", s).strip("_").lower()
    if re.match(r"^[0-9]", s): s = "c_" + s
    if s in RESERVED: s = s + "_col"
    if not s: s = "col"
    return s[:128]

def split_schema_table(table_name: str):
    val = str(table_name).strip()
    if "." in val:
        s, t = val.split(".", 1)
    else:
        s, t = "dbo", val
    return s.strip(), t.strip()

def clean_bronze_table_name(schema: str, table: str) -> str:
    schema_c = clean_identifier(schema)
    table_c  = clean_identifier(table)
    return f"bronze_{schema_c}_{table_c}"


StatementMeta(, a5d3d459-5910-450c-9f39-45308fdbc35b, 14, Finished, Available, Finished)

---
### Extracción incremental
---

In [None]:
if not hasattr(BronzeIncrementalLoader, "extract_table_incremental"):
    def _extract_table_incremental(self, schema: str, table: str, timestamp_column: str = "DWCreatedDate"):
        """Extrae datos incrementales basado en la última fecha de carga"""
        props = self.get_jdbc_connection_properties()
        full_name = f"[{schema}].[{table}]"
        last_timestamp = self.get_last_execution_timestamp(f"{schema}.{table}")
        
        if last_timestamp is None:
            logger.info(f"Primera carga para {full_name} - extrayendo todos los datos")
            query = f"SELECT * FROM {full_name}"
            load_type = "initial"
        else:
            safe_timestamp = last_timestamp - timedelta(hours=1)  # 1 hora de overlap
            logger.info(f"Carga incremental para {full_name} desde {safe_timestamp}")
            query = f"""
                SELECT * FROM {full_name} 
                WHERE {timestamp_column} > '{safe_timestamp.strftime('%Y-%m-%d %H:%M:%S')}'
            """
            load_type = "incremental"

        df = (
            self.spark.read.format("jdbc")
            .option("url", props["url"])
            .option("user", props["user"])
            .option("password", props["password"])
            .option("driver", props["driver"])
            .option("query", query)
            .option("fetchsize", "10000")
            .load()
        )
        
        return df, load_type
    
    BronzeIncrementalLoader.extract_table_incremental = _extract_table_incremental

if not hasattr(BronzeIncrementalLoader, "save_bronze_table_incremental"):
    def _save_bronze_table_incremental(self, df_spark, bronze_table_name: str, load_type: str) -> int:
        """Guarda datos en modo incremental - siempre usa append ya que asume full load previo"""
        record_count = df_spark.count()
        
        if record_count == 0:
            logger.info(f"No hay datos nuevos para {bronze_table_name}")
            return 0
        
        table_exists = self.spark._jsparkSession.catalog().tableExists(bronze_table_name)
        
        if not table_exists:
            logger.warning(f"Tabla {bronze_table_name} no existe. ¿Se ejecutó el full load primero?")
            df_spark.write.mode("overwrite").format("delta").saveAsTable(bronze_table_name)
            logger.info(f"Tabla creada para {bronze_table_name}: {record_count:,} records")
        else:
            df_spark.write.mode("append").format("delta").saveAsTable(bronze_table_name)
            logger.info(f"Datos incrementales agregados a {bronze_table_name}: {record_count:,} records")
            
        return record_count
    
    BronzeIncrementalLoader.save_bronze_table_incremental = _save_bronze_table_incremental


StatementMeta(, a5d3d459-5910-450c-9f39-45308fdbc35b, 15, Finished, Available, Finished)

---
### Normalización de nombres para cargue
---

In [None]:
FORBIDDEN_CHARS = r"[ ,;{}\(\)\n\t=]+"
RESERVED = {
    "select","from","where","group","order","by","having","limit","offset",
    "and","or","not","as","on","join","inner","left","right","full","cross",
    "desc","asc","table","column","index","view","database","schema","create",
    "drop","alter","insert","update","delete","merge","into","values","set",
    "case","when","then","else","end","union","all","distinct","true","false",
    "null"
}

def strip_accents(text: str) -> str:
    t = unicodedata.normalize("NFKD", str(text))
    return "".join([c for c in t if not unicodedata.combining(c)])

def clean_identifier(name: str) -> str:
    if name is None:
        return "col"
    s = strip_accents(str(name).strip())
    s = re.sub(FORBIDDEN_CHARS, "_", s)
    s = s.replace(".", "_").replace("-", "_").replace("/", "_").replace("\\", "_")
    s = re.sub(r"[^0-9a-zA-Z_]", "", s)
    s = re.sub(r"_+", "_", s).strip("_").lower()
    if re.match(r"^[0-9]", s):
        s = "c_" + s
    if s in RESERVED:
        s = s + "_col"
    if not s:
        s = "col"
    return s[:128]

def split_schema_table(table_name: str):
    val = str(table_name).strip()
    if "." in val:
        s, t = val.split(".", 1)
    else:
        s, t = "dbo", val
    return s.strip(), t.strip()

def clean_bronze_table_name(schema: str, table: str) -> str:
    schema_c = clean_identifier(schema)
    table_c  = clean_identifier(table)
    return f"bronze_{schema_c}_{table_c}"

def build_column_mapping_from_df(columns_df):
    cols_norm = {c.lower().strip(): c for c in columns_df.columns}
    df = columns_df.rename(columns=cols_norm)

    table_col  = next((c for c in df.columns if c in ["table","table_name","tabla","table_name_full","tabla_origen"]), None)
    column_col = next((c for c in df.columns if c in ["column","column_name","columna","nombre_columna"]), None)
    if table_col is None or column_col is None:
        raise ValueError(f"Se requieren columnas 'table_name' y 'column_name' (o equivalentes). Encontradas: {list(df.columns)}")

    per_table_seen = defaultdict(set)
    mapping = defaultdict(dict)
    for _, r in df.iterrows():
        schema, table = split_schema_table(r[table_col])
        full = f"{schema}.{table}"
        old  = str(r[column_col]).strip()
        new  = clean_identifier(old)

        base = new; k = 1
        while new in per_table_seen[full]:
            k += 1
            new = f"{base}_{k}"
        per_table_seen[full].add(new)

        mapping[full][old] = new
    return mapping

def apply_column_mapping(df_spark, schema: str, table: str, mapping_by_table: dict):
    full = f"{schema}.{table}"
    if full not in mapping_by_table:
        return df_spark
    mp = mapping_by_table[full]
    for old_col, new_col in mp.items():
        if old_col in df_spark.columns and old_col != new_col:
            df_spark = df_spark.withColumnRenamed(old_col, new_col)
    return df_spark

def normalize_df_with_mapping_or_clean(df_spark, schema: str, table: str, mapping_by_table: dict):
    """
    1) Aplica mapping por tabla si existe.
    2) Limpia cualquier columna restante con clean_identifier para garantizar compatibilidad con Delta.
    """
    df_norm = apply_column_mapping(df_spark, schema, table, mapping_by_table)
    fixes = {}
    for c in df_norm.columns:
        cleaned = clean_identifier(c)
        if cleaned != c:
            fixes[c] = cleaned
    for old_col, new_col in fixes.items():
        if old_col in df_norm.columns and old_col != new_col:
            df_norm = df_norm.withColumnRenamed(old_col, new_col)
    return df_norm

def find_col(df_spark, candidates):
    """
    Devuelve el nombre real de la primera columna que coincida con alguno de los candidates (case-insensitive).
    Si no encuentra, prueba quitando underscores.
    """
    cols = [c.lower() for c in df_spark.columns]
    for cand in candidates:
        c = cand.lower()
        if c in cols:
            return df_spark.columns[cols.index(c)]
    cols_rel = [c.replace("_", "") for c in cols]
    for cand in candidates:
        c = cand.lower().replace("_", "")
        if c in cols_rel:
            return df_spark.columns[cols_rel.index(c)]
    return None

try:
    mapping_by_table
except NameError:
    mapping_by_table = {}  

StatementMeta(, a5d3d459-5910-450c-9f39-45308fdbc35b, 26, Finished, Available, Finished)

---
### Ejecución del proceso incremental
---

In [None]:
extraction_results = []
total_records = 0

for table_row in tables_list:
    schema = table_row.TABLE_SCHEMA
    table = table_row.TABLE_NAME
    bronze_table_name = clean_bronze_table_name(schema, table)
    
    try:
        df_extracted, load_type = bronze_loader.extract_table_incremental(schema, table)

        if df_extracted is not None and len(df_extracted.columns) > 0:
            df_extracted = normalize_df_with_mapping_or_clean(df_extracted, schema, table, mapping_by_table)

        if df_extracted is not None:
            df_extracted = df_extracted.persist(StorageLevel.MEMORY_AND_DISK)

        max_timestamp = None
        if df_extracted is not None and df_extracted.count() > 0:
            ts_col = find_col(df_extracted, ["DWCreatedDate", "dwcreateddate", "dw_created_date"])
            if ts_col:
                max_timestamp = df_extracted.agg(fmax(col(ts_col))).collect()[0][0]
            if max_timestamp is None:
                max_timestamp = datetime.now()

        record_count = bronze_loader.save_bronze_table_incremental(
            df_extracted, bronze_table_name, load_type
        )

        bronze_loader.update_execution_control(
            f"{schema}.{table}", 
            max_timestamp, 
            "success", 
            record_count
        )

        extraction_results.append({
            "source_table": f"{schema}.{table}",
            "bronze_table": bronze_table_name,
            "record_count": record_count,
            "load_type": load_type,
            "status": "success",
            "last_timestamp": max_timestamp
        })
        total_records += record_count
        print(f"{bronze_table_name} ({load_type}): {record_count:,} records")

    except Exception as e:
        bronze_loader.update_execution_control(f"{schema}.{table}", None, "failed", 0)
        extraction_results.append({
            "source_table": f"{schema}.{table}",
            "bronze_table": bronze_table_name,
            "record_count": 0,
            "load_type": "failed",
            "status": "failed",
            "error": str(e)
        })
        print(f"{bronze_table_name}: FAILED - {str(e)}")
    finally:
        try:
            if df_extracted is not None:
                df_extracted.unpersist()
        except Exception:
            pass
    
    print("-" * 50)

print("-" * 60)


StatementMeta(, a5d3d459-5910-450c-9f39-45308fdbc35b, 31, Finished, Available, Finished)

INFO:__main__:Primera ejecución incremental para dim.Brands. Usando fecha máxima: 2023-02-10 14:52:07.983000
INFO:__main__:Carga incremental para [dim].[Brands] desde 2023-02-10 13:52:07.983000
INFO:__main__:Datos incrementales agregados a bronze_dim_brands: 20 records
INFO:__main__:Primera ejecución incremental para dim.Budget-Rate. Usando fecha máxima: 2023-02-10 14:52:08.250000
INFO:__main__:Carga incremental para [dim].[Budget-Rate] desde 2023-02-10 13:52:08.250000
INFO:__main__:Datos incrementales agregados a bronze_dim_budget_rate: 15 records
INFO:__main__:Primera ejecución incremental para dim.Customers. Usando fecha máxima: 2023-02-10 14:52:09.250000
INFO:__main__:Carga incremental para [dim].[Customers] desde 2023-02-10 13:52:09.250000
INFO:__main__:Datos incrementales agregados a bronze_dim_customers: 3,911 records
INFO:__main__:Primera ejecución incremental para dim.Employees. Usando fecha máxima: 2023-02-10 14:52:07.963000
INFO:__main__:Carga incremental para [dim].[Employe

bronze_dim_brands (incremental): 20 records
--------------------------------------------------
bronze_dim_customers (incremental): 3,911 records
--------------------------------------------------
bronze_dim_employees (incremental): 893 records
--------------------------------------------------
bronze_dim_order_doctype (incremental): 4 records
--------------------------------------------------
bronze_dim_order_status (incremental): 6 records
--------------------------------------------------
bronze_dim_products (incremental): 256,293 records
--------------------------------------------------
bronze_dim_regions (incremental): 181 records
--------------------------------------------------
bronze_fact_forecast (incremental): 5,197 records
--------------------------------------------------
bronze_fact_orders (incremental): 16,910,069 records
--------------------------------------------------
------------------------------------------------------------


---
### Log de ejecución mejorado
---

In [None]:
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

execution_log_data = [(
    execution_date,
    "bronze_incremental_load_azure",
    datetime.now(),
    "completed" if all(r["status"] == "success" for r in extraction_results) else "completed_with_errors",
    "bronze",
    "incremental", 
    total_records,
    len([r for r in extraction_results if r["status"] == "success"]),
    len([r for r in extraction_results if r["status"] == "failed"]),
    len([r for r in extraction_results if r.get("load_type") == "initial"]),
    len([r for r in extraction_results if r.get("load_type") == "incremental"]),
    str(extraction_results)[:1000]
)]

execution_log_schema = StructType([
    StructField("execution_id", StringType(), True),
    StructField("pipeline_name", StringType(), True),
    StructField("execution_timestamp", TimestampType(), True),
    StructField("status", StringType(), True),
    StructField("layer", StringType(), True),
    StructField("load_type", StringType(), True),
    StructField("total_records", LongType(), True),
    StructField("successful_tables", IntegerType(), True),
    StructField("failed_tables", IntegerType(), True),
    StructField("initial_loads", IntegerType(), True),
    StructField("incremental_loads", IntegerType(), True),
    StructField("details", StringType(), True)
])

execution_log = spark.createDataFrame(execution_log_data, execution_log_schema)

execution_log.write \
    .format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .saveAsTable("bronze_execution_log")


StatementMeta(, a5d3d459-5910-450c-9f39-45308fdbc35b, 32, Finished, Available, Finished)

---
### Resumen del proceso incremental
---

In [None]:
successful_loads = len([r for r in extraction_results if r["status"] == "success"])
failed_loads = len([r for r in extraction_results if r["status"] == "failed"])
initial_loads = len([r for r in extraction_results if r.get("load_type") == "initial"])
incremental_loads = len([r for r in extraction_results if r.get("load_type") == "incremental"])

execution_summary = {
    "status": "completed" if failed_loads == 0 else "completed_with_errors",
    "successful_tables": successful_loads,
    "failed_tables": failed_loads,
    "initial_loads": initial_loads,
    "incremental_loads": incremental_loads,
    "total_records": total_records,
    "execution_date": execution_date,   
    "data_source": "azure_sql_database",
    "load_type": "incremental"
}

print("INCREMENTAL EXTRACTION SUMMARY:")
print(f"Successful extractions: {successful_loads}")
print(f"  - Initial loads: {initial_loads}")
print(f"  - Incremental loads: {incremental_loads}")
print(f"Failed extractions: {failed_loads}")
print(f"Total records extracted: {total_records:,}")

if successful_loads > 0:
    print("\nDetalle de cargas exitosas:")
    for result in extraction_results:
        if result["status"] == "success":
            print(f"  {result['bronze_table']} ({result['load_type']}): {result['record_count']:,} records")


StatementMeta(, a5d3d459-5910-450c-9f39-45308fdbc35b, 33, Finished, Available, Finished)

INCREMENTAL EXTRACTION SUMMARY:
Successful extractions: 14
  - Initial loads: 0
  - Incremental loads: 14
Failed extractions: 0
Total records extracted: 38,641,746

Detalle de cargas exitosas:
  bronze_dim_brands (incremental): 20 records
  bronze_dim_budget_rate (incremental): 15 records
  bronze_dim_customers (incremental): 3,911 records
  bronze_dim_employees (incremental): 893 records
  bronze_dim_exchange_rate (incremental): 57,900 records
  bronze_dim_invoice_doctype (incremental): 5 records
  bronze_dim_order_doctype (incremental): 4 records
  bronze_dim_order_status (incremental): 6 records
  bronze_dim_products (incremental): 256,293 records
  bronze_dim_regions (incremental): 181 records
  bronze_fact_budget (incremental): 2,947,811 records
  bronze_fact_forecast (incremental): 5,197 records
  bronze_fact_invoices (incremental): 18,459,441 records
  bronze_fact_orders (incremental): 16,910,069 records


---
### Optimización post-carga
---

### Objetivo
Aplicar el comando `OPTIMIZE` de **Delta Lake** a todas las tablas que fueron cargadas de forma **incremental y exitosa**, con el fin de **mejorar el rendimiento de lectura** y reducir la fragmentación de archivos.

---

### Funcionamiento

1. **Recorrido de resultados**  
   Se recorren los registros de `extraction_results` que representan cada tabla procesada.  

2. **Filtrado de tablas**  
   Se seleccionan únicamente las tablas con:  
   - `status = "success"` → cargas exitosas.  
   - `load_type = "incremental"` → únicamente cargas incrementales.  

3. **Optimización con Delta Lake**  
   - Se ejecuta `spark.sql("OPTIMIZE <tabla>")`.  
   - Este comando **compacta pequeños archivos Parquet** en menos archivos de mayor tamaño.  
   - Resultado: consultas más rápidas y menor latencia en análisis.  

4. **Manejo de errores**  
   - Si alguna optimización falla (por permisos, tabla inexistente, etc.), se captura la excepción y se imprime el error.  

5. **Finalización**  
   - Al concluir el ciclo, se imprime que el proceso de carga incremental ha finalizado.  

---

### Beneficio
El uso de `OPTIMIZE` es fundamental en procesos incrementales porque:  
- Reduce el número de archivos pequeños generados en cada carga.  
- Mejora significativamente la eficiencia de consultas en tablas Delta.  
- Prepara las tablas para un mejor rendimiento en análisis posteriores. 


In [None]:
for result in extraction_results:
    if result["status"] == "success" and result["load_type"] == "incremental":
        try:
            spark.sql(f"OPTIMIZE {result['bronze_table']}")
            print(f"Optimizada: {result['bronze_table']}")
        except Exception as e:
            print(f"Error optimizando {result['bronze_table']}: {e}")

print("Proceso de carga incremental completado.")

StatementMeta(, a5d3d459-5910-450c-9f39-45308fdbc35b, 34, Finished, Available, Finished)

Optimizada: bronze_dim_brands
Optimizada: bronze_dim_budget_rate
Optimizada: bronze_dim_customers
Optimizada: bronze_dim_employees
Optimizada: bronze_dim_exchange_rate
Optimizada: bronze_dim_invoice_doctype
Optimizada: bronze_dim_order_doctype
Optimizada: bronze_dim_order_status
Optimizada: bronze_dim_products
Optimizada: bronze_dim_regions
Optimizada: bronze_fact_budget
Optimizada: bronze_fact_forecast
Optimizada: bronze_fact_invoices
Optimizada: bronze_fact_orders
Proceso de carga incremental completado.
