# Limpieza y enriquecimiento de series macroeconómicas – Capa Silver

**Descripcion:**<br>
Este notebook toma los Parquet almacenados en la capa *Bronze* para seis indicadores macroeconómicos (FEDFUNDS, CPI, UNRATE, GDP, GS10, STLFSI2) y construye un **panel diario completo** imputando lagunas de fechas mediante *forward‑fill*, calculando cambios porcentuales diarios cuando corresponda y marcando la calidad de cada registro.
Posteriormente, escribe la capa *Silver* particionada por `year=YYYY` dentro de Azure Data Lake (*silver/macros_clean/*) subiendo únicamente las filas nuevas detectadas.<br>

**Objetivos**:<br>
- Generar una versión limpia y densificada (calendario completo) de cada serie.<br>
- Imputar valores faltantes con *ffill* y etiquetar su origen mediante `data_quality_flag`.<br>
- Calcular métricas derivadas (porcentaje de cambio diario) donde sea aplicable.<br>
- Persistir la capa *Silver* en Parquet particionado por año, lista para análisis downstream.<br>
- Implementar actualizaciones incrementales idempotentes para minimizar escritura y costos de I/O.

## 1. Configuración básica, librerías y Parámetros globales del pipeline

In [1]:
# ================================================================
# 1.1 Configuración básica y librerías
# ================================================================

import os, io, datetime as dt, pandas as pd, numpy as np
from dotenv import load_dotenv
import adlfs                            # driver fsspec‑ADLS

from utils_adls import upload_bytes, _client, CONTAINER

# ► Credenciales y parámetros de entorno -----------------------
load_dotenv()                           # requiere AZ_STORAGE_ACCOUNT / AZ_ACCOUNT_KEY
ACCOUNT = os.getenv("AZ_STORAGE_ACCOUNT")
KEY     = os.getenv("AZ_ACCOUNT_KEY")

# Prefijos lógicos DENTRO del contenedor (sin anteponer "market/")
BRONZE_PREFIX = "bronze/macros"        # input
SILVER_PREFIX = "silver/macros_clean"  # output
TMP_DIR = "data/silver_tmp"; os.makedirs(TMP_DIR, exist_ok=True)

# Instancias de conexión ----------------------------------------------------------
fs  = adlfs.AzureBlobFileSystem(account_name=ACCOUNT, account_key=KEY)
svc = _client()

# Año actual (para filtros o validaciones futuras)
year_now = dt.date.today().year

## 2. Leer Capa Bronze, limpieza y muestreo de datos iniciales

In [2]:
# ================================================================
# 2.1 Leer Bronze (histórico completo)
# ================================================================
# Reúne todos los archivos Parquet de Bronze para construir un DataFrame único.
# -------------------------------------------------------------------------------
paths = fs.glob(f"{CONTAINER}/{BRONZE_PREFIX}/*/year=*/*.parquet")
if not paths:
    raise RuntimeError("No se encontraron ficheros Bronze")

# Combina todos los Parquet en un solo DataFrame ordenado por serie y fecha.
bronze_df = pd.concat(pd.read_parquet(fs.open(p, "rb")) for p in paths)

In [3]:
display(bronze_df.tail(10))

Unnamed: 0,date,value,series,year,ingest_ts,source
20,2025-05-21,6688726.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED
21,2025-05-28,6673244.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED
22,2025-06-04,6672885.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED
23,2025-06-11,6677155.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED
24,2025-06-18,6681056.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED
25,2025-06-25,6662200.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED
26,2025-07-02,6659598.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED
27,2025-07-09,6661912.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED
28,2025-07-16,6659273.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED
29,2025-07-23,6657715.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED


In [4]:
# ================================================================
# 2.2 Construcción del panel diario
# ================================================================
# • Para cada serie, reindexa al calendario diario pleno.
# • Imputa valores faltantes con forward‑fill y marca su procedencia.
# • Calcula pct_change_diario para series de frecuencia diaria.
# -----------------------------------------------------------------

panel = []

for serie, grp in bronze_df.groupby("series"):
    grp = grp.set_index("date").sort_index()
    
    # Calendario completo entre la fecha mínima y máxima --------
    cal = pd.date_range(grp.index.min(), dt.date.today(), freq="D")
    filled = grp.reindex(cal)

    # ② Flags y forward‑fill ------------------------------------
    orig_na = filled["value"].isna()          # qué fechas estaban vacías
    filled["value"] = filled["value"].ffill() # imputación

    # Forward‑fill de metadatos para trazabilidad ---------------
    for col in ["ingest_ts", "source"]:
        if col in filled.columns:
            filled[col] = filled[col].ffill()

    # Flag de calidad: 'source' vs 'ffill'
    filled["data_quality_flag"] = np.where(orig_na, "ffill", "source")

    # Métricas derivadas -----------------------------------------
    if grp.index.inferred_freq == "D":
        filled["pct_change_daily"] = filled["value"].pct_change().round(6)

    # Restaurar columnas y añadir metadatos ----------------------
    filled = (
        filled.reset_index()
              .rename(columns={"index": "date"})
              .assign(series=serie,
                      year=lambda x: x["date"].dt.year)
    )
    panel.append(filled)

# DataFrame Silver consolidado
silver_df = (
    pd.concat(panel, ignore_index=True)
      .sort_values(["series", "date"])
)

In [5]:
display(silver_df.head(60))

Unnamed: 0,date,value,series,year,ingest_ts,source,data_quality_flag
0,1992-01-03,-0.79596,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,source
1,1992-01-04,-0.79596,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,ffill
2,1992-01-05,-0.79596,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,ffill
3,1992-01-06,-0.79596,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,ffill
4,1992-01-07,-0.79596,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,ffill
5,1992-01-08,-0.79596,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,ffill
6,1992-01-09,-0.79596,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,ffill
7,1992-01-10,-0.79068,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,source
8,1992-01-11,-0.79068,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,ffill
9,1992-01-12,-0.79068,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,ffill


## 3. Escribir capa Silver (Local y Azure Data Lake Storage)

In [6]:
# ================================================================
# 3.1 Última fecha en Silver
# ================================================================
# Devuelve la fecha máxima disponible en Silver para una (serie, año) concreta.
# Se usa para determinar qué registros son nuevos.
# ----------------------------------------------------------------
def last_date_silver(series: str, year: int):
    file_path = f"{SILVER_PREFIX}/{series}/year={year}/{series}_{year}.parquet"
    full_path = f"{CONTAINER}/{file_path}"
    if not fs.exists(full_path):
        return None
    with fs.open(full_path, "rb") as f:
        return pd.read_parquet(f, columns=["date"])["date"].max().date()

silver_df.head()

Unnamed: 0,date,value,series,year,ingest_ts,source,data_quality_flag
0,1992-01-03,-0.79596,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,source
1,1992-01-04,-0.79596,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,ffill
2,1992-01-05,-0.79596,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,ffill
3,1992-01-06,-0.79596,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,ffill
4,1992-01-07,-0.79596,ANFCI,1992,2025-07-28T06:55:11.181425+00:00,FRED,ffill


In [7]:
# ================================================================
# 3.2 Escritura incremental en la capa Silver
# ================================================================
# • Escribe Parquet local por partición (serie, año).
# • Sube únicamente las filas que no existen todavía.
# ----------------------------------------------------------------

uploads = 0
for (serie, yr), sub in silver_df.groupby(["series", "year"]):
    last_dt = last_date_silver(serie, yr)

    # Si ya existe la partición en Silver, filtra filas nuevas
    if last_dt is not None:
        sub = sub[sub["date"] > pd.to_datetime(last_dt)]

    if sub.empty:
        print(f"{serie} {yr}: sin novedades"); continue

    # Guarda temp local ----------------------------------------
    tmp = f"{TMP_DIR}/{serie}_{yr}.parquet"
    sub.to_parquet(tmp, index=False)

     # Sube / sobrescribe en ADLS ------------------------------
    remote = f"{SILVER_PREFIX}/{serie}/year={yr}/{serie}_{yr}.parquet"
    with open(tmp, "rb") as f:
        upload_bytes(remote, f.read())          # overwrite=True

    uploads += len(sub)
    print(f"▲ {serie} {yr}: +{len(sub)} filas")

print("✅ Silver-Macros FULL build terminado – filas nuevas:", uploads)

▲ ANFCI 1992: +364 filas
▲ ANFCI 1993: +365 filas
▲ ANFCI 1994: +365 filas
▲ ANFCI 1995: +365 filas
▲ ANFCI 1996: +366 filas
▲ ANFCI 1997: +365 filas
▲ ANFCI 1998: +365 filas
▲ ANFCI 1999: +365 filas
▲ ANFCI 2000: +366 filas
▲ ANFCI 2001: +365 filas
▲ ANFCI 2002: +365 filas
▲ ANFCI 2003: +365 filas
▲ ANFCI 2004: +366 filas
▲ ANFCI 2005: +365 filas
▲ ANFCI 2006: +365 filas
▲ ANFCI 2007: +365 filas
▲ ANFCI 2008: +366 filas
▲ ANFCI 2009: +365 filas
▲ ANFCI 2010: +365 filas
▲ ANFCI 2011: +365 filas
▲ ANFCI 2012: +366 filas
▲ ANFCI 2013: +365 filas
▲ ANFCI 2014: +365 filas
▲ ANFCI 2015: +365 filas
▲ ANFCI 2016: +366 filas
▲ ANFCI 2017: +365 filas
▲ ANFCI 2018: +365 filas
▲ ANFCI 2019: +365 filas
▲ ANFCI 2020: +366 filas
▲ ANFCI 2021: +365 filas
▲ ANFCI 2022: +365 filas
▲ ANFCI 2023: +365 filas
▲ ANFCI 2024: +366 filas
▲ ANFCI 2025: +209 filas
▲ BAA10Y 1992: +366 filas
▲ BAA10Y 1993: +365 filas
▲ BAA10Y 1994: +365 filas
▲ BAA10Y 1995: +365 filas
▲ BAA10Y 1996: +366 filas
▲ BAA10Y 1997: +365 