# Construcción de features financieros – Capa Gold

**Descripcion:**<br>
Toma los precios diarios de índices bursátiles (capa *Silver/indices_clean*) y las series macroeconómicas limpias (*Silver/macros_clean*), los fusiona en un calendario maestro diario y genera un conjunto de **features técnicos** (lags, medias móviles, volatilidad, RSI‑14) junto con la variable objetivo de retorno a 5 días.  Los datos resultantes se almacenan particionados por año en la capa *Gold* del data‑lake para su uso directo en modelado ML o análisis avanzado.<br>

**Objetivos**:<br>
- Leer eficientemente los datos *Silver* de índices y macro desde ADLS.<br>
- Construir un panel diario completo que combine precios y variables macro.<br>
- Generar features técnicos estándar para cada índice (lag 1/5, SMA‑20/50, volatilidad 20‑días, RSI‑14).<br>
- Calcular la variable objetivo (retorno a 5 días) para entrenamiento supervisado.<br>
- Persistir la capa *Gold* en formato Parquet, particionada por `year=YYYY` y `ticker`, lista para consumo.<br>
- Operar incrementalmente: si una partición existe se sobreescribe sólo si hay nuevas fechas.<br>

## 1. Configuración básica, librerías y Parámetros globales del pipeline

In [1]:
# ================================================================
# 1.1 Configuración básica y librerías
# ================================================================
import os, io, datetime as dt, pandas as pd, numpy as np
from dotenv import load_dotenv
import adlfs    # fsspec‑ADLS
from utils_adls import upload_bytes, _client, CONTAINER   # helpers propios

# Variables de entorno (cuenta & clave de Azure) -----------------
load_dotenv()
ACCOUNT = os.getenv("AZ_STORAGE_ACCOUNT")
KEY = os.getenv("AZ_ACCOUNT_KEY")

# Prefijos internos al contenedor «market» (sin incluirlo en la ruta)
SILVER_IDX_PREFIX  = "silver/indices_clean"
SILVER_MAC_PREFIX  = "silver/macros_clean"
GOLD_PREFIX        = "gold/features"          # <ticker>/year=YYYY/…

LOCAL_TMP = "data/gold_tmp"; os.makedirs(LOCAL_TMP, exist_ok=True)

# fsspec filesystem y cliente ADLS -----------------------------
fs  = adlfs.AzureBlobFileSystem(account_name=ACCOUNT, account_key=KEY)
svc = _client()

# Lista explícita de índices a procesar -----------------------
tickers = ["SP500", "DJIA", "NASDAQ", "RUSSELL2000", "WILSHIRE5000"]
pd.set_option("display.max_columns", None)

## 2. Carga de datos Silver (Indices y variables Macroeconomicas)

In [2]:
# ================================================================
# 2.1 Carga Silver índices y macro
# ================================================================
# • Lee todos los Parquet Silver de índices y macro usando fsspec.
# • Crea tablas "wide" para CLOSE, retornos diarios y variables macro.
# ----------------------------------------------------------------

# Rutas absolutas dentro del contenedor (market/…) ---------------
paths_idx = fs.glob(f"{CONTAINER}/{SILVER_IDX_PREFIX}/*/year=*/*.parquet")
paths_mac = fs.glob(f"{CONTAINER}/{SILVER_MAC_PREFIX}/*/year=*/*.parquet")

# Carga y concatena ---------------------------------------------
idx_df = pd.concat(pd.read_parquet(fs.open(p, "rb")) for p in paths_idx)
mac_df = pd.concat(pd.read_parquet(fs.open(p, "rb")) for p in paths_mac)

# Índices en formato wide ---------------------------------------
close_wide = idx_df.pivot(index="date", columns="ticker", values="close").ffill()
ret_wide   = idx_df.pivot(index="date", columns="ticker", values="daily_return")

# Macros a wide (ffill ya aplicado en Silver) -------------------
macro_wide = mac_df.pivot(index="date", columns="series", values="value").ffill().bfill()

# Calendario maestro --------------------------------------------
date_range = pd.date_range(close_wide.index.min(), close_wide.index.max(), freq="D")
base = pd.DataFrame(index=date_range).join(macro_wide, how="left")

In [3]:
macro_wide.head()

series,ANFCI,BAA10Y,BAMLH0A0HYM2,CPIAUCSL,DCOILWTICO,DFII10,DGS10,DGS2,DTWEXBGS,FEDFUNDS,GDP,HOUST,ICSA,INDPRO,M2SL,NFCI,PAYEMS,PERMIT,STLFSI2,T10Y2Y,T10Y3M,T5YIE,TEDRATE,THREEFYTP10,UMCSENT,UNRATE,VIXCLS,WALCL
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
1992-01-01,-0.79596,2.33,3.13,138.3,19.43,2.43,6.78,4.77,101.4155,4.03,6363.102,1176.0,432000.0,61.4823,3381.2,-0.67185,108365.0,1077.0,-0.3803,2.01,2.82,1.3,0.33,2.0654,67.5,7.3,18.95,719542.0
1992-01-02,-0.79596,2.33,3.13,138.3,19.43,2.43,6.78,4.77,101.4155,4.03,6363.102,1176.0,432000.0,61.4823,3381.2,-0.67185,108365.0,1077.0,-0.3803,2.01,2.82,1.3,0.33,2.0654,67.5,7.3,18.95,719542.0
1992-01-03,-0.79596,2.26,3.13,138.3,19.22,2.43,6.85,4.8,101.4155,4.03,6363.102,1176.0,432000.0,61.4823,3381.2,-0.67185,108365.0,1077.0,-0.3803,2.05,2.9,1.3,0.34,2.1013,67.5,7.3,18.75,719542.0
1992-01-04,-0.79596,2.26,3.13,138.3,19.22,2.43,6.85,4.8,101.4155,4.03,6363.102,1176.0,432000.0,61.4823,3381.2,-0.67185,108365.0,1077.0,-0.3803,2.05,2.9,1.3,0.34,2.1013,67.5,7.3,18.75,719542.0
1992-01-05,-0.79596,2.26,3.13,138.3,19.22,2.43,6.85,4.8,101.4155,4.03,6363.102,1176.0,432000.0,61.4823,3381.2,-0.67185,108365.0,1077.0,-0.3803,2.05,2.9,1.3,0.34,2.1013,67.5,7.3,18.75,719542.0


In [4]:
display(mac_df.tail(25))

Unnamed: 0,date,value,series,year,ingest_ts,source,data_quality_flag
184,2025-07-04,6659598.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED,ffill
185,2025-07-05,6659598.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED,ffill
186,2025-07-06,6659598.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED,ffill
187,2025-07-07,6659598.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED,ffill
188,2025-07-08,6659598.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED,ffill
189,2025-07-09,6661912.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED,source
190,2025-07-10,6661912.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED,ffill
191,2025-07-11,6661912.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED,ffill
192,2025-07-12,6661912.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED,ffill
193,2025-07-13,6661912.0,WALCL,2025,2025-07-28T06:57:06.165976+00:00,FRED,ffill


## 3. Funciones de features técnicos

In [5]:
# ================================================================
# 3.1 Funciones de features técnicos
# ================================================================
# add_tech_features: dado un DataFrame y una columna de precios agrega:
#   • Lags (1 y 5 días)
#   • Medias móviles (20 y 50)
#   • Volatilidad (desv. estándar de log‑returns 20‑d)
#   • RSI‑14
# Devuelve el DataFrame enriquecido (in‑place).
# ---------------------------------------------------------------

def add_tech_features(df: pd.DataFrame, price_col: str, prefix: str):
    p = df[price_col]
    lr = np.log(p / p.shift(1))

    df[f"{prefix}_lag1"]   = p.shift(1)
    df[f"{prefix}_lag5"]   = p.shift(5)
    df[f"{prefix}_sma20"]  = p.rolling(20).mean()
    df[f"{prefix}_sma50"]  = p.rolling(50).mean()
    df[f"{prefix}_vol20"]  = lr.rolling(20).std()

    # ► RSI‑14 --------------------------------------------------
    delta = p.diff()
    gain  = delta.clip(lower=0).rolling(14).mean()
    loss  = (-delta.clip(upper=0)).rolling(14).mean()
    rs    = gain / loss
    df[f"{prefix}_rsi14"] = 100 - (100 / (1 + rs))

    # ► Bandas de Bollinger -------------------------------------
    rolling_std = p.rolling(20).std()
    df[f"{prefix}_std20"] = rolling_std
    df[f"{prefix}_bollinger_upper"] = df[f"{prefix}_sma20"] + 2 * rolling_std
    df[f"{prefix}_bollinger_lower"] = df[f"{prefix}_sma20"] - 2 * rolling_std
    
    return df

In [6]:
# ================================================================
# 3.2 Generación y persistencia de features en la capa Gold
# ================================================================
# Para cada índice:
#   - Copia el calendario maestro + variables macro.
#   - Añade la serie de precios y retornos del índice.
#   - Crea features técnicos y la variable target (retorno a 5 días).
#   - Escribe Parquet particionado por año en la capa Gold y sube al lake.
# ----------------------------------------------------------------

for tk in tickers:
    print(f"▶ Procesando {tk} …")

    # --- ensamblar panel ticker + macro -------------------------
    df = base.copy()
    df[tk]       = close_wide[tk]
    df[f"{tk}_ret"] = ret_wide[tk]

    # Añadir features técnicos ----------------------------------
    df = add_tech_features(df, price_col=tk, prefix=tk.lower())

    # Variables objetivo: retorno futuro a 5, 20 y 180 días ------
    for h in [5, 20, 180]:
        df[f"y_{tk.lower()}_ret_{h}d"] = df[tk].shift(-h) / df[tk] - 1

    # Filtra filas con dato real (precio existe) ----------------
    df = df[df[tk].notna()].copy()
    df["year"] = df.index.year

     # Formato tidy ---------------------------------------------
    df.reset_index(inplace=True); df.rename(columns={"index":"date"}, inplace=True)

    # --- persistencia por partición año ------------------------
    for yr, part in df.groupby("year"):
        tmp = f"{LOCAL_TMP}/{tk}_{yr}.parquet"
        part.to_parquet(tmp, index=False, engine="pyarrow")

        remote = f"{GOLD_PREFIX}/{tk}/year={yr}/{tk}_{yr}.parquet"
        with open(tmp, "rb") as f:
            upload_bytes(remote, f.read())  # overwrite=True para idempotencia
        print(f"   ▲ {tk} {yr} -> {remote}  ({len(part)} filas)")

print("✅ Gold por índice creado")

▶ Procesando SP500 …
   ▲ SP500 1992 -> gold/features/SP500/year=1992/SP500_1992.parquet  (365 filas)
   ▲ SP500 1993 -> gold/features/SP500/year=1993/SP500_1993.parquet  (365 filas)
   ▲ SP500 1994 -> gold/features/SP500/year=1994/SP500_1994.parquet  (365 filas)
   ▲ SP500 1995 -> gold/features/SP500/year=1995/SP500_1995.parquet  (365 filas)
   ▲ SP500 1996 -> gold/features/SP500/year=1996/SP500_1996.parquet  (366 filas)
   ▲ SP500 1997 -> gold/features/SP500/year=1997/SP500_1997.parquet  (365 filas)
   ▲ SP500 1998 -> gold/features/SP500/year=1998/SP500_1998.parquet  (365 filas)
   ▲ SP500 1999 -> gold/features/SP500/year=1999/SP500_1999.parquet  (365 filas)
   ▲ SP500 2000 -> gold/features/SP500/year=2000/SP500_2000.parquet  (366 filas)
   ▲ SP500 2001 -> gold/features/SP500/year=2001/SP500_2001.parquet  (365 filas)
   ▲ SP500 2002 -> gold/features/SP500/year=2002/SP500_2002.parquet  (365 filas)
   ▲ SP500 2003 -> gold/features/SP500/year=2003/SP500_2003.parquet  (365 filas)
   ▲ SP

## 4. Inspección rapida de resultados

In [7]:
# ================================================================
# 4.1 Inspección de los resultados
# ================================================================
for idx in tickers:                                            # ['SP500', ...]
    paths = fs.glob(f"{CONTAINER}/{GOLD_PREFIX}/{idx}/year=*/*.parquet")
    print(f"\n===== {idx}: {len(paths)} archivos encontrados =====")
    for p in paths[:5]:
        print(" •", p)

    if not paths:                          # si aún no hay particiones para el índice
        continue

    # Muestra esquema y primeras filas del último archivo (normalmente el año +reciente)
    sample_path = paths[-1]
    df_sample = pd.read_parquet(fs.open(sample_path, "rb"))

    print("\nColumnas y tipos:")
    print(df_sample.dtypes)

    print("\nPrimeras 5 filas:")
    display(df_sample.head())


===== SP500: 34 archivos encontrados =====
 • market/gold/features/SP500/year=1992/SP500_1992.parquet
 • market/gold/features/SP500/year=1993/SP500_1993.parquet
 • market/gold/features/SP500/year=1994/SP500_1994.parquet
 • market/gold/features/SP500/year=1995/SP500_1995.parquet
 • market/gold/features/SP500/year=1996/SP500_1996.parquet

Columnas y tipos:
date                     datetime64[ns]
ANFCI                           float64
BAA10Y                          float64
BAMLH0A0HYM2                    float64
CPIAUCSL                        float64
DCOILWTICO                      float64
DFII10                          float64
DGS10                           float64
DGS2                            float64
DTWEXBGS                        float64
FEDFUNDS                        float64
GDP                             float64
HOUST                           float64
ICSA                            float64
INDPRO                          float64
M2SL                            float64
NF

Unnamed: 0,date,ANFCI,BAA10Y,BAMLH0A0HYM2,CPIAUCSL,DCOILWTICO,DFII10,DGS10,DGS2,DTWEXBGS,FEDFUNDS,GDP,HOUST,ICSA,INDPRO,M2SL,NFCI,PAYEMS,PERMIT,STLFSI2,T10Y2Y,T10Y3M,T5YIE,TEDRATE,THREEFYTP10,UMCSENT,UNRATE,VIXCLS,WALCL,SP500,SP500_ret,sp500_lag1,sp500_lag5,sp500_sma20,sp500_sma50,sp500_vol20,sp500_rsi14,sp500_std20,sp500_bollinger_upper,sp500_bollinger_lower,y_sp500_ret_5d,y_sp500_ret_20d,y_sp500_ret_180d,year
0,2025-01-01,-0.49691,1.42,2.92,317.603,72.44,2.24,4.58,4.25,129.488,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.48658,158942.0,1480.0,-0.8509,0.33,0.21,2.38,0.09,0.6872,74.0,4.1,17.35,6852491.0,5881.629883,0.0,5881.629883,5970.839844,5974.210449,5989.426377,0.008688,51.4073,70.881458,6115.973365,5832.447534,0.015939,0.028497,0.054971,2025
1,2025-01-02,-0.49691,1.45,2.88,317.603,73.79,2.23,4.57,4.25,129.6666,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.48658,158942.0,1480.0,-0.8509,0.32,0.21,2.41,0.09,0.6851,74.0,4.1,17.93,6852491.0,5868.549805,-0.002224,5881.629883,5970.839844,5965.083447,5987.089775,0.008683,50.213344,72.201154,6109.485756,5820.681138,0.006898,0.037117,0.05614,2025
2,2025-01-03,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,5942.470215,0.012596,5868.549805,5970.839844,5959.652466,5986.955781,0.009233,51.63848,69.422988,6098.498442,5820.80649,-0.004076,0.029658,0.047951,2025
3,2025-01-04,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,205000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,5942.470215,0.0,5942.470215,5906.939941,5954.221484,5988.392783,0.009233,51.63848,66.060578,6086.34264,5822.100329,-0.004076,0.026718,0.05669,2025
4,2025-01-05,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,205000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,5942.470215,0.0,5942.470215,5881.629883,5947.640991,5989.829785,0.00917,51.63848,59.745932,6067.132855,5828.149128,-0.019425,0.026718,0.05669,2025



===== DJIA: 34 archivos encontrados =====
 • market/gold/features/DJIA/year=1992/DJIA_1992.parquet
 • market/gold/features/DJIA/year=1993/DJIA_1993.parquet
 • market/gold/features/DJIA/year=1994/DJIA_1994.parquet
 • market/gold/features/DJIA/year=1995/DJIA_1995.parquet
 • market/gold/features/DJIA/year=1996/DJIA_1996.parquet

Columnas y tipos:
date                    datetime64[ns]
ANFCI                          float64
BAA10Y                         float64
BAMLH0A0HYM2                   float64
CPIAUCSL                       float64
DCOILWTICO                     float64
DFII10                         float64
DGS10                          float64
DGS2                           float64
DTWEXBGS                       float64
FEDFUNDS                       float64
GDP                            float64
HOUST                          float64
ICSA                           float64
INDPRO                         float64
M2SL                           float64
NFCI                         

Unnamed: 0,date,ANFCI,BAA10Y,BAMLH0A0HYM2,CPIAUCSL,DCOILWTICO,DFII10,DGS10,DGS2,DTWEXBGS,FEDFUNDS,GDP,HOUST,ICSA,INDPRO,M2SL,NFCI,PAYEMS,PERMIT,STLFSI2,T10Y2Y,T10Y3M,T5YIE,TEDRATE,THREEFYTP10,UMCSENT,UNRATE,VIXCLS,WALCL,DJIA,DJIA_ret,djia_lag1,djia_lag5,djia_sma20,djia_sma50,djia_vol20,djia_rsi14,djia_std20,djia_bollinger_upper,djia_bollinger_lower,y_djia_ret_5d,y_djia_ret_20d,y_djia_ret_180d,year
0,2025-01-01,-0.49691,1.42,2.92,317.603,72.44,2.24,4.58,4.25,129.488,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.48658,158942.0,1480.0,-0.8509,0.33,0.21,2.38,0.09,0.6872,74.0,4.1,17.35,6852491.0,42544.21875,0.0,42544.21875,42992.210938,43065.353125,43797.163906,0.007473,56.103517,488.807992,44042.969109,42087.737141,0.003816,0.034825,0.036446,2025
1,2025-01-02,-0.49691,1.45,2.88,317.603,73.79,2.23,4.57,4.25,129.6666,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.48658,158942.0,1480.0,-0.8509,0.32,0.21,2.41,0.09,0.6851,74.0,4.1,17.93,6852491.0,42392.269531,-0.003572,42544.21875,42992.210938,42993.563672,43765.845469,0.007486,51.304872,476.167772,43945.899216,42041.228128,0.00321,0.041622,0.0496,2025
2,2025-01-03,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,42732.128906,0.008017,42392.269531,42992.210938,42938.767188,43745.470859,0.007786,46.926176,436.486719,43811.740625,42065.79375,-0.002268,0.042894,0.041006,2025
3,2025-01-04,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,205000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,42732.128906,0.0,42732.128906,42573.730469,42883.970703,43731.213672,0.007786,46.926176,384.687128,43653.344959,42114.596447,-0.002268,0.039598,0.049059,2025
4,2025-01-05,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,205000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,42732.128906,0.0,42732.128906,42544.21875,42834.703125,43716.956484,0.007785,46.926176,331.77918,43498.261484,42171.144766,-0.018573,0.039598,0.049059,2025



===== NASDAQ: 34 archivos encontrados =====
 • market/gold/features/NASDAQ/year=1992/NASDAQ_1992.parquet
 • market/gold/features/NASDAQ/year=1993/NASDAQ_1993.parquet
 • market/gold/features/NASDAQ/year=1994/NASDAQ_1994.parquet
 • market/gold/features/NASDAQ/year=1995/NASDAQ_1995.parquet
 • market/gold/features/NASDAQ/year=1996/NASDAQ_1996.parquet

Columnas y tipos:
date                      datetime64[ns]
ANFCI                            float64
BAA10Y                           float64
BAMLH0A0HYM2                     float64
CPIAUCSL                         float64
DCOILWTICO                       float64
DFII10                           float64
DGS10                            float64
DGS2                             float64
DTWEXBGS                         float64
FEDFUNDS                         float64
GDP                              float64
HOUST                            float64
ICSA                             float64
INDPRO                           float64
M2SL            

Unnamed: 0,date,ANFCI,BAA10Y,BAMLH0A0HYM2,CPIAUCSL,DCOILWTICO,DFII10,DGS10,DGS2,DTWEXBGS,FEDFUNDS,GDP,HOUST,ICSA,INDPRO,M2SL,NFCI,PAYEMS,PERMIT,STLFSI2,T10Y2Y,T10Y3M,T5YIE,TEDRATE,THREEFYTP10,UMCSENT,UNRATE,VIXCLS,WALCL,NASDAQ,NASDAQ_ret,nasdaq_lag1,nasdaq_lag5,nasdaq_sma20,nasdaq_sma50,nasdaq_vol20,nasdaq_rsi14,nasdaq_std20,nasdaq_bollinger_upper,nasdaq_bollinger_lower,y_nasdaq_ret_5d,y_nasdaq_ret_20d,y_nasdaq_ret_180d,year
0,2025-01-01,-0.49691,1.42,2.92,317.603,72.44,2.24,4.58,4.25,129.488,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.48658,158942.0,1480.0,-0.8509,0.33,0.21,2.38,0.09,0.6872,74.0,4.1,17.35,6852491.0,19310.789062,0.0,19310.789062,19722.029297,19733.416406,19460.833359,0.010923,47.072106,276.253957,20285.924319,19180.908493,0.028699,0.023095,0.054837,2025
1,2025-01-02,-0.49691,1.45,2.88,317.603,73.79,2.23,4.57,4.25,129.6666,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.48658,158942.0,1480.0,-0.8509,0.32,0.21,2.41,0.09,0.6851,74.0,4.1,17.93,6852491.0,19280.789062,-0.001554,19310.789062,19722.029297,19701.119824,19461.834727,0.010905,46.735278,289.886756,20280.893337,19121.346311,0.010834,0.037786,0.047825,2025
2,2025-01-03,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,19621.679688,0.01768,19280.789062,19722.029297,19685.867773,19472.115312,0.011718,51.58347,285.381995,20256.631763,19115.103784,-0.007278,0.022016,0.039316,2025
3,2025-01-04,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,205000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,19621.679688,0.0,19621.679688,19486.789062,19670.615723,19490.946523,0.011718,51.58347,279.931594,20230.478911,19110.752534,-0.007278,0.016952,0.049915,2025
4,2025-01-05,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,205000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,19621.679688,0.0,19621.679688,19310.789062,19643.005176,19509.777734,0.01131,51.58347,253.681785,20150.368746,19135.641605,-0.023446,0.016952,0.049915,2025



===== RUSSELL2000: 34 archivos encontrados =====
 • market/gold/features/RUSSELL2000/year=1992/RUSSELL2000_1992.parquet
 • market/gold/features/RUSSELL2000/year=1993/RUSSELL2000_1993.parquet
 • market/gold/features/RUSSELL2000/year=1994/RUSSELL2000_1994.parquet
 • market/gold/features/RUSSELL2000/year=1995/RUSSELL2000_1995.parquet
 • market/gold/features/RUSSELL2000/year=1996/RUSSELL2000_1996.parquet

Columnas y tipos:
date                           datetime64[ns]
ANFCI                                 float64
BAA10Y                                float64
BAMLH0A0HYM2                          float64
CPIAUCSL                              float64
DCOILWTICO                            float64
DFII10                                float64
DGS10                                 float64
DGS2                                  float64
DTWEXBGS                              float64
FEDFUNDS                              float64
GDP                                   float64
HOUST                   

Unnamed: 0,date,ANFCI,BAA10Y,BAMLH0A0HYM2,CPIAUCSL,DCOILWTICO,DFII10,DGS10,DGS2,DTWEXBGS,FEDFUNDS,GDP,HOUST,ICSA,INDPRO,M2SL,NFCI,PAYEMS,PERMIT,STLFSI2,T10Y2Y,T10Y3M,T5YIE,TEDRATE,THREEFYTP10,UMCSENT,UNRATE,VIXCLS,WALCL,RUSSELL2000,RUSSELL2000_ret,russell2000_lag1,russell2000_lag5,russell2000_sma20,russell2000_sma50,russell2000_vol20,russell2000_rsi14,russell2000_std20,russell2000_bollinger_upper,russell2000_bollinger_lower,y_russell2000_ret_5d,y_russell2000_ret_20d,y_russell2000_ret_180d,year
0,2025-01-01,-0.49691,1.42,2.92,317.603,72.44,2.24,4.58,4.25,129.488,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.48658,158942.0,1480.0,-0.8509,0.33,0.21,2.38,0.09,0.6872,74.0,4.1,17.35,6852491.0,2230.159912,0.0,2230.159912,2244.590088,2268.804517,2339.772012,0.011861,49.493777,48.48689,2365.778297,2171.830736,0.016362,0.039374,-0.024716,2025
1,2025-01-02,-0.49691,1.45,2.88,317.603,73.79,2.23,4.57,4.25,129.6666,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.48658,158942.0,1480.0,-0.8509,0.32,0.21,2.41,0.09,0.6851,74.0,4.1,17.93,6852491.0,2231.669922,0.000677,2230.159912,2244.590088,2263.043018,2337.018008,0.011861,54.072853,45.471092,2353.985201,2172.100834,0.008124,0.032285,-0.015293,2025
2,2025-01-03,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,2268.469971,0.01649,2231.669922,2244.590088,2259.121521,2335.648608,0.012585,59.269729,41.022905,2341.16733,2177.075712,-0.013009,0.020335,-0.018554,2025
3,2025-01-04,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,205000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,2268.469971,0.0,2268.469971,2227.780029,2255.200024,2334.941206,0.012585,59.269729,35.577552,2326.355129,2184.04492,-0.013009,0.017311,-0.008565,2025
4,2025-01-05,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,205000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,2268.469971,0.0,2268.469971,2230.159912,2250.524023,2334.233804,0.012449,59.269729,25.53035,2301.584724,2199.463323,-0.034931,0.017311,-0.008565,2025



===== WILSHIRE5000: 34 archivos encontrados =====
 • market/gold/features/WILSHIRE5000/year=1992/WILSHIRE5000_1992.parquet
 • market/gold/features/WILSHIRE5000/year=1993/WILSHIRE5000_1993.parquet
 • market/gold/features/WILSHIRE5000/year=1994/WILSHIRE5000_1994.parquet
 • market/gold/features/WILSHIRE5000/year=1995/WILSHIRE5000_1995.parquet
 • market/gold/features/WILSHIRE5000/year=1996/WILSHIRE5000_1996.parquet

Columnas y tipos:
date                            datetime64[ns]
ANFCI                                  float64
BAA10Y                                 float64
BAMLH0A0HYM2                           float64
CPIAUCSL                               float64
DCOILWTICO                             float64
DFII10                                 float64
DGS10                                  float64
DGS2                                   float64
DTWEXBGS                               float64
FEDFUNDS                               float64
GDP                                    float64
H

Unnamed: 0,date,ANFCI,BAA10Y,BAMLH0A0HYM2,CPIAUCSL,DCOILWTICO,DFII10,DGS10,DGS2,DTWEXBGS,FEDFUNDS,GDP,HOUST,ICSA,INDPRO,M2SL,NFCI,PAYEMS,PERMIT,STLFSI2,T10Y2Y,T10Y3M,T5YIE,TEDRATE,THREEFYTP10,UMCSENT,UNRATE,VIXCLS,WALCL,WILSHIRE5000,WILSHIRE5000_ret,wilshire5000_lag1,wilshire5000_lag5,wilshire5000_sma20,wilshire5000_sma50,wilshire5000_vol20,wilshire5000_rsi14,wilshire5000_std20,wilshire5000_bollinger_upper,wilshire5000_bollinger_lower,y_wilshire5000_ret_5d,y_wilshire5000_ret_20d,y_wilshire5000_ret_180d,year
0,2025-01-01,-0.49691,1.42,2.92,317.603,72.44,2.24,4.58,4.25,129.488,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.48658,158942.0,1480.0,-0.8509,0.33,0.21,2.38,0.09,0.6872,74.0,4.1,17.35,6852491.0,58970.019531,0.0,58970.019531,59833.5,59942.236328,60258.355859,0.009015,50.910195,765.071084,61472.378497,58412.094159,0.017076,0.032476,0.050374,2025
1,2025-01-02,-0.49691,1.45,2.88,317.603,73.79,2.23,4.57,4.25,129.6666,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.48658,158942.0,1480.0,-0.8509,0.32,0.21,2.41,0.09,0.6851,74.0,4.1,17.93,6852491.0,58879.191406,-0.00154,58970.019531,59833.5,59842.321484,60231.731719,0.009012,50.568606,766.984879,61376.291243,58308.351726,0.007439,0.039094,0.051513,2025
2,2025-01-03,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,209000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,59659.671875,0.013256,58879.191406,59833.5,59781.430664,60229.095156,0.009602,52.404594,727.818913,61237.06849,58325.792838,-0.004561,0.030732,0.043513,2025
3,2025-01-04,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,205000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,59659.671875,0.0,59659.671875,59204.589844,59720.539844,60242.271562,0.009602,52.404594,680.712324,61081.964492,58359.115195,-0.004561,0.028062,0.052285,2025
4,2025-01-05,-0.49995,1.44,2.81,317.603,74.64,2.26,4.6,4.28,129.6721,4.48,29962.047,1514.0,205000.0,103.0447,21441.8,-0.49156,158942.0,1480.0,-0.8509,0.32,0.26,2.4,0.09,0.7016,74.0,4.1,16.13,6852491.0,59659.671875,0.0,59659.671875,58970.019531,59647.380859,60255.447969,0.009533,52.404594,595.341481,60838.063822,58456.697897,-0.019927,0.028062,0.052285,2025


| **Columna**             | **Descripción**                                                                                         |
| ----------------------- | ------------------------------------------------------------------------------------------------------- |
| `date`                  | Fecha de la observación (formato día/mes/año).                                                          |
| `CPIAUCSL`              | Índice de Precios al Consumidor en EE.UU., mide la inflación.                                           |
| `FEDFUNDS`              | Tasa de fondos federales establecida por la Reserva Federal.                                            |
| `GDP`                   | Producto Interno Bruto (PIB) nominal de EE.UU.                                                          |
| `GS10`                  | Tasa de rendimiento de bonos del Tesoro a 10 años.                                                      |
| `STLFSI2`               | Índice de estrés financiero de St. Louis.                                                               |
| `UNRATE`                | Tasa de desempleo en EE.UU.                                                                             |
| `TICKER`                | Valor del activo financiero (dependiendo cual se filtre).                                               |
| `TICKER_ret`            | Retorno diario del activo financiero . Calculado como el cambio porcentual.                             |
| `TICKER_lag1`           | Precio del activo en el día anterior.                                                                   |
| `TICKER_lag5`           | Precio promedio o cierre del activo 5 días antes. Útil para capturar memoria del mercado.               |
| `TICKER_sma20`          | Media móvil simple de 20 días del activo. Ayuda a suavizar la tendencia a corto plazo.                  |
| `TICKER_sma50`          | Media móvil simple de 50 días. Detecta tendencias más largas.                                           |
| `TICKER_vol20`          | Volatilidad promedio de 20 días, mide la variación reciente del precio.                                 |
| `TICKER_rsi14`          | Índice de Fuerza Relativa (RSI) de 14 días. Oscilador que indica condiciones de sobrecompra/sobreventa. |
| `TICKER_std20`          | Desviación estándar de los precios en los últimos 20 días, mide la dispersión o volatilidad.            |
| `TICKER_bollinger_upper`| Banda superior de las Bandas de Bollinger (basadas en `sma20` + 2 desviaciones estándar).               |
| `TICKER_bollinger_lower`| Banda inferior de las Bandas de Bollinger (basadas en `sma20` - 2 desviaciones estándar).               |
| `y_TICKER_ret_5d`       | Retorno futuro del activo en 5 días. Etiqueta usada en modelos de predicción a corto plazo.             |
| `y_TICKER_ret_20d`      | Retorno futuro en 20 días. Objetivo para predicción a mediano plazo.                                    |
| `y_TICKER_ret_180d`     | Retorno futuro en 180 días. Utilizado para predicciones a largo plazo.                                  |
| `year`                  | Año de la observación (extraído de la columna `date`).                                                  |

In [8]:
import pandas as pd
import numpy as np
import pyarrow.parquet as pq   # lectura de metadata rápida

def quick_stats(paths):
    """
    Devuelve filas totales, columnas totales, % nulos, kurtosis media y % outliers
    usando solo metadata + una muestra (último año) para la parte estadística.
    """
    if not paths:
        return dict(n_rows=0, n_cols=0,
                    pct_nan=np.nan, kurtosis=np.nan, pct_outliers=np.nan)

    # --------- 1. nº de filas y columnas solo con metadata ----------
    total_rows = 0
    n_cols = None
    for f in paths:
        meta = pq.ParquetFile(fs.open(f, "rb")).metadata
        total_rows += meta.num_rows
        if n_cols is None:
            n_cols = meta.num_columns  # basta con leerlo una vez

    # --------- 2. estadística básica con un único archivo ------------
    df_sample = pd.read_parquet(fs.open(paths[-1], "rb"))
    num = df_sample.select_dtypes("number")
    ret_cols = [c for c in num.columns if c.endswith("_ret")]

    z = (num - num.mean()) / num.std(ddof=0)
    pct_out = 100 * (np.abs(z) > 5).sum().sum() / (len(df_sample) * len(num.columns))

    pct_nan = 100 * df_sample.isna().sum().sum() / (len(df_sample) * n_cols)

    return dict(n_rows=total_rows,
                n_cols=n_cols,
                pct_nan=round(pct_nan, 2),
                pct_outliers=round(pct_out, 3))


rows = []
for idx in tickers:                            
    paths = fs.glob(f"{CONTAINER}/{GOLD_PREFIX}/{idx}/year=*/*.parquet")
    stats = quick_stats(paths)
    stats["index"] = idx
    rows.append(stats)

df_summary = (pd.DataFrame(rows)
                .set_index("index")
                .sort_index())

display(df_summary)

Unnamed: 0_level_0,n_rows,n_cols,pct_nan,pct_outliers
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DJIA,12259,44,2.26,0.034
NASDAQ,12259,44,2.26,0.023
RUSSELL2000,12259,44,2.26,0.023
SP500,12259,44,2.26,0.034
WILSHIRE5000,12259,44,2.26,0.023


In [9]:
# ================================================================
# 3.3 Resumen gráfico y tablas comparativas para los cinco índices
# ================================================================
import os, numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
from statsmodels.stats.diagnostic import acorr_ljungbox

os.makedirs("eda_outputs", exist_ok=True)

TICKERS = ["SP500", "DJIA", "NASDAQ", "RUSSELL2000", "WILSHIRE5000"]
summary, bars_q, adf_rows = [], [], []

for tk in TICKERS:
    # ---------------------------------------------------------------------
    # Cargar TODA la historia del índice (1992-2025)
    # ---------------------------------------------------------------------
    paths = fs.glob(f"{CONTAINER}/{GOLD_PREFIX}/{tk}/year=*/*.parquet")
    df = pd.concat(pd.read_parquet(fs.open(p, "rb")) for p in paths).set_index("date")

    # ---------- 1) Calidad de datos --------------------------------------
    pct_nan = df.isna().mean().mean() * 100
    z = (df.select_dtypes("number") - df.mean()) / df.std(ddof=0)
    pct_out = (np.abs(z) > 5).mean().mean() * 100
    bars_q.append([pct_nan, pct_out])

    # ---------- 2) Estadísticos para la tabla de calidad -----------------
    r1d = df[f"{tk}_ret"].dropna()
    kurtosis = r1d.kurtosis()
    p_lb20 = acorr_ljungbox(r1d, lags=[20], return_df=True)["lb_pvalue"].iloc[0]
    p_ret1d = adfuller(r1d, autolag="AIC")[1]
    summary.append([tk, len(df), df.shape[1], kurtosis, p_ret1d, pct_nan, pct_out, p_lb20])

    # ---------- 3) p-ADF de precio y de los 4 retornos -------------------
    p_price   = adfuller(np.log(df[tk]).dropna(), autolag="AIC")[1]
    p_ret5d   = adfuller(df[f"y_{tk.lower()}_ret_5d"  ].dropna(), autolag="AIC")[1]
    p_ret20d  = adfuller(df[f"y_{tk.lower()}_ret_20d" ].dropna(), autolag="AIC")[1]
    p_ret180d = adfuller(df[f"y_{tk.lower()}_ret_180d"].dropna(), autolag="AIC")[1]
    adf_rows.append([tk, p_price, p_ret1d, p_ret5d, p_ret20d, p_ret180d])

# -----------------------------------------------------------------
# 4) Guardar CSV de calidad
# -----------------------------------------------------------------
cols_metrics = ["Index","Rows","Cols","Kurtosis","p_ADF_ret1d",
                "%NaN","%Outliers","p_LB20"]
pd.DataFrame(summary, columns=cols_metrics).set_index("Index") \
  .to_csv("eda_outputs/metrics_summary.csv", float_format="%.4g")

# -----------------------------------------------------------------
# 5) Guardar CSV de p-ADF (precio + 4 retornos)
# -----------------------------------------------------------------
cols_adf = ["Index","p_ADF_price","p_ADF_ret1d",
            "p_ADF_ret5d","p_ADF_ret20d","p_ADF_ret180d"]
pd.DataFrame(adf_rows, columns=cols_adf).set_index("Index") \
  .to_csv("eda_outputs/adf_pvalues.csv", float_format="%.4g")

# -----------------------------------------------------------------
# 6) Barra apilada %NaN + %Outliers
# -----------------------------------------------------------------
bars_q = np.array(bars_q)
fig, ax = plt.subplots(figsize=(6,3))
ax.bar(TICKERS, bars_q[:,0], label="% NaN")
ax.bar(TICKERS, bars_q[:,1], bottom=bars_q[:,0], label="% Outliers")
ax.set_ylabel("Porcentaje"); ax.set_title("Calidad de datos por índice"); ax.legend()
plt.tight_layout(); fig.savefig("eda_outputs/quality_bar.png", dpi=300); plt.close()

# -----------------------------------------------------------------
# 7) Histogramas comparados (1×5) – retornos 1 d
# -----------------------------------------------------------------
fig, axes = plt.subplots(1,5, figsize=(15,3), sharey=True)
for i, tk in enumerate(TICKERS):
    r1d_recent = pd.read_parquet(fs.open(
        fs.glob(f"{CONTAINER}/{GOLD_PREFIX}/{tk}/year=*/*.parquet")[-1], "rb")
    )[f"{tk}_ret"].dropna()
    sns.histplot(r1d_recent, bins=80, kde=True, ax=axes[i])
    axes[i].set_title(tk); axes[i].set_xlabel("")
fig.suptitle("Distribución de retornos diarios (1992-2025)")
plt.tight_layout(); fig.savefig("eda_outputs/hist_grid.png", dpi=300); plt.close()

# -----------------------------------------------------------------
# 8) ACF/PACF grid (2×5) – retornos 1 d
# -----------------------------------------------------------------
fig, axes = plt.subplots(2,5, figsize=(15,6), sharex=True)
for i, tk in enumerate(TICKERS):
    r1d_recent = pd.read_parquet(fs.open(
        fs.glob(f"{CONTAINER}/{GOLD_PREFIX}/{tk}/year=*/*.parquet")[-1], "rb")
    )[f"{tk}_ret"].dropna()
    plot_acf (r1d_recent, lags=30, ax=axes[0,i]); axes[0,i].set_title(tk)
    plot_pacf(r1d_recent, lags=30, ax=axes[1,i])
fig.suptitle("ACF y PACF de retornos diarios")
plt.tight_layout(); fig.savefig("eda_outputs/acf_pacf_grid.png", dpi=300); plt.close()

print("Figuras y tablas EDA actualizadas en 'eda_outputs/'")

Figuras y tablas EDA actualizadas en 'eda_outputs/'
