# Ingesta histórica + carga incremental de índices bursátiles hacia Azure Data Lake

**Descripcion:**<br>
El presente script de Python descarga precios diarios de los principales índices bursátiles utilizando la librería de yfinance (1992-Actualidad), los normaliza y valida, guardando en formato Parquet local y replicándolos en Azure Data Lake Storage Gen2, incluyendo controles de calidad, logs de ingesta y lógica incremental.<br>

**Objetivos**:<br>
-Automatizar la extracción de datos desde la API de Yahoo Finance.<br>
-Estandarizar el esquema.<br>
-Persistir en formato eficiente (Parquet) tanto localmente como en ADLS, habilitando compresión y lectura columnar.<br>
-Garantizar la calidad inicial mediante chequeos de cobertura, nulos y duplicados.<br>
-Implementar carga incremental idempotente, particionando por year=YYYY.<br>
-Registrar la actividad (log de ingesta y tamaños de archivo).

## 1. Configuración básica y librerías y Parámetros globales del pipeline

In [1]:
# ================================================================
# 1.1 Configuración básica y librerías
# ================================================================
import os
from datetime import date

import pandas as pd
import yfinance as yf

# Mostrar todas las columnas al imprimir DataFrames
pd.set_option("display.max_columns", None)

In [2]:
# ================================================================
# 1.2 · Parámetros globales del pipeline
# ================================================================
TICKERS = {
    "SP500": "^GSPC",
    "DJIA": "^DJI",
    "NASDAQ": "^IXIC",
    "RUSSELL2000": "^RUT",
    "WILSHIRE5000": "^W5000",
}

START_DATE = "1992-01-01"   # Fecha inicial fija del histórico
END_DATE   = date.today().isoformat()  # Fecha final dinámica (hoy)
DATA_DIR   = "data/raw/indices"      # Carpeta local para la zona RAW
os.makedirs(DATA_DIR, exist_ok=True) # Crea la carpeta si no existe

## 2. Función de descarga y normalización (RAW)

In [3]:
# ================================================================
# 2.1 · Descarga de precios díarios
# ================================================================
def download_index(ticker: str, name: str) -> pd.DataFrame:
    """
    Descarga precios diarios y deja columnas planas:
    open, high, low, close, adj_close, volume, ticker
    """

    # --- Descarga desde la API de Yahoo Finance ------------------
    df = yf.download(
        ticker,
        start=START_DATE,
        end=END_DATE,
        interval="1d",
        auto_adjust=True, # Ajusta historical splits/dividends en close
        progress=False    # Suprime barra de progreso
    )

    # --- 1· Aplanar encabezados ----------------------------------
    if isinstance(df.columns, pd.MultiIndex):
        # Ejemplo actual: MultiIndex( ('open','^GSPC'), … )
        df.columns = df.columns.get_level_values(0)        # → open, high, low…
    else:
        # A veces vienen tuplas sueltas: ('^GSPC', 'Open')
        df.columns = [col[0] if isinstance(col, tuple) else col for col in df.columns]

    # --- 2· Normalizar nombres -----------------------------------
    df.columns = [str(c).lower().replace(" ", "_") for c in df.columns]

    # --- 3· Añadir metadatos y retornar --------------------------
    df.index.name = "date"   # índice → fecha
    df["ticker"] = name      # identificador de mercado
    return df


## 3. Ingesta inicial (histórico completo) -> Azure Data Lake Storage y Chequeos rápidos de calidad

In [4]:
# ================================================================
# 3.1 Muestreo inicial del tipo de datos
# ================================================================
test = download_index("^GSPC", "SP500")
print(test.columns)

Index(['close', 'high', 'low', 'open', 'volume', 'ticker'], dtype='object')


In [5]:
# ================================================================
# 3.2 Ingesta inicial (histórico completo RAW → ADLS)
# ================================================================

from utils_adls import upload_bytes
frames = []  # Almacena los DataFrames en memoria para EDA posterior
for name, ticker in TICKERS.items():
    print(f"⏬ {name} ({ticker})")
    df_i = download_index(ticker, name)
    print(f"   {df_i.shape[0]} filas desde {df_i.index.min().date()}")

    # 1) Guarda archivo Parquet local (zona RAW)
    local_file = f"{DATA_DIR}/{name}.parquet"
    df_i.to_parquet(local_file)

    # 2) Replica archivo RAW en Azure Data Lake Storage Gen2
    remote_raw = f"raw/indices/{name}.parquet"
    with open(local_file, "rb") as f:
        upload_bytes(remote_raw, f.read())     # Sobrescribe si ya existe

    frames.append(df_i)

⏬ SP500 (^GSPC)
   8451 filas desde 1992-01-02
⏬ DJIA (^DJI)
   8451 filas desde 1992-01-02
⏬ NASDAQ (^IXIC)
   8451 filas desde 1992-01-02
⏬ RUSSELL2000 (^RUT)
   8451 filas desde 1992-01-02
⏬ WILSHIRE5000 (^W5000)
   8444 filas desde 1992-01-02


## 4. Chequeos rápidos de calidad

In [6]:
# ================================================================
# 4.1 Combinar todos los Parquet para análisis (Quick EDA)
# ================================================================
parquet_paths = [f"{DATA_DIR}/{name}.parquet" for name in TICKERS]
market_df = pd.concat([pd.read_parquet(p) for p in parquet_paths])
market_df.tail()

Unnamed: 0_level_0,close,high,low,open,volume,ticker
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2025-07-21,62981.320312,63327.730469,62947.539062,62947.539062,0,WILSHIRE5000
2025-07-22,63076.570312,63149.789062,62757.300781,62981.320312,0,WILSHIRE5000
2025-07-23,63586.96875,63594.929688,63076.570312,63076.570312,0,WILSHIRE5000
2025-07-24,63560.398438,63745.820312,63559.71875,63586.96875,0,WILSHIRE5000
2025-07-25,63834.25,63898.480469,63560.398438,63560.398438,0,WILSHIRE5000


In [7]:
# ================================================================
# 4.2 Cobertura temporal
# ================================================================
coverage = (
    market_df.reset_index()
             .groupby("ticker")["date"]
             .agg(["min", "max", "count"])
             .rename(columns={"min": "fecha_min", "max": "fecha_max", "count": "num_dias"})
)
print("\nCobertura temporal por índice:\n", coverage)


Cobertura temporal por índice:
               fecha_min  fecha_max  num_dias
ticker                                      
DJIA         1992-01-02 2025-07-25      8451
NASDAQ       1992-01-02 2025-07-25      8451
RUSSELL2000  1992-01-02 2025-07-25      8451
SP500        1992-01-02 2025-07-25      8451
WILSHIRE5000 1992-01-02 2025-07-25      8444


In [8]:
# ================================================================
# 4.3 Valores nulos
# ================================================================
nulls_pct = (market_df.isna().mean() * 100).round(3)
print("Porcentaje de NaNs por columna:\n", nulls_pct)

Porcentaje de NaNs por columna:
 close     0.0
high      0.0
low       0.0
open      0.0
volume    0.0
ticker    0.0
dtype: float64


In [9]:
# ================================================================
# 4.4 Duplicados fecha‑ticker
# ================================================================
dupes = (
    market_df.reset_index()
             .duplicated(subset=["date", "ticker"])
             .sum()
)
print("Duplicados:", dupes)
assert dupes == 0, "Hay fechas duplicadas — revisa la descarga."

Duplicados: 0


## 5. Verificación de tamaños de archivo Parquet y Logging de ingesta

In [10]:
# ================================================================
# 5.1 Verificación de tamaños de archivo Parquet
# ================================================================
import pathlib

filesize_mb = {
    p.name: round(p.stat().st_size / 1_048_576, 2)   # bytes → MB
    for p in pathlib.Path(DATA_DIR).glob("*.parquet")
}
print("Tamaños de archivo (MB):", filesize_mb)

Tamaños de archivo (MB): {'DJIA.parquet': 0.37, 'NASDAQ.parquet': 0.38, 'RUSSELL2000.parquet': 0.37, 'SP500.parquet': 0.38, 'WILSHIRE5000.parquet': 0.32}


In [11]:
# ================================================================
# 5.2 Logging de ingesta
# ================================================================
import logging, datetime as dt
LOG_DIR = "logs"; os.makedirs(LOG_DIR, exist_ok=True)
logging.basicConfig(
    filename=f"{LOG_DIR}/ingest_indices_{dt.date.today()}.log",
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
)

for name, path in filesize_mb.items():
    logging.info(f"{name}: {path} MB guardado en {DATA_DIR}")
logging.info("Chequeos — duplicados: %s, NaNs: %s", dupes, nulls_pct.to_dict())

# Etapa Bronze

## 6. Prueba de escritura en Bronze

In [12]:
# ================================================================
# 6.1 Prueba de escritura
# ================================================================
from utils_adls import upload_bytes
upload_bytes(
    "bronze/test.txt",
    b"hola mundo",          # Contenido arbitrario
    content_type="text/plain"
)
print("✔ Archivo de prueba creado/sobrescrito sin error en Bronze")

✔ Archivo de prueba creado/sobrescrito sin error en Bronze


## 7. Carga incremental de último año cargado

In [13]:
# ================================================================
# 7.1 Obteniendo de último año cargado (para incremental)
# ================================================================
from azure.core.exceptions import ResourceNotFoundError
from utils_adls import _client, CONTAINER
BRONZE_ADLS_PREFIX = "bronze/indices"
TICKERS_LIST = list(TICKERS.keys())

svc = _client()           # cliente ADLS

ultimo_anio = {}          # {ticker: último año encontrado}

for tk in TICKERS_LIST:
    dir_path = f"{BRONZE_ADLS_PREFIX}/{tk}"
    dir_client = svc.get_directory_client(CONTAINER, dir_path)

    try:
        years = [
            int(p.name.split("=")[1])
            for p in dir_client.get_paths() if p.is_directory
        ]
        ultimo_anio[tk] = max(years) if years else None
    except ResourceNotFoundError:
        ultimo_anio[tk] = 1992

print("\nÚltimo año cargado por índice:", ultimo_anio)



Último año cargado por índice: {'SP500': 1992, 'DJIA': 1992, 'NASDAQ': 1992, 'RUSSELL2000': 1992, 'WILSHIRE5000': 1992}


In [14]:
# ================================================================
# 7.2 Carga incremental (RAW → Bronze por año)
# ================================================================
import os, pandas as pd, yfinance as yf, io, datetime as dt
from utils_adls import _client, upload_bytes, CONTAINER

def max_date_for(ticker: str):
    """Devuelve la fecha máxima almacenada para un ticker en Bronze."""
    
    dir_client = svc.get_directory_client(
        CONTAINER,
        f"{BRONZE_ADLS_PREFIX}/{ticker}"    # ← ya apunta al índice
    )

    try:
        # 1) Identifica los directorios year=YYYY
        years = sorted(
            int(p.name.split("=")[1])
            for p in dir_client.get_paths() if p.is_directory
        )
        if not years:
            return None
        latest_year = years[-1]
    
        # 2) Obtiene el primer fichero Parquet dentro del último año
        latest_file = next(
            p.name for p in dir_client.get_paths()            # ← ¡sin path extra!
            if (not p.is_directory) and f"year={latest_year}/" in p.name
        )
    
        # 3) Descarga y lee sólo la columna fecha
        file_client = svc.get_file_client(CONTAINER, latest_file)
        raw = file_client.download_file().readall()
    
        df = pd.read_parquet(io.BytesIO(raw), columns=["date"])
        return pd.to_datetime(df["date"]).max().date()
    except ResourceNotFoundError:
        return None
# ---------- Loop principal por ticker ----------------------------
today = dt.date.today()
for name, y_ticker in TICKERS.items():
    last_dt = max_date_for(name)
    start_dt = last_dt + dt.timedelta(days=1) if last_dt else dt.date(1992,1,1)

    if start_dt > today:
        print(f"{name}: ya actualizado hasta {last_dt}")
        continue

    print(f"{name}: descargando desde {start_dt} hasta {today}")
    df_new = yf.download(
        y_ticker,
        start=start_dt.isoformat(),
        end=today.isoformat(),
        interval="1d",
        auto_adjust=True,
        progress=False,
    )

    if df_new.empty:
        print(f"{name}: nada nuevo")
        continue

    # --- Normaliza columnas nuevamente ----------------------------
    if isinstance(df_new.columns, pd.MultiIndex):
        df_new.columns = df_new.columns.get_level_values(0)
    df_new.columns = [c.lower().replace(" ", "_") for c in df_new.columns]
    df_new.index.name = "date"
    df_new["ticker"] = name

     # Añade columna de partición por año y reinicia índice
    df_new = df_new.reset_index()
    df_new["year"] = pd.to_datetime(df_new["date"]).dt.year

    # --- Sube Parquet por cada año --------------------------------
    for yr, group in df_new.groupby("year"):
        buf = io.BytesIO()
        group.to_parquet(buf, index=False, engine="pyarrow")
        buf.seek(0)

        remote = f"{BRONZE_ADLS_PREFIX}/{name}/year={yr}/{name}_{yr}.parquet"
        upload_bytes(remote, buf.read())        # overwrite=True dentro de la función
        print(f"▲ {name} {yr} -> {remote}   ({len(group)} filas)")

print("\nProceso completado ✅")

SP500: descargando desde 1992-01-01 hasta 2025-07-28
▲ SP500 1992 -> bronze/indices/SP500/year=1992/SP500_1992.parquet   (254 filas)
▲ SP500 1993 -> bronze/indices/SP500/year=1993/SP500_1993.parquet   (253 filas)
▲ SP500 1994 -> bronze/indices/SP500/year=1994/SP500_1994.parquet   (252 filas)
▲ SP500 1995 -> bronze/indices/SP500/year=1995/SP500_1995.parquet   (252 filas)
▲ SP500 1996 -> bronze/indices/SP500/year=1996/SP500_1996.parquet   (254 filas)
▲ SP500 1997 -> bronze/indices/SP500/year=1997/SP500_1997.parquet   (253 filas)
▲ SP500 1998 -> bronze/indices/SP500/year=1998/SP500_1998.parquet   (252 filas)
▲ SP500 1999 -> bronze/indices/SP500/year=1999/SP500_1999.parquet   (252 filas)
▲ SP500 2000 -> bronze/indices/SP500/year=2000/SP500_2000.parquet   (252 filas)
▲ SP500 2001 -> bronze/indices/SP500/year=2001/SP500_2001.parquet   (248 filas)
▲ SP500 2002 -> bronze/indices/SP500/year=2002/SP500_2002.parquet   (252 filas)
▲ SP500 2003 -> bronze/indices/SP500/year=2003/SP500_2003.parquet  