<a href="https://colab.research.google.com/github/afcabre/git-25-09-gh/blob/main/ValoracionEmpresas_USA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PROYECTO FINAL: Valoraci√≥n de Fundamentales de Empresas Estadounidenses que cotizan en Bolsa y Agente SQL con LangChain

**Autor:** Andr√©s Fernando Cabrera - Curso de Fundamentos de LLM y Datos  
**Sesi√≥n:** Preprocesamiento de Datos y Agentes SQL

---
## 1. Introducci√≥n: de Estados Financieros y Precios a Inteligencia de Inversi√≥n Conversacional

El escenario propuesto se plantea bajo el contexto de an√°lisis de inversiones, y la resoluci√≥n de las preguntas recurrentes que se suelen enfrentar antes de hacer o liquidar una inversi√≥n, por ejemplo, se quiere saber si el precio que se est√° pagando o recibiendo es considerado justo, es econ√≥mico o costos, frente a sus pares. O se necesita identificar se√±ales de riesgo financiero antes de comprar, y se debe establecer de manera clara, con argumentos provenientes del an√°lisis fundamental, por qu√© una acci√≥n luce atractiva o costosa en un momento espec√≠fico.
Tradicionalmente, responder estas preguntas implica construir queries SQL distintas, cruzar estados financieros con precios, calcular m√©tricas derivadas (TTM, m√°rgenes, yields, endeudamiento) y luego consolidar hallazgos en reportes. Este flujo es lento, repetitivo y dif√≠cil de escalar cuando las preguntas se multiplican.

Este proyecto final transforma ese proceso manual en un sistema de **an√°lisis conversacional** soportado por un **agente (LLM) conectado a una base de datos SQL**. El pipeline toma datos p√∫blicos de SimFin (estados financieros trimestrales y precios diarios de empresas de USA), los procesa y estructura en un esquema relacional, y habilita un agente capaz de responder preguntas en lenguaje natural. El resultado esperado es una interfaz donde un usuario sin conocimiento de SQL puede explorar, filtrar y explicar oportunidades basadas en fundamentales, con trazabilidad hacia las columnas fuente del dataset.

### Objetivos de aprendizaje (enfoque de apropiaci√≥n)

Se busca demostrar apropiaci√≥n de lo visto en arquitecturas de agentes con SQL, mediante decisiones de dise√±o y pruebas que conectan datos con preguntas reales de an√°lisis financiero:

**Dise√±o de esquema relacional pensando en el agente:**  
Se busca dise√±ar tablas y relaciones que faciliten el razonamiento: dimensiones (empresas, industrias), hechos (balance, income, cashflow, precios), y una capa de m√©tricas derivadas con trazabilidad. El objetivo no es solo normalizar, sino habilitar consultas repetibles y comprensibles para un asistente conversacional.

**Orquestaci√≥n del agente para preguntas en lenguaje natural:**  
EL agente SQL que no solo traduce la preguntas a queries, sino quedebe mantener contexto y usar el lenguaje del dominio. Por ejemplo: ‚Äúbarata vs su industria‚Äù implica comparar percentiles sectoriales; ‚Äúse√±ales de riesgo‚Äù implica revisar deuda, liquidez y cobertura; ‚Äúmejora sostenida‚Äù implica tendencias y estabilidad, no un trimestre aislado. Estas capacidades se prueban con un set de preguntas gu√≠a y casos de prueba.

Al finalizar, el objetivo es contar con un prototipo funcional y, sobre todo, con un entendimiento pr√°ctico de c√≥mo **estructurar datos y m√©tricas para maximizar su utilidad en aplicaciones de IA conversacional** apoyadas en SQL.

---

# 2. Exploraci√≥n inicial de datos

## 2.1 Instalaci√≥n de Dependencias
Instalar librer√≠as base para ingesti√≥n de CSV, SQL (SQLite) y agente conversacional (LangChain + OpenAI).

In [1]:
!pip -q install -U \
  "pandas==2.2.2" \
  "numpy==2.0.2" \
  pyarrow sqlalchemy tabulate \
  langchain langchain-openai langchain-community \
    openai tiktoken

## 2.2 Importaci√≥n de librer√≠as
Cargar librer√≠as de trabajo (pandas/numpy para DataFrames, pathlib para rutas, IPython para visualizaci√≥n).

In [2]:
import os
import io
import pandas as pd
import numpy as np
import gc
import sqlite3
from sqlalchemy import create_engine
from pathlib import Path
from IPython.display import display
from datetime import datetime

pd.set_option("display.max_columns", None)
pd.set_option("display.width", 140)

## 2.3 Carga del Dataset Original

Descomprimir us-shareprices-daily.zip en /content/data para obtener us-shareprices-daily.csv. Cargar los 6 CSV (separador ;) desde /content/data en DataFrames y confirmar dimensiones por archivo y mostrar las primeras filas de cada DataFrame para validar que la carga fue correcta.

In [3]:
# 2.3 Carga del Dataset Original (con soporte ZIP para prices)

from pathlib import Path
import zipfile
import pandas as pd

DATA_DIR = Path("/content/data")

# --- 2.3.1: Descomprimir prices si viene como ZIP ---
zip_prices = DATA_DIR / "us-shareprices-daily.zip"
csv_prices = DATA_DIR / "us-shareprices-daily.csv"

if (not csv_prices.exists()) and zip_prices.exists():
    print("üì¶ Encontr√© us-shareprices-daily.zip y no existe el CSV. Descomprimiendo...")
    with zipfile.ZipFile(zip_prices, "r") as z:
        z.extractall(DATA_DIR)
    print("‚úì Zip descomprimido en:", DATA_DIR)

# Verificaci√≥n de existencia del CSV de precios
if not csv_prices.exists():
    print("‚ùå ERROR: No existe us-shareprices-daily.csv en /content/data")
    print("   - Si tienes el zip, aseg√∫rate que se llame: us-shareprices-daily.zip")
    print("   - Archivos presentes (top 20):", [p.name for p in sorted(DATA_DIR.glob("*"))[:20]])
    raise FileNotFoundError("Falta us-shareprices-daily.csv (o zip no descomprimi√≥ correctamente).")

print(f"‚úì prices CSV listo: {csv_prices.name} | size_MB={csv_prices.stat().st_size/(1024**2):.2f}\n")

# --- 2.3.2: Carga de los 6 CSV ---
files = {
    "industries": "industries.csv",
    "companies": "us-companies.csv",
    "balance_q": "us-balance-quarterly.csv",
    "income_q": "us-income-quarterly.csv",
    "cashflow_q": "us-cashflow-quarterly.csv",
    "prices_d": "us-shareprices-daily.csv",
}

dfs = {}
print("üìö Leyendo CSV (sep=';')...\n")

for name, fname in files.items():
    file_path = DATA_DIR / fname
    try:
        df = pd.read_csv(file_path, sep=";", low_memory=False)
        dfs[name] = df
        print(f"‚úì {fname} cargado correctamente")
        print(f"  - {name}: {df.shape[0]:,} registros √ó {df.shape[1]} columnas\n")
    except FileNotFoundError:
        print(f"‚ùå ERROR: No se encuentra el archivo: {fname}")
        print(f"   Ruta esperada: {file_path}")
        raise

# --- 2.3.3: Sanity check m√≠nimo del archivo grande (5 filas) ---
print("üîé Sanity check r√°pido de prices (5 filas):")
df_prices_test = pd.read_csv(csv_prices, sep=";", nrows=5, low_memory=False)
display(df_prices_test)
print("‚úì Columnas prices:", df_prices_test.columns.tolist())

# --- 2.3.4: Validaci√≥n visual del cargue de los archivos ---
for name, df in dfs.items():
    print("\n" + "="*90)
    print(f"{name} | shape: {df.shape[0]:,} √ó {df.shape[1]}")
    display(df.head(5))


‚úì prices CSV listo: us-shareprices-daily.csv | size_MB=413.49

üìö Leyendo CSV (sep=';')...

‚úì industries.csv cargado correctamente
  - industries: 74 registros √ó 3 columnas

‚úì us-companies.csv cargado correctamente
  - companies: 6,525 registros √ó 11 columnas

‚úì us-balance-quarterly.csv cargado correctamente
  - balance_q: 52,098 registros √ó 30 columnas

‚úì us-income-quarterly.csv cargado correctamente
  - income_q: 52,106 registros √ó 28 columnas

‚úì us-cashflow-quarterly.csv cargado correctamente
  - cashflow_q: 52,103 registros √ó 28 columnas

‚úì us-shareprices-daily.csv cargado correctamente
  - prices_d: 6,210,379 registros √ó 11 columnas

üîé Sanity check r√°pido de prices (5 filas):


Unnamed: 0,Ticker,SimFinId,Date,Open,High,Low,Close,Adj. Close,Volume,Dividend,Shares Outstanding
0,A,45846,2020-03-30,71.06,73.18,71.06,72.67,69.86,1486203,0.18,309651359
1,A,45846,2020-03-31,72.34,72.8,70.5,71.62,68.85,1822122,,309651359
2,A,45846,2020-04-01,69.47,70.23,68.15,68.92,66.26,2173595,,309651359
3,A,45846,2020-04-02,68.27,72.45,68.14,72.29,69.5,1840311,,309651359
4,A,45846,2020-04-03,71.71,72.33,69.66,70.42,67.7,2052642,,309651359


‚úì Columnas prices: ['Ticker', 'SimFinId', 'Date', 'Open', 'High', 'Low', 'Close', 'Adj. Close', 'Volume', 'Dividend', 'Shares Outstanding']

industries | shape: 74 √ó 3


Unnamed: 0,IndustryId,Industry,Sector
0,100001,Industrial Products,Industrials
1,100002,Business Services,Industrials
2,100003,Engineering & Construction,Industrials
3,100004,Waste Management,Industrials
4,100005,Industrial Distribution,Industrials



companies | shape: 6,525 √ó 11


Unnamed: 0,Ticker,SimFinId,Company Name,IndustryId,ISIN,End of financial year (month),Number Employees,Business Summary,Market,CIK,Main Currency
0,,18692750,,,,,,,us,1997711.0,USD
1,,18847915,,,,,,,us,1769731.0,USD
2,,18538670,,,,,,,us,1734107.0,USD
3,,18657366,,,,,,,us,1899830.0,USD
4,,18667300,,,,,,,us,1178819.0,USD



balance_q | shape: 52,098 √ó 30


Unnamed: 0,Ticker,SimFinId,Currency,Fiscal Year,Fiscal Period,Report Date,Publish Date,Restated Date,Shares (Basic),Shares (Diluted),"Cash, Cash Equivalents & Short Term Investments",Accounts & Notes Receivable,Inventories,Total Current Assets,"Property, Plant & Equipment, Net",Long Term Investments & Receivables,Other Long Term Assets,Total Noncurrent Assets,Total Assets,Payables & Accruals,Short Term Debt,Total Current Liabilities,Long Term Debt,Total Noncurrent Liabilities,Total Liabilities,Share Capital & Additional Paid-In Capital,Treasury Stock,Retained Earnings,Total Equity,Total Liabilities & Equity
0,A,45846,USD,2020,Q2,2020-04-30,2020-06-01,2020-06-01,309000000.0,312000000.0,1324000000.0,886000000.0,750000000.0,3171000000.0,836000000.0,141000000.0,5307000000.0,6284000000.0,9455000000,333000000.0,700000000.0,1945000000.0,1788000000.0,2742000000.0,4687000000.0,5291000000.0,,15000000.0,4768000000.0,9455000000
1,A,45846,USD,2020,Q3,2020-07-31,2020-09-01,2020-09-01,309000000.0,312000000.0,1358000000.0,930000000.0,746000000.0,3245000000.0,846000000.0,148000000.0,5307000000.0,6301000000.0,9546000000,311000000.0,40000000.0,1314000000.0,2283000000.0,3251000000.0,4565000000.0,5327000000.0,,130000000.0,4981000000.0,9546000000
2,A,45846,USD,2020,Q4,2020-10-31,2020-12-18,2021-12-17,308000000.0,311000000.0,1441000000.0,1038000000.0,720000000.0,3415000000.0,845000000.0,158000000.0,5209000000.0,6212000000.0,9627000000,639000000.0,75000000.0,1467000000.0,2284000000.0,3287000000.0,4754000000.0,5314000000.0,,81000000.0,4873000000.0,9627000000
3,A,45846,USD,2021,Q1,2021-01-31,2021-03-02,2021-03-02,306000000.0,309000000.0,1329000000.0,1087000000.0,755000000.0,3483000000.0,866000000.0,165000000.0,5160000000.0,6191000000.0,9674000000,656000000.0,314000000.0,1687000000.0,2185000000.0,3183000000.0,4870000000.0,5269000000.0,,4000000.0,4804000000.0,9674000000
4,A,45846,USD,2021,Q2,2021-04-30,2021-06-01,2021-06-01,306000000.0,306000000.0,1380000000.0,1075000000.0,791000000.0,3514000000.0,884000000.0,188000000.0,5812000000.0,6884000000.0,10398000000,738000000.0,205000000.0,1758000000.0,2727000000.0,3830000000.0,5588000000.0,5274000000.0,,-12000000.0,4810000000.0,10398000000



income_q | shape: 52,106 √ó 28


Unnamed: 0,Ticker,SimFinId,Currency,Fiscal Year,Fiscal Period,Report Date,Publish Date,Restated Date,Shares (Basic),Shares (Diluted),Revenue,Cost of Revenue,Gross Profit,Operating Expenses,"Selling, General & Administrative",Research & Development,Depreciation & Amortization,Operating Income (Loss),Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Extraordinary Gains (Losses),Net Income,Net Income (Common)
0,A,45846,USD,2020,Q2,2020-04-30,2020-06-01,2021-06-01,309000000.0,312000000.0,1238000000.0,-581000000.0,657000000.0,-555000000.0,-358000000.0,-197000000.0,,102000000.0,19000000.0,-17000000.0,121000000.0,,121000000,-20000000.0,101000000,,101000000,101000000
1,A,45846,USD,2020,Q3,2020-07-31,2020-09-01,2021-09-01,309000000.0,312000000.0,1261000000.0,-592000000.0,669000000.0,-439000000.0,-347000000.0,-92000000.0,,230000000.0,-11000000.0,-18000000.0,219000000.0,,219000000,-20000000.0,199000000,,199000000,199000000
2,A,45846,USD,2020,Q4,2020-10-31,2020-12-18,2021-09-01,308000000.0,311000000.0,1483000000.0,-695000000.0,788000000.0,-489000000.0,-387000000.0,-102000000.0,,299000000.0,-16000000.0,-18000000.0,283000000.0,,283000000,-61000000.0,222000000,,222000000,222000000
3,A,45846,USD,2021,Q1,2021-01-31,2021-03-02,2022-03-03,306000000.0,309000000.0,1548000000.0,-710000000.0,838000000.0,-510000000.0,-407000000.0,-103000000.0,,328000000.0,-16000000.0,-19000000.0,312000000.0,,312000000,-24000000.0,288000000,,288000000,288000000
4,A,45846,USD,2021,Q2,2021-04-30,2021-06-01,2022-05-31,306000000.0,306000000.0,1525000000.0,-708000000.0,817000000.0,-529000000.0,-420000000.0,-109000000.0,,288000000.0,-15000000.0,-19000000.0,273000000.0,,273000000,-57000000.0,216000000,,216000000,216000000



cashflow_q | shape: 52,103 √ó 28


Unnamed: 0,Ticker,SimFinId,Currency,Fiscal Year,Fiscal Period,Report Date,Publish Date,Restated Date,Shares (Basic),Shares (Diluted),Net Income/Starting Line,Depreciation & Amortization,Non-Cash Items,Change in Working Capital,Change in Accounts Receivable,Change in Inventories,Change in Accounts Payable,Change in Other,Net Cash from Operating Activities,Change in Fixed Assets & Intangibles,Net Change in Long Term Investment,Net Cash from Acquisitions & Divestitures,Net Cash from Investing Activities,Dividends Paid,Cash from (Repayment of) Debt,Cash from (Repurchase of) Equity,Net Cash from Financing Activities,Net Change in Cash
0,A,45846,USD,2020,Q2,2020-04-30,2020-06-01,2021-03-02,309000000.0,312000000.0,101000000.0,76000000.0,98000000.0,38000000.0,65000000.0,-53000000.0,5000000.0,21000000.0,313000000.0,-33000000.0,,,-53000000.0,-55000000.0,25000000.0,-126000000.0,-156000000.0,97000000
1,A,45846,USD,2020,Q3,2020-07-31,2020-09-01,2021-06-01,309000000.0,312000000.0,199000000.0,77000000.0,34000000.0,-20000000.0,-24000000.0,-1000000.0,-25000000.0,30000000.0,290000000.0,-24000000.0,,,-32000000.0,-56000000.0,-161000000.0,-9000000.0,-231000000.0,35000000
2,A,45846,USD,2020,Q4,2020-10-31,2020-12-18,2021-09-01,308000000.0,311000000.0,222000000.0,76000000.0,61000000.0,18000000.0,,,,,377000000.0,-27000000.0,,,-27000000.0,-55000000.0,35000000.0,-246000000.0,-269000000.0,83000000
3,A,45846,USD,2021,Q1,2021-01-31,2021-03-02,2022-03-03,306000000.0,309000000.0,288000000.0,76000000.0,80000000.0,-206000000.0,-31000000.0,-35000000.0,43000000.0,-183000000.0,238000000.0,-41000000.0,,,-42000000.0,-59000000.0,134000000.0,-319000000.0,-316000000.0,-111000000
4,A,45846,USD,2021,Q2,2021-04-30,2021-06-01,2022-03-03,306000000.0,306000000.0,216000000.0,77000000.0,37000000.0,142000000.0,14000000.0,-45000000.0,8000000.0,165000000.0,472000000.0,-31000000.0,,-547000000.0,-587000000.0,-59000000.0,427000000.0,-194000000.0,166000000.0,51000000



prices_d | shape: 6,210,379 √ó 11


Unnamed: 0,Ticker,SimFinId,Date,Open,High,Low,Close,Adj. Close,Volume,Dividend,Shares Outstanding
0,A,45846,2020-03-30,71.06,73.18,71.06,72.67,69.86,1486203,0.18,309651359.0
1,A,45846,2020-03-31,72.34,72.8,70.5,71.62,68.85,1822122,,309651359.0
2,A,45846,2020-04-01,69.47,70.23,68.15,68.92,66.26,2173595,,309651359.0
3,A,45846,2020-04-02,68.27,72.45,68.14,72.29,69.5,1840311,,309651359.0
4,A,45846,2020-04-03,71.71,72.33,69.66,70.42,67.7,2052642,,309651359.0


## 2.4 Exploraci√≥n Estad√≠stica B√°sica

Inspeccionar tipos de datos, valores faltantes y estad√≠sticos descriptivos b√°sicos para cada dataset, como verificaci√≥n inicial antes de limpieza.


In [4]:
# 2.4 Exploraci√≥n Estad√≠stica B√°sica (completa y robusta para prices_d)

SEP_LINE = "\n" + "="*80 + "\n"
SAMPLE_N = 100_000   # muestra para datasets grandes (ej. prices_d)
TOP_NULLS = 20       # top columnas con m√°s nulos
TOP_EXAMPLES = 5     # top valores por categor√≠a

for name, df in dfs.items():
    # Vista a analizar: para prices_d muy grande, usamos muestra (evita uso alto de RAM/tiempo)
    if name == "prices_d" and len(df) > SAMPLE_N:
        df_view = df.sample(SAMPLE_N, random_state=42)
        view_note = f"(vista: muestra aleatoria n={SAMPLE_N:,})"
    else:
        df_view = df
        view_note = "(vista: completo)"

    print("\n" + "#"*90)
    print(f"DATASET: {name} {view_note}")
    print(f"Shape original: {df.shape[0]:,} √ó {df.shape[1]}  |  Shape vista: {df_view.shape[0]:,} √ó {df_view.shape[1]}")
    print("#"*90)

    # 1) Informaci√≥n general sobre el DataFrame
    print("Informaci√≥n del Dataset:")
    buf = io.StringIO()
    df_view.info(buf=buf)
    print(buf.getvalue())
    print(SEP_LINE)

    # 2) Resumen estad√≠stico de columnas num√©ricas
    print("Resumen Estad√≠stico (num√©ricas):")
    df_num = df_view.select_dtypes(include=[np.number])
    if df_num.shape[1] == 0:
        print("‚ÑπÔ∏è No hay columnas num√©ricas para describe().")
    else:
        display(df_num.describe().T)
    print(SEP_LINE)

    # 3) Verifica valores nulos (conteo + % + top-N)
    print("Valores Nulos por Columna (top):")
    null_count = df_view.isnull().sum()
    null_count = null_count[null_count > 0].sort_values(ascending=False)

    if len(null_count) == 0:
        print("‚úì Sin nulos")
    else:
        null_pct = (null_count / len(df_view) * 100).round(2)
        null_summary = pd.DataFrame({"null_count": null_count, "null_pct": null_pct}).head(TOP_NULLS)
        display(null_summary)
    print(SEP_LINE)

    # 4) Para columnas categ√≥ricas: nunique + ejemplos top-N
    print("Valores √önicos en Columnas Categ√≥ricas:")
    obj_cols = df_view.select_dtypes(include=["object"]).columns.tolist()

    if len(obj_cols) == 0:
        print("‚ÑπÔ∏è No hay columnas categ√≥ricas (object).")
    else:
        for col in obj_cols:
            nunq = df_view[col].nunique(dropna=True)
            print(f"\n{col}: {nunq:,} valores √∫nicos")

            examples = df_view[col].value_counts(dropna=True).head(TOP_EXAMPLES).to_dict()
            print(f"Ejemplos (top {TOP_EXAMPLES}): {examples}")


##########################################################################################
DATASET: industries (vista: completo)
Shape original: 74 √ó 3  |  Shape vista: 74 √ó 3
##########################################################################################
Informaci√≥n del Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74 entries, 0 to 73
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   IndustryId  74 non-null     int64 
 1   Industry    74 non-null     object
 2   Sector      74 non-null     object
dtypes: int64(1), object(2)
memory usage: 1.9+ KB



Resumen Estad√≠stico (num√©ricas):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
IndustryId,74.0,104329.797297,3293.071327,100001.0,102001.25,104002.5,107002.75,111001.0




Valores Nulos por Columna (top):
‚úì Sin nulos


Valores √önicos en Columnas Categ√≥ricas:

Industry: 74 valores √∫nicos
Ejemplos (top 5): {'Industrial Products': 1, 'Business Services': 1, 'Engineering & Construction': 1, 'Waste Management': 1, 'Industrial Distribution': 1}

Sector: 12 valores √∫nicos
Ejemplos (top 5): {'Industrials': 13, 'Consumer Cyclical': 11, 'Healthcare': 8, 'Financial Services': 8, 'Energy': 7}

##########################################################################################
DATASET: companies (vista: completo)
Shape original: 6,525 √ó 11  |  Shape vista: 6,525 √ó 11
##########################################################################################
Informaci√≥n del Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6525 entries, 0 to 6524
Data columns (total 11 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Ticker                         6488 non-n

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SimFinId,6525.0,6991677.0,6783715.0,18.0,663113.0,6744552.0,12444309.0,19976457.0
IndustryId,6224.0,104050.9,2817.199,100001.0,101004.0,104002.0,106002.0,111001.0
End of financial year (month),6492.0,10.9846,2.656196,1.0,12.0,12.0,12.0,12.0
Number Employees,5700.0,7500.325,31340.1,0.0,136.0,880.5,3700.0,1298000.0
CIK,6513.0,1270512.0,528434.9,0.0,928054.0,1402436.0,1697862.0,2079173.0




Valores Nulos por Columna (top):


Unnamed: 0,null_count,null_pct
ISIN,1182,18.11
Number Employees,825,12.64
IndustryId,301,4.61
Business Summary,294,4.51
Ticker,37,0.57
Company Name,34,0.52
End of financial year (month),33,0.51
CIK,12,0.18




Valores √önicos en Columnas Categ√≥ricas:

Ticker: 6,488 valores √∫nicos
Ejemplos (top 5): {'ZYXI': 1, 'A': 1, 'A21': 1, 'AA': 1, 'AAC': 1}

Company Name: 6,473 valores √∫nicos
Ejemplos (top 5): {'The Liberty Braves Group': 2, 'LifeMD, Inc.': 2, 'CS Disco, Inc.': 2, 'Nicolet Bankshares, Inc.': 2, 'CECO Environmental Corp.': 2}

ISIN: 5,340 valores √∫nicos
Ejemplos (top 5): {'US2941001024': 2, 'US44975P1030': 2, 'US9682232064': 2, 'US68619K2042': 1, 'US68621F1021': 1}

Business Summary: 6,207 valores √∫nicos
Ejemplos (top 5): {'Baker Hughes, a GE Co is a fullstream provider of integrated oilfield products, services, and digital solutions. The company offers the full spectrum of services to oil and gas companies, from upstream to downstream.': 2, 'GGP Inc is a self-administered and self-managed real estate investment trust. It is engaged in owning, managing, leasing, and redeveloping high-quality retail properties throughout the United States.': 2, 'ProFrac Holding Corp., a vertically 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SimFinId,52098.0,4461161.0,5225436.0,18.0,446361.0,1033570.0,10383340.0,19439000.0
Fiscal Year,52098.0,2022.069,1.40883,2019.0,2021.0,2022.0,2023.0,2025.0
Shares (Basic),51777.0,2202123000.0,222309000000.0,1.0,22741000.0,54759000.0,143640000.0,36691290000000.0
Shares (Diluted),51632.0,2034672000.0,170298200000.0,1.0,23369560.0,56058300.0,147100000.0,24463730000000.0
"Cash, Cash Equivalents & Short Term Investments",51909.0,1336755000.0,23328690000.0,0.0,32711000.0,142535000.0,458914000.0,2477360000000.0
Accounts & Notes Receivable,41367.0,790904800.0,3318918000.0,-6152378000.0,19105000.0,101271000.0,463156500.0,103771000000.0
Inventories,28362.0,868222400.0,2792455000.0,0.0,19714840.0,121084500.0,564082500.0,69229000000.0
Total Current Assets,52074.0,3144279000.0,26505310000.0,4.0,103768800.0,377470000.0,1415269000.0,2518935000000.0
"Property, Plant & Equipment, Net",48721.0,2434031000.0,11245550000.0,-4910000000.0,11382000.0,109923000.0,833723000.0,299543000000.0
Long Term Investments & Receivables,11975.0,2125613000.0,18283340000.0,-7236621000.0,13000000.0,67713000.0,345007600.0,829905200000.0




Valores Nulos por Columna (top):


Unnamed: 0,null_count,null_pct
Long Term Investments & Receivables,40123,77.01
Treasury Stock,34481,66.18
Short Term Debt,24366,46.77
Inventories,23736,45.56
Long Term Debt,14467,27.77
Accounts & Notes Receivable,10731,20.6
"Property, Plant & Equipment, Net",3377,6.48
Retained Earnings,2055,3.94
Other Long Term Assets,1725,3.31
Total Noncurrent Liabilities,1515,2.91




Valores √önicos en Columnas Categ√≥ricas:

Ticker: 3,704 valores √∫nicos
Ejemplos (top 5): {'ADBE': 21, 'COST': 21, 'ENSG': 20, 'ENS': 20, 'ENR': 20}

Currency: 1 valores √∫nicos
Ejemplos (top 5): {'USD': 52098}

Fiscal Period: 4 valores √∫nicos
Ejemplos (top 5): {'Q1': 13128, 'Q2': 13063, 'Q3': 13035, 'Q4': 12872}

Report Date: 61 valores √∫nicos
Ejemplos (top 5): {'2022-03-31': 2547, '2022-06-30': 2543, '2023-03-31': 2532, '2022-09-30': 2523, '2023-06-30': 2502}

Publish Date: 1,336 valores √∫nicos
Ejemplos (top 5): {'2024-08-08': 393, '2024-11-07': 375, '2022-08-04': 370, '2020-11-05': 368, '2021-08-05': 366}

Restated Date: 1,423 valores √∫nicos
Ejemplos (top 5): {'2024-08-08': 404, '2024-11-07': 382, '2022-08-04': 373, '2021-08-05': 371, '2022-05-05': 365}

##########################################################################################
DATASET: income_q (vista: completo)
Shape original: 52,106 √ó 28  |  Shape vista: 52,106 √ó 28
#######################################

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SimFinId,52106.0,4459729.0,5224714.0,18.0,446361.0,1033570.0,10383342.0,19439000.0
Fiscal Year,52106.0,2022.068,1.408801,2019.0,2021.0,2022.0,2023.0,2025.0
Shares (Basic),51783.0,2201874000.0,222296100000.0,1.0,22741500.0,54756024.0,143605000.0,36691290000000.0
Shares (Diluted),51638.0,2034442000.0,170288300000.0,1.0,23372750.0,56056301.5,147092027.0,24463730000000.0
Revenue,46459.0,1638471000.0,10569360000.0,-901149000000.0,38374000.0,195766000.0,847068000.0,1065665000000.0
Cost of Revenue,40767.0,-1147415000.0,6260103000.0,-624570000000.0,-540900000.0,-113425000.0,-20542000.0,71826000000.0
Gross Profit,40780.0,660249100.0,3731707000.0,-24353000000.0,22236750.0,96268500.0,360498000.0,441095000000.0
Operating Expenses,52025.0,-473747900.0,24651980000.0,-5576217000000.0,-192996000.0,-58412972.0,-16904000.0,382296600000.0
"Selling, General & Administrative",49827.0,-321765500.0,25009710000.0,-5576217000000.0,-113253000.0,-30232000.0,-7418006.0,90895530000.0
Research & Development,25392.0,-97756810.0,2243379000.0,-83500000000.0,-38398250.0,-14049500.0,-4040000.0,237131400000.0




Valores Nulos por Columna (top):


Unnamed: 0,null_count,null_pct
Net Extraordinary Gains (Losses),48813,93.68
Depreciation & Amortization,31091,59.67
Research & Development,26714,51.27
Abnormal Gains (Losses),23809,45.69
Cost of Revenue,11339,21.76
Gross Profit,11326,21.74
"Income Tax (Expense) Benefit, Net",11200,21.49
"Interest Expense, Net",6771,12.99
Revenue,5647,10.84
"Selling, General & Administrative",2279,4.37




Valores √önicos en Columnas Categ√≥ricas:

Ticker: 3,701 valores √∫nicos
Ejemplos (top 5): {'COST': 21, 'APOG': 21, 'ENTA': 20, 'ENSG': 20, 'ENS': 20}

Currency: 1 valores √∫nicos
Ejemplos (top 5): {'USD': 52106}

Fiscal Period: 4 valores √∫nicos
Ejemplos (top 5): {'Q1': 13131, 'Q2': 13066, 'Q3': 13036, 'Q4': 12873}

Report Date: 62 valores √∫nicos
Ejemplos (top 5): {'2022-03-31': 2547, '2022-06-30': 2542, '2023-03-31': 2532, '2022-09-30': 2523, '2023-06-30': 2502}

Publish Date: 1,376 valores √∫nicos
Ejemplos (top 5): {'2024-08-08': 392, '2024-11-07': 374, '2022-08-04': 368, '2020-11-05': 367, '2021-08-05': 366}

Restated Date: 1,296 valores √∫nicos
Ejemplos (top 5): {'2024-11-07': 715, '2025-11-06': 642, '2022-11-03': 556, '2024-11-12': 528, '2023-11-09': 524}

##########################################################################################
DATASET: cashflow_q (vista: completo)
Shape original: 52,103 √ó 28  |  Shape vista: 52,103 √ó 28
####################################

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SimFinId,52103.0,4460777.0,5225334.0,18.0,446361.0,1033570.0,10383340.0,19439000.0
Fiscal Year,52103.0,2022.068,1.40892,2019.0,2021.0,2022.0,2023.0,2025.0
Shares (Basic),51779.0,2202035000.0,222304700000.0,1.0,22739640.0,54736979.0,143605000.0,36691290000000.0
Shares (Diluted),51634.0,2034590000.0,170294900000.0,1.0,23364680.0,56054000.0,147092000.0,24463730000000.0
Net Income/Starting Line,51624.0,1727580.0,28090960000.0,-6289205000000.0,-13450250.0,900000.0,46292000.0,759225000000.0
Depreciation & Amortization,49476.0,84486540.0,838356800.0,-11674050000.0,1003000.0,8118500.0,38200000.0,161373000000.0
Non-Cash Items,51567.0,104561700.0,15294030000.0,-108493400000.0,287000.0,4221000.0,19943000.0,3459275000000.0
Change in Working Capital,51679.0,-21081250.0,1143478000.0,-100494800000.0,-15781500.0,-547945.0,6604500.0,107602300000.0
Change in Accounts Receivable,721.0,-64109320.0,1010596000.0,-9355000000.0,-89000000.0,-5000000.0,51200000.0,14037000000.0
Change in Inventories,493.0,-34761690.0,416549600.0,-3899000000.0,-77944000.0,-4392000.0,20000000.0,2622000000.0




Valores Nulos por Columna (top):


Unnamed: 0,null_count,null_pct
Change in Inventories,51610,99.05
Change in Accounts Payable,51493,98.83
Change in Accounts Receivable,51382,98.62
Change in Other,51158,98.19
Net Cash from Acquisitions & Divestitures,35258,67.67
Dividends Paid,33962,65.18
Net Change in Long Term Investment,33561,64.41
Cash from (Repayment of) Debt,13313,25.55
Cash from (Repurchase of) Equity,12131,23.28
Change in Fixed Assets & Intangibles,4421,8.49




Valores √önicos en Columnas Categ√≥ricas:

Ticker: 3,704 valores √∫nicos
Ejemplos (top 5): {'APOG': 21, 'COST': 21, 'ENTA': 20, 'ENSG': 20, 'ENS': 20}

Currency: 1 valores √∫nicos
Ejemplos (top 5): {'USD': 52103}

Fiscal Period: 4 valores √∫nicos
Ejemplos (top 5): {'Q1': 13134, 'Q2': 13061, 'Q3': 13033, 'Q4': 12875}

Report Date: 66 valores √∫nicos
Ejemplos (top 5): {'2022-03-31': 2547, '2022-06-30': 2543, '2023-03-31': 2532, '2022-09-30': 2524, '2023-06-30': 2502}

Publish Date: 1,389 valores √∫nicos
Ejemplos (top 5): {'2024-08-08': 390, '2024-11-07': 375, '2021-08-05': 371, '2020-11-05': 371, '2022-08-04': 370}

Restated Date: 1,302 valores √∫nicos
Ejemplos (top 5): {'2024-05-09': 703, '2025-05-08': 653, '2023-05-04': 617, '2023-05-09': 611, '2022-05-05': 536}

##########################################################################################
DATASET: prices_d (vista: muestra aleatoria n=100,000)
Shape original: 6,210,379 √ó 11  |  Shape vista: 100,000 √ó 11
###############

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SimFinId,100000.0,6875920.0,6628685.0,18.0,682408.0,6767429.0,11819770.0,19937590.0
Open,100000.0,35942.03,1710728.0,0.0,7.4,19.47,52.07,100000000.0
High,100000.0,36484.31,1723215.0,0.0,7.6,19.88,53.0225,100000000.0
Low,100000.0,35555.76,1704572.0,0.0,7.19,19.075,51.2,100000000.0
Close,100000.0,35982.27,1713159.0,0.0,7.39,19.48,52.09,100000000.0
Adj. Close,100000.0,35979.91,1713159.0,0.0,6.92,18.1,48.92,100000000.0
Volume,100000.0,1621350.0,16254000.0,0.0,32946.25,239140.5,943464.5,3352070000.0
Dividend,596.0,0.5137081,1.447287,0.0,0.12,0.26,0.49,28.19
Shares Outstanding,91469.0,498173300000.0,56791200000000.0,0.0,17194000.0,50093590.0,141700000.0,6667887000000000.0




Valores Nulos por Columna (top):


Unnamed: 0,null_count,null_pct
Dividend,99404,99.4
Shares Outstanding,8531,8.53
Ticker,13,0.01




Valores √önicos en Columnas Categ√≥ricas:

Ticker: 5,792 valores √∫nicos
Ejemplos (top 5): {'FTNT': 38, 'WRLD': 37, 'RVPH': 34, 'FBIO': 34, 'NMTC': 34}

Date: 1,237 valores √∫nicos
Ejemplos (top 5): {'2023-08-17': 111, '2023-09-20': 111, '2022-06-03': 111, '2022-01-05': 110, '2023-01-25': 110}


# 3. Preprocesamiento y limpieza de datos

3.1 Manejo de valores nulos
Aplicar reglas de manejo de nulos por dataset (llaves cr√≠ticas y universo de compa√±√≠as) y contabilizar registros removidos.

In [5]:
# 3.1 Manejo de valores nulos (tratamiento + impacto)

# Copia de trabajo desde los DF cargados en 2.x
# (espera un diccionario `dfs` con las tablas originales)
if "dfs" not in globals() or not isinstance(dfs, dict):
    raise NameError("No existe `dfs` en memoria. Ejecuta primero 2.3 (carga de datasets).")

dfs_31 = {name: df.copy() for name, df in dfs.items()}

impact_rows = []

def _impact(table_name, stage, before_rows, after_rows, rule):
    impact_rows.append({
        "table": table_name,
        "stage": stage,
        "rows_before": int(before_rows),
        "rows_after": int(after_rows),
        "rows_removed": int(before_rows - after_rows),
        "rule": rule
    })

print("3.1 Manejo de valores nulos")

# 1) industries: IndustryId critico
before = len(dfs_31["industries"])
dfs_31["industries"] = dfs_31["industries"].dropna(subset=["IndustryId"])
_impact("industries", "nulos", before, len(dfs_31["industries"]), "dropna IndustryId")

# 2) companies: eliminar Ticker nulo/vacio y crear has_industry
comp = dfs_31["companies"].copy()
if "Ticker" not in comp.columns:
    raise KeyError("companies: falta columna 'Ticker'")
if "SimFinId" not in comp.columns:
    raise KeyError("companies: falta columna 'SimFinId'")

comp["Ticker"] = comp["Ticker"].astype("string").str.strip()
invalid_ticker = comp["Ticker"].isna() | comp["Ticker"].isin(["", "nan", "None", "NaN", "null", "NULL"])
comp.loc[invalid_ticker, "Ticker"] = pd.NA

before = len(comp)
comp = comp.dropna(subset=["Ticker"])
_impact("companies", "nulos", before, len(comp), "dropna Ticker (incluye vacios/string nan)")

comp["has_industry"] = comp["IndustryId"].notna().astype("int8") if "IndustryId" in comp.columns else 0
dfs_31["companies"] = comp
unknown_industry = int((comp["has_industry"] == 0).sum())

# Universo transable: SimFinId de companies con Ticker valido
listed_ids = set(
    pd.to_numeric(comp["SimFinId"], errors="coerce").dropna().astype("Int64").tolist()
)
print(f"Empresas en universo transable (Ticker valido): {len(listed_ids):,}")

# 3) facts trimestrales: filtrar universo + llaves criticas
keys_q = ["SimFinId", "Fiscal Year", "Fiscal Period", "Report Date"]
for t in ["balance_q", "income_q", "cashflow_q"]:
    dfq = dfs_31[t].copy()

    # 3a) Filtro universo transable
    before = len(dfq)
    sim = pd.to_numeric(dfq["SimFinId"], errors="coerce").astype("Int64")
    dfq = dfq[sim.isin(listed_ids)].copy()
    _impact(t, "universo", before, len(dfq), "filtrar SimFinId en universo transable")

    # 3b) Llaves criticas
    before = len(dfq)
    dfq = dfq.dropna(subset=keys_q)
    _impact(t, "nulos", before, len(dfq), f"dropna llaves trimestrales {keys_q}")

    dfs_31[t] = dfq

# 4) prices diarios: filtrar universo + llaves criticas
prd = dfs_31["prices_d"].copy()

# 4a) Filtro universo transable
before = len(prd)
sim = pd.to_numeric(prd["SimFinId"], errors="coerce").astype("Int64")
prd = prd[sim.isin(listed_ids)].copy()
_impact("prices_d", "universo", before, len(prd), "filtrar SimFinId en universo transable")

# 4b) Llaves criticas
before = len(prd)
prd = prd.dropna(subset=["SimFinId", "Date"])
_impact("prices_d", "nulos", before, len(prd), "dropna llaves diarias ['SimFinId', 'Date']")

dfs_31["prices_d"] = prd

# Resumen compacto de impacto
impact_df = pd.DataFrame(impact_rows).sort_values(["table", "stage"])
print("\nImpacto por tabla:")
print(impact_df.to_string(index=False))

print(f"\ncompanies sin IndustryId (has_industry=0): {unknown_industry:,}")

print("\nTop-5 nulos post-limpieza por tabla:")
for t in ["industries", "companies", "balance_q", "income_q", "cashflow_q", "prices_d"]:
    dft = dfs_31[t]
    null_counts = dft.isna().sum()
    null_counts = null_counts[null_counts > 0].sort_values(ascending=False).head(5)
    if null_counts.empty:
        print(f"[{t}] sin nulos")
    else:
        null_pct = (null_counts / len(dft) * 100).round(2)
        out = pd.DataFrame({"null_count": null_counts, "null_pct": null_pct})
        print(f"\n[{t}]")
        print(out.to_string())

# Salida para 3.2
dfs_proc = dfs_31
print("\n3.1 OK -> dfs_proc listo para 3.2")


3.1 Manejo de valores nulos
Empresas en universo transable (Ticker valido): 6,488

Impacto por tabla:
     table    stage  rows_before  rows_after  rows_removed                                                                                   rule
 balance_q    nulos        52098       52098             0 dropna llaves trimestrales ['SimFinId', 'Fiscal Year', 'Fiscal Period', 'Report Date']
 balance_q universo        52098       52098             0                                                 filtrar SimFinId en universo transable
cashflow_q    nulos        52103       52103             0 dropna llaves trimestrales ['SimFinId', 'Fiscal Year', 'Fiscal Period', 'Report Date']
cashflow_q universo        52103       52103             0                                                 filtrar SimFinId en universo transable
 companies    nulos         6525        6488            37                                              dropna Ticker (incluye vacios/string nan)
  income_q    nulos   

In [6]:
# Validar intersecci√≥n de IDs removidos por ticker nulo vs cada tabla
comp_raw = dfs["companies"].copy()
comp_raw["Ticker"] = comp_raw["Ticker"].astype("string").str.strip()
invalid = comp_raw["Ticker"].isna() | comp_raw["Ticker"].isin(["", "nan", "None", "NaN", "null", "NULL"])

ids_all = set(pd.to_numeric(comp_raw["SimFinId"], errors="coerce").dropna().astype("Int64"))
ids_valid = set(pd.to_numeric(comp_raw.loc[~invalid, "SimFinId"], errors="coerce").dropna().astype("Int64"))
ids_removed = ids_all - ids_valid

print(f"IDs removidos del universo transable: {len(ids_removed)}")

for t in ["balance_q", "income_q", "cashflow_q", "prices_d"]:
    sim = pd.to_numeric(dfs[t]["SimFinId"], errors="coerce").dropna().astype("Int64")
    rows_to_remove = int(sim.isin(ids_removed).sum())
    ids_in_table = len(set(sim) & ids_removed)
    print(f"{t}: ids_en_tabla={ids_in_table}, filas_a_remover={rows_to_remove}")


IDs removidos del universo transable: 37
balance_q: ids_en_tabla=0, filas_a_remover=0
income_q: ids_en_tabla=0, filas_a_remover=0
cashflow_q: ids_en_tabla=0, filas_a_remover=0
prices_d: ids_en_tabla=1, filas_a_remover=662


# 3.2 Correcci√≥n de tipos de datos

Convertir columnas clave a tipos consistentes (fechas, identificadores, periodos y m√©tricas num√©ricas) para habilitar joins confiables, c√°lculos robustos y carga limpia hacia SQLite.


In [7]:
# 3.2 Correcci√≥n de tipos de datos

if "dfs_proc" not in globals() or not isinstance(dfs_proc, dict):
    raise NameError("No existe `dfs_proc` en memoria. Ejecuta primero 3.1.")

dfs_proc2 = {name: df.copy() for name, df in dfs_proc.items()}

conv_rows = []

def _conv(table, col, target_type, changed_rows):
    conv_rows.append({
        "table": table,
        "column": col,
        "target_type": target_type,
        "changed_rows": int(changed_rows)
    })

def _to_datetime(df, col, table):
    if col in df.columns:
        before = df[col].copy()
        df[col] = pd.to_datetime(df[col], errors="coerce")
        changed = (before.astype("string").fillna("<NA>") != df[col].astype("string").fillna("<NA>")).sum()
        _conv(table, col, "datetime64[ns]", changed)

def _to_int64_nullable(df, col, table):
    if col in df.columns:
        before = df[col].copy()
        df[col] = pd.to_numeric(df[col], errors="coerce").astype("Int64")
        changed = (before.astype("string").fillna("<NA>") != df[col].astype("string").fillna("<NA>")).sum()
        _conv(table, col, "Int64", changed)

def _to_numeric(df, col, table):
    if col in df.columns:
        before = df[col].copy()
        df[col] = pd.to_numeric(df[col], errors="coerce")
        changed = (before.astype("string").fillna("<NA>") != df[col].astype("string").fillna("<NA>")).sum()
        _conv(table, col, str(df[col].dtype), changed)

def _to_category(df, col, table):
    if col in df.columns:
        before = df[col].copy()
        df[col] = df[col].astype("string").astype("category")
        changed = (before.astype("string").fillna("<NA>") != df[col].astype("string").fillna("<NA>")).sum()
        _conv(table, col, "category", changed)

# 1) industries
ind = dfs_proc2["industries"].copy()
_to_int64_nullable(ind, "IndustryId", "industries")
for c in ["Industry", "Sector"]:
    _to_category(ind, c, "industries")
dfs_proc2["industries"] = ind

# 2) companies
comp = dfs_proc2["companies"].copy()
for c in ["SimFinId", "IndustryId"]:
    _to_int64_nullable(comp, c, "companies")
if "End of financial year (month)" in comp.columns:
    _to_int64_nullable(comp, "End of financial year (month)", "companies")
for c in ["Ticker", "Market", "Main Currency"]:
    _to_category(comp, c, "companies")
if "Number Employees" in comp.columns:
    _to_numeric(comp, "Number Employees", "companies")
dfs_proc2["companies"] = comp

# 3) trimestrales (balance/income/cashflow)
quarterly_tables = ["balance_q", "income_q", "cashflow_q"]
for t in quarterly_tables:
    dfq = dfs_proc2[t].copy()

    # Claves y fechas
    _to_int64_nullable(dfq, "SimFinId", t)
    _to_int64_nullable(dfq, "Fiscal Year", t)
    _to_category(dfq, "Fiscal Period", t)
    for c in ["Report Date", "Publish Date", "Restated Date"]:
        _to_datetime(dfq, c, t)

    # Candidatas num√©ricas: todo excepto identificadores/texto y fechas
    excluded = {
        "Ticker", "Currency", "Fiscal Period",
        "Report Date", "Publish Date", "Restated Date"
    }
    for c in dfq.columns:
        if c in excluded:
            continue
        if c in ["SimFinId", "Fiscal Year"]:
            continue
        if str(dfq[c].dtype).startswith("datetime"):
            continue
        _to_numeric(dfq, c, t)

    # Currency suele baja cardinalidad
    _to_category(dfq, "Currency", t)

    dfs_proc2[t] = dfq

# 4) precios diarios
pr = dfs_proc2["prices_d"].copy()
_to_int64_nullable(pr, "SimFinId", "prices_d")
_to_datetime(pr, "Date", "prices_d")
_to_category(pr, "Ticker", "prices_d")
for c in ["Open", "High", "Low", "Close", "Adj. Close", "Volume", "Dividend", "Shares Outstanding"]:
    _to_numeric(pr, c, "prices_d")
dfs_proc2["prices_d"] = pr

# --- Verificaci√≥n compacta ---
print("Tipos de datos despu√©s de conversi√≥n (resumen):")
dtype_summary = []
for t, df in dfs_proc2.items():
    dtype_summary.append({
        "table": t,
        "rows": len(df),
        "cols": df.shape[1],
        "datetime_cols": int((df.dtypes.astype(str).str.contains("datetime")).sum()),
        "category_cols": int((df.dtypes.astype(str) == "category").sum()),
        "int64_nullable_cols": int((df.dtypes.astype(str) == "Int64").sum()),
        "numeric_cols": int(df.select_dtypes(include=["number"]).shape[1])
    })
print(pd.DataFrame(dtype_summary).sort_values("table").to_string(index=False))

print("\nColumnas convertidas (top 25 por cambios):")
conv_df = pd.DataFrame(conv_rows).sort_values(["changed_rows", "table", "column"], ascending=[False, True, True])
print(conv_df.head(25).to_string(index=False))

print("\nDtypes clave por tabla:")
key_cols = {
    "companies": ["Ticker", "SimFinId", "IndustryId", "Main Currency", "has_industry"],
    "balance_q": ["SimFinId", "Fiscal Year", "Fiscal Period", "Report Date", "Currency"],
    "income_q": ["SimFinId", "Fiscal Year", "Fiscal Period", "Report Date", "Currency"],
    "cashflow_q": ["SimFinId", "Fiscal Year", "Fiscal Period", "Report Date", "Currency"],
    "prices_d": ["Ticker", "SimFinId", "Date", "Adj. Close", "Shares Outstanding"],
    "industries": ["IndustryId", "Industry", "Sector"]
}
for t, cols in key_cols.items():
    if t not in dfs_proc2:
        continue
    dft = dfs_proc2[t]
    existing = [c for c in cols if c in dft.columns]
    pairs = [f"{c}: {dft[c].dtype}" for c in existing]
    print(f"[{t}] " + " | ".join(pairs))

# Salida para 3.3
dfs_proc = dfs_proc2
print("\n3.2 OK -> dfs_proc listo para 3.3")


Tipos de datos despu√©s de conversi√≥n (resumen):
     table    rows  cols  datetime_cols  category_cols  int64_nullable_cols  numeric_cols
 balance_q   52098    30              3              2                    2            24
cashflow_q   52103    28              3              2                    2            22
 companies    6488    12              0              3                    3             6
  income_q   52106    28              3              2                    2            22
industries      74     3              0              2                    1             1
  prices_d 6209717    11              1              1                    1             9

Columnas convertidas (top 25 por cambios):
    table                                          column    target_type  changed_rows
companies                   End of financial year (month)          Int64          6488
companies                                      IndustryId          Int64          6223
balance_q      

# 3.3 Normalizaci√≥n y Estandarizaci√≥n

Normalizar y estandarizar campos categ√≥ricos/texto por tabla (Ticker, Currency, Fiscal Period y descriptores), y verificar resultados con m√©tricas de impacto y perfiles de valores √∫nicos para asegurar consistencia antes de la eliminaci√≥n de duplicados.

In [8]:
# 3.3 Normalizaci√≥n y Estandarizaci√≥n

if "dfs_proc" not in globals() or not isinstance(dfs_proc, dict):
    raise NameError("No existe `dfs_proc` en memoria. Ejecuta primero 3.2.")

dfs_proc3 = {name: df.copy() for name, df in dfs_proc.items()}

change_rows = []

def _register(table, col, rule, changed_rows):
    change_rows.append({
        "table": table,
        "column": col,
        "rule": rule,
        "changed_rows": int(changed_rows)
    })

def _normalize_text(s: pd.Series) -> pd.Series:
    # trim + collapse spaces + empty tokens to NA
    out = s.astype("string").str.strip().str.replace(r"\s+", " ", regex=True)
    out = out.replace({"": pd.NA, "nan": pd.NA, "NaN": pd.NA, "None": pd.NA, "NULL": pd.NA, "null": pd.NA})
    return out

def _count_changes(before: pd.Series, after: pd.Series) -> int:
    b = before.astype("string").fillna("<NA>")
    a = after.astype("string").fillna("<NA>")
    return int((b != a).sum())

def _normalize_ticker(df: pd.DataFrame, col: str, table: str):
    if col not in df.columns:
        return
    before = df[col].copy()
    x = _normalize_text(df[col]).str.upper().str.replace(r"\s+", "", regex=True)
    df[col] = x.astype("category") if str(df[col].dtype) == "category" else x
    _register(table, col, "upper + remove spaces + empty->NA", _count_changes(before, df[col]))

def _normalize_currency(df: pd.DataFrame, col: str, table: str):
    if col not in df.columns:
        return
    before = df[col].copy()
    x = _normalize_text(df[col]).str.upper().str.replace(r"[^A-Z]", "", regex=True)
    x = x.where(x.str.len().eq(3), pd.NA)  # keep ISO-like 3-letter codes
    df[col] = x.astype("category") if str(df[col].dtype) == "category" else x
    _register(table, col, "upper + letters only + len==3 else NA", _count_changes(before, df[col]))

def _normalize_fiscal_period(df: pd.DataFrame, col: str, table: str):
    if col not in df.columns:
        return
    before = df[col].copy()
    x = _normalize_text(df[col]).str.upper().str.replace(r"[^A-Z0-9]", "", regex=True)
    x = x.replace({
        "1Q": "Q1", "2Q": "Q2", "3Q": "Q3", "4Q": "Q4",
        "QUARTER1": "Q1", "QUARTER2": "Q2", "QUARTER3": "Q3", "QUARTER4": "Q4"
    })
    valid = {"Q1", "Q2", "Q3", "Q4", "FY", "TTM"}
    x = x.where(x.isin(valid), pd.NA)
    df[col] = x.astype("category") if str(df[col].dtype) == "category" else x
    _register(table, col, "map aliases + keep {Q1..Q4,FY,TTM}", _count_changes(before, df[col]))

def _normalize_plain_text(df: pd.DataFrame, col: str, table: str):
    if col not in df.columns:
        return
    before = df[col].copy()
    x = _normalize_text(df[col])
    df[col] = x.astype("category") if str(df[col].dtype) == "category" else x
    _register(table, col, "trim/collapse spaces/empty->NA", _count_changes(before, df[col]))

# 1) companies
comp = dfs_proc3["companies"].copy()
_normalize_ticker(comp, "Ticker", "companies")
_normalize_currency(comp, "Main Currency", "companies")
_normalize_plain_text(comp, "Company Name", "companies")
_normalize_plain_text(comp, "Market", "companies")
if "IndustryId" in comp.columns:
    comp["has_industry"] = comp["IndustryId"].notna().astype("int8")
    _register("companies", "has_industry", "1 if IndustryId not null else 0", 0)
dfs_proc3["companies"] = comp

# 2) industries
ind = dfs_proc3["industries"].copy()
_normalize_plain_text(ind, "Industry", "industries")
_normalize_plain_text(ind, "Sector", "industries")
dfs_proc3["industries"] = ind

# 3) quarterly tables
for t in ["balance_q", "income_q", "cashflow_q"]:
    dfq = dfs_proc3[t].copy()
    _normalize_ticker(dfq, "Ticker", t)
    _normalize_currency(dfq, "Currency", t)
    _normalize_fiscal_period(dfq, "Fiscal Period", t)
    dfs_proc3[t] = dfq

# 4) daily prices
pr = dfs_proc3["prices_d"].copy()
_normalize_ticker(pr, "Ticker", "prices_d")
if "Currency" in pr.columns:
    _normalize_currency(pr, "Currency", "prices_d")
dfs_proc3["prices_d"] = pr

# ---------- Verificacion 1: impacto compacto ----------
print("3.3 Normalizaci√≥n y Estandarizaci√≥n")
impact_df = pd.DataFrame(change_rows).sort_values(["table", "column"])
print("\nImpacto de cambios por columna:")
print(impact_df.to_string(index=False))

print("\nNulos clave post-3.3 (%):")
null_checks = {
    "companies": ["Ticker", "Main Currency", "IndustryId"],
    "balance_q": ["SimFinId", "Fiscal Year", "Fiscal Period", "Report Date", "Currency"],
    "income_q": ["SimFinId", "Fiscal Year", "Fiscal Period", "Report Date", "Currency"],
    "cashflow_q": ["SimFinId", "Fiscal Year", "Fiscal Period", "Report Date", "Currency"],
    "prices_d": ["SimFinId", "Date", "Ticker"]
}
for t, cols in null_checks.items():
    if t not in dfs_proc3:
        continue
    df = dfs_proc3[t]
    parts = []
    for c in cols:
        if c in df.columns:
            pct = round(df[c].isna().mean() * 100, 2)
            parts.append(f"{c}={pct}%")
    print(f"[{t}] " + " | ".join(parts))

# ---------- Verificacion 2: estilo plantilla (compacto) ----------
print("\nValores √∫nicos despu√©s de normalizaci√≥n:")
MAX_CAT_COLS_PER_TABLE = 6  # evita salida excesiva
for t, df in dfs_proc3.items():
    cat_cols = df.select_dtypes(include=["object", "category", "string"]).columns.tolist()
    if not cat_cols:
        continue
    print(f"\n[{t}] columnas categ√≥ricas evaluadas: {min(len(cat_cols), MAX_CAT_COLS_PER_TABLE)}/{len(cat_cols)}")
    for col in cat_cols[:MAX_CAT_COLS_PER_TABLE]:
        nunq = df[col].nunique(dropna=True)
        print(f"  {col}: {nunq} valores √∫nicos")
        print(df[col].value_counts(dropna=False).head(10).to_string())

# salida para 3.4
dfs_proc = dfs_proc3
print("\n3.3 OK -> dfs_proc listo para 3.4 (eliminaci√≥n de duplicados)")


3.3 Normalizaci√≥n y Estandarizaci√≥n

Impacto de cambios por columna:
     table        column                                  rule  changed_rows
 balance_q      Currency upper + letters only + len==3 else NA             0
 balance_q Fiscal Period    map aliases + keep {Q1..Q4,FY,TTM}             0
 balance_q        Ticker     upper + remove spaces + empty->NA           112
cashflow_q      Currency upper + letters only + len==3 else NA             0
cashflow_q Fiscal Period    map aliases + keep {Q1..Q4,FY,TTM}             0
cashflow_q        Ticker     upper + remove spaces + empty->NA           112
 companies  Company Name        trim/collapse spaces/empty->NA             0
 companies Main Currency upper + letters only + len==3 else NA             0
 companies        Market        trim/collapse spaces/empty->NA             0
 companies        Ticker     upper + remove spaces + empty->NA            48
 companies  has_industry       1 if IndustryId not null else 0             0
  inc

# 3.4 Eliminaci√≥n de duplicados

Identificar y tratar registros duplicados seg√∫n el grano de cada tabla. En hechos trimestrales, conservar la versi√≥n m√°s reciente con prioridad en Restated Date y luego Publish Date; en dimensiones y precios, conservar una sola fila por llave.

In [9]:
# 3.4 Eliminaci√≥n de duplicados

if "dfs_proc" not in globals() or not isinstance(dfs_proc, dict):
    raise NameError("No existe `dfs_proc` en memoria. Ejecuta primero 3.3.")

dfs_proc4 = {name: df.copy() for name, df in dfs_proc.items()}

# Llaves de grano por tabla
grain_keys = {
    "industries": ["IndustryId"],
    "companies": ["SimFinId"],
    "balance_q": ["SimFinId", "Fiscal Year", "Fiscal Period", "Report Date"],
    "income_q": ["SimFinId", "Fiscal Year", "Fiscal Period", "Report Date"],
    "cashflow_q": ["SimFinId", "Fiscal Year", "Fiscal Period", "Report Date"],
    "prices_d": ["SimFinId", "Date"],
}

quarterly_tables = {"balance_q", "income_q", "cashflow_q"}
priority_dates = ["Restated Date", "Publish Date"]

summary_rows = []

for t, keys in grain_keys.items():
    if t not in dfs_proc4:
        continue

    df = dfs_proc4[t].copy()
    missing = [c for c in keys if c not in df.columns]
    if missing:
        raise KeyError(f"{t}: faltan columnas llave {missing}")

    rows_before = len(df)
    duplicated_groups_rows = int(df.duplicated(subset=keys, keep=False).sum())
    duplicated_excess_rows = int(df.duplicated(subset=keys, keep="first").sum())

    print(f"[{t}] Registros duplicados encontrados (exceso por llave): {duplicated_excess_rows:,}")

    if duplicated_excess_rows > 0:
        if t in quarterly_tables:
            # Prioridad: version mas reciente segun Restated Date y luego Publish Date
            sort_cols = keys + [c for c in priority_dates if c in df.columns]
            for c in priority_dates:
                if c in df.columns and not str(df[c].dtype).startswith("datetime"):
                    df[c] = pd.to_datetime(df[c], errors="coerce")
            df = df.sort_values(sort_cols, kind="mergesort", na_position="first")
            df = df.drop_duplicates(subset=keys, keep="last")
        else:
            # Dimensions/prices: conservar una fila por llave
            df = df.drop_duplicates(subset=keys, keep="last")

    rows_after = len(df)
    print(f"[{t}] Total de registros despu√©s de limpieza: {rows_after:,}")

    summary_rows.append({
        "table": t,
        "rows_before": rows_before,
        "dup_rows_groups": duplicated_groups_rows,
        "dup_rows_excess": duplicated_excess_rows,
        "rows_after": rows_after,
        "rows_removed": rows_before - rows_after,
        "rule": "quarterly keep latest by Restated/Publish" if t in quarterly_tables else "keep one row per grain key"
    })

    dfs_proc4[t] = df

print("\nResumen 3.4 (compacto):")
summary_df = pd.DataFrame(summary_rows).sort_values("table")
print(summary_df.to_string(index=False))

# Salida para 4.x
dfs_proc = dfs_proc4
print("\n3.4 OK -> dfs_proc listo para 4.x")


[industries] Registros duplicados encontrados (exceso por llave): 0
[industries] Total de registros despu√©s de limpieza: 74
[companies] Registros duplicados encontrados (exceso por llave): 0
[companies] Total de registros despu√©s de limpieza: 6,488
[balance_q] Registros duplicados encontrados (exceso por llave): 0
[balance_q] Total de registros despu√©s de limpieza: 52,098
[income_q] Registros duplicados encontrados (exceso por llave): 0
[income_q] Total de registros despu√©s de limpieza: 52,106
[cashflow_q] Registros duplicados encontrados (exceso por llave): 0
[cashflow_q] Total de registros despu√©s de limpieza: 52,103
[prices_d] Registros duplicados encontrados (exceso por llave): 0
[prices_d] Total de registros despu√©s de limpieza: 6,209,717

Resumen 3.4 (compacto):
     table  rows_before  dup_rows_groups  dup_rows_excess  rows_after  rows_removed                                      rule
 balance_q        52098                0                0       52098             0 quart

### 4.1 Features Derivadas - Feature 1 (CurrentRatio_Q)
Unir en una sola tabla los datos trimestrales de balance, income y cashflow por empresa y trimestre, y calcular `CurrentRatio_Q` (activos corrientes / pasivos corrientes). Si falta alg√∫n dato necesario, el resultado se deja nulo (sin rellenar valores).


In [10]:
# 4.1 Features Derivadas - Feature 1 (CurrentRatio_Q)

if "dfs_proc" not in globals() or not isinstance(dfs_proc, dict):
    raise NameError("No existe `dfs_proc` en memoria. Ejecuta primero 3.4.")

# Llaves de integraci√≥n trimestral
keys_q = ["SimFinId", "Fiscal Year", "Fiscal Period", "Report Date"]

# Validaci√≥n m√≠nima de tablas requeridas
for t in ["companies", "balance_q", "income_q", "cashflow_q"]:
    if t not in dfs_proc:
        raise KeyError(f"Falta tabla requerida: {t}")

# Columnas m√≠nimas para integraci√≥n 4.x
bal_required = ["Total Current Assets", "Total Current Liabilities"]
cf_required = ["Net Cash from Operating Activities", "Change in Fixed Assets & Intangibles"]
for c in bal_required:
    if c not in dfs_proc["balance_q"].columns:
        raise KeyError(f"Falta columna requerida en balance_q: {c}")
for c in cf_required:
    if c not in dfs_proc["cashflow_q"].columns:
        raise KeyError(f"Falta columna requerida en cashflow_q: {c}")
for c in ["SimFinId", "Ticker", "Main Currency"]:
    if c not in dfs_proc["companies"].columns:
        raise KeyError(f"Falta columna requerida en companies: {c}")

# Selecci√≥n de columnas
bal_cols = keys_q + [c for c in ["Ticker", "Currency", "Total Current Assets", "Total Current Liabilities"] if c in dfs_proc["balance_q"].columns]
inc_cols = keys_q + [c for c in ["Revenue", "Operating Income (Loss)", "Net Income (Common)"] if c in dfs_proc["income_q"].columns]
cf_cols = keys_q + [c for c in ["Net Cash from Operating Activities", "Change in Fixed Assets & Intangibles"] if c in dfs_proc["cashflow_q"].columns]

bal_41 = dfs_proc["balance_q"][bal_cols].copy()
inc_41 = dfs_proc["income_q"][inc_cols].copy()
cf_41 = dfs_proc["cashflow_q"][cf_cols].copy()

# Base integrada trimestral (uni√≥n de llaves de las 3 tablas)
base_keys = (
    pd.concat([bal_41[keys_q], inc_41[keys_q], cf_41[keys_q]], ignore_index=True)
    .drop_duplicates()
)

# Integraci√≥n por llaves (outer l√≥gico a trav√©s de base_keys)
df_procesado = (
    base_keys
    .merge(bal_41.assign(_from_balance=1), on=keys_q, how="left")
    .merge(inc_41.assign(_from_income=1), on=keys_q, how="left")
    .merge(cf_41.assign(_from_cashflow=1), on=keys_q, how="left")
)

for flag in ["_from_balance", "_from_income", "_from_cashflow"]:
    df_procesado[flag] = df_procesado[flag].fillna(0).astype("int8")

# Completar Ticker y Currency desde companies (maestro)
master = dfs_proc["companies"][["SimFinId", "Ticker", "Main Currency"]].copy()
master["Ticker"] = master["Ticker"].astype("string").str.strip()
master["Main Currency"] = master["Main Currency"].astype("string").str.strip().str.upper()

invalid_ticker = master["Ticker"].isna() | master["Ticker"].isin(["", "nan", "None", "NaN", "null", "NULL"])
master.loc[invalid_ticker, "Ticker"] = pd.NA

invalid_currency = master["Main Currency"].isna() | master["Main Currency"].isin(["", "nan", "None", "NaN", "null", "NULL"])
master.loc[invalid_currency, "Main Currency"] = pd.NA
master["Main Currency"] = master["Main Currency"].str.replace(r"[^A-Z]", "", regex=True)
master["Main Currency"] = master["Main Currency"].where(master["Main Currency"].str.len().eq(3), pd.NA)

master = master.dropna(subset=["SimFinId"]).drop_duplicates(subset=["SimFinId"])
master = master.rename(columns={"Ticker": "Ticker_master", "Main Currency": "Currency_master"})

df_procesado = df_procesado.merge(master, on="SimFinId", how="left")

if "Ticker" in df_procesado.columns:
    df_procesado["Ticker"] = df_procesado["Ticker"].fillna(df_procesado["Ticker_master"])
else:
    df_procesado["Ticker"] = df_procesado["Ticker_master"]

if "Currency" in df_procesado.columns:
    df_procesado["Currency"] = df_procesado["Currency"].fillna(df_procesado["Currency_master"])
else:
    df_procesado["Currency"] = df_procesado["Currency_master"]

df_procesado.drop(columns=["Ticker_master", "Currency_master"], inplace=True)

# Universo estricto: remover residuales sin Ticker
# (Currency se completa para integridad, pero no se usa como filtro duro)
ticker_na_before = int(df_procesado["Ticker"].isna().sum())
if ticker_na_before > 0:
    before = len(df_procesado)
    df_procesado = df_procesado[df_procesado["Ticker"].notna()].copy()
    removed = before - len(df_procesado)
else:
    removed = 0

currency_na_final = int(df_procesado["Currency"].isna().sum())

# Feature 1: CurrentRatio_Q = Total Current Assets / Total Current Liabilities
# Regla de calculabilidad: solo si ambos componentes existen y denominador > 0
tca = pd.to_numeric(df_procesado["Total Current Assets"], errors="coerce")
tcl = pd.to_numeric(df_procesado["Total Current Liabilities"], errors="coerce")
df_procesado["CurrentRatio_Q"] = np.where(
    tca.notna() & tcl.notna() & (tcl > 0),
    tca / tcl,
    np.nan
)

# Verificaci√≥n compacta
print("4.1 Feature 1 - CurrentRatio_Q")
print(f"Filas del df integrado trimestral: {len(df_procesado):,}")
print(
    "Cobertura fuentes: "
    f"balance={int(df_procesado['_from_balance'].sum()):,}, "
    f"income={int(df_procesado['_from_income'].sum()):,}, "
    f"cashflow={int(df_procesado['_from_cashflow'].sum()):,}"
)
print(f"Ticker nulo detectado antes de filtro estricto: {ticker_na_before:,}")
print(f"Filas removidas por Ticker nulo residual: {removed:,}")
print(f"Ticker nulo final: {int(df_procesado['Ticker'].isna().sum()):,}")
print(f"Currency nulo final (post maestro): {currency_na_final:,}")

calc_ok = int(df_procesado["CurrentRatio_Q"].notna().sum())
calc_pct = (calc_ok / len(df_procesado) * 100) if len(df_procesado) > 0 else 0
print(f"CurrentRatio_Q calculable: {calc_ok:,} / {len(df_procesado):,} ({calc_pct:.2f}%)")
print(f"Casos con pasivo corriente <= 0: {int((tcl <= 0).sum()):,}")

print("\nEstad√≠sticas de la nueva feature:")
print(df_procesado["CurrentRatio_Q"].describe())

# Sneak peek: inicio, medio y final
preview_cols = [c for c in [
    "SimFinId", "Fiscal Year", "Fiscal Period", "Report Date",
    "Ticker", "Currency", "Total Current Assets", "Total Current Liabilities", "CurrentRatio_Q"
] if c in df_procesado.columns]

n = len(df_procesado)
mid_start = max((n // 2) - 1, 0)

peek_head = df_procesado[preview_cols].head(3).copy()
peek_head["segment"] = "inicio"
peek_mid = df_procesado[preview_cols].iloc[mid_start:mid_start + 3].copy()
peek_mid["segment"] = "medio"
peek_tail = df_procesado[preview_cols].tail(3).copy()
peek_tail["segment"] = "final"

peek = pd.concat([peek_head, peek_mid, peek_tail], ignore_index=True)
print("\nSneak peek del DataFrame integrado (3 inicio + 3 medio + 3 final):")
print(peek.to_string(index=False))


4.1 Feature 1 - CurrentRatio_Q
Filas del df integrado trimestral: 52,228
Cobertura fuentes: balance=52,098, income=52,106, cashflow=52,103
Ticker nulo detectado antes de filtro estricto: 0
Filas removidas por Ticker nulo residual: 0
Ticker nulo final: 0
Currency nulo final (post maestro): 0
CurrentRatio_Q calculable: 52,027 / 52,228 (99.62%)
Casos con pasivo corriente <= 0: 3

Estad√≠sticas de la nueva feature:
count    5.202700e+04
mean     1.929931e+01
std      2.265169e+03
min      8.905806e-07
25%      1.199766e+00
50%      2.036414e+00
75%      4.073680e+00
max      5.022542e+05
Name: CurrentRatio_Q, dtype: float64

Sneak peek del DataFrame integrado (3 inicio + 3 medio + 3 final):
 SimFinId  Fiscal Year Fiscal Period Report Date Ticker Currency  Total Current Assets  Total Current Liabilities  CurrentRatio_Q segment
    45846         2020            Q2  2020-04-30      A      USD          3171000000.0               1945000000.0        1.630334  inicio
    45846         2020      

In [11]:
# Validaci√≥n externa: casos problem√°ticos de CurrentRatio_Q
if "df_procesado" not in globals():
    raise NameError("No existe `df_procesado`. Ejecuta primero 4.1.")

tca = pd.to_numeric(df_procesado["Total Current Assets"], errors="coerce")
tcl = pd.to_numeric(df_procesado["Total Current Liabilities"], errors="coerce")
cr  = pd.to_numeric(df_procesado["CurrentRatio_Q"], errors="coerce")

mask_ratio_le0 = cr.notna() & (cr <= 0)
mask_tcl_le0   = tcl.notna() & (tcl <= 0)

casos = df_procesado[mask_ratio_le0 | mask_tcl_le0].copy()

cols = [c for c in [
    "SimFinId", "Ticker", "Fiscal Year", "Fiscal Period", "Report Date",
    "Total Current Assets", "Total Current Liabilities", "CurrentRatio_Q"
] if c in casos.columns]

print(f"Casos con CurrentRatio_Q <= 0: {int(mask_ratio_le0.sum()):,}")
print(f"Casos con Total Current Liabilities <= 0: {int(mask_tcl_le0.sum()):,}")
print(f"Total casos a revisar (uni√≥n): {len(casos):,}\n")

if len(casos) == 0:
    print("No se encontraron casos.")
else:
    print(casos[cols].sort_values(["SimFinId", "Fiscal Year", "Fiscal Period"]).to_string(index=False))


Casos con CurrentRatio_Q <= 0: 0
Casos con Total Current Liabilities <= 0: 3
Total casos a revisar (uni√≥n): 3

 SimFinId Ticker  Fiscal Year Fiscal Period Report Date  Total Current Assets  Total Current Liabilities  CurrentRatio_Q
   660121   CYCA         2022            Q1  2021-12-31             3948687.0                -23656361.0             NaN
   660121   CYCA         2022            Q2  2022-03-31             2552229.0                -25186798.0             NaN
   660121   CYCA         2022            Q3  2022-06-30             1715762.0                -26163568.0             NaN


### 4.2 Features Derivadas - Feature 2 (DebtToEquity_Q)
Calcular `DebtToEquity_Q` como proxy de apalancamiento trimestral: deuda financiera total entre patrimonio, aplicando regla de calculabilidad (sin imputaci√≥n y con denominador positivo).

In [12]:
# 4.2 Features Derivadas - Feature 2 (DebtToEquity_Q)

if "dfs_proc" not in globals() or not isinstance(dfs_proc, dict):
    raise NameError("No existe `dfs_proc` en memoria. Ejecuta primero 3.4.")
if "df_procesado" not in globals() or not isinstance(df_procesado, pd.DataFrame):
    raise NameError("No existe `df_procesado`. Ejecuta primero 4.1.")

# Llaves trimestrales (deben existir en el consolidado)
keys_q = ["SimFinId", "Fiscal Year", "Fiscal Period", "Report Date"]
missing_keys = [k for k in keys_q if k not in df_procesado.columns]
if missing_keys:
    raise KeyError(f"Faltan llaves en df_procesado: {missing_keys}")

# Columnas requeridas desde balance
needed = ["Short Term Debt", "Long Term Debt", "Total Equity"]
for c in needed:
    if c not in dfs_proc["balance_q"].columns:
        raise KeyError(f"Falta columna requerida en balance_q: {c}")

# Evitar duplicaci√≥n si la celda se ejecuta varias veces
for c in ["Short Term Debt", "Long Term Debt", "Total Equity", "TotalDebt_Q", "DebtToEquity_Q"]:
    if c in df_procesado.columns:
        df_procesado = df_procesado.drop(columns=[c])

# Traer columnas desde balance al consolidado 4.x
bal_42 = dfs_proc["balance_q"][keys_q + needed].copy()

df_procesado = df_procesado.merge(
    bal_42,
    on=keys_q,
    how="left"
)

# C√°lculo feature
std = pd.to_numeric(df_procesado["Short Term Debt"], errors="coerce")
ltd = pd.to_numeric(df_procesado["Long Term Debt"], errors="coerce")
te = pd.to_numeric(df_procesado["Total Equity"], errors="coerce")

# Regla conservadora acordada: TotalDebt solo si ambas deudas existen
df_procesado["TotalDebt_Q"] = np.where(std.notna() & ltd.notna(), std + ltd, np.nan)

# DebtToEquity_Q solo si TotalDebt existe y Equity > 0
df_procesado["DebtToEquity_Q"] = np.where(
    df_procesado["TotalDebt_Q"].notna() & te.notna() & (te > 0),
    df_procesado["TotalDebt_Q"] / te,
    np.nan
)

# Verificaci√≥n compacta
print("4.2 Feature 2 - DebtToEquity_Q")
calc_ok = int(df_procesado["DebtToEquity_Q"].notna().sum())
rows = len(df_procesado)
calc_pct = (calc_ok / rows * 100) if rows > 0 else 0
print(f"DebtToEquity_Q calculable: {calc_ok:,} / {rows:,} ({calc_pct:.2f}%)")
print(f"Casos con Total Equity <= 0: {int((te <= 0).sum()):,}")
print(f"Nulos en Short Term Debt: {int(std.isna().sum()):,}")
print(f"Nulos en Long Term Debt: {int(ltd.isna().sum()):,}")

print("\nEstad√≠sticas de la nueva feature:")
print(df_procesado["DebtToEquity_Q"].describe())

# Sneak peek: inicio, medio y final (9 registros)
preview_cols = [c for c in [
    "SimFinId", "Fiscal Year", "Fiscal Period", "Report Date",
    "Ticker", "Currency", "TotalDebt_Q", "Total Equity", "DebtToEquity_Q"
] if c in df_procesado.columns]

n = len(df_procesado)
mid_start = max((n // 2) - 1, 0)

peek_head = df_procesado[preview_cols].head(3).copy()
peek_head["segment"] = "inicio"
peek_mid = df_procesado[preview_cols].iloc[mid_start:mid_start + 3].copy()
peek_mid["segment"] = "medio"
peek_tail = df_procesado[preview_cols].tail(3).copy()
peek_tail["segment"] = "final"

peek = pd.concat([peek_head, peek_mid, peek_tail], ignore_index=True)
print("\nSneak peek de la feature 4.2 (3 inicio + 3 medio + 3 final):")
print(peek.to_string(index=False))


4.2 Feature 2 - DebtToEquity_Q
DebtToEquity_Q calculable: 23,006 / 52,228 (44.05%)
Casos con Total Equity <= 0: 4,241
Nulos en Short Term Debt: 24,496
Nulos en Long Term Debt: 14,597

Estad√≠sticas de la nueva feature:
count    23006.000000
mean         2.238020
std         24.146008
min         -0.472317
25%          0.335318
50%          0.715573
75%          1.443597
max       2265.891720
Name: DebtToEquity_Q, dtype: float64

Sneak peek de la feature 4.2 (3 inicio + 3 medio + 3 final):
 SimFinId  Fiscal Year Fiscal Period Report Date Ticker Currency  TotalDebt_Q  Total Equity  DebtToEquity_Q segment
    45846         2020            Q2  2020-04-30      A      USD 2488000000.0  4768000000.0        0.521812  inicio
    45846         2020            Q3  2020-07-31      A      USD 2323000000.0  4981000000.0        0.466372  inicio
    45846         2020            Q4  2020-10-31      A      USD 2359000000.0  4873000000.0        0.484096  inicio
  1841448         2023            Q2  2023

### 4.3 Features Derivadas - Feature 3 (FCF_Q)
Calcular `FCF_Q` como caja libre trimestral usando `CFO_Q - CapexProxy_Q`, donde `CapexProxy_Q = -Change in Fixed Assets & Intangibles`, respetando regla de calculabilidad (sin imputaci√≥n).


In [13]:
# 4.3 Features Derivadas - Feature 3 (FCF_Q)

if "dfs_proc" not in globals() or not isinstance(dfs_proc, dict):
    raise NameError("No existe `dfs_proc` en memoria. Ejecuta primero 3.4.")
if "df_procesado" not in globals() or not isinstance(df_procesado, pd.DataFrame):
    raise NameError("No existe `df_procesado`. Ejecuta primero 4.1/4.2.")

keys_q = ["SimFinId", "Fiscal Year", "Fiscal Period", "Report Date"]
missing_keys = [k for k in keys_q if k not in df_procesado.columns]
if missing_keys:
    raise KeyError(f"Faltan llaves en df_procesado: {missing_keys}")

required_cf = ["Net Cash from Operating Activities", "Change in Fixed Assets & Intangibles"]
for c in required_cf:
    if c not in dfs_proc["cashflow_q"].columns:
        raise KeyError(f"Falta columna requerida en cashflow_q: {c}")

# Si faltan columnas fuente en df_procesado, traerlas desde cashflow_q
missing_in_df = [c for c in required_cf if c not in df_procesado.columns]
if missing_in_df:
    cf_src = dfs_proc["cashflow_q"][keys_q + missing_in_df].copy()
    df_procesado = df_procesado.merge(cf_src, on=keys_q, how="left")

# Evitar acumulaci√≥n al re-ejecutar
for c in ["CFO_Q", "CapexProxy_Q", "FCF_Q"]:
    if c in df_procesado.columns:
        df_procesado = df_procesado.drop(columns=[c])

cfo = pd.to_numeric(df_procesado["Net Cash from Operating Activities"], errors="coerce")
chg_fixed = pd.to_numeric(df_procesado["Change in Fixed Assets & Intangibles"], errors="coerce")

df_procesado["CFO_Q"] = cfo
# Regla acordada: CapexProxy_Q = -Change in Fixed Assets & Intangibles
df_procesado["CapexProxy_Q"] = np.where(chg_fixed.notna(), -chg_fixed, np.nan)

df_procesado["FCF_Q"] = np.where(
    df_procesado["CFO_Q"].notna() & df_procesado["CapexProxy_Q"].notna(),
    df_procesado["CFO_Q"] - df_procesado["CapexProxy_Q"],
    np.nan
)

# Verificaci√≥n compacta
print("4.3 Feature 3 - FCF_Q")
rows = len(df_procesado)
calc_ok = int(df_procesado["FCF_Q"].notna().sum())
calc_pct = (calc_ok / rows * 100) if rows > 0 else 0
print(f"FCF_Q calculable: {calc_ok:,} / {rows:,} ({calc_pct:.2f}%)")
print(f"Nulos en CFO_Q: {int(df_procesado['CFO_Q'].isna().sum()):,}")
print(f"Nulos en CapexProxy_Q: {int(df_procesado['CapexProxy_Q'].isna().sum()):,}")

print("\nEstad√≠sticas de la nueva feature:")
print(df_procesado["FCF_Q"].describe())

# Sneak peek: inicio, medio y final (9 registros)
preview_cols = [c for c in [
    "SimFinId", "Fiscal Year", "Fiscal Period", "Report Date",
    "Ticker", "Currency", "CFO_Q", "CapexProxy_Q", "FCF_Q"
] if c in df_procesado.columns]

n = len(df_procesado)
mid_start = max((n // 2) - 1, 0)

peek_head = df_procesado[preview_cols].head(3).copy()
peek_head["segment"] = "inicio"
peek_mid = df_procesado[preview_cols].iloc[mid_start:mid_start + 3].copy()
peek_mid["segment"] = "medio"
peek_tail = df_procesado[preview_cols].tail(3).copy()
peek_tail["segment"] = "final"

peek = pd.concat([peek_head, peek_mid, peek_tail], ignore_index=True)
print("\nSneak peek de la feature 4.3 (3 inicio + 3 medio + 3 final):")
print(peek.to_string(index=False))


4.3 Feature 3 - FCF_Q
FCF_Q calculable: 47,681 / 52,228 (91.29%)
Nulos en CFO_Q: 126
Nulos en CapexProxy_Q: 4,546

Estad√≠sticas de la nueva feature:
count    4.768100e+04
mean     9.693454e+07
std      1.364900e+10
min     -2.776736e+12
25%     -1.311800e+07
50%      3.752000e+06
75%      6.900000e+07
max      7.785530e+11
Name: FCF_Q, dtype: float64

Sneak peek de la feature 4.3 (3 inicio + 3 medio + 3 final):
 SimFinId  Fiscal Year Fiscal Period Report Date Ticker Currency       CFO_Q  CapexProxy_Q       FCF_Q segment
    45846         2020            Q2  2020-04-30      A      USD 313000000.0    33000000.0 280000000.0  inicio
    45846         2020            Q3  2020-07-31      A      USD 290000000.0    24000000.0 266000000.0  inicio
    45846         2020            Q4  2020-10-31      A      USD 377000000.0    27000000.0 350000000.0  inicio
  1841448         2023            Q2  2023-06-30   KNTE      USD -29423000.0       -9000.0 -29414000.0   medio
  1841448         2023       

### 4.4 Resumen de Preprocesamiento
Sintetizar el estado final del `df_procesado` tras limpieza, integraci√≥n y feature engineering, verificando dimensiones, columnas y cobertura de variables derivadas para asegurar consistencia anal√≠tica.


In [14]:
# 4.4 Resumen de Preprocesamiento

if "df_procesado" not in globals() or not isinstance(df_procesado, pd.DataFrame):
    raise NameError("No existe `df_procesado`. Ejecuta primero 4.1, 4.2 y 4.3.")

# Muestra el DataFrame final procesado
print(f"Dimensiones del DataFrame procesado: {df_procesado.shape}")
print("\nPrimeras filas del DataFrame procesado:")
print(df_procesado.head(5).to_string(index=False))

print("\nColumnas finales:")
print(df_procesado.columns.tolist())

# Resumen compacto de features derivadas
feat_cols = ["CurrentRatio_Q", "DebtToEquity_Q", "CFO_Q", "CapexProxy_Q", "FCF_Q", "TotalDebt_Q"]
feat_present = [c for c in feat_cols if c in df_procesado.columns]

print("\nCobertura de features derivadas (no nulos):")
for c in feat_present:
    non_null = int(df_procesado[c].notna().sum())
    pct = (non_null / len(df_procesado) * 100) if len(df_procesado) else 0
    print(f"- {c}: {non_null:,} / {len(df_procesado):,} ({pct:.2f}%)")

# Control de columnas clave
key_cols = ["SimFinId", "Fiscal Year", "Fiscal Period", "Report Date", "Ticker", "Currency"]
key_existing = [c for c in key_cols if c in df_procesado.columns]
print("\nNulos en columnas clave (%):")
for c in key_existing:
    pct = round(df_procesado[c].isna().mean() * 100, 2)
    print(f"- {c}: {pct}%")


Dimensiones del DataFrame procesado: (52228, 25)

Primeras filas del DataFrame procesado:
 SimFinId  Fiscal Year Fiscal Period Report Date Ticker Currency  Total Current Assets  Total Current Liabilities  _from_balance      Revenue  Operating Income (Loss)  Net Income (Common)  _from_income  Net Cash from Operating Activities  Change in Fixed Assets & Intangibles  _from_cashflow  CurrentRatio_Q  Short Term Debt  Long Term Debt  Total Equity  TotalDebt_Q  DebtToEquity_Q       CFO_Q  CapexProxy_Q       FCF_Q
    45846         2020            Q2  2020-04-30      A      USD          3171000000.0               1945000000.0              1 1238000000.0              102000000.0          101000000.0             1                         313000000.0                           -33000000.0               1        1.630334      700000000.0    1788000000.0  4768000000.0 2488000000.0        0.521812 313000000.0    33000000.0 280000000.0
    45846         2020            Q3  2020-07-31      A      USD  

### 4.5 Features sem√°nticas (extensi√≥n)
Agregar un bloque de features sem√°nticas para mejorar interpretabilidad en consultas SQL: m√°rgenes, buckets de liquidez/apalancamiento, bandera de rentabilidad, completitud de features y crecimiento YoY.


In [15]:
# 4.5 Features sem√°nticas (extensi√≥n)

if "df_procesado" not in globals() or not isinstance(df_procesado, pd.DataFrame):
    raise NameError("No existe `df_procesado`. Ejecuta primero 4.1, 4.2 y 4.3.")

# Evitar acumulaci√≥n al re-ejecutar
new_cols = [
    "fcf_margin_q",
    "liquidity_bucket_q",
    "leverage_bucket_q",
    "profitability_flag_q",
    "yoy_revenue_growth_q",
    "yoy_fcf_growth_q",
    "feature_completeness_q"
]
for c in new_cols:
    if c in df_procesado.columns:
        df_procesado = df_procesado.drop(columns=[c])

# Fuentes m√≠nimas
if "CurrentRatio_Q" not in df_procesado.columns:
    raise KeyError("Falta CurrentRatio_Q (ejecuta 4.1).")
if "DebtToEquity_Q" not in df_procesado.columns:
    raise KeyError("Falta DebtToEquity_Q (ejecuta 4.2).")
if "FCF_Q" not in df_procesado.columns:
    raise KeyError("Falta FCF_Q (ejecuta 4.3).")

# Resolver columnas base para Revenue y EBIT
revenue_col = "Revenue_Q" if "Revenue_Q" in df_procesado.columns else ("Revenue" if "Revenue" in df_procesado.columns else None)
ebit_col = "EBIT_Q" if "EBIT_Q" in df_procesado.columns else ("Operating Income (Loss)" if "Operating Income (Loss)" in df_procesado.columns else None)

if revenue_col is None:
    raise KeyError("No se encontr√≥ Revenue_Q ni Revenue en df_procesado.")
if ebit_col is None:
    raise KeyError("No se encontr√≥ EBIT_Q ni Operating Income (Loss) en df_procesado.")

# 1) fcf_margin_q = FCF_Q / Revenue_Q (si Revenue > 0)
revenue = pd.to_numeric(df_procesado[revenue_col], errors="coerce")
fcf = pd.to_numeric(df_procesado["FCF_Q"], errors="coerce")
df_procesado["fcf_margin_q"] = np.where(
    fcf.notna() & revenue.notna() & (revenue > 0),
    fcf / revenue,
    np.nan
)

# Helpers para buckets por cuantiles
labels = ["bajo", "medio", "alto"]
def _bucket_terciles(s: pd.Series) -> pd.Series:
    x = pd.to_numeric(s, errors="coerce")
    valid = x.dropna()
    out = pd.Series(pd.NA, index=s.index, dtype="string")
    if valid.nunique() < 3:
        return out.astype("category")
    try:
        b = pd.qcut(valid, q=3, labels=labels, duplicates="drop")
        out.loc[b.index] = b.astype("string")
    except Exception:
        r = valid.rank(method="average")
        b = pd.qcut(r, q=3, labels=labels, duplicates="drop")
        out.loc[b.index] = b.astype("string")
    return out.astype("category")

# 2) liquidity_bucket_q
# 3) leverage_bucket_q
df_procesado["liquidity_bucket_q"] = _bucket_terciles(df_procesado["CurrentRatio_Q"])
df_procesado["leverage_bucket_q"] = _bucket_terciles(df_procesado["DebtToEquity_Q"])

# 4) profitability_flag_q (1 si EBIT>0, 0 si <=0, NA si falta)
ebit = pd.to_numeric(df_procesado[ebit_col], errors="coerce")
profit = np.where(ebit.notna(), (ebit > 0).astype("int8"), np.nan)
df_procesado["profitability_flag_q"] = pd.Series(profit, index=df_procesado.index).astype("Int8")

# Preparaci√≥n para YoY (orden por empresa + periodo fiscal)
for c in ["SimFinId", "Fiscal Year", "Fiscal Period"]:
    if c not in df_procesado.columns:
        raise KeyError(f"Falta columna clave para YoY: {c}")

work = df_procesado[["SimFinId", "Fiscal Year", "Fiscal Period", revenue_col, "FCF_Q"]].copy()
work["Fiscal Year"] = pd.to_numeric(work["Fiscal Year"], errors="coerce")
work[revenue_col] = pd.to_numeric(work[revenue_col], errors="coerce")
work["FCF_Q"] = pd.to_numeric(work["FCF_Q"], errors="coerce")
work["Fiscal Period"] = work["Fiscal Period"].astype("string")

# Orden fiscal para estabilidad
period_order = {"Q1":1, "Q2":2, "Q3":3, "Q4":4, "FY":5, "TTM":6}
work["_period_ord"] = work["Fiscal Period"].map(period_order)
work = work.sort_values(["SimFinId", "Fiscal Period", "Fiscal Year", "_period_ord"])

# 5) yoy_revenue_growth_q
prev_rev = work.groupby(["SimFinId", "Fiscal Period"])[revenue_col].shift(1)
yoy_rev = np.where(
    work[revenue_col].notna() & prev_rev.notna() & (prev_rev != 0),
    (work[revenue_col] / prev_rev) - 1,
    np.nan
)

# 6) yoy_fcf_growth_q
prev_fcf = work.groupby(["SimFinId", "Fiscal Period"])["FCF_Q"].shift(1)
yoy_fcf = np.where(
    work["FCF_Q"].notna() & prev_fcf.notna() & (prev_fcf != 0),
    (work["FCF_Q"] / prev_fcf) - 1,
    np.nan
)

# Reubicar al √≠ndice original
work["yoy_revenue_growth_q"] = yoy_rev
work["yoy_fcf_growth_q"] = yoy_fcf
work = work.sort_index()

df_procesado["yoy_revenue_growth_q"] = work["yoy_revenue_growth_q"]
df_procesado["yoy_fcf_growth_q"] = work["yoy_fcf_growth_q"]

# 7) feature_completeness_q (de 0 a 6)
feat_for_completeness = [
    "fcf_margin_q",
    "liquidity_bucket_q",
    "leverage_bucket_q",
    "profitability_flag_q",
    "yoy_revenue_growth_q",
    "yoy_fcf_growth_q"
]
df_procesado["feature_completeness_q"] = df_procesado[feat_for_completeness].notna().sum(axis=1).astype("int8")

# Verificaci√≥n compacta
print("4.5 Features sem√°nticas (extensi√≥n)")
print("\nCobertura no nula:")
for c in new_cols:
    nn = int(df_procesado[c].notna().sum())
    pct = (nn / len(df_procesado) * 100) if len(df_procesado) else 0
    print(f"- {c}: {nn:,} / {len(df_procesado):,} ({pct:.2f}%)")

print("\nDistribuci√≥n buckets/flag:")
for c in ["liquidity_bucket_q", "leverage_bucket_q", "profitability_flag_q", "feature_completeness_q"]:
    vc = df_procesado[c].value_counts(dropna=False).head(10)
    print(f"\n{c}")
    print(vc.to_string())

print("\nEstad√≠sticas num√©ricas principales:")
for c in ["fcf_margin_q", "yoy_revenue_growth_q", "yoy_fcf_growth_q"]:
    print(f"\n{c}")
    print(df_procesado[c].describe())

# Sneak peek 9 filas
preview_cols = [c for c in [
    "SimFinId", "Fiscal Year", "Fiscal Period", "Report Date", "Ticker",
    "fcf_margin_q", "liquidity_bucket_q", "leverage_bucket_q",
    "profitability_flag_q", "yoy_revenue_growth_q", "yoy_fcf_growth_q", "feature_completeness_q"
] if c in df_procesado.columns]

n = len(df_procesado)
mid_start = max((n // 2) - 1, 0)
peek_head = df_procesado[preview_cols].head(3).copy(); peek_head["segment"] = "inicio"
peek_mid = df_procesado[preview_cols].iloc[mid_start:mid_start+3].copy(); peek_mid["segment"] = "medio"
peek_tail = df_procesado[preview_cols].tail(3).copy(); peek_tail["segment"] = "final"
peek = pd.concat([peek_head, peek_mid, peek_tail], ignore_index=True)

print("\nSneak peek 4.5 (3 inicio + 3 medio + 3 final):")
print(peek.to_string(index=False))


4.5 Features sem√°nticas (extensi√≥n)

Cobertura no nula:
- fcf_margin_q: 44,031 / 52,228 (84.31%)
- liquidity_bucket_q: 52,027 / 52,228 (99.62%)
- leverage_bucket_q: 23,006 / 52,228 (44.05%)
- profitability_flag_q: 52,103 / 52,228 (99.76%)
- yoy_revenue_growth_q: 34,067 / 52,228 (65.23%)
- yoy_fcf_growth_q: 34,778 / 52,228 (66.59%)
- feature_completeness_q: 52,228 / 52,228 (100.00%)

Distribuci√≥n buckets/flag:

liquidity_bucket_q
liquidity_bucket_q
alto     17343
bajo     17342
medio    17342
<NA>       201

leverage_bucket_q
leverage_bucket_q
<NA>     29222
alto      7669
bajo      7669
medio     7668

profitability_flag_q
profitability_flag_q
1       29605
0       22498
<NA>      125

feature_completeness_q
feature_completeness_q
5    16262
6    16092
3     8779
4     6901
2     4047
1      115
0       32

Estad√≠sticas num√©ricas principales:

fcf_margin_q
count    4.403100e+04
mean     7.292994e-01
std      2.289179e+04
min     -3.442431e+06
25%     -1.310365e-01
50%      4.51127

## 5. Dise√±o del esquema de base de datos
### Dise√±o conceptual
####N√∫mero de tablas: 3
####Tabla principal: fact_fundamentals_q

Prop√≥sito: Representar una observaci√≥n empresa-trimestre con m√©tricas fundamentales y features financieras para an√°lisis financiero.
Qu√© representa cada fila:
1 fila = 1 SimFinId + 1 Fiscal Year + 1 Fiscal Period + 1 Report Date

####Tabla secundaria 1: dim_companies
#####	Prop√≥sito: Cat√°logo maestro de empresas para contexto descriptivo y segmentaci√≥n.
#####Relaci√≥n con tabla principal:
fact_fundamentals_q.simfin_id ÔÉ†  dim_companies.simfin_id  (relaci√≥n N:1)

####Tabla secundaria 2:
####Prop√≥sito: Serie diaria de mercado para an√°lisis de precio y capitalizaci√≥n en horizonte diario.
#####Relaci√≥n con tabla principal:
fact_prices_d se relaciona anal√≠ticamente con fact_fundamentals_q por la columna simfin_id y tiempo (condici√≥n temporal), utilizando dim_companies como dimensi√≥n com√∫n.
La condici√≥n temporal es de tipo as-of: un precio diario solo puede vincularse con el √∫ltimo fundamental trimestral disponible a esa fecha,
(report_date <= date)
 nunca con reportes publicados despu√©s.

### Justificaci√≥n del dise√±o
Se eligi√≥ un esquema de 3 tablas por diferencia de granularidad y tipo de consulta: fundamentales trimestrales (`fact_fundamentals_q`), precios diarios (`fact_prices_d`) y dimensi√≥n de empresa (`dim_companies`). Esta estructura reduce ambig√ºedad en queries del agente SQL, mantiene consistencia temporal y evita duplicaci√≥n innecesaria de datos diarios dentro del grano trimestral. unir La funci√≥n de la dimensi√≥n companies es √∫til para el contexto sectorial o industrial y la tabla de fact_precios, para hacer an√°lisis de valoraci√≥n del tipo as_of (dado el precio en determinada fecha‚Ä¶).

###	Decisiones de normalizaci√≥n vs desnormalizaci√≥n
Se normaliz√≥ por granularidad: se separaron hechos trimestrales (`fact_fundamentals_q`) de hechos diarios (`fact_prices_d`) y se centraliz√≥ el contexto descriptivo en `dim_companies`. Esta decisi√≥n evita mezclar frecuencias distintas en una sola tabla y previene explosi√≥n de filas por repetici√≥n diaria.

Adicionalmente, se desnormaliz√≥ dentro de la tabla principal trimestral fact_fundamentals_q, concentrando m√©tricas y features de alto uso (liquidez, apalancamiento, caja, crecimiento y variables sem√°nticas). Esto reduce joins en consultas frecuentes y facilitar al agente SQL razonar sobre indicadores directamente disponibles.

Se mantiene repetici√≥n de algunas columnas (por ejemplo `ticker` y `currency` en la tabla de hechos) para mejorar trazabilidad y legibilidad de consultas.

En `dim_companies` se incluyeron directamente `industry_id`, `industry` y `sector` para desnormalizar el contexto sectorial y simplificar consultas. Es decir, los nombres descriptivos (`industry`, `sector`) se materializan en la misma dimensi√≥n para evitar joins con una tabla adicional de industrias.

De forma general, no se persigue normalizaci√≥n completa, se propende por el rendimiento y facilidad de an√°lisis conversacional (esto es expectativa, lo ver√© en las pruebas en numerales m√°s adelante).


## 6. Implementaci√≥n de la Base de Datos
### 6.1 Preparaci√≥n Final del DataFrame
Preparar los DataFrames finales para carga en SQLite (esquema de 3 tablas), seleccionando columnas, normalizando nombres a `snake_case` y validando estructura final.
##
####Nota:
Se incluy√≥ una feature para la periodicidad diaria (precio) (market_cap_d), que se deriv√≥ en la etapa de preparaci√≥n final para SQLite (6.1), dentro de fact_prices_d, aplicando regla de calculabilidad: si falta alguno de los dos, market_cap_d = NULL/NaN.
market_cap_d = adj_close * shares_outstanding


In [16]:
# 6.1 Preparaci√≥n Final del DataFrame

if "df_procesado" not in globals() or not isinstance(df_procesado, pd.DataFrame):
    raise NameError("No existe `df_procesado`. Ejecuta primero 4.1-4.5.")
if "dfs_proc" not in globals() or not isinstance(dfs_proc, dict):
    raise NameError("No existe `dfs_proc`. Ejecuta primero 3.x.")

# -----------------------------
# A) fact_fundamentals_q (principal)
# -----------------------------
source_priority = {
    "simfin_id": ["SimFinId"],
    "fiscal_year": ["Fiscal Year"],
    "fiscal_period": ["Fiscal Period"],
    "report_date": ["Report Date"],
    "ticker": ["Ticker"],
    "currency": ["Currency"],
    "total_current_assets": ["Total Current Assets"],
    "total_current_liabilities": ["Total Current Liabilities"],
    "short_term_debt": ["Short Term Debt"],
    "long_term_debt": ["Long Term Debt"],
    "total_equity": ["Total Equity"],
    "revenue_q": ["Revenue_Q", "Revenue"],
    "ebit_q": ["EBIT_Q", "Operating Income (Loss)"],
    "net_income_common_q": ["NetIncomeCommon_Q", "Net Income (Common)"],
    "cfo_q": ["CFO_Q", "Net Cash from Operating Activities"],
    "capex_proxy_q": ["CapexProxy_Q", "Change in Fixed Assets & Intangibles"],
    "fcf_q": ["FCF_Q"],
    "total_debt_q": ["TotalDebt_Q"],
    "current_ratio_q": ["CurrentRatio_Q"],
    "debt_to_equity_q": ["DebtToEquity_Q"],
    "fcf_margin_q": ["fcf_margin_q"],
    "liquidity_bucket_q": ["liquidity_bucket_q"],
    "leverage_bucket_q": ["leverage_bucket_q"],
    "profitability_flag_q": ["profitability_flag_q"],
    "yoy_revenue_growth_q": ["yoy_revenue_growth_q"],
    "yoy_fcf_growth_q": ["yoy_fcf_growth_q"],
    "feature_completeness_q": ["feature_completeness_q"],
}

df_final_fundamentals = pd.DataFrame(index=df_procesado.index)
for target, candidates in source_priority.items():
    chosen = next((c for c in candidates if c in df_procesado.columns), None)
    if chosen is None:
        df_final_fundamentals[target] = pd.NA
    else:
        df_final_fundamentals[target] = df_procesado[chosen]

# Ajuste de signo si capex_proxy_q viene de columna raw
if "CapexProxy_Q" not in df_procesado.columns and "Change in Fixed Assets & Intangibles" in df_procesado.columns:
    x = pd.to_numeric(df_final_fundamentals["capex_proxy_q"], errors="coerce")
    df_final_fundamentals["capex_proxy_q"] = np.where(x.notna(), -x, np.nan)

# Tipos m√≠nimos
for c in ["simfin_id", "fiscal_year", "feature_completeness_q", "profitability_flag_q"]:
    df_final_fundamentals[c] = pd.to_numeric(df_final_fundamentals[c], errors="coerce").astype("Int64")

for c in [
    "total_current_assets", "total_current_liabilities", "short_term_debt", "long_term_debt", "total_equity",
    "revenue_q", "ebit_q", "net_income_common_q", "cfo_q", "capex_proxy_q", "fcf_q", "total_debt_q",
    "current_ratio_q", "debt_to_equity_q", "fcf_margin_q", "yoy_revenue_growth_q", "yoy_fcf_growth_q"
]:
    df_final_fundamentals[c] = pd.to_numeric(df_final_fundamentals[c], errors="coerce")

df_final_fundamentals["report_date"] = pd.to_datetime(df_final_fundamentals["report_date"], errors="coerce")

# Quitar duplicados por grano de la tabla principal
key_f = ["simfin_id", "fiscal_year", "fiscal_period", "report_date"]
df_final_fundamentals = df_final_fundamentals.drop_duplicates(subset=key_f, keep="last")

# -----------------------------
# B) dim_companies
# -----------------------------
comp = dfs_proc["companies"].copy()

# Enriquecer con industry/sector si existen en industries
if "industries" in dfs_proc and "IndustryId" in comp.columns and "IndustryId" in dfs_proc["industries"].columns:
    ind_cols = [c for c in ["IndustryId", "Industry", "Sector"] if c in dfs_proc["industries"].columns]
    comp = comp.merge(dfs_proc["industries"][ind_cols].drop_duplicates(subset=["IndustryId"]), on="IndustryId", how="left")

comp_map = {
    "simfin_id": "SimFinId",
    "ticker": "Ticker",
    "company_name": "Company Name",
    "market": "Market",
    "main_currency": "Main Currency",
    "industry_id": "IndustryId",
    "industry": "Industry",
    "sector": "Sector",
    "has_industry": "has_industry",
}

df_final_companies = pd.DataFrame()
for t, s in comp_map.items():
    df_final_companies[t] = comp[s] if s in comp.columns else pd.NA

df_final_companies["simfin_id"] = pd.to_numeric(df_final_companies["simfin_id"], errors="coerce").astype("Int64")
df_final_companies["industry_id"] = pd.to_numeric(df_final_companies["industry_id"], errors="coerce").astype("Int64")
df_final_companies["has_industry"] = pd.to_numeric(df_final_companies["has_industry"], errors="coerce").fillna(0).astype("Int64")

df_final_companies = df_final_companies.drop_duplicates(subset=["simfin_id"], keep="last")

# -----------------------------
# C) fact_prices_d
# -----------------------------
pr = dfs_proc["prices_d"].copy()
price_map = {
    "simfin_id": "SimFinId",
    "date": "Date",
    "ticker": "Ticker",
    "open": "Open",
    "high": "High",
    "low": "Low",
    "close": "Close",
    "adj_close": "Adj. Close",
    "volume": "Volume",
    "shares_outstanding": "Shares Outstanding",
}

df_final_prices = pd.DataFrame()
for t, s in price_map.items():
    df_final_prices[t] = pr[s] if s in pr.columns else pd.NA

# market_cap_d = adj_close * shares_outstanding
adj = pd.to_numeric(df_final_prices["adj_close"], errors="coerce")
shr = pd.to_numeric(df_final_prices["shares_outstanding"], errors="coerce")
df_final_prices["market_cap_d"] = np.where(adj.notna() & shr.notna(), adj * shr, np.nan)

# Tipos
df_final_prices["simfin_id"] = pd.to_numeric(df_final_prices["simfin_id"], errors="coerce").astype("Int64")
df_final_prices["date"] = pd.to_datetime(df_final_prices["date"], errors="coerce")
for c in ["open", "high", "low", "close", "adj_close", "volume", "shares_outstanding", "market_cap_d"]:
    df_final_prices[c] = pd.to_numeric(df_final_prices[c], errors="coerce")

df_final_prices = df_final_prices.drop_duplicates(subset=["simfin_id", "date"], keep="last")

# -----------------------------
# Resultado final para carga DB
# -----------------------------
# Para seguir plantilla, df_final apunta a la tabla principal
columnas_finales = df_final_fundamentals.columns.tolist()
df_final = df_final_fundamentals[columnas_finales].copy()

print("DataFrame final para carga a base de datos (tabla principal):")
print(df_final.head(5).to_string(index=False))
print(f"\nTotal de registros a cargar (fact_fundamentals_q): {len(df_final):,}")

print("\nResumen de tablas finales del esquema:")
print(f"- fact_fundamentals_q: {df_final_fundamentals.shape}")
print(f"- dim_companies:       {df_final_companies.shape}")
print(f"- fact_prices_d:       {df_final_prices.shape}")


DataFrame final para carga a base de datos (tabla principal):
 simfin_id  fiscal_year fiscal_period report_date ticker currency  total_current_assets  total_current_liabilities  short_term_debt  long_term_debt  total_equity    revenue_q      ebit_q  net_income_common_q       cfo_q  capex_proxy_q       fcf_q  total_debt_q  current_ratio_q  debt_to_equity_q  fcf_margin_q liquidity_bucket_q leverage_bucket_q  profitability_flag_q  yoy_revenue_growth_q  yoy_fcf_growth_q  feature_completeness_q
     45846         2020            Q2  2020-04-30      A      USD          3171000000.0               1945000000.0      700000000.0    1788000000.0  4768000000.0 1238000000.0 102000000.0          101000000.0 313000000.0     33000000.0 280000000.0  2488000000.0         1.630334          0.521812      0.226171              medio             medio                     1                   NaN               NaN                       4
     45846         2020            Q3  2020-07-31      A      USD       

##Task: liberaci√≥n de memoria
Liberar DataFrames intermedios para reducir uso de memoria y evitar falla en la carga a SQLIte, preservando los insumos finales de carga (`df_final_fundamentals`, `df_final_companies`, `df_final_prices`).

In [17]:
# Limpieza  antes de cargar a SQLite

# 1. Garantizar que existen los insumos de carga
required = ["df_final_fundamentals", "df_final_companies", "df_final_prices"]
for name in required:
    if name not in globals() or not isinstance(globals()[name], pd.DataFrame):
        raise NameError(f"Falta {name}. Ejecuta primero 6.1.")

print("Insumos de carga verificados:")
for name in required:
    print(f"- {name}: {globals()[name].shape}")

# 2. Liberar solo df intermedios
to_delete = [
    "dfs", "dfs_31", "dfs_proc", "dfs_proc2", "dfs_proc3", "dfs_proc4",
    "df_procesado", "df_final",  # df_final es alias temporal, no cr√≠tico
    "bal_41", "inc_41", "cf_41", "base_keys", "peek"
]

for var_name in to_delete:
    if var_name in globals() and var_name not in required:
        del globals()[var_name]
        print(f"Liberada variable: {var_name}")

gc.collect()
print("Memoria de objetos intermedios liberada (insumos de carga preservados).")

Insumos de carga verificados:
- df_final_fundamentals: (52228, 27)
- df_final_companies: (6488, 9)
- df_final_prices: (6209717, 11)
Liberada variable: dfs
Liberada variable: dfs_31
Liberada variable: dfs_proc
Liberada variable: dfs_proc2
Liberada variable: dfs_proc3
Liberada variable: dfs_proc4
Liberada variable: df_procesado
Liberada variable: df_final
Liberada variable: bal_41
Liberada variable: inc_41
Liberada variable: cf_41
Liberada variable: base_keys
Liberada variable: peek
Memoria de objetos intermedios liberada (insumos de carga preservados).


### 6.2 Creaci√≥n de la Base de Datos SQLite
Crear la base SQLite y cargar cada tabla del esquema final (`fact_fundamentals_q`, `dim_companies`, `fact_prices_d`).

In [18]:
# 6.2.0 Conexi√≥n SQLite (una sola vez)

from sqlalchemy import create_engine

db_path = "agenteSQLInvesrionesBolsaUSA.db"
engine = create_engine(f"sqlite:///{db_path}")
print(f"Base de datos creada/conectada en: {db_path}")


Base de datos creada/conectada en: agenteSQLInvesrionesBolsaUSA.db


In [19]:
# 6.2.1 Carga fact_fundamentals_q

if "df_final_fundamentals" not in globals():
    raise NameError("No existe df_final_fundamentals. Ejecuta primero 6.1.")
if "engine" not in globals():
    raise NameError("No existe engine. Ejecuta primero la celda de conexi√≥n.")

df_final = df_final_fundamentals.copy()
df_final.to_sql("fact_fundamentals_q", engine, if_exists="replace", index=False)

print("Tabla creada: fact_fundamentals_q")
print("\nResumen de carga:")
print(f"- fact_fundamentals_q: {len(df_final):,} filas")


Tabla creada: fact_fundamentals_q

Resumen de carga:
- fact_fundamentals_q: 52,228 filas


In [20]:
# 6.2.2 Carga dim_companies

if "df_final_companies" not in globals():
    raise NameError("No existe df_final_companies. Ejecuta primero 6.1.")
if "engine" not in globals():
    raise NameError("No existe engine. Ejecuta primero la celda de conexi√≥n.")

df_final_companies.to_sql("dim_companies", engine, if_exists="replace", index=False)

print("Tabla creada: dim_companies")
print("\nResumen de carga:")
print(f"- dim_companies: {len(df_final_companies):,} filas")


Tabla creada: dim_companies

Resumen de carga:
- dim_companies: 6,488 filas


##Task: liberaci√≥n de memoria
Liberar memoria para evitar falla en la carga a tabla de precios.

In [21]:
import gc

for v in ["df_final_fundamentals", "df_final_companies", "df_final"]:
    if v in globals():
        del globals()[v]
        print(f"Liberada variable: {v}")

gc.collect()
print("Memoria liberada antes de cargar fact_prices_d.")


Liberada variable: df_final_fundamentals
Liberada variable: df_final_companies
Liberada variable: df_final
Memoria liberada antes de cargar fact_prices_d.


In [22]:
# 6.2.3 Carga fact_prices_d por chunks (gesti√≥n de RAM)

import gc
import pandas as pd

if "df_final_prices" not in globals():
    raise NameError("No existe df_final_prices. Ejecuta primero 6.1.")
if "engine" not in globals():
    raise NameError("No existe engine. Ejecuta primero la celda de conexi√≥n.")

# Downcast opcional para bajar huella en RAM
for c in ["open", "high", "low", "close", "adj_close", "volume", "shares_outstanding", "market_cap_d"]:
    if c in df_final_prices.columns:
        df_final_prices[c] = pd.to_numeric(df_final_prices[c], errors="coerce", downcast="float")

chunk_size = 150_000
total = len(df_final_prices)

for start in range(0, total, chunk_size):
    end = min(start + chunk_size, total)
    chunk = df_final_prices.iloc[start:end].copy()

    mode = "replace" if start == 0 else "append"
    chunk.to_sql("fact_prices_d", engine, if_exists=mode, index=False)

    if start == 0 or end == total or (start // chunk_size) % 10 == 0:
        print(f"Cargadas filas: {end:,}/{total:,}")

    del chunk
    gc.collect()

print("Tabla creada: fact_prices_d")
print(f"\nResumen de carga:\n- fact_prices_d: {total:,} filas")


Cargadas filas: 150,000/6,209,717
Cargadas filas: 1,650,000/6,209,717
Cargadas filas: 3,150,000/6,209,717
Cargadas filas: 4,650,000/6,209,717
Cargadas filas: 6,150,000/6,209,717
Cargadas filas: 6,209,717/6,209,717
Tabla creada: fact_prices_d

Resumen de carga:
- fact_prices_d: 6,209,717 filas


### 6.3 Verificaci√≥n de la Carga
Validar que las tablas fueron cargadas correctamente en SQLite mediante conteos, muestras y agregaciones b√°sicas por cada tabla del esquema.


In [23]:
# 6.3 Verificaci√≥n de la Carga

import sqlite3
import pandas as pd

if "db_path" not in globals():
    db_path = "mi_proyecto_agente_sql.db"

conn = sqlite3.connect(db_path)

try:
    print(f"Verificando base de datos: {db_path}\n")

    # 1) Conteo de registros por tabla
    print("=== Conteos por tabla ===")
    query_counts = """
    SELECT 'fact_fundamentals_q' AS table_name, COUNT(*) AS total_rows FROM fact_fundamentals_q
    UNION ALL
    SELECT 'dim_companies' AS table_name, COUNT(*) AS total_rows FROM dim_companies
    UNION ALL
    SELECT 'fact_prices_d' AS table_name, COUNT(*) AS total_rows FROM fact_prices_d
    """
    counts_df = pd.read_sql_query(query_counts, conn)
    print(counts_df.to_string(index=False))

    # 2) Muestra de filas por tabla
    print("\n=== Primeras 5 filas: fact_fundamentals_q ===")
    print(pd.read_sql_query("SELECT * FROM fact_fundamentals_q LIMIT 5", conn).to_string(index=False))

    print("\n=== Primeras 5 filas: dim_companies ===")
    print(pd.read_sql_query("SELECT * FROM dim_companies LIMIT 5", conn).to_string(index=False))

    print("\n=== Primeras 5 filas: fact_prices_d ===")
    print(pd.read_sql_query("SELECT * FROM fact_prices_d LIMIT 5", conn).to_string(index=False))

    # 3) Agregaciones de prueba (sanity checks)
    print("\n=== Agregaciones de prueba ===")

    agg_fund = pd.read_sql_query(
        """
        SELECT
            AVG(current_ratio_q) AS avg_current_ratio_q,
            AVG(debt_to_equity_q) AS avg_debt_to_equity_q,
            AVG(fcf_q) AS avg_fcf_q
        FROM fact_fundamentals_q
        """,
        conn,
    )
    print("\n[fundamentals]")
    print(agg_fund.to_string(index=False))

    agg_comp = pd.read_sql_query(
        """
        SELECT
            COUNT(*) AS total_companies,
            SUM(CASE WHEN has_industry = 1 THEN 1 ELSE 0 END) AS companies_with_industry
        FROM dim_companies
        """,
        conn,
    )
    print("\n[companies]")
    print(agg_comp.to_string(index=False))

    agg_prices = pd.read_sql_query(
        """
        SELECT
            AVG(adj_close) AS avg_adj_close,
            AVG(market_cap_d) AS avg_market_cap_d
        FROM fact_prices_d
        """,
        conn,
    )
    print("\n[prices]")
    print(agg_prices.to_string(index=False))

    print("\nVerificaci√≥n completada sin errores SQL.")

finally:
    conn.close()
    print("Conexi√≥n SQLite cerrada.")


Verificando base de datos: agenteSQLInvesrionesBolsaUSA.db

=== Conteos por tabla ===
         table_name  total_rows
fact_fundamentals_q       52228
      dim_companies        6488
      fact_prices_d     6209717

=== Primeras 5 filas: fact_fundamentals_q ===
 simfin_id  fiscal_year fiscal_period                report_date ticker currency  total_current_assets  total_current_liabilities  short_term_debt  long_term_debt  total_equity    revenue_q      ebit_q  net_income_common_q       cfo_q  capex_proxy_q       fcf_q  total_debt_q  current_ratio_q  debt_to_equity_q  fcf_margin_q liquidity_bucket_q leverage_bucket_q  profitability_flag_q  yoy_revenue_growth_q  yoy_fcf_growth_q  feature_completeness_q
     45846         2020            Q2 2020-04-30 00:00:00.000000      A      USD          3171000000.0               1945000000.0      700000000.0    1788000000.0  4768000000.0 1238000000.0 102000000.0          101000000.0 313000000.0     33000000.0 280000000.0  2488000000.0         1.63033

#### Tabla de Validaci√≥n Cruzada de Totales (6.1 vs 6.2 vs 6.3)
Consolidar en una sola tabla los conteos de registros reportados en preparaci√≥n, cargue y consulta SQL, para evidenciar consistencia del proceso de carga a SQLite y documentar la trazabilidad por numeral.


In [24]:
# Tabla de validaci√≥n cruzada: 6.1 Preparaci√≥n vs 6.2 Cargue vs 6.3 Consulta SQL
# (valores tomados de tus salidas reportadas)

import pandas as pd

val_totales = pd.DataFrame([
    {
        "Numeral": "6.1 Preparaci√≥n",
        "Tabla": "fact_fundamentals_q",
        "Total_registros": 52228,
        "Fuente": "Salida 6.1 (shape/total de carga)"
    },
    {
        "Numeral": "6.1 Preparaci√≥n",
        "Tabla": "dim_companies",
        "Total_registros": 6488,
        "Fuente": "Salida 6.1 (shape/total de carga)"
    },
    {
        "Numeral": "6.1 Preparaci√≥n",
        "Tabla": "fact_prices_d",
        "Total_registros": 6209717,
        "Fuente": "Salida 6.1 (shape/total de carga)"
    },
    {
        "Numeral": "6.2 Cargue",
        "Tabla": "fact_fundamentals_q",
        "Total_registros": 52228,
        "Fuente": "Salida 6.2 (resumen de carga)"
    },
    {
        "Numeral": "6.2 Cargue",
        "Tabla": "dim_companies",
        "Total_registros": 6488,
        "Fuente": "Salida 6.2 (resumen de carga)"
    },
    {
        "Numeral": "6.2 Cargue",
        "Tabla": "fact_prices_d",
        "Total_registros": 6209717,
        "Fuente": "Salida 6.2 (resumen de carga por chunks)"
    },
    {
        "Numeral": "6.3 Consulta SQL",
        "Tabla": "fact_fundamentals_q",
        "Total_registros": 52228,
        "Fuente": "Salida 6.3 (SELECT COUNT(*))"
    },
    {
        "Numeral": "6.3 Consulta SQL",
        "Tabla": "dim_companies",
        "Total_registros": 6488,
        "Fuente": "Salida 6.3 (SELECT COUNT(*))"
    },
    {
        "Numeral": "6.3 Consulta SQL",
        "Tabla": "fact_prices_d",
        "Total_registros": 6209717,
        "Fuente": "Salida 6.3 (SELECT COUNT(*))"
    },
])

# Vista ordenada para captura
print("Validaci√≥n de totales por etapa:\n")
print(val_totales.to_string(index=False))

# Chequeo autom√°tico de consistencia por tabla
pivot = val_totales.pivot(index="Tabla", columns="Numeral", values="Total_registros")
pivot["Coincide_6.1_6.2_6.3"] = (
    (pivot["6.1 Preparaci√≥n"] == pivot["6.2 Cargue"]) &
    (pivot["6.2 Cargue"] == pivot["6.3 Consulta SQL"])
)

print("\nResumen de consistencia:")
print(pivot.to_string())


Validaci√≥n de totales por etapa:

         Numeral               Tabla  Total_registros                                   Fuente
 6.1 Preparaci√≥n fact_fundamentals_q            52228        Salida 6.1 (shape/total de carga)
 6.1 Preparaci√≥n       dim_companies             6488        Salida 6.1 (shape/total de carga)
 6.1 Preparaci√≥n       fact_prices_d          6209717        Salida 6.1 (shape/total de carga)
      6.2 Cargue fact_fundamentals_q            52228            Salida 6.2 (resumen de carga)
      6.2 Cargue       dim_companies             6488            Salida 6.2 (resumen de carga)
      6.2 Cargue       fact_prices_d          6209717 Salida 6.2 (resumen de carga por chunks)
6.3 Consulta SQL fact_fundamentals_q            52228             Salida 6.3 (SELECT COUNT(*))
6.3 Consulta SQL       dim_companies             6488             Salida 6.3 (SELECT COUNT(*))
6.3 Consulta SQL       fact_prices_d          6209717             Salida 6.3 (SELECT COUNT(*))

Resumen de 

## 7. Configuraci√≥n del Agente SQL con LangChain
### 7.1 Configuraci√≥n de API Key
Configurar de forma segura la credencial de OpenAI, usando Colab Secrets y  entrada interactiva como respaldo.


In [25]:
# 7.1 Configuraci√≥n de API Key

# Configuraci√≥n segura de credenciales de OpenAI
import os

# Intenta cargar desde Colab Secrets si est√°s en Colab
try:
    from google.colab import userdata
    openai_api_key = userdata.get("OPENAI_API_KEY")

    if openai_api_key is None or str(openai_api_key).strip() == "":
        raise ValueError("OPENAI_API_KEY no encontrada en Colab Secrets.")

    os.environ["OPENAI_API_KEY"] = openai_api_key
    print("‚úì API Key cargada desde Colab Secrets")

except Exception:
    # Si no est√°s en Colab o no existe el secret, solicita la API key de forma interactiva
    from getpass import getpass
    os.environ["OPENAI_API_KEY"] = getpass("Ingresa tu OpenAI API Key: ")
    print("‚úì API Key configurada")

# Verificaci√≥n m√≠nima (sin exponer la clave)
print("Longitud de API Key:", len(os.environ.get("OPENAI_API_KEY", "")))


‚úì API Key cargada desde Colab Secrets
Longitud de API Key: 164


### 7.2 Inicializaci√≥n del Agente SQL
Inicializar conexi√≥n a SQLite y crear un agente SQL con LangChain + OpenAI para ejecutar consultas en lenguaje natural sobre el esquema cargado.


In [45]:
# 7.2 Inicializaci√≥n del Agente SQL

from langchain_community.utilities import SQLDatabase
from langchain_community.agent_toolkits import create_sql_agent
from langchain_community.callbacks.manager import get_openai_callback
from langchain_openai import ChatOpenAI


# Validaciones m√≠nimas
import os
if "db_path" not in globals():
    db_path = "mi_proyecto_agente_sql.db"

if not os.path.exists(db_path):
    raise FileNotFoundError(f"No se encontr√≥ la base de datos: {db_path}. Ejecuta primero 6.2.")

if os.environ.get("OPENAI_API_KEY", "").strip() == "":
    raise ValueError("OPENAI_API_KEY no configurada. Ejecuta primero 7.1.")

# Conecta a tu base de datos SQLite
db = SQLDatabase.from_uri(
    f"sqlite:///{db_path}",
    #sample_rows_in_table_info=1
)

# Inicializa el modelo de lenguaje
#llm = ChatOpenAI(
#    model="gpt-4",      # O "gpt-3.5-turbo" si prefieres menor costo
#    temperature=0       # Respuestas m√°s determin√≠sticas para SQL
#)

llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

# Crea el agente SQL
agent_executor = create_sql_agent(
    llm=llm,
    db=db,
    verbose=True,        # Muestra razonamiento/pasos del agente
    max_iterations=10,   # Limita el n√∫mero de "vueltas" que el agente puede dar a la API
    early_stopping_method="generate",  # Define qu√© pasa si llega al l√≠mite (detiene o devuelve el √∫ltimo pensamiento)
    agent_type="tool-calling",
    agent_executor_kwargs={"handle_parsing_errors": True}
)

# Refactor : se implementa como funci√≥n para recrearlo en cada llamada / consulta, recibiendo como par√°metro las tablas inlcuidas
def configurar_agente(tablas_personalizadas=None):
    """
    Crea una instancia fresca del agente SQL.
    Si tablas_personalizadas es None, ve toda la base de datos.
    """
    return create_sql_agent(
        llm=llm,
        db=db,
        verbose=True,
        include_tables=tablas_personalizadas,
        #max_iterations=5,
        early_stopping_method="generate",
        agent_type="tool-calling",
        agent_executor_kwargs={
          "handle_parsing_errors": True,
          "return_intermediate_steps": True
        }
    )

agente_dinamico = configurar_agente()

print("‚úì Agente SQL inicializado correctamente")


‚úì Agente SQL inicializado correctamente


### 7.2. Versi√≥n del modelo | Add-on

In [27]:
# Realizamos el llamado
msg = llm.invoke("Responde solo: OK")

# Extraemos metadatos
full_model_name = msg.response_metadata.get("model_name", "N/A")
usage = msg.response_metadata.get("token_usage", {})

# L√≥gica para separar Familia de Versi√≥n
# Ejemplo: 'gpt-4o-mini-2024-07-18' -> Modelo: GPT-4O-MINI | Versi√≥n: 2024-07-18
parts = full_model_name.split('-')
if len(parts) > 3:
    modelo_base = "-".join(parts[:3]) # gpt-4o-mini
    version_tag = "-".join(parts[3:]) # 2024-07-18
else:
    modelo_base = full_model_name
    version_tag = "LATEST"

print("\n" + "‚Äî" * 50)
print(f"ü§ñ MODELO:  \033[1;36m{modelo_base.upper()}\033[0m")
print(f"üìå VERSI√ìN: \033[1;34m{version_tag}\033[0m")
print("‚Äî" * 50)
print(f"üìä TOKENS:   {usage.get('total_tokens', 0)}")
print("‚Äî" * 50 + "\n")



‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî
ü§ñ MODELO:  [1;36mGPT-4O-MINI[0m
üìå VERSI√ìN: [1;34m2024-07-18[0m
‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî
üìä TOKENS:   13
‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî



### 7.2.2 Selector de tablas | Add-on
Implementar  un selector de tablas mediante Structured Output, garantizando que solo el esquema relevante sea enviado a la API de OpenAI, minimizando costos operativos."

In [28]:
from typing import List
from pydantic import BaseModel, Field

# --- OPTIMIZADOR DIN√ÅMICO DE CONSULTAS ---

# ==========================================
# ‚öôÔ∏è CONFIGURACI√ìN DEL SWITCH
# ==========================================
usar_filtro = False  # <--- True para ahorrar tokens, False para esquema completo
# ==========================================

# Define estructura de datos que el LLM debe devolver
class TableSelection(BaseModel):
    """Esquema para la selecci√≥n de tablas relevantes."""
    tablas: List[str] = Field(
        description="Lista de nombres de tablas estrictamente necesarios para la consulta."
    )

# Funci√≥n que filtra las tablas que se pasan al agente
def obtener_tablas_relevantes(query, db, llm):
    """
    Act√∫a como un 'filtro inteligente' usando Structured Output.
    Analiza la pregunta y solo permite que el agente vea las tablas necesarias.
    """
    # 1. Obtenemos SOLO los nombres (pocos tokens)
    nombres_all = db.get_usable_table_names()

    # 2. Preparamos el LLM para salida estructurada
    structured_llm = llm.with_structured_output(TableSelection)

    prompt = f"""
    Analiza la base de datos con estas tablas: {nombres_all}
    Pregunta del usuario: "{query}"

    INSTRUCCIONES:
    1. Identifica las tablas indispensables para responder.
    2. Si la pregunta es gen√©rica, ambigua o no est√°s seguro, incluye TODAS las tablas de la lista.
    3. Responde √∫nicamente con los nombres de las tablas que existan en la lista proporcionada.
    """

    try:
        # Llamada optimizada
        resultado = structured_llm.invoke(prompt)

        # Validamos que las tablas existan realmente en la DB (Filtro de alucinaciones)
        tablas_validas = [t for t in resultado.tablas if t in nombres_all]

        # Regla de seguridad: Si el modelo devuelve vac√≠o, enviamos todas
        return tablas_validas if tablas_validas else nombres_all

    except Exception as e:
        # Si algo falla (red, api, etc), devolvemos todas para no detener el proceso
        return nombres_all

# --- MENSAJE DE NOTIFICACI√ìN ESTILIZADO ---
print("\n" + "üöÄ" + "‚Äî" * 60)
print("\033[1;32m‚úÖ OPTIMIZADOR DE CONSULTAS CARGADO\033[0m")
print("‚Äî" * 60)
print("Esta versi√≥n utiliza 'with_structured_output' para mayor precisi√≥n:")
print(f"üîπ \033[1mFiltro de tablas:\033[0m Menos ruido, menos errores de SQL.")
print(f"üîπ \033[1mSeguridad Fail-Safe:\033[0m Si hay duda o error, se cargan todas las tablas.")
print(f"üîπ \033[1mControl:\033[0m Usa \033[1;36musar_filtro = True\033[0m para activar el ahorro.")
print("‚Äî" * 60 + "\n")



üöÄ‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî
[1;32m‚úÖ OPTIMIZADOR DE CONSULTAS CARGADO[0m
‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî
Esta versi√≥n utiliza 'with_structured_output' para mayor precisi√≥n:
üîπ [1mFiltro de tablas:[0m Menos ruido, menos errores de SQL.
üîπ [1mSeguridad Fail-Safe:[0m Si hay duda o error, se cargan todas las tablas.
üîπ [1mControl:[0m Usa [1;36musar_filtro = True[0m para activar el ahorro.
‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî



### 7.3 Prueba de Conexi√≥n B√°sica
Validar funcionamiento m√≠nimo del agente SQL con una pregunta de conteo para confirmar conexi√≥n, ejecuci√≥n y respuesta.


In [29]:
from langchain_community.callbacks.manager import get_openai_callback

pregunta_test = "¬øCu√°ntas filas hay en la tabla fact_fundamentals_q?"

print(f"Pregunta: {pregunta_test}")
print("\n" + "="*80 + "\n")

with get_openai_callback() as cb:

    # 1. Fase de Selecci√≥n (Filtro Din√°mico)
    if usar_filtro:
        print("üîç [INFO] Aplicando filtro inteligente de tablas...")
        tablas_relevantes = obtener_tablas_relevantes(pregunta_test, db, llm)
        print(f"‚úÖ Tablas seleccionadas: {tablas_relevantes}")
    else:
        print("‚ö†Ô∏è [INFO] Modo est√°ndar (Sin filtro). Cargando base de datos completa...")
        tablas_relevantes = None

    # 2. Creaci√≥n del Agente usando tu funci√≥n previa
    # Nota: Aseg√∫rate de que la funci√≥n se llame 'configurar_agente'
    agent_executor_test = configurar_agente(tablas_relevantes)

    # 3. Ejecuci√≥n del Agente
    respuesta = agent_executor_test.invoke({"input": pregunta_test})

# --- L√≥gica de Extracci√≥n y Formato (Estilo 7.3) ---
full_model = llm.model_name
parts = full_model.split('-', 3)
modelo_base = "-".join(parts[:3]) if len(parts) >= 3 else full_model
version_tag = parts[3] if len(parts) > 3 else "LATEST"

# --- REPORTE FINAL ---
status_txt = "\033[1;32mOPTIMIZADO (ON)\033[0m" if usar_filtro else "\033[1;31mEST√ÅNDAR (OFF)\033[0m"

print(f"\nRespuesta: {respuesta['output']}")
print("\n" + "‚Äî" * 60)
print(f"‚öôÔ∏è  ESTADO FILTRO: {status_txt}")
print(f"ü§ñ MODELO:        \033[1;36m{modelo_base.upper()}\033[0m")
print(f"üìå VERSI√ìN:       \033[1;34m{version_tag.upper()}\033[0m")
print("‚Äî" * 60)
print(f"üìä TOKENS:   {cb.total_tokens} (In: {cb.prompt_tokens} | Out: {cb.completion_tokens})")
print(f"üí∞ COSTO:    \033[1;33m${cb.total_cost:.6f} USD\033[0m")
print("‚Äî" * 60 + "\n")


Pregunta: ¬øCu√°ntas filas hay en la tabla fact_fundamentals_q?


‚ö†Ô∏è [INFO] Modo est√°ndar (Sin filtro). Cargando base de datos completa...


[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables  
Action Input: ""  [0m[38;5;200m[1;3mdim_companies, fact_fundamentals_q, fact_prices_d[0m[32;1m[1;3mI need to check the schema of the table `fact_fundamentals_q` to understand its structure and confirm that it exists.  
Action: sql_db_schema  
Action Input: "fact_fundamentals_q"  [0m[33;1m[1;3m
CREATE TABLE fact_fundamentals_q (
	simfin_id BIGINT, 
	fiscal_year BIGINT, 
	fiscal_period TEXT, 
	report_date DATETIME, 
	ticker TEXT, 
	currency TEXT, 
	total_current_assets FLOAT, 
	total_current_liabilities FLOAT, 
	short_term_debt FLOAT, 
	long_term_debt FLOAT, 
	total_equity FLOAT, 
	revenue_q FLOAT, 
	ebit_q FLOAT, 
	net_income_common_q FLOAT, 
	cfo_q FLOAT, 
	capex_proxy_q FLOAT, 
	fcf_q FLOAT, 
	total_debt_q FLOAT, 
	current_ratio_q FLOAT, 
	d

# 8. Pruebas del Agente SQL
## 8.1 Categor√≠as de prueba


1.   Queries de Agregaci√≥n Simple: valida c√°lculos b√°sicos (COUNT, AVG, SUM) y consistencia num√©rica general.
2.   Queries Temporales: eval√∫a tendencias, comparaciones por per√≠odo y m√©tricas interanuales (YoY).
1.   Queries de Filtrado y Comparaci√≥n: prueba filtros multi-condici√≥n y comparaci√≥n
2.   Elemento de lista entre grupos (tickers, sectores, buckets).
Queries de Ranking y Ordenamiento: eval√∫a top/bottom N y ordenaci√≥n por m√©tricas financieras o sem√°nticas.
1.   Queries Complejas o Ambiguas: mide comportamiento ante preguntas abiertas, criterios impl√≠citos y necesidad de aclaraciones.
2.   Queries de Awareness del Esquema: verifica si el agente entiende qu√© tablas/campos existen, cobertura temporal y tipo de preguntas posibles.


Pruebas de Variaci√≥n Ling√º√≠stica: comprueba robustez al reformular la misma intenci√≥n con distinto lenguaje.

## 8.2 Queries de agregaci√≥n simple
###Pregunta 1


In [30]:
# Prueba 1 - Estilo compacto de salida
import sqlite3
import pandas as pd
from langchain_community.callbacks.manager import get_openai_callback

pregunta_1 = "Cu√°ntas observaciones trimestrales hay en fundamentales de las empresas con ticker MSFT y TSLA?"

sql_validacion_1 = """
SELECT
    c.ticker,
    COUNT(*) AS observaciones_trimestrales
FROM fact_fundamentals_q f
JOIN dim_companies c
  ON c.simfin_id = f.simfin_id
WHERE UPPER(c.ticker) IN ('MSFT', 'TSLA')
GROUP BY c.ticker
ORDER BY c.ticker;
"""

if "db_path" not in globals():
    raise NameError("No existe db_path. Ejecuta 6.2/6.3 primero.")
if "configurar_agente" not in globals():
    raise NameError("No existe configurar_agente(...). Ejecuta 7.2 primero.")

# 1) Validaci√≥n SQL directa
conn = sqlite3.connect(db_path)
resultado_sql_1 = pd.read_sql_query(sql_validacion_1, conn)
conn.close()

# 2) Llamado al agente con tu funci√≥n integrada
with get_openai_callback() as cb:
    if "usar_filtro" in globals() and usar_filtro:
        tablas_relevantes = obtener_tablas_relevantes(pregunta_1, db, llm)
    else:
        tablas_relevantes = None

    agent_executor_test = configurar_agente(tablas_relevantes)
    respuesta_1 = agent_executor_test.invoke({"input": pregunta_1})

# 3) Extraer modelo/versi√≥n
full_model = getattr(llm, "model_name", "N/A")
parts = full_model.split("-", 3)
modelo_base = "-".join(parts[:3]) if len(parts) >= 3 else full_model
version_tag = parts[3] if len(parts) > 3 else "LATEST"

# 4) Salida compacta y legible
status_txt = "\033[1;32mOPTIMIZADO (ON)\033[0m" if ("usar_filtro" in globals() and usar_filtro) else "\033[1;31mEST√ÅNDAR (OFF)\033[0m"

print("\n" + "‚Äî" * 60)
print("üìù CONSULTA (USUARIO)")
print("‚Äî" * 60)
print(pregunta_1)

print("\n" + "‚Äî" * 60)
print("üìå QUERY VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(sql_validacion_1.strip())

print("\n" + "‚Äî" * 60)
print("üßæ RESULTADO VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(resultado_sql_1.to_string(index=False))

print("\n" + "‚Äî" * 60)
print("\033[1;32m‚úÖ RESPUESTA DEL AGENTE:\033[0m")
print("\033[1;96m" + str(respuesta_1["output"]) + "\033[0m")
print("‚Äî" * 60)
print(f"‚öôÔ∏è  ESTADO FILTRO: {status_txt}")
print(f"ü§ñ MODELO:        \033[1;36m{modelo_base.upper()}\033[0m")
print(f"üìå VERSI√ìN:       \033[1;34m{version_tag.upper()}\033[0m")
print("‚Äî" * 60)
print(f"üìä TOKENS:   {cb.total_tokens} (In: {cb.prompt_tokens} | Out: {cb.completion_tokens})")
print(f"üí∞ COSTO:    \033[1;33m${cb.total_cost:.6f} USD\033[0m")
print("‚Äî" * 60 + "\n")




[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables  
Action Input: ""  [0m[38;5;200m[1;3mdim_companies, fact_fundamentals_q, fact_prices_d[0m[32;1m[1;3mI need to check the schema of the `fact_fundamentals_q` table, as it likely contains the quarterly observations for companies. I will also check the `dim_companies` table to find the relevant company IDs for MSFT and TSLA.  
Action: sql_db_schema  
Action Input: "dim_companies,fact_fundamentals_q"  [0m[33;1m[1;3m
CREATE TABLE dim_companies (
	simfin_id BIGINT, 
	ticker TEXT, 
	company_name TEXT, 
	market TEXT, 
	main_currency TEXT, 
	industry_id BIGINT, 
	industry TEXT, 
	sector TEXT, 
	has_industry BIGINT
)

/*
3 rows from dim_companies table:
simfin_id	ticker	company_name	market	main_currency	industry_id	industry	sector	has_industry
45846	A	AGILENT TECHNOLOGIES INC	us	USD	106001	Medical Diagnostics & Research	Healthcare	1
1333027	A21	Li Auto Inc.	us	USD	None	None	None	0
367153	AA	Alco

### Prueba 2 - Query de Agregaci√≥n Simple (versi√≥n reducida)

**Consulta (usuario):**  
Para la acci√≥n de Apple, ¬øcu√°l es el promedio de acciones en circulaci√≥n?





In [31]:
# Prueba 2 - Estilo compacto de salida (versi√≥n reducida)
import sqlite3
import pandas as pd
from langchain_community.callbacks.manager import get_openai_callback

pregunta_2 = "Para la acci√≥n de Nvidia, ¬øcu√°l es el promedio de acciones en circulaci√≥n?"

sql_validacion_2 = """
SELECT
    UPPER(ticker) AS ticker,
    AVG(shares_outstanding) AS promedio_acciones_circulacion
FROM fact_prices_d
WHERE UPPER(ticker) = 'NVDA'
GROUP BY UPPER(ticker);
"""

if "db_path" not in globals():
    raise NameError("No existe db_path. Ejecuta 6.2/6.3 primero.")
if "configurar_agente" not in globals():
    raise NameError("No existe configurar_agente(...). Ejecuta 7.2 primero.")

# 1) Validaci√≥n SQL directa
conn = sqlite3.connect(db_path)
resultado_sql_2 = pd.read_sql_query(sql_validacion_2, conn)
conn.close()

# 2) Llamado al agente con tu funci√≥n integrada
with get_openai_callback() as cb:
    if "usar_filtro" in globals() and usar_filtro:
        tablas_relevantes = obtener_tablas_relevantes(pregunta_2, db, llm)
    else:
        tablas_relevantes = None

    agent_executor_test = configurar_agente(tablas_relevantes)
    respuesta_2 = agent_executor_test.invoke({"input": pregunta_2})

# 3) Extraer modelo/versi√≥n
full_model = getattr(llm, "model_name", "N/A")
parts = full_model.split("-", 3)
modelo_base = "-".join(parts[:3]) if len(parts) >= 3 else full_model
version_tag = parts[3] if len(parts) > 3 else "LATEST"

# 4) Salida compacta y legible
status_txt = "\033[1;32mOPTIMIZADO (ON)\033[0m" if ("usar_filtro" in globals() and usar_filtro) else "\033[1;31mEST√ÅNDAR (OFF)\033[0m"

print("\n" + "‚Äî" * 60)
print("üìù CONSULTA (USUARIO)")
print("‚Äî" * 60)
print(pregunta_2)

print("\n" + "‚Äî" * 60)
print("üìå QUERY VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(sql_validacion_2.strip())

print("\n" + "‚Äî" * 60)
print("üßæ RESULTADO VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(resultado_sql_2.to_string(index=False))

print("\n" + "‚Äî" * 60)
print("\033[1;32m‚úÖ RESPUESTA DEL AGENTE:\033[0m")
print("\033[1;96m" + str(respuesta_2["output"]) + "\033[0m")
print("‚Äî" * 60)
print(f"‚öôÔ∏è  ESTADO FILTRO: {status_txt}")
print(f"ü§ñ MODELO:        \033[1;36m{modelo_base.upper()}\033[0m")
print(f"üìå VERSI√ìN:       \033[1;34m{version_tag.upper()}\033[0m")
print("‚Äî" * 60)
print(f"üìä TOKENS:   {cb.total_tokens} (In: {cb.prompt_tokens} | Out: {cb.completion_tokens})")
print(f"üí∞ COSTO:    \033[1;33m${cb.total_cost:.6f} USD\033[0m")
print("‚Äî" * 60 + "\n")




[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables  
Action Input: ""  [0m[38;5;200m[1;3mdim_companies, fact_fundamentals_q, fact_prices_d[0m[32;1m[1;3mI should check the schema of the relevant tables to find out where the information about Nvidia and its shares outstanding might be stored. The "fact_fundamentals_q" table seems like a good candidate for financial metrics, including shares outstanding. 

Action: sql_db_schema  
Action Input: fact_fundamentals_q  [0m[33;1m[1;3m
CREATE TABLE fact_fundamentals_q (
	simfin_id BIGINT, 
	fiscal_year BIGINT, 
	fiscal_period TEXT, 
	report_date DATETIME, 
	ticker TEXT, 
	currency TEXT, 
	total_current_assets FLOAT, 
	total_current_liabilities FLOAT, 
	short_term_debt FLOAT, 
	long_term_debt FLOAT, 
	total_equity FLOAT, 
	revenue_q FLOAT, 
	ebit_q FLOAT, 
	net_income_common_q FLOAT, 
	cfo_q FLOAT, 
	capex_proxy_q FLOAT, 
	fcf_q FLOAT, 
	total_debt_q FLOAT, 
	current_ratio_q FLOAT, 
	debt_to_equi

### Pregunta 3

**Consulta (usuario):**  
Para las industrias Banks, Biotechnology y Application Software, ¬øcu√°ntas empresas hay por industria?


In [32]:
# Prueba 3 - Estilo compacto de salida
import sqlite3
import pandas as pd
from langchain_community.callbacks.manager import get_openai_callback

pregunta_3 = "Para las industrias Banks, Biotechnology y Application Software, ¬øcu√°ntas empresas hay por industria?"

sql_validacion_3 = """
SELECT
    industry,
    COUNT(DISTINCT simfin_id) AS total_empresas
FROM dim_companies
WHERE industry IN ('Banks', 'Biotechnology', 'Application Software')
GROUP BY industry
ORDER BY industry;
"""

if "db_path" not in globals():
    raise NameError("No existe db_path. Ejecuta 6.2/6.3 primero.")
if "configurar_agente" not in globals():
    raise NameError("No existe configurar_agente(...). Ejecuta 7.2 primero.")

# 1) Validaci√≥n SQL directa
conn = sqlite3.connect(db_path)
resultado_sql_3 = pd.read_sql_query(sql_validacion_3, conn)
conn.close()

# 2) Llamado al agente con tu funci√≥n integrada
with get_openai_callback() as cb:
    if "usar_filtro" in globals() and usar_filtro:
        tablas_relevantes = obtener_tablas_relevantes(pregunta_3, db, llm)
    else:
        tablas_relevantes = None

    agent_executor_test = configurar_agente(tablas_relevantes)
    respuesta_3 = agent_executor_test.invoke({"input": pregunta_3})

# 3) Extraer modelo/versi√≥n
full_model = getattr(llm, "model_name", "N/A")
parts = full_model.split("-", 3)
modelo_base = "-".join(parts[:3]) if len(parts) >= 3 else full_model
version_tag = parts[3] if len(parts) > 3 else "LATEST"

# 4) Salida compacta y legible
status_txt = "\033[1;32mOPTIMIZADO (ON)\033[0m" if ("usar_filtro" in globals() and usar_filtro) else "\033[1;31mEST√ÅNDAR (OFF)\033[0m"

print("\n" + "‚Äî" * 60)
print("üìù CONSULTA (USUARIO)")
print("‚Äî" * 60)
print(pregunta_3)

print("\n" + "‚Äî" * 60)
print("üìå QUERY VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(sql_validacion_3.strip())

print("\n" + "‚Äî" * 60)
print("üßæ RESULTADO VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(resultado_sql_3.to_string(index=False))

print("\n" + "‚Äî" * 60)
print("\033[1;32m‚úÖ RESPUESTA DEL AGENTE:\033[0m")
print("\033[1;96m" + str(respuesta_3["output"]) + "\033[0m")
print("‚Äî" * 60)
print(f"‚öôÔ∏è  ESTADO FILTRO: {status_txt}")
print(f"ü§ñ MODELO:        \033[1;36m{modelo_base.upper()}\033[0m")
print(f"üìå VERSI√ìN:       \033[1;34m{version_tag.upper()}\033[0m")
print("‚Äî" * 60)
print(f"üìä TOKENS:   {cb.total_tokens} (In: {cb.prompt_tokens} | Out: {cb.completion_tokens})")
print(f"üí∞ COSTO:    \033[1;33m${cb.total_cost:.6f} USD\033[0m")
print("‚Äî" * 60 + "\n")




[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables  
Action Input: ""  [0m[38;5;200m[1;3mdim_companies, fact_fundamentals_q, fact_prices_d[0m[32;1m[1;3mI need to check the schema of the `dim_companies` table, as it likely contains information about the companies and their respective industries.  
Action: sql_db_schema  
Action Input: "dim_companies"  [0m[33;1m[1;3m
CREATE TABLE dim_companies (
	simfin_id BIGINT, 
	ticker TEXT, 
	company_name TEXT, 
	market TEXT, 
	main_currency TEXT, 
	industry_id BIGINT, 
	industry TEXT, 
	sector TEXT, 
	has_industry BIGINT
)

/*
3 rows from dim_companies table:
simfin_id	ticker	company_name	market	main_currency	industry_id	industry	sector	has_industry
45846	A	AGILENT TECHNOLOGIES INC	us	USD	106001	Medical Diagnostics & Research	Healthcare	1
1333027	A21	Li Auto Inc.	us	USD	None	None	None	0
367153	AA	Alcoa Corp	us	USD	110004	Metals & Mining	Basic Materials	1
*/[0m[32;1m[1;3mI can see that the `dim_c

## 8.3  Preguntas de temporalidad

### Pregunta 4

**Consulta (usuario):**  
¬øC√≥mo ha evolucionado por a√±o el promedio de liquidez en el sector Healthcare?



In [33]:
# Prueba 4 - Query temporal (evoluci√≥n anual de liquidez en Healthcare)
import sqlite3
import pandas as pd
from langchain_community.callbacks.manager import get_openai_callback

pregunta_4 = "¬øC√≥mo ha evolucionado por a√±o el promedio de liquidez en el sector Healthcare?"

sql_validacion_4 = """
SELECT
    f.fiscal_year,
    AVG(f.current_ratio_q) AS promedio_liquidez
FROM fact_fundamentals_q f
JOIN dim_companies c
  ON c.simfin_id = f.simfin_id
WHERE c.sector = 'Healthcare'
  AND f.current_ratio_q IS NOT NULL
GROUP BY f.fiscal_year
ORDER BY f.fiscal_year;
"""

if "db_path" not in globals():
    raise NameError("No existe db_path. Ejecuta 6.2/6.3 primero.")
if "configurar_agente" not in globals():
    raise NameError("No existe configurar_agente(...). Ejecuta 7.2 primero.")

# 1) Validaci√≥n SQL directa
conn = sqlite3.connect(db_path)
resultado_sql_4 = pd.read_sql_query(sql_validacion_4, conn)
conn.close()

# 2) Llamado al agente con tu funci√≥n integrada
with get_openai_callback() as cb:
    if "usar_filtro" in globals() and usar_filtro:
        tablas_relevantes = obtener_tablas_relevantes(pregunta_4, db, llm)
    else:
        tablas_relevantes = None

    agent_executor_test = configurar_agente(tablas_relevantes)
    respuesta_4 = agent_executor_test.invoke({"input": pregunta_4})

# 3) Extraer modelo/versi√≥n
full_model = getattr(llm, "model_name", "N/A")
parts = full_model.split("-", 3)
modelo_base = "-".join(parts[:3]) if len(parts) >= 3 else full_model
version_tag = parts[3] if len(parts) > 3 else "LATEST"

# 4) Salida compacta y legible
status_txt = "\033[1;32mOPTIMIZADO (ON)\033[0m" if ("usar_filtro" in globals() and usar_filtro) else "\033[1;31mEST√ÅNDAR (OFF)\033[0m"

print("\n" + "‚Äî" * 60)
print("üìù CONSULTA (USUARIO)")
print("‚Äî" * 60)
print(pregunta_4)

print("\n" + "‚Äî" * 60)
print("üìå QUERY VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(sql_validacion_4.strip())

print("\n" + "‚Äî" * 60)
print("üßæ RESULTADO VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(resultado_sql_4.to_string(index=False))

print("\n" + "‚Äî" * 60)
print("\033[1;32m‚úÖ RESPUESTA DEL AGENTE:\033[0m")
print("\033[1;96m" + str(respuesta_4["output"]) + "\033[0m")
print("‚Äî" * 60)
print(f"‚öôÔ∏è  ESTADO FILTRO: {status_txt}")
print(f"ü§ñ MODELO:        \033[1;36m{modelo_base.upper()}\033[0m")
print(f"üìå VERSI√ìN:       \033[1;34m{version_tag.upper()}\033[0m")
print("‚Äî" * 60)
print(f"üìä TOKENS:   {cb.total_tokens} (In: {cb.prompt_tokens} | Out: {cb.completion_tokens})")
print(f"üí∞ COSTO:    \033[1;33m${cb.total_cost:.6f} USD\033[0m")
print("‚Äî" * 60 + "\n")




[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables  
Action Input: ""  [0m[38;5;200m[1;3mdim_companies, fact_fundamentals_q, fact_prices_d[0m[32;1m[1;3mI need to check the schema of the relevant tables to find out where the liquidity data is stored, particularly in the context of the Healthcare sector. The `fact_fundamentals_q` table likely contains financial metrics, including liquidity. I'll check its schema. 

Action: sql_db_schema  
Action Input: "fact_fundamentals_q"  [0m[33;1m[1;3m
CREATE TABLE fact_fundamentals_q (
	simfin_id BIGINT, 
	fiscal_year BIGINT, 
	fiscal_period TEXT, 
	report_date DATETIME, 
	ticker TEXT, 
	currency TEXT, 
	total_current_assets FLOAT, 
	total_current_liabilities FLOAT, 
	short_term_debt FLOAT, 
	long_term_debt FLOAT, 
	total_equity FLOAT, 
	revenue_q FLOAT, 
	ebit_q FLOAT, 
	net_income_common_q FLOAT, 
	cfo_q FLOAT, 
	capex_proxy_q FLOAT, 
	fcf_q FLOAT, 
	total_debt_q FLOAT, 
	current_ratio_q FLOAT, 
	

### Pregunta 5 (ajustada)

**Consulta (usuario):**  
Para la acci√≥n MSFT, ¬øcu√°l fue el incremento porcentual de ingresos trimestrales entre Q4 2023 y Q4 2024?


In [34]:
# Prueba 5 - Versi√≥n acotada por empresa (MSFT)
import sqlite3
import pandas as pd
from langchain_community.callbacks.manager import get_openai_callback

pregunta_5 = "Para la acci√≥n MSFT, ¬øcu√°l fue el incremento porcentual de ingresos trimestrales entre Q4 2023 y Q4 2024?"

sql_validacion_5 = """
WITH base AS (
  SELECT
    UPPER(ticker) AS ticker,
    fiscal_year,
    fiscal_period,
    report_date,
    revenue_q,
    ROW_NUMBER() OVER (
      PARTITION BY UPPER(ticker), fiscal_year, fiscal_period
      ORDER BY report_date DESC
    ) AS rn
  FROM fact_fundamentals_q
  WHERE UPPER(ticker) = 'MSFT'
    AND fiscal_period = 'Q4'
    AND fiscal_year IN (2023, 2024)
    AND revenue_q IS NOT NULL
),
clean AS (
  SELECT ticker, fiscal_year, revenue_q
  FROM base
  WHERE rn = 1
),
pivoted AS (
  SELECT
    ticker,
    MAX(CASE WHEN fiscal_year = 2023 THEN revenue_q END) AS revenue_q4_2023,
    MAX(CASE WHEN fiscal_year = 2024 THEN revenue_q END) AS revenue_q4_2024
  FROM clean
  GROUP BY ticker
)
SELECT
  ticker,
  revenue_q4_2023,
  revenue_q4_2024,
  (revenue_q4_2024 - revenue_q4_2023) AS incremento_abs,
  CASE
    WHEN revenue_q4_2023 <> 0
    THEN ((revenue_q4_2024 - revenue_q4_2023) / revenue_q4_2023) * 100.0
    ELSE NULL
  END AS incremento_pct
FROM pivoted;
"""

if "db_path" not in globals():
    raise NameError("No existe db_path. Ejecuta 6.2/6.3 primero.")
if "configurar_agente" not in globals():
    raise NameError("No existe configurar_agente(...). Ejecuta 7.2 primero.")

conn = sqlite3.connect(db_path)
resultado_sql_5 = pd.read_sql_query(sql_validacion_5, conn)
conn.close()

with get_openai_callback() as cb:
    if "usar_filtro" in globals() and usar_filtro:
        tablas_relevantes = obtener_tablas_relevantes(pregunta_5, db, llm)
    else:
        tablas_relevantes = None

    agent_executor_test = configurar_agente(tablas_relevantes)
    respuesta_5 = agent_executor_test.invoke({"input": pregunta_5})

full_model = getattr(llm, "model_name", "N/A")
parts = full_model.split("-", 3)
modelo_base = "-".join(parts[:3]) if len(parts) >= 3 else full_model
version_tag = parts[3] if len(parts) > 3 else "LATEST"

status_txt = "\033[1;32mOPTIMIZADO (ON)\033[0m" if ("usar_filtro" in globals() and usar_filtro) else "\033[1;31mEST√ÅNDAR (OFF)\033[0m"

print("\n" + "‚Äî" * 60)
print("üìù CONSULTA (USUARIO)")
print("‚Äî" * 60)
print(pregunta_5)

print("\n" + "‚Äî" * 60)
print("üìå QUERY VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(sql_validacion_5.strip())

print("\n" + "‚Äî" * 60)
print("üßæ RESULTADO VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(resultado_sql_5.to_string(index=False))

print("\n" + "‚Äî" * 60)
print("\033[1;32m‚úÖ RESPUESTA DEL AGENTE:\033[0m")
print("\033[1;96m" + str(respuesta_5["output"]) + "\033[0m")
print("‚Äî" * 60)
print(f"‚öôÔ∏è  ESTADO FILTRO: {status_txt}")
print(f"ü§ñ MODELO:        \033[1;36m{modelo_base.upper()}\033[0m")
print(f"üìå VERSI√ìN:       \033[1;34m{version_tag.upper()}\033[0m")
print("‚Äî" * 60)
print(f"üìä TOKENS:   {cb.total_tokens} (In: {cb.prompt_tokens} | Out: {cb.completion_tokens})")
print(f"üí∞ COSTO:    \033[1;33m${cb.total_cost:.6f} USD\033[0m")
print("‚Äî" * 60 + "\n")




[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables  
Action Input: ""  [0m[38;5;200m[1;3mdim_companies, fact_fundamentals_q, fact_prices_d[0m[32;1m[1;3mI need to check the schema of the relevant tables to find the necessary columns for the quarterly revenue data. The `fact_fundamentals_q` table seems to be the most relevant for quarterly financial data. I will check its schema.  
Action: sql_db_schema  
Action Input: "fact_fundamentals_q"  [0m[33;1m[1;3m
CREATE TABLE fact_fundamentals_q (
	simfin_id BIGINT, 
	fiscal_year BIGINT, 
	fiscal_period TEXT, 
	report_date DATETIME, 
	ticker TEXT, 
	currency TEXT, 
	total_current_assets FLOAT, 
	total_current_liabilities FLOAT, 
	short_term_debt FLOAT, 
	long_term_debt FLOAT, 
	total_equity FLOAT, 
	revenue_q FLOAT, 
	ebit_q FLOAT, 
	net_income_common_q FLOAT, 
	cfo_q FLOAT, 
	capex_proxy_q FLOAT, 
	fcf_q FLOAT, 
	total_debt_q FLOAT, 
	current_ratio_q FLOAT, 
	debt_to_equity_q FLOAT, 
	fcf_marg

## 8.4 Queries de Filtrado y Comparaci√≥n

### Pregunta 6

**Consulta (usuario):**  
En el sector Healthcare, ¬øqu√© tickers tienen liquidez = 'baja' y apalancamiento = 'alto' en el √∫ltimo a√±o disponible?


In [35]:
# Prueba 6 - Filtrado y comparaci√≥n (multi-condici√≥n en √∫ltimo a√±o)
import sqlite3
import pandas as pd
from langchain_community.callbacks.manager import get_openai_callback

pregunta_6 = (
    "En el sector Healthcare, ¬øqu√© tickers tienen liquidez = 'baja' y "
    "apalancamiento = 'alto' en el √∫ltimo a√±o disponible?"
)

sql_validacion_6 = """
WITH ultimo_anio AS (
    SELECT MAX(fiscal_year) AS anio
    FROM fact_fundamentals_q
)
SELECT
    f.fiscal_year,
    f.ticker,
    COUNT(*) AS trimestres_cumplen_condicion
FROM fact_fundamentals_q f
JOIN dim_companies c
  ON c.simfin_id = f.simfin_id
JOIN ultimo_anio u
  ON f.fiscal_year = u.anio
WHERE c.sector = 'Healthcare'
  AND LOWER(f.liquidity_bucket_q) = 'baja'
  AND LOWER(f.leverage_bucket_q) = 'alto'
GROUP BY f.fiscal_year, f.ticker
ORDER BY trimestres_cumplen_condicion DESC, f.ticker;
"""

if "db_path" not in globals():
    raise NameError("No existe db_path. Ejecuta 6.2/6.3 primero.")
if "configurar_agente" not in globals():
    raise NameError("No existe configurar_agente(...). Ejecuta 7.2 primero.")

# 1) Validaci√≥n SQL directa
conn = sqlite3.connect(db_path)
resultado_sql_6 = pd.read_sql_query(sql_validacion_6, conn)
conn.close()

# 2) Llamado al agente con tu funci√≥n integrada
with get_openai_callback() as cb:
    if "usar_filtro" in globals() and usar_filtro:
        tablas_relevantes = obtener_tablas_relevantes(pregunta_6, db, llm)
    else:
        tablas_relevantes = None

    agent_executor_test = configurar_agente(tablas_relevantes)
    respuesta_6 = agent_executor_test.invoke({"input": pregunta_6})

# 3) Extraer modelo/versi√≥n
full_model = getattr(llm, "model_name", "N/A")
parts = full_model.split("-", 3)
modelo_base = "-".join(parts[:3]) if len(parts) >= 3 else full_model
version_tag = parts[3] if len(parts) > 3 else "LATEST"

# 4) Salida compacta y legible
status_txt = "\033[1;32mOPTIMIZADO (ON)\033[0m" if ("usar_filtro" in globals() and usar_filtro) else "\033[1;31mEST√ÅNDAR (OFF)\033[0m"

print("\n" + "‚Äî" * 60)
print("üìù CONSULTA (USUARIO)")
print("‚Äî" * 60)
print(pregunta_6)

print("\n" + "‚Äî" * 60)
print("üìå QUERY VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(sql_validacion_6.strip())

print("\n" + "‚Äî" * 60)
print("üßæ RESULTADO VALIDACI√ìN SQL")
print("‚Äî" * 60)
if len(resultado_sql_6) == 0:
    print("Sin resultados para la condici√≥n en el √∫ltimo a√±o disponible.")
else:
    print(resultado_sql_6.to_string(index=False))
    print(f"\nTotal tickers encontrados: {resultado_sql_6['ticker'].nunique()}")

print("\n" + "‚Äî" * 60)
print("\033[1;32m‚úÖ RESPUESTA DEL AGENTE:\033[0m")
print("\033[1;96m" + str(respuesta_6["output"]) + "\033[0m")
print("‚Äî" * 60)
print(f"‚öôÔ∏è  ESTADO FILTRO: {status_txt}")
print(f"ü§ñ MODELO:        \033[1;36m{modelo_base.upper()}\033[0m")
print(f"üìå VERSI√ìN:       \033[1;34m{version_tag.upper()}\033[0m")
print("‚Äî" * 60)
print(f"üìä TOKENS:   {cb.total_tokens} (In: {cb.prompt_tokens} | Out: {cb.completion_tokens})")
print(f"üí∞ COSTO:    \033[1;33m${cb.total_cost:.6f} USD\033[0m")
print("‚Äî" * 60 + "\n")




[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables  
Action Input: ""  [0m[38;5;200m[1;3mdim_companies, fact_fundamentals_q, fact_prices_d[0m[32;1m[1;3mI need to check the schema of the relevant tables to find the necessary columns for the query. The `dim_companies` table likely contains information about the companies, while the `fact_fundamentals_q` table may contain liquidity and leverage data. I'll check the schema of both tables to find the relevant columns.  
Action: sql_db_schema  
Action Input: "dim_companies, fact_fundamentals_q"  [0m[33;1m[1;3m
CREATE TABLE dim_companies (
	simfin_id BIGINT, 
	ticker TEXT, 
	company_name TEXT, 
	market TEXT, 
	main_currency TEXT, 
	industry_id BIGINT, 
	industry TEXT, 
	sector TEXT, 
	has_industry BIGINT
)

/*
3 rows from dim_companies table:
simfin_id	ticker	company_name	market	main_currency	industry_id	industry	sector	has_industry
45846	A	AGILENT TECHNOLOGIES INC	us	USD	106001	Medical Diagn

### Pregunta 7

**Consulta (usuario):**  
En 2024, entre los tickers INO, ELYM y LRMR, ¬øcu√°l tuvo mayor apalancamiento promedio y cu√°l mayor liquidez promedio?


In [36]:
# Prueba 7 - Filtrado y comparaci√≥n (empresas, mismo a√±o)
import sqlite3
import pandas as pd
from langchain_community.callbacks.manager import get_openai_callback

pregunta_7 = (
    "En 2024, entre los tickers INO, ELYM y LRMR, ¬øcu√°l tuvo mayor "
    "apalancamiento promedio y cu√°l mayor liquidez promedio?"
)

sql_validacion_7 = """
WITH base AS (
    SELECT
        UPPER(ticker) AS ticker,
        AVG(debt_to_equity_q) AS apalancamiento_promedio,
        AVG(current_ratio_q) AS liquidez_promedio
    FROM fact_fundamentals_q
    WHERE fiscal_year = 2024
      AND UPPER(ticker) IN ('INO', 'ELYM', 'LRMR')
    GROUP BY UPPER(ticker)
),
ranked AS (
    SELECT
        ticker,
        apalancamiento_promedio,
        liquidez_promedio,
        RANK() OVER (ORDER BY apalancamiento_promedio DESC) AS rk_apalancamiento,
        RANK() OVER (ORDER BY liquidez_promedio DESC) AS rk_liquidez
    FROM base
)
SELECT
    ticker,
    apalancamiento_promedio,
    liquidez_promedio,
    rk_apalancamiento,
    rk_liquidez
FROM ranked
ORDER BY ticker;
"""

if "db_path" not in globals():
    raise NameError("No existe db_path. Ejecuta 6.2/6.3 primero.")
if "configurar_agente" not in globals():
    raise NameError("No existe configurar_agente(...). Ejecuta 7.2 primero.")

# 1) Validaci√≥n SQL directa
conn = sqlite3.connect(db_path)
resultado_sql_7 = pd.read_sql_query(sql_validacion_7, conn)
conn.close()

# 2) Llamado al agente con tu funci√≥n integrada
with get_openai_callback() as cb:
    if "usar_filtro" in globals() and usar_filtro:
        tablas_relevantes = obtener_tablas_relevantes(pregunta_7, db, llm)
    else:
        tablas_relevantes = None

    agent_executor_test = configurar_agente(tablas_relevantes)
    respuesta_7 = agent_executor_test.invoke({"input": pregunta_7})

# 3) Extraer modelo/versi√≥n
full_model = getattr(llm, "model_name", "N/A")
parts = full_model.split("-", 3)
modelo_base = "-".join(parts[:3]) if len(parts) >= 3 else full_model
version_tag = parts[3] if len(parts) > 3 else "LATEST"

# 4) Salida compacta y legible
status_txt = "\033[1;32mOPTIMIZADO (ON)\033[0m" if ("usar_filtro" in globals() and usar_filtro) else "\033[1;31mEST√ÅNDAR (OFF)\033[0m"

print("\n" + "‚Äî" * 60)
print("üìù CONSULTA (USUARIO)")
print("‚Äî" * 60)
print(pregunta_7)

print("\n" + "‚Äî" * 60)
print("üìå QUERY VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(sql_validacion_7.strip())

print("\n" + "‚Äî" * 60)
print("üßæ RESULTADO VALIDACI√ìN SQL")
print("‚Äî" * 60)
if len(resultado_sql_7) == 0:
    print("Sin resultados para los tickers/periodo definidos.")
else:
    print(resultado_sql_7.to_string(index=False))

    top_ap = resultado_sql_7.sort_values("apalancamiento_promedio", ascending=False).head(1)
    top_liq = resultado_sql_7.sort_values("liquidez_promedio", ascending=False).head(1)

 # Reemplaza este bloque en tu celda
if len(resultado_sql_7) == 0:
    print("Sin resultados para los tickers/periodo definidos.")
else:
    print(resultado_sql_7.to_string(index=False))

    # Ordena tratando None/NaN como faltantes (van al final)
    top_ap = resultado_sql_7.sort_values("apalancamiento_promedio", ascending=False, na_position="last").head(1)
    top_liq = resultado_sql_7.sort_values("liquidez_promedio", ascending=False, na_position="last").head(1)

    def fmt_num(x):
        return "NA" if pd.isna(x) else f"{float(x):.6f}"

    print("\nResumen SQL directo:")
    print(f"- Mayor apalancamiento promedio: {top_ap.iloc[0]['ticker']} ({fmt_num(top_ap.iloc[0]['apalancamiento_promedio'])})")
    print(f"- Mayor liquidez promedio: {top_liq.iloc[0]['ticker']} ({fmt_num(top_liq.iloc[0]['liquidez_promedio'])})")

print("\n" + "‚Äî" * 60)
print("\033[1;32m‚úÖ RESPUESTA DEL AGENTE:\033[0m")
print("\033[1;96m" + str(respuesta_7["output"]) + "\033[0m")
print("‚Äî" * 60)
print(f"‚öôÔ∏è  ESTADO FILTRO: {status_txt}")
print(f"ü§ñ MODELO:        \033[1;36m{modelo_base.upper()}\033[0m")
print(f"üìå VERSI√ìN:       \033[1;34m{version_tag.upper()}\033[0m")
print("‚Äî" * 60)
print(f"üìä TOKENS:   {cb.total_tokens} (In: {cb.prompt_tokens} | Out: {cb.completion_tokens})")
print(f"üí∞ COSTO:    \033[1;33m${cb.total_cost:.6f} USD\033[0m")
print("‚Äî" * 60 + "\n")




[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables  
Action Input: ""  [0m[38;5;200m[1;3mdim_companies, fact_fundamentals_q, fact_prices_d[0m[32;1m[1;3mI need to check the schema of the relevant tables to find out where the leverage and liquidity data is stored. The `fact_fundamentals_q` table likely contains the leverage and liquidity information, while the `fact_prices_d` table may contain ticker information. I'll check the schema of both tables.  
Action: sql_db_schema  
Action Input: fact_fundamentals_q, fact_prices_d  [0m[33;1m[1;3m
CREATE TABLE fact_fundamentals_q (
	simfin_id BIGINT, 
	fiscal_year BIGINT, 
	fiscal_period TEXT, 
	report_date DATETIME, 
	ticker TEXT, 
	currency TEXT, 
	total_current_assets FLOAT, 
	total_current_liabilities FLOAT, 
	short_term_debt FLOAT, 
	long_term_debt FLOAT, 
	total_equity FLOAT, 
	revenue_q FLOAT, 
	ebit_q FLOAT, 
	net_income_common_q FLOAT, 
	cfo_q FLOAT, 
	capex_proxy_q FLOAT, 
	fcf_q FLOAT

## 8.5 Queries de Ranking y Ordenamiento

### Pregunta 8

**Consulta (usuario):**  
En 2024, ¬øcu√°les fueron los 5 tickers con mayor flujo de caja libre promedio trimestral, considerando solo empresas con al menos 3 trimestres con dato disponible?


In [37]:
# Prueba 8 - Query de Ranking y Ordenamiento (Top 5 por FCF promedio en 2024)
import sqlite3
import pandas as pd
from langchain_community.callbacks.manager import get_openai_callback

pregunta_8 = (
    "En 2024, ¬øcu√°les fueron los 5 tickers con mayor flujo de caja libre promedio trimestral, "
    "considerando solo empresas con al menos 3 trimestres con dato disponible?"
)

sql_validacion_8 = """
WITH base AS (
    SELECT
        UPPER(ticker) AS ticker,
        COUNT(fcf_q) AS trimestres_con_dato,
        AVG(fcf_q) AS fcf_promedio_q
    FROM fact_fundamentals_q
    WHERE fiscal_year = 2024
      AND fcf_q IS NOT NULL
      AND ticker IS NOT NULL
    GROUP BY UPPER(ticker)
)
SELECT
    ticker,
    trimestres_con_dato,
    fcf_promedio_q
FROM base
WHERE trimestres_con_dato >= 3
ORDER BY fcf_promedio_q DESC, ticker
LIMIT 5;
"""

if "db_path" not in globals():
    raise NameError("No existe db_path. Ejecuta 6.2/6.3 primero.")
if "configurar_agente" not in globals():
    raise NameError("No existe configurar_agente(...). Ejecuta 7.2 primero.")

# 1) Validaci√≥n SQL directa
conn = sqlite3.connect(db_path)
resultado_sql_8 = pd.read_sql_query(sql_validacion_8, conn)
conn.close()

# 2) Llamado al agente con tu funci√≥n integrada
with get_openai_callback() as cb:
    if "usar_filtro" in globals() and usar_filtro:
        tablas_relevantes = obtener_tablas_relevantes(pregunta_8, db, llm)
    else:
        tablas_relevantes = None

    agent_executor_test = configurar_agente(tablas_relevantes)
    respuesta_8 = agent_executor_test.invoke({"input": pregunta_8})

# 3) Extraer modelo/versi√≥n
full_model = getattr(llm, "model_name", "N/A")
parts = full_model.split("-", 3)
modelo_base = "-".join(parts[:3]) if len(parts) >= 3 else full_model
version_tag = parts[3] if len(parts) > 3 else "LATEST"

# 4) Salida compacta y legible
status_txt = "\033[1;32mOPTIMIZADO (ON)\033[0m" if ("usar_filtro" in globals() and usar_filtro) else "\033[1;31mEST√ÅNDAR (OFF)\033[0m"

print("\n" + "‚Äî" * 60)
print("üìù CONSULTA (USUARIO)")
print("‚Äî" * 60)
print(pregunta_8)

print("\n" + "‚Äî" * 60)
print("üìå QUERY VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(sql_validacion_8.strip())

print("\n" + "‚Äî" * 60)
print("üßæ RESULTADO VALIDACI√ìN SQL")
print("‚Äî" * 60)
if len(resultado_sql_8) == 0:
    print("Sin resultados para la condici√≥n definida.")
else:
    print(resultado_sql_8.to_string(index=False))

print("\n" + "‚Äî" * 60)
print("\033[1;32m‚úÖ RESPUESTA DEL AGENTE:\033[0m")
print("\033[1;96m" + str(respuesta_8["output"]) + "\033[0m")
print("‚Äî" * 60)
print(f"‚öôÔ∏è  ESTADO FILTRO: {status_txt}")
print(f"ü§ñ MODELO:        \033[1;36m{modelo_base.upper()}\033[0m")
print(f"üìå VERSI√ìN:       \033[1;34m{version_tag.upper()}\033[0m")
print("‚Äî" * 60)
print(f"üìä TOKENS:   {cb.total_tokens} (In: {cb.prompt_tokens} | Out: {cb.completion_tokens})")
print(f"üí∞ COSTO:    \033[1;33m${cb.total_cost:.6f} USD\033[0m")
print("‚Äî" * 60 + "\n")




[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables  
Action Input: ""  [0m[38;5;200m[1;3mdim_companies, fact_fundamentals_q, fact_prices_d[0m[32;1m[1;3mI need to check the schema of the tables to find relevant columns for companies and their cash flow data. The `fact_fundamentals_q` table likely contains the cash flow information, while `dim_companies` will provide the tickers. I'll check the schema of both tables.  
Action: sql_db_schema  
Action Input: "dim_companies, fact_fundamentals_q"  [0m[33;1m[1;3m
CREATE TABLE dim_companies (
	simfin_id BIGINT, 
	ticker TEXT, 
	company_name TEXT, 
	market TEXT, 
	main_currency TEXT, 
	industry_id BIGINT, 
	industry TEXT, 
	sector TEXT, 
	has_industry BIGINT
)

/*
3 rows from dim_companies table:
simfin_id	ticker	company_name	market	main_currency	industry_id	industry	sector	has_industry
45846	A	AGILENT TECHNOLOGIES INC	us	USD	106001	Medical Diagnostics & Research	Healthcare	1
1333027	A21	Li Aut

### Pregunta 9

**Consulta (usuario):**  
En los a√±os 2020, 2021, 2022, 2023 y 2024, ¬øcu√°les fueron los 3 tickers con mayor capitalizaci√≥n burs√°til al cierre de cada a√±o?


In [38]:
# Prueba 9 - Ranking y ordenamiento (Top 3 market cap al cierre por a√±o)
import sqlite3
import pandas as pd
from langchain_community.callbacks.manager import get_openai_callback

pregunta_9 = (
    "En los a√±os 2020, 2021, 2022, 2023 y 2024, ¬øcu√°les fueron los 3 tickers "
    "con mayor capitalizaci√≥n burs√°til al cierre de cada a√±o?"
)

sql_validacion_9 = """
WITH cierre_ticker_anio AS (
    SELECT
        UPPER(ticker) AS ticker,
        CAST(strftime('%Y', date) AS INTEGER) AS anio,
        MAX(date) AS fecha_cierre_ticker
    FROM fact_prices_d
    WHERE market_cap_d IS NOT NULL
      AND ticker IS NOT NULL
      AND CAST(strftime('%Y', date) AS INTEGER) IN (2020, 2021, 2022, 2023, 2024)
    GROUP BY UPPER(ticker), CAST(strftime('%Y', date) AS INTEGER)
),
mc_cierre AS (
    SELECT
        c.anio,
        c.ticker,
        p.market_cap_d
    FROM cierre_ticker_anio c
    JOIN fact_prices_d p
      ON UPPER(p.ticker) = c.ticker
     AND p.date = c.fecha_cierre_ticker
),
ranking AS (
    SELECT
        anio,
        ticker,
        market_cap_d,
        ROW_NUMBER() OVER (PARTITION BY anio ORDER BY market_cap_d DESC, ticker) AS rk
    FROM mc_cierre
)
SELECT
    anio,
    rk,
    ticker,
    market_cap_d
FROM ranking
WHERE rk <= 3
ORDER BY anio, rk;
"""

if "db_path" not in globals():
    raise NameError("No existe db_path. Ejecuta 6.2/6.3 primero.")
if "configurar_agente" not in globals():
    raise NameError("No existe configurar_agente(...). Ejecuta 7.2 primero.")

# 1) Validaci√≥n SQL directa
conn = sqlite3.connect(db_path)
resultado_sql_9 = pd.read_sql_query(sql_validacion_9, conn)
conn.close()

# 2) Llamado al agente con tu funci√≥n integrada
with get_openai_callback() as cb:
    if "usar_filtro" in globals() and usar_filtro:
        tablas_relevantes = obtener_tablas_relevantes(pregunta_9, db, llm)
    else:
        tablas_relevantes = None

    agent_executor_test = configurar_agente(tablas_relevantes)
    respuesta_9 = agent_executor_test.invoke({"input": pregunta_9})

# 3) Extraer modelo/versi√≥n
full_model = getattr(llm, "model_name", "N/A")
parts = full_model.split("-", 3)
modelo_base = "-".join(parts[:3]) if len(parts) >= 3 else full_model
version_tag = parts[3] if len(parts) > 3 else "LATEST"

# 4) Salida ordenada (compacta = secciones claras y tabla final sin ruido extra)
status_txt = "\033[1;32mOPTIMIZADO (ON)\033[0m" if ("usar_filtro" in globals() and usar_filtro) else "\033[1;31mEST√ÅNDAR (OFF)\033[0m"

print("\n" + "‚Äî" * 60)
print("üìù CONSULTA (USUARIO)")
print("‚Äî" * 60)
print(pregunta_9)

print("\n" + "‚Äî" * 60)
print("üìå QUERY VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(sql_validacion_9.strip())

print("\n" + "‚Äî" * 60)
print("üßæ RESULTADO VALIDACI√ìN SQL")
print("‚Äî" * 60)
if len(resultado_sql_9) == 0:
    print("Sin resultados para los a√±os solicitados.")
else:
    print(resultado_sql_9.to_string(index=False))

print("\n" + "‚Äî" * 60)
print("\033[1;32m‚úÖ RESPUESTA DEL AGENTE:\033[0m")
print("\033[1;96m" + str(respuesta_9["output"]) + "\033[0m")
print("‚Äî" * 60)
print(f"‚öôÔ∏è  ESTADO FILTRO: {status_txt}")
print(f"ü§ñ MODELO:        \033[1;36m{modelo_base.upper()}\033[0m")
print(f"üìå VERSI√ìN:       \033[1;34m{version_tag.upper()}\033[0m")
print("‚Äî" * 60)
print(f"üìä TOKENS:   {cb.total_tokens} (In: {cb.prompt_tokens} | Out: {cb.completion_tokens})")
print(f"üí∞ COSTO:    \033[1;33m${cb.total_cost:.6f} USD\033[0m")
print("‚Äî" * 60 + "\n")




[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables  
Action Input: ""  [0m[38;5;200m[1;3mdim_companies, fact_fundamentals_q, fact_prices_d[0m[32;1m[1;3mI need to check the schema of the relevant tables to find the necessary columns for my query. The `fact_fundamentals_q` table likely contains information about market capitalization, while the `fact_prices_d` table may have the ticker symbols and dates. I'll start by checking the schema of both tables. 

Action: sql_db_schema  
Action Input: fact_fundamentals_q, fact_prices_d  [0m[33;1m[1;3m
CREATE TABLE fact_fundamentals_q (
	simfin_id BIGINT, 
	fiscal_year BIGINT, 
	fiscal_period TEXT, 
	report_date DATETIME, 
	ticker TEXT, 
	currency TEXT, 
	total_current_assets FLOAT, 
	total_current_liabilities FLOAT, 
	short_term_debt FLOAT, 
	long_term_debt FLOAT, 
	total_equity FLOAT, 
	revenue_q FLOAT, 
	ebit_q FLOAT, 
	net_income_common_q FLOAT, 
	cfo_q FLOAT, 
	capex_proxy_q FLOAT, 
	fcf_q FL

In [39]:
import sqlite3
import pandas as pd

conn = sqlite3.connect(db_path)

query = """
WITH objetivos(anio, ticker) AS (
    VALUES
    (2020,'AAPL'), (2020,'AMZN'), (2020,'MSFT'),
    (2021,'BBVA'), (2021,'XP'),   (2021,'ATCC'),
    (2022,'BBVA'), (2022,'XP'),   (2022,'AAPL'),
    (2023,'BBVA'), (2023,'FLJ'),  (2023,'AAPL'),
    (2024,'FLJ'),  (2024,'AAPL'), (2024,'NVDA')
),
cierre AS (
    SELECT
        o.anio,
        o.ticker,
        MAX(p.date) AS fecha_precio
    FROM objetivos o
    JOIN fact_prices_d p
      ON UPPER(p.ticker) = o.ticker
     AND CAST(strftime('%Y', p.date) AS INTEGER) = o.anio
    GROUP BY o.anio, o.ticker
)
SELECT
    c.anio,
    c.ticker,
    c.fecha_precio,
    p.adj_close,
    p.shares_outstanding,
    p.market_cap_d,
    dc.company_name,
    dc.sector,
    dc.industry
FROM cierre c
JOIN fact_prices_d p
  ON UPPER(p.ticker) = c.ticker
 AND p.date = c.fecha_precio
LEFT JOIN dim_companies dc
  ON dc.simfin_id = p.simfin_id
ORDER BY c.anio, c.ticker;
"""

df_check = pd.read_sql_query(query, conn)
conn.close()

print(df_check.to_string(index=False))


 anio ticker               fecha_precio  adj_close  shares_outstanding  market_cap_d                          company_name             sector                     industry
 2020   AAPL 2020-12-31 00:00:00.000000     129.06        1.682326e+10  2.171210e+12                             APPLE INC         Technology            Computer Hardware
 2020   AMZN 2020-12-31 00:00:00.000000     162.85        1.006000e+10  1.638271e+12                        AMAZON COM INC  Consumer Cyclical Retail - Apparel & Specialty
 2020   MSFT 2020-12-31 00:00:00.000000     212.92        7.546000e+09  1.606694e+12                        MICROSOFT CORP         Technology         Application Software
 2021   ATCC 2021-12-31 00:00:00.000000       0.22        7.239574e+12  1.592706e+12                Ameritrust Corporation Financial Services   Brokers, Exchanges & Other
 2021   BBVA 2021-12-31 00:00:00.000000       4.51        6.667887e+15  3.007217e+16 Banco Bilbao Vizcaya Argentaria, S.A. Financial Services    

## 8.6 Queries Complejas o Ambiguas
### Pregunta 10

**Consulta (usuario):**  
En el √∫ltimo a√±o disponible, ¬øqu√© sectores mostraron mejora operativa pero deterioro en caja libre?


In [40]:
# Prueba 10 - Query compleja/ambigua (sector: mejora operativa + deterioro en caja libre)
import sqlite3
import pandas as pd
from langchain_community.callbacks.manager import get_openai_callback

pregunta_10 = "En el √∫ltimo a√±o disponible, ¬øqu√© sectores mostraron mejora operativa pero deterioro en caja libre?"

sql_validacion_10 = """
WITH ultimo_anio AS (
    SELECT MAX(fiscal_year) AS anio
    FROM fact_fundamentals_q
),
sector_metrics AS (
    SELECT
        c.sector,
        AVG(f.yoy_revenue_growth_q) AS yoy_revenue_promedio,
        AVG(f.yoy_fcf_growth_q) AS yoy_fcf_promedio,
        COUNT(*) AS observaciones
    FROM fact_fundamentals_q f
    JOIN dim_companies c
      ON c.simfin_id = f.simfin_id
    JOIN ultimo_anio u
      ON f.fiscal_year = u.anio
    WHERE c.sector IS NOT NULL
    GROUP BY c.sector
)
SELECT
    sector,
    yoy_revenue_promedio,
    yoy_fcf_promedio,
    observaciones
FROM sector_metrics
WHERE yoy_revenue_promedio > 0
  AND yoy_fcf_promedio < 0
ORDER BY yoy_revenue_promedio DESC, yoy_fcf_promedio ASC, sector;
"""

if "db_path" not in globals():
    raise NameError("No existe db_path. Ejecuta 6.2/6.3 primero.")
if "configurar_agente" not in globals():
    raise NameError("No existe configurar_agente(...). Ejecuta 7.2 primero.")

# 1) Validaci√≥n SQL directa
conn = sqlite3.connect(db_path)
resultado_sql_10 = pd.read_sql_query(sql_validacion_10, conn)
conn.close()

# 2) Llamado al agente con tu funci√≥n integrada
with get_openai_callback() as cb:
    if "usar_filtro" in globals() and usar_filtro:
        tablas_relevantes = obtener_tablas_relevantes(pregunta_10, db, llm)
    else:
        tablas_relevantes = None

    agent_executor_test = configurar_agente(tablas_relevantes)
    respuesta_10 = agent_executor_test.invoke({"input": pregunta_10})

# 3) Extraer modelo/versi√≥n
full_model = getattr(llm, "model_name", "N/A")
parts = full_model.split("-", 3)
modelo_base = "-".join(parts[:3]) if len(parts) >= 3 else full_model
version_tag = parts[3] if len(parts) > 3 else "LATEST"

# 4) Salida compacta y legible
status_txt = "\033[1;32mOPTIMIZADO (ON)\033[0m" if ("usar_filtro" in globals() and usar_filtro) else "\033[1;31mEST√ÅNDAR (OFF)\033[0m"

print("\n" + "‚Äî" * 60)
print("üìù CONSULTA (USUARIO)")
print("‚Äî" * 60)
print(pregunta_10)

print("\n" + "‚Äî" * 60)
print("üìå QUERY VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(sql_validacion_10.strip())

print("\n" + "‚Äî" * 60)
print("üßæ RESULTADO VALIDACI√ìN SQL")
print("‚Äî" * 60)
if len(resultado_sql_10) == 0:
    print("Sin sectores que cumplan mejora operativa y deterioro de caja en el √∫ltimo a√±o disponible.")
else:
    print(resultado_sql_10.to_string(index=False))
    print(f"\nSectores encontrados: {len(resultado_sql_10)}")

print("\n" + "‚Äî" * 60)
print("\033[1;32m‚úÖ RESPUESTA DEL AGENTE:\033[0m")
print("\033[1;96m" + str(respuesta_10["output"]) + "\033[0m")
print("‚Äî" * 60)
print(f"‚öôÔ∏è  ESTADO FILTRO: {status_txt}")
print(f"ü§ñ MODELO:        \033[1;36m{modelo_base.upper()}\033[0m")
print(f"üìå VERSI√ìN:       \033[1;34m{version_tag.upper()}\033[0m")
print("‚Äî" * 60)
print(f"üìä TOKENS:   {cb.total_tokens} (In: {cb.prompt_tokens} | Out: {cb.completion_tokens})")
print(f"üí∞ COSTO:    \033[1;33m${cb.total_cost:.6f} USD\033[0m")
print("‚Äî" * 60 + "\n")




[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables  
Action Input: ""  [0m[38;5;200m[1;3mdim_companies, fact_fundamentals_q, fact_prices_d[0m[32;1m[1;3mI need to check the schema of the relevant tables to understand how to formulate my query. The tables that seem relevant are `fact_fundamentals_q` for operational improvements and `fact_prices_d` for cash flow data. I'll check the schema of both tables.  
Action: sql_db_schema  
Action Input: "fact_fundamentals_q, fact_prices_d"  [0m[33;1m[1;3m
CREATE TABLE fact_fundamentals_q (
	simfin_id BIGINT, 
	fiscal_year BIGINT, 
	fiscal_period TEXT, 
	report_date DATETIME, 
	ticker TEXT, 
	currency TEXT, 
	total_current_assets FLOAT, 
	total_current_liabilities FLOAT, 
	short_term_debt FLOAT, 
	long_term_debt FLOAT, 
	total_equity FLOAT, 
	revenue_q FLOAT, 
	ebit_q FLOAT, 
	net_income_common_q FLOAT, 
	cfo_q FLOAT, 
	capex_proxy_q FLOAT, 
	fcf_q FLOAT, 
	total_debt_q FLOAT, 
	current_ratio_q FLO

## Pregunta 10 reformulada

In [41]:
# Prueba 10 - Query compleja/ambigua (reformulada y acotada a sectores)
import sqlite3
import pandas as pd
from langchain_community.callbacks.manager import get_openai_callback

pregunta_10 = (
    "¬øQu√© sectores (no empresas ni tickers), en el √∫ltimo a√±o disponible, muestran "
    "simult√°neamente mejora promedio en desempe√±o operativo y deterioro promedio en "
    "crecimiento de caja libre interanual? Presenta el resultado en una tabla con el "
    "sector y ambos promedios."
)

sql_validacion_10 = """
WITH ultimo_anio AS (
    SELECT MAX(fiscal_year) AS anio
    FROM fact_fundamentals_q
),
sector_metrics AS (
    SELECT
        c.sector,
        AVG(f.yoy_revenue_growth_q) AS promedio_desempeno_operativo,
        AVG(f.yoy_fcf_growth_q) AS promedio_crecimiento_caja_libre,
        COUNT(*) AS observaciones
    FROM fact_fundamentals_q f
    JOIN dim_companies c
      ON c.simfin_id = f.simfin_id
    JOIN ultimo_anio u
      ON f.fiscal_year = u.anio
    WHERE c.sector IS NOT NULL
    GROUP BY c.sector
)
SELECT
    sector,
    promedio_desempeno_operativo,
    promedio_crecimiento_caja_libre,
    observaciones
FROM sector_metrics
WHERE promedio_desempeno_operativo > 0
  AND promedio_crecimiento_caja_libre < 0
ORDER BY promedio_desempeno_operativo DESC, promedio_crecimiento_caja_libre ASC, sector;
"""

if "db_path" not in globals():
    raise NameError("No existe db_path. Ejecuta 6.2/6.3 primero.")
if "configurar_agente" not in globals():
    raise NameError("No existe configurar_agente(...). Ejecuta 7.2 primero.")

# 1) Validaci√≥n SQL directa
conn = sqlite3.connect(db_path)
resultado_sql_10 = pd.read_sql_query(sql_validacion_10, conn)
conn.close()

# 2) Llamado al agente con tu funci√≥n integrada
with get_openai_callback() as cb:
    if "usar_filtro" in globals() and usar_filtro:
        tablas_relevantes = obtener_tablas_relevantes(pregunta_10, db, llm)
    else:
        tablas_relevantes = None

    agent_executor_test = configurar_agente(tablas_relevantes)
    respuesta_10 = agent_executor_test.invoke({"input": pregunta_10})

# 3) Extraer modelo/versi√≥n
full_model = getattr(llm, "model_name", "N/A")
parts = full_model.split("-", 3)
modelo_base = "-".join(parts[:3]) if len(parts) >= 3 else full_model
version_tag = parts[3] if len(parts) > 3 else "LATEST"

# 4) Salida compacta y legible
status_txt = "\033[1;32mOPTIMIZADO (ON)\033[0m" if ("usar_filtro" in globals() and usar_filtro) else "\033[1;31mEST√ÅNDAR (OFF)\033[0m"

print("\n" + "‚Äî" * 60)
print("üìù CONSULTA (USUARIO)")
print("‚Äî" * 60)
print(pregunta_10)

print("\n" + "‚Äî" * 60)
print("üìå QUERY VALIDACI√ìN SQL")
print("‚Äî" * 60)
print(sql_validacion_10.strip())

print("\n" + "‚Äî" * 60)
print("üßæ RESULTADO VALIDACI√ìN SQL")
print("‚Äî" * 60)
if len(resultado_sql_10) == 0:
    print("Sin sectores que cumplan la condici√≥n en el √∫ltimo a√±o disponible.")
else:
    print(resultado_sql_10.to_string(index=False))
    print(f"\nSectores encontrados: {len(resultado_sql_10)}")

print("\n" + "‚Äî" * 60)
print("\033[1;32m‚úÖ RESPUESTA DEL AGENTE:\033[0m")
print("\033[1;96m" + str(respuesta_10["output"]) + "\033[0m")
print("‚Äî" * 60)
print(f"‚öôÔ∏è  ESTADO FILTRO: {status_txt}")
print(f"ü§ñ MODELO:        \033[1;36m{modelo_base.upper()}\033[0m")
print(f"üìå VERSI√ìN:       \033[1;34m{version_tag.upper()}\033[0m")
print("‚Äî" * 60)
print(f"üìä TOKENS:   {cb.total_tokens} (In: {cb.prompt_tokens} | Out: {cb.completion_tokens})")
print(f"üí∞ COSTO:    \033[1;33m${cb.total_cost:.6f} USD\033[0m")
print("‚Äî" * 60 + "\n")




[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables  
Action Input: ""  [0m[38;5;200m[1;3mdim_companies, fact_fundamentals_q, fact_prices_d[0m[32;1m[1;3mI need to check the schema of the relevant tables to understand how to construct my query. The `fact_fundamentals_q` table seems to be the most relevant for operational performance and free cash flow growth. I'll check its schema.  
Action: sql_db_schema  
Action Input: "fact_fundamentals_q"  [0m[33;1m[1;3m
CREATE TABLE fact_fundamentals_q (
	simfin_id BIGINT, 
	fiscal_year BIGINT, 
	fiscal_period TEXT, 
	report_date DATETIME, 
	ticker TEXT, 
	currency TEXT, 
	total_current_assets FLOAT, 
	total_current_liabilities FLOAT, 
	short_term_debt FLOAT, 
	long_term_debt FLOAT, 
	total_equity FLOAT, 
	revenue_q FLOAT, 
	ebit_q FLOAT, 
	net_income_common_q FLOAT, 
	cfo_q FLOAT, 
	capex_proxy_q FLOAT, 
	fcf_q FLOAT, 
	total_debt_q FLOAT, 
	current_ratio_q FLOAT, 
	debt_to_equity_q FLOAT, 
	fcf_ma

## 8.7 Pruebas de Variaci√≥n Ling√º√≠stica

Se ejecuta la pregunta base y 3 reformulaciones, comparando consistencia de resultados y mostrando los pasos intermedios del agente.

**Pregunta base:**  
¬øC√≥mo ha evolucionado por a√±o el promedio de liquidez en el sector Healthcare?




In [46]:
# =========================================================
# 2) Variaci√≥n ling√º√≠stica - (base + 3 variantes)
# =========================================================
import sqlite3
import pandas as pd
from langchain_community.callbacks.manager import get_openai_callback

# Preguntas
preguntas_b = {
    "base": "¬øC√≥mo ha evolucionado por a√±o el promedio de liquidez en el sector Healthcare?",
    "var_1": "Mu√©strame la tendencia anual de la liquidez promedio en Healthcare.",
    "var_2": "Por cada a√±o, ¬øcu√°l fue el promedio de liquidez del sector Healthcare?",
    "var_3": "En el sector Healthcare, resume a√±o a a√±o el comportamiento del ratio de liquidez promedio."
}

# SQL de referencia
sql_validacion_b = """
SELECT
    f.fiscal_year,
    AVG(f.current_ratio_q) AS promedio_liquidez
FROM fact_fundamentals_q f
JOIN dim_companies c
  ON c.simfin_id = f.simfin_id
WHERE c.sector = 'Healthcare'
  AND f.current_ratio_q IS NOT NULL
GROUP BY f.fiscal_year
ORDER BY f.fiscal_year;
"""

# Validaciones m√≠nimas
if "db_path" not in globals():
    raise NameError("No existe db_path. Ejecuta 6.2/6.3 primero.")
if "configurar_agente" not in globals():
    raise NameError("No existe configurar_agente(...). Ejecuta primero la celda de configuraci√≥n.")
if "db" not in globals() or "llm" not in globals():
    raise NameError("No existen db o llm en memoria. Ejecuta 7.2 primero.")

# 1) SQL directo
conn = sqlite3.connect(db_path)
resultado_sql_b = pd.read_sql_query(sql_validacion_b, conn)
conn.close()

# 2) Ejecutar agente para cada formulaci√≥n
resultados_agente = []

for etiqueta, pregunta in preguntas_b.items():
    with get_openai_callback() as cb:
        if "usar_filtro" in globals() and usar_filtro:
            tablas_relevantes = obtener_tablas_relevantes(pregunta, db, llm)
        else:
            tablas_relevantes = None

        agent_executor_test = configurar_agente(tablas_relevantes)
        respuesta = agent_executor_test.invoke({"input": pregunta})

    resultados_agente.append({
        "etiqueta": etiqueta,
        "pregunta": pregunta,
        "respuesta": str(respuesta.get("output", "")),
        "pasos": respuesta.get("intermediate_steps", []),
        "tokens_total": cb.total_tokens,
        "tokens_in": cb.prompt_tokens,
        "tokens_out": cb.completion_tokens,
        "costo_usd": cb.total_cost
    })

# 3) Reporte
full_model = getattr(llm, "model_name", "N/A")
parts = full_model.split("-", 3)
modelo_base = "-".join(parts[:3]) if len(parts) >= 3 else full_model
version_tag = parts[3] if len(parts) > 3 else "LATEST"
status_txt = "\033[1;32mOPTIMIZADO (ON)\033[0m" if ("usar_filtro" in globals() and usar_filtro) else "\033[1;31mEST√ÅNDAR (OFF)\033[0m"

print("\n" + "‚Äî" * 70)
print("üßæ VALIDACI√ìN SQL (REFERENCIA)")
print("‚Äî" * 70)
print(sql_validacion_b.strip())
print("\nResultado SQL:")
print(resultado_sql_b.to_string(index=False))

print("\n" + "‚Äî" * 70)
print("‚úÖ RESPUESTAS DEL AGENTE (BASE + VARIACIONES)")
print("‚Äî" * 70)

for r in resultados_agente:
    print("\n" + "¬∑" * 70)
    print(f"Etiqueta: {r['etiqueta']}")
    print(f"Pregunta: {r['pregunta']}")
    print("\033[1;96m" + r["respuesta"] + "\033[0m")
    print(f"Tokens: {r['tokens_total']} (In: {r['tokens_in']} | Out: {r['tokens_out']})")
    print(f"Costo: ${r['costo_usd']:.6f} USD")

    print("\nüß† Pasos del agente:")
    pasos = r.get("pasos", [])
    if not pasos:
        print("  - No se capturaron pasos intermedios.")
    else:
        for i, step in enumerate(pasos, 1):
            if isinstance(step, tuple) and len(step) == 2:
                action, observation = step
                tool = getattr(action, "tool", "N/A")
                tool_input = getattr(action, "tool_input", "N/A")
            else:
                tool = "N/A"
                tool_input = "N/A"
                observation = step

            print(f"  Paso {i}")
            print(f"    Tool: {tool}")
            print(f"    Input: {str(tool_input)[:300]}")
            print(f"    Output: {str(observation)[:500]}")

# 4) Tabla resumen de m√©tricas
df_metricas = pd.DataFrame(resultados_agente)[["etiqueta", "tokens_total", "tokens_in", "tokens_out", "costo_usd"]]
print("\n" + "‚Äî" * 70)
print("üìä M√âTRICAS RESUMEN")
print("‚Äî" * 70)
print(df_metricas.to_string(index=False))

print("\n" + "‚Äî" * 70)
print(f"‚öôÔ∏è  ESTADO FILTRO: {status_txt}")
print(f"ü§ñ MODELO:        \033[1;36m{modelo_base.upper()}\033[0m")
print(f"üìå VERSI√ìN:       \033[1;34m{version_tag.upper()}\033[0m")
print("‚Äî" * 70 + "\n")




[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3m
Invoking: `sql_db_list_tables` with `{}`


[0m[38;5;200m[1;3mdim_companies, fact_fundamentals_q, fact_prices_d[0m[32;1m[1;3m
Invoking: `sql_db_schema` with `{'table_names': 'dim_companies'}`


[0m[33;1m[1;3m
CREATE TABLE dim_companies (
	simfin_id BIGINT, 
	ticker TEXT, 
	company_name TEXT, 
	market TEXT, 
	main_currency TEXT, 
	industry_id BIGINT, 
	industry TEXT, 
	sector TEXT, 
	has_industry BIGINT
)

/*
3 rows from dim_companies table:
simfin_id	ticker	company_name	market	main_currency	industry_id	industry	sector	has_industry
45846	A	AGILENT TECHNOLOGIES INC	us	USD	106001	Medical Diagnostics & Research	Healthcare	1
1333027	A21	Li Auto Inc.	us	USD	None	None	None	0
367153	AA	Alcoa Corp	us	USD	110004	Metals & Mining	Basic Materials	1
*/[0m[32;1m[1;3m
Invoking: `sql_db_schema` with `{'table_names': 'fact_fundamentals_q'}`


[0m[33;1m[1;3m
CREATE TABLE fact_fundamentals_q (
	simfin_id BIGINT, 
	fiscal_year