# Download SPX Data to `market_data` and Run agent-alpha

This notebook does two things:

1. Downloads real SPX universe + prices data using `agent_alpha.data.download_spx_data`.
2. Runs factor evaluation where features/factor are computed on all downloaded prices, while RankIC/ICIR/Ex-ante IR are evaluated only on the universe mask.

Optional: run the full `AgentAlphaWorkflow` if `OPENAI_API_KEY` is set.


In [1]:
from __future__ import annotations

import sys
from pathlib import Path

import pandas as pd

CURRENT_DIR = Path.cwd().resolve()
candidate_roots = [CURRENT_DIR, CURRENT_DIR.parent, CURRENT_DIR.parent.parent]
REPO_ROOT = next((p for p in candidate_roots if (p / "agent_alpha").exists()), None)
if REPO_ROOT is None:
    raise FileNotFoundError(
        "Could not find project root containing 'agent_alpha'. Open this notebook from agent-alpha/notebooks."
    )
NOTEBOOK_DIR = REPO_ROOT / "notebooks"
if not NOTEBOOK_DIR.exists():
    raise FileNotFoundError(f"Expected notebooks directory at: {NOTEBOOK_DIR}")

if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))

MARKET_DATA_DIR = NOTEBOOK_DIR / "market_data"
MARKET_DATA_DIR.mkdir(parents=True, exist_ok=True)

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("REPO_ROOT:", REPO_ROOT)
print("MARKET_DATA_DIR:", MARKET_DATA_DIR)

NOTEBOOK_DIR: D:\python scripts\qalpha\agent-alpha\notebooks
REPO_ROOT: D:\python scripts\qalpha\agent-alpha
MARKET_DATA_DIR: D:\python scripts\qalpha\agent-alpha\notebooks\market_data


If your environment is missing data-download deps, run this once:

```python
%pip install yfinance requests beautifulsoup4 lxml
```

In [2]:
from agent_alpha.data.download_spx_data import build_spx_data

START_DATE = "2015-01-01"
END_DATE = pd.Timestamp.today().strftime("%Y-%m-%d")
BATCH_SIZE = 100
PAUSE_SECONDS = 0.2

summary = build_spx_data(
    start_date=START_DATE,
    end_date=END_DATE,
    output_dir=MARKET_DATA_DIR,
    batch_size=BATCH_SIZE,
    pause_seconds=PAUSE_SECONDS,
    auto_adjust=True,
)

summary

$BCR: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$ALXN: possibly delisted; no timezone found
$AGN: possibly delisted; no timezone found
$ALTR: possibly delisted; no timezone found
$ARG: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$ARNC: possibly delisted; no timezone found
$AVP: possibly delisted; no timezone found
$BHI: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$ATVI: possibly delisted; no timezone found
$ACE: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$ANSS: possibly delisted; no timezone found
$ADS: possibly delisted; no timezone found
$ABMD: possibly delisted; no timezone found

13 Failed downloads:
['BCR', 'ARG', 'BHI', 'ACE']: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
['ALXN', 'AGN', 'ALTR', 'ARNC', 'AVP', 'ATVI', 'ANSS', 'ADS', 'ABMD']: possibly delisted; no timezone found


Batch 1: tickers=100 rows=228,270


$BRCM: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$CERN: possibly delisted; no timezone found
$CTXS: possibly delisted; no timezone found
$CXO: possibly delisted; no timezone found
$CSC: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$CELG: possibly delisted; no timezone found
$BXLT: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$CPGX: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$DFS: possibly delisted; no timezone found
$COV: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$CVC: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$CMCSK: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$CFN: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$CHK: possibly delisted; no timezone found
$CTLT: possibly delisted; no timezone found

15 Failed downloads:
['BRCM', 'CSC', 'BXLT', 'CPGX', 'COV', 'CVC', 'CMCSK', 'CFN'

Batch 2: tickers=100 rows=215,177


$GAS: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$DTV: possibly delisted; no timezone found
$DISCK: possibly delisted; no timezone found
$FRC: possibly delisted; no timezone found
$DO: possibly delisted; no timezone found
$GGP: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$DPS: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$ETFC: possibly delisted; no timezone found
$FLIR: possibly delisted; no timezone found
$DISCA: possibly delisted; no timezone found
$DWDP: possibly delisted; no timezone found
$ENDP: possibly delisted; no timezone found
$FTR: possibly delisted; no timezone found
$DISH: possibly delisted; no timezone found
$ESV: possibly delisted; no timezone found
$DRE: possibly delisted; no timezone found
$FDO: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$FL: possibly delisted; no timezone found
$DNR: possibly delisted; no timezone found
$FBHS: possibly delisted; no timezone found


Batch 3: tickers=100 rows=205,675


$HSP: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$HBI: possibly delisted; no timezone found
$KRFT: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$K: possibly delisted; no timezone found
$GPS: possibly delisted; no timezone found
$JOY: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$HES: possibly delisted; no timezone found
$JNPR: possibly delisted; no timezone found
$HCBK: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$JWN: possibly delisted; no timezone found
$HFC: possibly delisted; no timezone found
$GMCR: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$IPG: possibly delisted; no timezone found
$KSU: possibly delisted; no timezone found

14 Failed downloads:
['HSP', 'KRFT', 'JOY', 'HCBK', 'GMCR']: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
['HBI', 'K', 'GPS', 'HES', 'JNPR', 'JWN', 'HFC', 'IPG', 'KSU']: possibly delisted; no timezone foun

Batch 4: tickers=100 rows=229,125


$MNK: possibly delisted; no timezone found
$NLSN: possibly delisted; no timezone found
$MRO: possibly delisted; no timezone found
$MXIM: possibly delisted; no timezone found
$LM: possibly delisted; no timezone found
$LO: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$MON: possibly delisted; no timezone found
$MJN: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$LVLT: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$LLTC: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$LLL: possibly delisted; no timezone found
$NBL: possibly delisted; no timezone found

12 Failed downloads:
['MNK', 'NLSN', 'MRO', 'MXIM', 'LM', 'MON', 'LLL', 'NBL']: possibly delisted; no timezone found
['LO', 'MJN', 'LVLT', 'LLTC']: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)


Batch 5: tickers=100 rows=237,695


$PBCT: possibly delisted; no timezone found
$SIAL: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$SIVB: possibly delisted; no timezone found
$PCP: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$RAI: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$PXD: possibly delisted; no timezone found
$QEP: possibly delisted; no timezone found
$PLL: possibly delisted; no timezone found
$PDCO: possibly delisted; no timezone found
$PETM: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$RTN: possibly delisted; no timezone found
$RHT: possibly delisted; no timezone found
$SNI: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)

13 Failed downloads:
['PBCT', 'SIVB', 'PXD', 'QEP', 'PLL', 'PDCO', 'RTN', 'RHT']: possibly delisted; no timezone found
['SIAL', 'PCP', 'RAI', 'PETM', 'SNI']: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)


Batch 6: tickers=100 rows=219,678


$TWC: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$TYC: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$SWY: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$VAR: possibly delisted; no timezone found
$TWTR: possibly delisted; no timezone found
$TSS: possibly delisted; no timezone found
$SWN: possibly delisted; no timezone found
$VIAB: possibly delisted; no timezone found
$TIF: possibly delisted; no timezone found
$STJ: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$WCG: possibly delisted; no timezone found
$WBA: possibly delisted; no timezone found

12 Failed downloads:
['TWC', 'TYC', 'SWY', 'STJ']: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
['VAR', 'TWTR', 'TSS', 'SWN', 'VIAB', 'TIF', 'WCG', 'WBA']: possibly delisted; no timezone found


Batch 7: tickers=100 rows=227,603


$WYN: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$WFM: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
$YHOO: possibly delisted; no timezone found
$XLNX: possibly delisted; no timezone found
$XEC: possibly delisted; no timezone found
$XL: possibly delisted; no timezone found
$WIN: possibly delisted; no timezone found

7 Failed downloads:
['WYN', 'WFM']: possibly delisted; no price data found  (1d 2015-01-01 -> 2026-02-18)
['YHOO', 'XLNX', 'XEC', 'XL', 'WIN']: possibly delisted; no timezone found


Batch 8: tickers=30 rows=64,108


{'universe_path': 'D:\\python scripts\\qalpha\\agent-alpha\\notebooks\\market_data\\spx_universe.csv',
 'prices_path': 'D:\\python scripts\\qalpha\\agent-alpha\\notebooks\\market_data\\spx_prices.csv',
 'filtered_universe_path': 'D:\\python scripts\\qalpha\\agent-alpha\\notebooks\\market_data\\spx_universe_filtered.csv',
 'agent_panel_path': 'D:\\python scripts\\qalpha\\agent-alpha\\notebooks\\market_data\\spx_agent_panel.csv',
 'universe_rows': 1468631,
 'prices_rows': 1627331,
 'panel_rows': 1297210,
 'n_universe_tickers': 730,
 'n_price_tickers': 623,
 'n_panel_tickers': 606,
 'n_panel_dates': 2797}

In [2]:
from IPython.display import display

for file_path in sorted(MARKET_DATA_DIR.glob("spx_*.csv")):
    size_mb = file_path.stat().st_size / (1024 * 1024)
    print(f"{file_path.name}: {size_mb:.2f} MB")

prices_csv_path = MARKET_DATA_DIR / "spx_prices.csv"
universe_csv_path = MARKET_DATA_DIR / "spx_universe_filtered.csv"

display(pd.read_csv(prices_csv_path, nrows=5))
display(pd.read_csv(universe_csv_path, nrows=5))


spx_agent_panel.csv: 122.27 MB
spx_prices.csv: 152.94 MB
spx_universe.csv: 25.42 MB
spx_universe_filtered.csv: 22.67 MB


Unnamed: 0,date,ticker,open,high,low,close,volume
0,2015-01-02,A,37.621143,37.739909,36.881144,37.054726,1529200.0
1,2015-01-02,AA,35.550392,35.796805,35.102374,35.572796,4340408.0
2,2015-01-02,AAL,51.430493,51.733694,50.284015,51.079918,10748600.0
3,2015-01-02,AAP,140.634759,142.077385,137.679545,138.632553,509800.0
4,2015-01-02,AAPL,24.671155,24.68223,23.776357,24.214897,212818400.0


Unnamed: 0,date,ticker,in_universe
0,2015-01-02,A,1
1,2015-01-02,AA,1
2,2015-01-02,AAPL,1
3,2015-01-02,ABBV,1
4,2015-01-02,ABT,1


In [3]:
prices_csv_path = MARKET_DATA_DIR / "spx_prices.csv"
universe_csv_path = MARKET_DATA_DIR / "spx_universe_filtered.csv"

prices_df = pd.read_csv(prices_csv_path, parse_dates=["date"])
universe_mask = pd.read_csv(universe_csv_path, parse_dates=["date"])

panel = (
    prices_df.rename(
        columns={
            "date": "datetime",
            "ticker": "instrument",
            "open": "$open",
            "high": "$high",
            "low": "$low",
            "close": "$close",
            "volume": "$volume",
        }
    )
    .set_index(["datetime", "instrument"])
    .sort_index()[["$open", "$high", "$low", "$close", "$volume"]]
)

dates = panel.index.get_level_values("datetime")
instruments = panel.index.get_level_values("instrument")
print(
    {
        "panel_rows": int(len(panel)),
        "panel_tickers": int(instruments.nunique()),
        "panel_dates": int(dates.nunique()),
        "start": str(dates.min().date()),
        "end": str(dates.max().date()),
        "universe_rows": int(len(universe_mask)),
        "universe_tickers": int(universe_mask["ticker"].nunique()),
        "universe_snapshot_dates": int(universe_mask["date"].nunique()),
    }
)
panel.head()


{'panel_rows': 1627331, 'panel_tickers': 623, 'panel_dates': 2797, 'start': '2015-01-02', 'end': '2026-02-17', 'universe_rows': 1310954, 'universe_tickers': 623, 'universe_snapshot_dates': 2797}


Unnamed: 0_level_0,Unnamed: 1_level_0,$open,$high,$low,$close,$volume
datetime,instrument,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-02,A,37.621143,37.739909,36.881144,37.054726,1529200.0
2015-01-02,AA,35.550392,35.796805,35.102374,35.572796,4340408.0
2015-01-02,AAL,51.430493,51.733694,50.284015,51.079918,10748600.0
2015-01-02,AAP,140.634759,142.077385,137.679545,138.632553,509800.0
2015-01-02,AAPL,24.671155,24.68223,23.776357,24.214897,212818400.0


In [4]:
import json

from agent_alpha.evaluator import FactorEvaluator

example_ast = {
    "version": "1",
    "root": {
        "type": "call",
        "op": "RANK",
        "args": [
            {
                "type": "call",
                "op": "DELTA",
                "args": [
                    {"type": "var", "name": "$close"},
                    {"type": "const", "value": 5},
                ],
            }
        ],
    },
}

evaluator = FactorEvaluator(periods=[1, 5, 10], min_cross_section=5)
factor = evaluator.calculate_factor(panel, example_ast)
forward_returns = evaluator.calculate_forward_returns(panel, periods=[1, 5, 10])
metrics_all = evaluator.calculate_ex_ante_ir(factor, forward_returns)
metrics_universe = evaluator.calculate_ex_ante_ir(
    factor,
    forward_returns,
    universe_mask=universe_mask,
)

print("Example AST rank_ic_ir (all rows):", metrics_all["rank_ic_ir"])
print("Example AST rank_ic_ir (universe-filtered):", metrics_universe["rank_ic_ir"])
print("Universe evaluation scope:", json.dumps(metrics_universe.get("evaluation_scope", {}), indent=2))
display(factor.dropna().head(10))
metrics_universe


Example AST rank_ic_ir (all rows): 0.00017761636530499308
Example AST rank_ic_ir (universe-filtered): 0.012379439305265695
Universe evaluation scope: {
  "universe_filter_applied": true,
  "rows_total": 1627331,
  "rows_in_scope": 1420309,
  "n_tickers_total": 623,
  "n_tickers_in_scope": 623
}


datetime    instrument
2015-01-09  A             0.566787
            AA            0.722022
            AAL           0.155235
            AAP           0.871841
            AAPL          0.741877
            ABBV          0.521661
            ABT           0.638989
            ACGL          0.624549
            ACN           0.781588
            ADBE          0.380866
Name: factor, dtype: float64

{'rank_ic': 0.0005762436425004962,
 'rank_ic_ir': 0.012379439305265695,
 'period_metrics': {'ret_1': {'rank_ic': 0.0028186791176473967,
   'rank_ic_std': 0.044993780892988874,
   'rank_ic_ir': 0.0626459715477393,
   'n_days': 2792},
  'ret_5': {'rank_ic': -0.002056539793829896,
   'rank_ic_std': 0.04440275896905576,
   'rank_ic_ir': -0.04631558582346418,
   'n_days': 2792},
  'ret_10': {'rank_ic': 0.0009665916036839877,
   'rank_ic_std': 0.04645303506312933,
   'rank_ic_ir': 0.020807932191521973,
   'n_days': 2788}},
 'primary_metric': 'rank_ic',
 'evaluation_scope': {'universe_filter_applied': True,
  'rows_total': 1627331,
  'rows_in_scope': 1420309,
  'n_tickers_total': 623,
  'n_tickers_in_scope': 623}}

In [4]:
import os


Optional full workflow with LLM (`gpt-5-mini`). This requires `OPENAI_API_KEY` in your environment.

In [5]:

from agent_alpha.workflow import AgentAlphaWorkflow
 
workflow = AgentAlphaWorkflow(model_name="gpt-5-mini", periods=[1], max_attempts=2)
user_goal = (
   "Generate a robust SPX cross-sectional alpha hypothesis" ,
    "Keep the expression compact and interpretable."
)
state = workflow.run(
    user_goal=user_goal,
    panel=panel,
    max_attempts=2,
    universe_mask=universe_mask,
)



In [6]:

print("error:", state.get("error"))
print("hypothesis:", state.get("hypothesis"))
print( "blueprint_json",state.get("blueprint_json"))
print("ast_expression:", state.get("ast_expression"))
print("metrics:", state.get("metrics"))
print("ast_summary:", state.get("ast_summary"))

error: None
hypothesis: Alpha ∝ rank( μ_{20}[log(C/O)] / σ_{20}[returns] ) — long top decile, short bottom decile across SPX.
blueprint_json {'hypothesis': 'Alpha ∝ rank( μ_{20}[log(C/O)] / σ_{20}[returns] ). Approximated by (μ20_close - μ20_open) / μ20_open, scaled by 20-day realized volatility, cross-sectionally ranked. Long top decile, short bottom decile across SPX.', 'components': [{'id': 'close_ma_20', 'feature': 'sma', 'params': {'period': 20, 'value': 'close', 'shift': 0}}, {'id': 'open_ma_20', 'feature': 'sma', 'params': {'period': 20, 'value': 'open', 'shift': 0}}, {'id': 'rv_20', 'feature': 'rv', 'params': {'period': 20, 'shift': 0}}], 'combine': {'type': 'call', 'op': 'RANK', 'args': [{'type': 'call', 'op': 'DIVIDE', 'args': [{'type': 'call', 'op': 'DIVIDE', 'args': [{'type': 'call', 'op': 'SUBTRACT', 'args': [{'type': 'component', 'id': 'close_ma_20'}, {'type': 'component', 'id': 'open_ma_20'}]}, {'type': 'component', 'id': 'open_ma_20'}]}, {'type': 'component', 'id': 'rv_

Note: you may need to restart the kernel to use updated packages.
