## Weekly Features & Targets (S&P 500 Members Only)

This cell builds **weekly price features** and a **training target** that compares each stock’s *next-week return* to the *next-week cross-sectional median*, **restricted to weeks when the stock is in the S&P 500**.

### 1) Prices → weekly bars
- Load daily `adj_close` and `volume` from `sp500_prices_daily_yahoo`.
- Create a **Friday-ending week key** with `to_period("W-FRI")` and map to `week_end`.
- For each `(ticker_latest, week_end)`:
  - `adj_close_week_last`: last close in that week (proxy for week-end close).
  - `adj_close_week_avg`: average adjusted close across the week.
  - `volume_week_sum`: sum of daily volume in the week.
- Compute **weekly return**: `ret_week = pct_change(adj_close_week_last)` per ticker.

### 2) Membership filter (no leakage)
- Build weekly S&P keys from `sp500_long_latest_profiles` by taking each row’s `date`, truncating to start of week and **adding 4 days ⇒ Friday** to match the prices’ `week_end`.
- Use `COALESCE(latest_ticker, ticker)` so legacy rows still match.
- **Inner join** prices to membership on `(ticker_latest, week_end)` ⇒ keep only weeks a ticker is actually in the index.

### 3) Targets (computed on the filtered universe)
- **Forward 1-week return** per ticker: `ret_week_fwd1 = ret_week.shift(-1)`.
- **Cross-sectional median** of `ret_week` by `week_end`; then **shift by +1 week** to align with the forward window: `median_ret_fwd1`.
- Classification target:  
  `target_gt_median = 1` if `ret_week_fwd1 > median_ret_fwd1`, else `0` (nullable Int for missing tail rows).

### 4) Final tidy
- Ensure `week_end` is `date`, deduplicate `(ticker_latest, week_end)`, and sort.
- Add a convenience feature: **dollar volume** proxy  
  `dollar_vol_week_avgp = adj_close_week_avg * volume_week_sum`.

### Why this setup works
- **Calendar alignment:** using `W-FRI` ensures price and membership weeks line up on Fridays.
- **Leakage-safe:** membership filter happens **before** target calculation; the “next-week” median is shifted, so today’s label doesn’t use future info from the same week.
- **Robust week close:** “last close of the week” respects market holidays (it’s just the last available trading day in that week).

### Notes / tweaks
- If you prefer **OHLC resampling**, you can `set_index('date')` and `resample('W-FRI')` per ticker; the current approach avoids multi-index juggling.
- For extremely sparse symbols, consider requiring a **minimum days per week** before computing `adj_close_week_avg`.
- You can switch the target to **top-tercile** or a **regression target** by replacing the median step with quantiles or keeping the raw `ret_week_fwd1`.



In [2]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

ENGINE_URL = "postgresql://postgres:CSDBMS623@localhost:5432/SP500_ML"
engine = create_engine(ENGINE_URL)

# ---------- 1) Daily prices ----------
px = pd.read_sql("""
    SELECT date, ticker_latest, adj_close, volume
    FROM sp500_prices_daily_yahoo
""", engine, parse_dates=["date"]).dropna(subset=["date","ticker_latest","adj_close"])

px["ticker_latest"] = px["ticker_latest"].str.upper().str.strip()
px["week_end"] = px["date"].dt.to_period("W-FRI").dt.end_time.dt.date
px = px.sort_values(["ticker_latest","date"])

# Weekly last/avg/vol
weekly_last = (px.groupby(["ticker_latest","week_end"], as_index=False)
                 .tail(1)[["ticker_latest","week_end","adj_close"]]
                 .rename(columns={"adj_close":"adj_close_week_last"}))
weekly_avg  = (px.groupby(["ticker_latest","week_end"], as_index=False)["adj_close"]
                 .mean().rename(columns={"adj_close":"adj_close_week_avg"}))
weekly_vol  = (px.groupby(["ticker_latest","week_end"], as_index=False)["volume"]
                 .sum().rename(columns={"volume":"volume_week_sum"}))

wk = (weekly_last.merge(weekly_avg, on=["ticker_latest","week_end"], how="left")
                 .merge(weekly_vol, on=["ticker_latest","week_end"], how="left"))
wk = wk.sort_values(["ticker_latest","week_end"])

# Weekly return (based on week-end last close)
wk["ret_week"] = wk.groupby("ticker_latest")["adj_close_week_last"].pct_change()

# ---------- 2) Membership keys (S&P members only by week) ----------
members = pd.read_sql("""
    SELECT DISTINCT
           (date_trunc('week', date)::date + 4) AS week_end,
           COALESCE(latest_ticker, ticker)      AS ticker_latest
    FROM sp500_long_latest_profiles
""", engine, parse_dates=["week_end"])

members["week_end"] = members["week_end"].dt.date
members["ticker_latest"] = members["ticker_latest"].str.upper().str.strip()

# Keep only member rows
wk_mem = (wk.merge(members, on=["ticker_latest","week_end"], how="inner")
            .sort_values(["ticker_latest","week_end"])
            .reset_index(drop=True))

# Dollar volume rollup: avg weekly price × weekly volume sum
wk_mem["dollar_vol_week_avgp"] = wk_mem["adj_close_week_avg"] * wk_mem["volume_week_sum"]

# ---------- 3) Targets on the FILTERED universe (no leakage) ----------
# next-week return (within membership weeks)
wk_mem["ret_week_fwd1"] = wk_mem.groupby("ticker_latest")["ret_week"].shift(-1)

# cross-sectional median by week, then next week’s median
med = (wk_mem.groupby("week_end", as_index=False)["ret_week"]
             .median().rename(columns={"ret_week":"median_ret"}))
med["median_ret_fwd1"] = med["median_ret"].shift(-1)

wk_mem = wk_mem.merge(med[["week_end","median_ret_fwd1"]], on="week_end", how="left")
wk_mem["target_gt_median"] = (wk_mem["ret_week_fwd1"] > wk_mem["median_ret_fwd1"]).astype("Int64")

# ---------- 4) Final tidy ----------
wk_mem["week_end"] = pd.to_datetime(wk_mem["week_end"]).dt.date
wk_mem = (wk_mem
          .drop_duplicates(subset=["ticker_latest","week_end"], keep="last")
          .reset_index(drop=True))

print("wk_mem rows:", len(wk_mem))
print(wk_mem.head())

wk_mem rows: 288609
  ticker_latest    week_end  adj_close_week_last  adj_close_week_avg  \
0             A  2013-09-27            33.471466           33.496452   
1             A  2013-10-04            33.568722           33.394971   
2             A  2013-10-11            33.361267           32.889301   
3             A  2013-10-18            34.353157           33.638732   
4             A  2013-10-25            33.627052           33.276983   

   volume_week_sum  ret_week  dollar_vol_week_avgp  ret_week_fwd1  \
0         17581527 -0.007686          5.889188e+08       0.002906   
1         15957051  0.002906          5.328853e+08      -0.006180   
2         14187603 -0.006180          4.666203e+08       0.029732   
3         18159600  0.029732          6.108659e+08      -0.021136   
4         18690840 -0.021136          6.219748e+08      -0.014845   

   median_ret_fwd1  target_gt_median  
0         0.000690                 1  
1         0.009711                 0  
2         0.023

In [6]:
wk_mem.to_csv('membership_checks.csv')

In [9]:
# === Next-week outperformance vs S&P 500 (^GSPC), aligned to W-FRI ===
import pandas as pd
import yfinance as yf

# 1) Pull market (^GSPC) from just before your window
start_date = pd.to_datetime(wk_mem["week_end"]).min() - pd.Timedelta(days=7)
mkt = yf.download("^GSPC", start=start_date, auto_adjust=False, progress=False)

# 2) Normalize columns (yfinance sometimes returns MultiIndex)
if isinstance(mkt.columns, pd.MultiIndex):
    mkt.columns = ["_".join([str(x) for x in tup if x not in (None, "")]) for tup in mkt.columns]

if "Adj Close" in mkt.columns:
    mkt = mkt.rename(columns={"Adj Close": "adj_close_mkt"})
elif "Adj Close_^GSPC" in mkt.columns:
    mkt = mkt.rename(columns={"Adj Close_^GSPC": "adj_close_mkt"})
elif "Adj_Close" in mkt.columns:
    mkt = mkt.rename(columns={"Adj_Close": "adj_close_mkt"})
else:
    cand = [c for c in mkt.columns if "adj" in c.lower() and "close" in c.lower()]
    assert cand, "Couldn't find adjusted close in market DataFrame"
    mkt = mkt.rename(columns={cand[0]: "adj_close_mkt"})

mkt = mkt.reset_index().rename(columns={"Date":"date"})
mkt["week_end"] = mkt["date"].dt.to_period("W-FRI").dt.end_time.dt.date

# 3) Weekly last close and weekly market return
mkt_week = (mkt.sort_values("date")
              .groupby("week_end", as_index=False)
              .tail(1)[["week_end","adj_close_mkt"]]
              .sort_values("week_end"))
mkt_week["mkt_ret_week"] = mkt_week["adj_close_mkt"].pct_change()

# 4) Use *next week's* market return as the benchmark for our label
mkt_week["mkt_ret_week_fwd1"] = mkt_week["mkt_ret_week"].shift(-1)

# 5) Merge onto your membership-filtered weekly panel and make the target
wk_mem = wk_mem.merge(mkt_week[["week_end","mkt_ret_week_fwd1"]],
                      on="week_end", how="left")

wk_mem["target_gt_sp500"] = (wk_mem["ret_week_fwd1"] > wk_mem["mkt_ret_week_fwd1"]).astype("Int64")

print("Rows with next-week SP500 label:",
      wk_mem["target_gt_sp500"].notna().sum(), "/", len(wk_mem))
wk_mem.head()


Rows with next-week SP500 label: 288609 / 288609


Unnamed: 0,ticker_latest,week_end,adj_close_week_last,adj_close_week_avg,volume_week_sum,ret_week,dollar_vol_week_avgp,ret_week_fwd1,median_ret_fwd1,target_gt_median,mkt_ret_week_fwd1,target_gt_sp500
0,A,2013-09-27,33.471466,33.496452,17581527,-0.007686,588918800.0,0.002906,0.00069,1,-0.000739,1
1,A,2013-10-04,33.568722,33.394971,15957051,0.002906,532885300.0,-0.00618,0.009711,0,0.007513,0
2,A,2013-10-11,33.361267,32.889301,14187603,-0.00618,466620300.0,0.029732,0.023547,1,0.024249,1
3,A,2013-10-18,34.353157,33.638732,18159600,0.029732,610865900.0,-0.021136,0.01016,0,0.008753,0
4,A,2013-10-25,33.627052,33.276983,18690840,-0.021136,621974800.0,-0.014845,0.0,0,0.001063,0


In [13]:
# Align S&P-based target to your existing schema names
# Requires wk_mem has: ['mkt_ret_week_fwd1', 'target_gt_sp500'] from the prior step

import pandas as pd

wk_mem = wk_mem.copy()

# 1) Normalize keys
wk_mem["week_end"] = pd.to_datetime(wk_mem["week_end"]).dt.date
wk_mem["ticker_latest"] = wk_mem["ticker_latest"].astype(str).str.upper().str.strip()

# 2) Reuse existing schema names
#    - Put next-week S&P return into 'median_ret_fwd1'
#    - Put "beats S&P next week?" into 'target_gt_median'
wk_mem["median_ret_fwd1"] = wk_mem["mkt_ret_week_fwd1"]
wk_mem["target_gt_median"] = wk_mem["target_gt_sp500"].astype("Int64")

# (optional) drop the original S&P columns to avoid confusion in staging
wk_mem = wk_mem.drop(columns=[c for c in ["mkt_ret_week_fwd1", "target_gt_sp500"] if c in wk_mem.columns])

# 3) Ensure required columns exist for your loader
required = [
    "week_end","ticker_latest",
    "adj_close_week_last","adj_close_week_avg","volume_week_sum","dollar_vol_week_avgp",
    "ret_week","ret_week_fwd1","median_ret_fwd1","target_gt_median"
]
missing = [c for c in required if c not in wk_mem.columns]
if missing:
    raise ValueError(f"wk_mem missing required columns for ingest: {missing}")

# 4) (optional) enforce one row per (week_end, ticker_latest)
wk_mem = (wk_mem
          .sort_values(["ticker_latest","week_end"])
          .drop_duplicates(subset=["week_end","ticker_latest"], keep="last"))


In [17]:
from sqlalchemy import text
from sqlalchemy.dialects.postgresql import DATE, VARCHAR, DOUBLE_PRECISION, BIGINT, INTEGER

TABLE   = "sp500_weekly_rollups"
STAGING = TABLE + "_stg"

# 1) Ensure target table
with engine.begin() as conn:
    conn.execute(text(f"""
        CREATE TABLE IF NOT EXISTS "{TABLE}" (
            week_end              date         NOT NULL,
            ticker_latest         varchar(16)  NOT NULL,
            adj_close_week_last   double precision,
            adj_close_week_avg    double precision,
            volume_week_sum       bigint,
            dollar_vol_week_avgp  double precision,
            ret_week              double precision,
            ret_week_fwd1         double precision,
            median_ret_fwd1       double precision,
            target_gt_median      integer,
            PRIMARY KEY (week_end, ticker_latest)
        );
    """))

# 2) Stage
wk_mem.to_sql(
    STAGING, engine, if_exists="replace", index=False, method="multi", chunksize=50_000,
    dtype={
        "week_end": DATE(),
        "ticker_latest": VARCHAR(16),
        "adj_close_week_last": DOUBLE_PRECISION(),
        "adj_close_week_avg": DOUBLE_PRECISION(),
        "volume_week_sum": BIGINT(),
        "dollar_vol_week_avgp": DOUBLE_PRECISION(),
        "ret_week": DOUBLE_PRECISION(),
        "ret_week_fwd1": DOUBLE_PRECISION(),
        "median_ret_fwd1": DOUBLE_PRECISION(),
        "target_gt_median": INTEGER(),
    }
)

# 3) UPSERT & clean staging
with engine.begin() as conn:
    conn.execute(text(f"""
        INSERT INTO "{TABLE}" (
            week_end, ticker_latest,
            adj_close_week_last, adj_close_week_avg, volume_week_sum, dollar_vol_week_avgp,
            ret_week, ret_week_fwd1, median_ret_fwd1, target_gt_median
        )
        SELECT
            s.week_end, s.ticker_latest,
            s.adj_close_week_last, s.adj_close_week_avg, s.volume_week_sum, s.dollar_vol_week_avgp,
            s.ret_week, s.ret_week_fwd1, s.median_ret_fwd1, s.target_gt_median
        FROM "{STAGING}" s
        ON CONFLICT (week_end, ticker_latest) DO UPDATE SET
            adj_close_week_last  = EXCLUDED.adj_close_week_last,
            adj_close_week_avg   = EXCLUDED.adj_close_week_avg,
            volume_week_sum      = EXCLUDED.volume_week_sum,
            dollar_vol_week_avgp = EXCLUDED.dollar_vol_week_avgp,
            ret_week             = EXCLUDED.ret_week,
            ret_week_fwd1        = EXCLUDED.ret_week_fwd1,
            median_ret_fwd1      = EXCLUDED.median_ret_fwd1,
            target_gt_median     = EXCLUDED.target_gt_median;
        DROP TABLE "{STAGING}";
    """))

print("Upsert complete →", TABLE)


Upsert complete → sp500_weekly_rollups


In [11]:
wk_mem.to_csv('membership2_checks.csv')