# Dense Networks ¬∑ SP500 Next‚ÄëDay Direction (Binary Classification)

**Objective**: Predict tomorrow's SP500 move (up/down) using **lagged cross‚Äësectional stock returns** with a **Dense (MLP) network**.

You can run this notebook in **Google Colab**. It includes two ready experiments:
- **A:** Few stocks, many lags  
- **B:** Many stocks, few lags

üá¨üáß Note: This exercise belongs to the Dense Networks topic. Students compare how the number of lags and the number of stocks affect the model‚Äôs stability and performance.

üá∑üá∫ –ü–æ–¥—Å–∫–∞–∑–∫–∞: —ç—Ç–æ –ø—Ä–∞–∫—Ç–∏–∫—É–º –¥–ª—è —Ç–µ–º—ã *Dense Networks*. –°—Ç—É–¥–µ–Ω—Ç—ã —Å—Ä–∞–≤–Ω–∏–≤–∞—é—Ç –≤–ª–∏—è–Ω–∏–µ **—á–∏—Å–ª–∞ –ª–∞–≥–æ–≤** –∏ **—á–∏—Å–ª–∞ –∞–∫—Ü–∏–π** –Ω–∞ —É—Å—Ç–æ–π—á–∏–≤–æ—Å—Ç—å –∏ –∫–∞—á–µ—Å—Ç–≤–æ –º–æ–¥–µ–ª–∏.

## 0) Setup

In [25]:
# If running in Colab:
# !pip -q install yfinance pandas numpy scikit-learn torch==2.4.1 matplotlib

In [26]:
import os
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, balanced_accuracy_score, roc_auc_score, confusion_matrix
from sklearn.model_selection import TimeSeriesSplit

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
torch.manual_seed(7)
np.random.seed(7)

print("Torch:", torch.__version__)

Torch: 2.9.0+cu128


## 1) Data: choose one of the options

### Option 1 ‚Äî Download with `yfinance` (easiest in Colab)
- Includes **SPY** for market proxy and a basket of large‚Äëcap tickers.

In [27]:
use_yfinance = True  # set to False if you upload your own CSV of returns

tickers_all = [
    "AAPL","MSFT","NVDA","AMZN","GOOG","META","BRK-B","TSLA","AVGO","LLY",
    "V","JPM","XOM","UNH","MA","HD","PG","COST","ADBE","NFLX",
    "PEP","CRM","KO","MRK","ABBV"
]
spy = "SPY"
start_date = "2012-01-01"
end_date = None  # or e.g. "2025-01-01"

if use_yfinance:
    try:
        import yfinance as yf
        data = yf.download([spy] + tickers_all, start=start_date, end=end_date, auto_adjust=True, progress=False)["Close"]
        data = data.dropna(how="all")
        data = data.asfreq("B").ffill()  # align to business days
        rets = data.pct_change().dropna()
        #rets.columns = [c.replace(" ", "_").replace("-", "_") for c in rets.columns]
        #tickers_all = [t.replace(" ", "_").replace("-", "_") for t in tickers_all]

        spy_col = spy
        print("Data downloaded. Shape:", rets.shape)
    except Exception as e:
        print("yfinance failed:", e)
        use_yfinance = False
else:
    print("Set use_yfinance=True to auto-download, or upload your own CSV in the next cell.")

Data downloaded. Shape: (3352, 26)


### Option 2 ‚Äî Upload your own **returns** CSV
- Must contain **SPY** column (market proxy) and several stock columns
- Index must be a **date** (parseable)

In [28]:
# If not using yfinance, upload your own daily RETURNS CSV with columns: SPY, AAPL, MSFT, ...
# Example:
# from google.colab import files
# uploaded = files.upload()
# rets = pd.read_csv(list(uploaded.keys())[0], index_col=0, parse_dates=True)
# spy_col = "SPY"

In [29]:
assert 'rets' in globals(), "Please prepare `rets` (returns DataFrame) and `spy_col`."
assert spy_col in rets.columns, f"`{spy_col}` column required."
rets = rets.sort_index()
rets.head()

Ticker,AAPL,ABBV,ADBE,AMZN,AVGO,BRK-B,COST,CRM,GOOG,HD,...,MSFT,NFLX,NVDA,PEP,PG,SPY,TSLA,UNH,V,XOM
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-03,-0.012622,-0.008257,-0.015389,0.004547,0.005224,0.004507,0.010251,-0.014372,0.000581,-0.002836,...,-0.013396,0.049777,0.000786,0.000433,-0.006341,-0.002259,-0.016685,-0.046755,0.000772,-0.001803
2013-01-04,-0.027855,-0.012633,0.010066,0.002592,-0.00642,0.002457,-0.00322,0.005335,0.01976,-0.001896,...,-0.018716,-0.006315,0.032993,0.001442,0.002031,0.004392,-0.010642,0.001923,0.008167,0.00463
2013-01-07,-0.005882,0.002035,-0.004983,0.035925,-0.005539,-0.004262,-0.007733,-0.003714,-0.004363,-0.005382,...,-0.00187,0.033549,-0.028897,-0.000144,-0.006803,-0.002733,-0.001744,0.0,0.007144,-0.011578
2013-01-08,0.002691,-0.021764,0.005272,-0.007748,-0.006807,0.003852,-0.001875,0.005859,-0.001974,0.006047,...,-0.005245,-0.020565,-0.021926,0.003024,-0.001603,-0.002878,-0.01922,-0.013246,0.00931,0.006255
2013-01-09,-0.015629,0.005636,0.013634,-0.000113,0.022118,-0.005223,0.000494,0.010943,0.006573,-0.000791,...,0.00565,-0.012865,-0.022418,0.005024,0.005401,0.002542,-0.001187,0.018872,0.015248,-0.003843


## 2) Target and Base Features

In [30]:
# Target = tomorrow's SP500/market sign
y = (rets[spy_col].shift(-1) > 0).astype(int)

# Base cross-sectional features (exclude SPY to avoid leakage)
X_base = rets.drop(columns=[spy_col])
X_base.head(), y.head()

(Ticker          AAPL      ABBV      ADBE      AMZN      AVGO     BRK-B  \
 Date                                                                     
 2013-01-03 -0.012622 -0.008257 -0.015389  0.004547  0.005224  0.004507   
 2013-01-04 -0.027855 -0.012633  0.010066  0.002592 -0.006420  0.002457   
 2013-01-07 -0.005882  0.002035 -0.004983  0.035925 -0.005539 -0.004262   
 2013-01-08  0.002691 -0.021764  0.005272 -0.007748 -0.006807  0.003852   
 2013-01-09 -0.015629  0.005636  0.013634 -0.000113  0.022118 -0.005223   
 
 Ticker          COST       CRM      GOOG        HD  ...       MRK      MSFT  \
 Date                                                ...                       
 2013-01-03  0.010251 -0.014372  0.000581 -0.002836  ...  0.023947 -0.013396   
 2013-01-04 -0.003220  0.005335  0.019760 -0.001896  ... -0.008504 -0.018716   
 2013-01-07 -0.007733 -0.003714 -0.004363 -0.005382  ...  0.003574 -0.001870   
 2013-01-08 -0.001875  0.005859 -0.001974  0.006047  ...  0.001425 -0.005

## 3) Helper: build lagged features

In [31]:
def build_lagged_features(df: pd.DataFrame, n_lags: int) -> pd.DataFrame:
    cols = {}
    for lag in range(1, n_lags + 1):
        lagged = df.shift(lag).add_suffix(f"_lag{lag}")
        cols[lag] = lagged
    X = pd.concat([cols[lag] for lag in cols], axis=1)
    return X

# Quick test
_ = build_lagged_features(X_base.iloc[:10], 3)

## 4) PyTorch dataset & model

In [32]:
class TabDataset(Dataset):
    def __init__(self, X: np.ndarray, y: np.ndarray):
        self.X = X.astype(np.float32)
        self.y = y.astype(np.float32).reshape(-1, 1)
    def __len__(self): return len(self.X)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

class MLP(nn.Module):
    def __init__(self, in_dim: int, hidden=(64, 32), p_dropout=0.1):
        super().__init__()
        layers = []
        last = in_dim
        for h in hidden:
            layers += [nn.Linear(last, h), nn.ReLU(), nn.Dropout(p_dropout)]
            last = h
        layers += [nn.Linear(last, 1)]
        self.net = nn.Sequential(*layers)
    def forward(self, x):
        return self.net(x)  # logits

In [33]:
def train_model(X_train, y_train, X_val, y_val, epochs=50, batch_size=128, lr=1e-3, weight_decay=1e-4, hidden=(64,32)):

    scaler = StandardScaler()
    X_train_s = scaler.fit_transform(X_train)
    X_val_s   = scaler.transform(X_val)

    train_ds = TabDataset(X_train_s, y_train)
    val_ds   = TabDataset(X_val_s,   y_val)
    train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True, drop_last=False)
    val_loader   = DataLoader(val_ds,   batch_size=batch_size, shuffle=False, drop_last=False)

    model = MLP(in_dim=X_train.shape[1], hidden=hidden, p_dropout=0.1)
    criterion = nn.BCEWithLogitsLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)

    best_val = float("inf")
    best_state = None
    patience, patience_left = 8, 8

    for epoch in range(1, epochs+1):
        model.train()
        train_loss = 0.0
        for xb, yb in train_loader:
            optimizer.zero_grad()
            logits = model(xb)
            loss = criterion(logits, yb)
            loss.backward()
            optimizer.step()
            train_loss += loss.item() * len(xb)
        train_loss /= len(train_ds)

        # validation
        model.eval()
        with torch.no_grad():
            val_loss = 0.0
            preds = []
            ys = []
            for xb, yb in val_loader:
                logits = model(xb)
                loss = criterion(logits, yb)
                val_loss += loss.item() * len(xb)
                preds.append(torch.sigmoid(logits).cpu().numpy())
                ys.append(yb.cpu().numpy())
            val_loss /= len(val_ds)
        if val_loss < best_val - 1e-5:
            best_val = val_loss
            best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
            patience_left = patience
        else:
            patience_left -= 1
            if patience_left <= 0:
                # early stop
                break

    if best_state is not None:
        model.load_state_dict(best_state)

    return model, scaler

In [34]:
def evaluate_model(model, scaler, X, y, threshold=0.5, label="VAL"):
    Xs = scaler.transform(X).astype(np.float32)
    with torch.no_grad():
        logits = model(torch.from_numpy(Xs))
        probs = torch.sigmoid(logits).numpy().ravel()

    y_pred = (probs >= threshold).astype(int)
    acc  = accuracy_score(y, y_pred)
    bacc = balanced_accuracy_score(y, y_pred)
    try:
        roc  = roc_auc_score(y, probs)
    except ValueError:
        roc = np.nan
    cm = confusion_matrix(y, y_pred)

    print(f"[{label}] Accuracy={acc:.3f} | BalancedAcc={bacc:.3f} | ROC-AUC={roc:.3f}")
    print("Confusion matrix:\n", cm)
    return {"acc": acc, "bacc": bacc, "roc": roc, "cm": cm, "probs": probs, "pred": y_pred}

## 5) Time split: train / val / test

In [35]:
def time_split_idx(n, train_frac=0.7, val_frac=0.15):
    train_end = int(n * train_frac)
    val_end   = int(n * (train_frac + val_frac))
    return slice(0, train_end), slice(train_end, val_end), slice(val_end, n)

## 6) Experiments A & B

In [42]:
def run_experiment(name, stock_list, n_lags, hidden=(64,32), epochs=60):
    print(f"\n=== Experiment {name}: stocks={len(stock_list)}, lags={n_lags} ===")

    # Validate and filter stock_list to only include available columns
    available_stocks = [stock for stock in stock_list if stock in X_base.columns]
    missing_stocks = [stock for stock in stock_list if stock not in X_base.columns]

    if missing_stocks:
        print(f"Warning: The following stocks are not available and will be skipped: {missing_stocks}")

    if not available_stocks:
        print(f"Error: None of the requested stocks are available in X_base")
        return None


    X_cs = X_base[available_stocks]
    X = build_lagged_features(X_cs, n_lags)
    y_aligned = y.loc[X.index]

    # drop NA from shifts
    mask = X.notna().all(axis=1) & y_aligned.notna()
    X = X[mask]
    y_arr = y_aligned[mask].values

    n = len(X)
    tr, va, te = time_split_idx(n, 0.7, 0.15)
    X_train, y_train = X.iloc[tr].values, y_arr[tr]
    X_val,   y_val   = X.iloc[va].values, y_arr[va]
    X_test,  y_test  = X.iloc[te].values, y_arr[te]

    model, scaler = train_model(X_train, y_train, X_val, y_val, epochs=epochs, hidden=hidden)
    eval_val = evaluate_model(model, scaler, X_val, y_val, label=f"{name}-VAL")
    eval_te  = evaluate_model(model, scaler, X_test, y_test, label=f"{name}-TEST")

    return {"name": name, "eval_val": eval_val, "eval_test": eval_te}

# Define stock subsets
few_stocks  = ["AAPL","MSFT","NVDA","AMZN","GOOG"]
many_stocks = tickers_all if 'tickers_all' in globals() else list(X_base.columns)[:20]

# Run two contrasting experiments
res_A = run_experiment("A (few stocks, many lags)", stock_list=few_stocks,  n_lags=15, hidden=(256,128))
res_B = run_experiment("B (many stocks, few lags)", stock_list=many_stocks, n_lags=5, hidden=(128,64))


=== Experiment A (few stocks, many lags): stocks=5, lags=15 ===
[A (few stocks, many lags)-VAL] Accuracy=0.499 | BalancedAcc=0.516 | ROC-AUC=0.522
Confusion matrix:
 [[ 54 210]
 [ 41 196]]
[A (few stocks, many lags)-TEST] Accuracy=0.547 | BalancedAcc=0.506 | ROC-AUC=0.502
Confusion matrix:
 [[ 40 179]
 [ 48 234]]

=== Experiment B (many stocks, few lags): stocks=25, lags=5 ===
[B (many stocks, few lags)-VAL] Accuracy=0.462 | BalancedAcc=0.488 | ROC-AUC=0.506
Confusion matrix:
 [[ 23 244]
 [ 26 209]]
[B (many stocks, few lags)-TEST] Accuracy=0.543 | BalancedAcc=0.491 | ROC-AUC=0.534
Confusion matrix:
 [[ 20 199]
 [ 31 253]]


### üß© Concept Notes

#### **Bias‚ÄìVariance Trade-off**
When we increase model complexity ‚Äî for example, by adding more **lags**, **neurons**, or **layers** ‚Äî the model can fit the training data more precisely but may lose its ability to generalize.  
This balance between **bias** (systematic error) and **variance** (sensitivity to noise) is central to all machine learning:

| Scenario | Bias | Variance | Description |
|-----------|------|-----------|--------------|
| Too simple model (few lags, small net) | high | low | underfitting ‚Äî misses true patterns |
| Too complex model (many lags, deep net) | low | high | overfitting ‚Äî memorizes noise |
| Balanced model | moderate | moderate | captures signal, ignores noise |

> üí° In this exercise: adding too many lags can make the model memorize daily fluctuations instead of learning meaningful temporal structure.

---

#### **Cross-sectional Breadth**
Cross-sectional means ‚Äúacross many assets at the same time.‚Äù  
If **lags** add *temporal depth*, then **breadth** adds *market width* ‚Äî the diversity of stocks used as simultaneous features.

| Scenario | What it means | Effect |
|-----------|----------------|---------|
| Few stocks, many lags | Focus on temporal behavior of each stock | Captures time-series structure but loses market context |
| Many stocks, few lags | Focus on market snapshot at each date | Captures cross-sectional structure but adds noise |

> üí° Example: if the market index rises because only tech stocks surge while others fall, this is a **cross-sectional signal** ‚Äî it can‚Äôt be seen from SP500 alone.


## 7) Discuss

EN:
- **Bias‚ÄìVariance Trade-off**: Increasing the number of lags (time dimension) while keeping the same number of stocks expands the feature space ‚Äî this raises the risk of overfitting, especially with shorter time series.

- **Cross-sectional Breadth**: Using more stocks with fewer lags adds cross-sectional information (market dispersion on a given day) but may introduce additional noise.

- **Metrics ‚âà 0.50‚Äì0.55** on test data are typical for a clean setup without data leakage.
If ROC-AUC < 0.5, the model has learned a weak **anti-signal** ‚Äî try inverting the target or adjusting feature definitions.


RU:
- **Bias‚ÄìVariance trade‚Äëoff**: –±–æ–ª—å—à–µ –ª–∞–≥–æ–≤ (–≤—Ä–µ–º–µ–Ω–∏) –ø—Ä–∏ —Ñ–∏–∫—Å–∏—Ä–æ–≤–∞–Ω–Ω–æ–º —á–∏—Å–ª–µ –∞–∫—Ü–∏–π —É–≤–µ–ª–∏—á–∏–≤–∞–µ—Ç —Ä–∞–∑–º–µ—Ä –ø—Ä–∏–∑–Ω–∞–∫–æ–≤–æ–≥–æ –ø—Ä–æ—Å—Ç—Ä–∞–Ω—Å—Ç–≤–∞ ‚Üí —Ä–∏—Å–∫ –ø–µ—Ä–µ–æ–±—É—á–µ–Ω–∏—è –ø—Ä–∏ –∫–æ—Ä–æ—Ç–∫–∏—Ö —Ä—è–¥–∞—Ö.  
- **Cross‚Äësectional breadth**: –±–æ–ª—å—à–µ –∞–∫—Ü–∏–π –ø—Ä–∏ –º–∞–ª–æ–º —á–∏—Å–ª–µ –ª–∞–≥–æ–≤ –¥–æ–±–∞–≤–ª—è–µ—Ç –ø–æ–ø–µ—Ä–µ—á–Ω—É—é –∏–Ω—Ñ–æ—Ä–º–∞—Ü–∏—é (—Ä–∞—Å–ø—Ä–µ–¥–µ–ª–µ–Ω–∏–µ —Ä—ã–Ω–∫–∞ –≤ –∫–æ–Ω–∫—Ä–µ—Ç–Ω—ã–π –¥–µ–Ω—å), –Ω–æ –º–æ–∂–µ—Ç —É—Å–∏–ª–∏–≤–∞—Ç—å —à—É–º.  
- **–ú–µ—Ç—Ä–∏–∫–∏ ‚âà 0.50‚Äì0.55** –Ω–∞ —Ç–µ—Å—Ç–µ ‚Äî —ç—Ç–æ –Ω–æ—Ä–º–∞–ª—å–Ω–æ –¥–ª—è ¬´—á–µ—Å—Ç–Ω–æ–π¬ª –ø–æ—Å—Ç–∞–Ω–æ–≤–∫–∏ –±–µ–∑ —É—Ç–µ—á–∫–∏. –ï—Å–ª–∏ ROC‚ÄëAUC < 0.5 ‚Äî —ç—Ç–æ **–∞–Ω—Ç–∏‚Äë—Å–∏–≥–Ω–∞–ª**; –ø–æ–ø—Ä–æ–±—É–π—Ç–µ –∏–Ω–≤–µ—Ä—Ç–∏—Ä–æ–≤–∞—Ç—å –ø–æ—Ä–æ–≥ –∏–ª–∏ –ø—Ä–∏–∑–Ω–∞–∫–∏.

