# 01 - Data Preparation

This notebook will prepare data (synthetic fallback if Alpaca keys absent).

## Overview

- Set seed for reproducibility
- Load raw CSVs via `src.utils`
- Validate required columns and check missingness
- Rule-based sentiment on headlines; aggregate by date+symbol
- Merge prices, returns, sentiment into tidy dataset


## Seed, Imports, and Paths


In [None]:
from pathlib import Path

import pandas as pd

from src.utils import set_seed, load_prices, load_returns, load_headlines, validate_columns
from src.features import add_sentiment_scores

# Reproducibility
set_seed(42)

DATA_DIR = Path("data")
RAW_DIR = DATA_DIR / "raw"
PROCESSED_DIR = DATA_DIR / "processed"


## Data Dictionary (Schemas)

- prices.csv: `date`, `ticker`, `sector`, `close`, `volume`, `volatility`
- returns.csv: `date`, `ticker`, `return`
- headlines.csv: `date`, `symbol`, `headline`, `source?`, `created_at?`


## Load and Validate Raw Data


In [None]:
prices = load_prices(RAW_DIR / "prices.csv")
returns = load_returns(RAW_DIR / "returns.csv")
headlines = load_headlines(RAW_DIR / "headlines.csv")

# Validate columns explicitly (defensive)
validate_columns(prices, ["date", "ticker", "sector", "close", "volume", "volatility"]) if not prices.empty else None
validate_columns(returns, ["date", "ticker", "return"]) if not returns.empty else None
validate_columns(headlines, ["date", "symbol", "headline"]) if not headlines.empty else None


## Missingness Checks


In [None]:
def summarize_missing(df, name: str):
    if df.empty:
        print(f"{name}: EMPTY")
        return
    print(name)
    display(df.isna().sum())

summarize_missing(prices, "prices")
summarize_missing(returns, "returns")
summarize_missing(headlines, "headlines")


## Rule-based Sentiment Scoring and Aggregation


In [None]:
# Score sentiment per headline
headlines_scored = add_sentiment_scores(headlines) if not headlines.empty else headlines

# Aggregate by date + symbol (ticker)
if not headlines_scored.empty:
    agg = (
        headlines_scored
        .groupby(["date", "symbol"], as_index=False)
        .agg(sentiment_score=("sentiment_score", "mean"),
             n_headlines=("headline", "count"))
    )
else:
    import pandas as pd
    agg = pd.DataFrame(columns=["date", "symbol", "sentiment_score", "n_headlines"])


## Merge into Tidy Dataset and Save


In [None]:
# Align symbol->ticker for join
if not agg.empty:
    agg = agg.rename(columns={"symbol": "ticker"})

# Left-join sentiment onto returns then prices for a tidy panel
merged = None
if not prices.empty and not returns.empty:
    pr = prices.merge(returns, on=["date", "ticker"], how="inner")
    merged = pr.merge(agg, on=["date", "ticker"], how="left") if not agg.empty else pr.copy()
elif not prices.empty:
    merged = prices.copy()
elif not returns.empty:
    merged = returns.copy()

# Write processed output (header-only if empty)
from src.utils import write_csv_safe
import pandas as pd

if merged is None:
    merged = pd.DataFrame(columns=["date","ticker","sector","close","volume","volatility","return","sentiment_score","n_headlines"]) 

write_csv_safe(merged, PROCESSED_DIR / "merged.csv", index=False)


## Notes

- This notebook is structured but not executed in this step.
- Downstream notebooks will consume `data/processed/merged.csv`.
