## Notebook 01 – Universe Construction

This notebook builds the stock universe by loading raw price data, applying basic filters, 
and saving valid tickers for downstream analysis. It 

### Step 0 - Import packages and functions

In [1]:
import sys, os
sys.path.append(os.path.abspath("../src"))
import pandas as pd

# Helper functions to load price and compute log returns
from data_loader import load_price_data
from factor_calculations import compute_log_returns



### Step 1 - Load raw wide-format prices

In [2]:
price_dir = "../data/raw_prices"
price_wide_raw = load_price_data(price_dir)

In [3]:
# Print initial universe
# Reference universe size (503 tickers in original S&P 500 file)
original_tickers = set(price_wide_raw.columns)
print(f"Initial universe: {len(original_tickers)} tickers")

Initial universe: 503 tickers


### Step 2 – Clean Tickers and Filter Coverage

In [4]:
# Drop weekends
price_wide = price_wide_raw[price_wide_raw.index.dayofweek < 5]

In [5]:
# Drop duplicate timestamps
price_wide = price_wide[~price_wide.index.duplicated()]

In [6]:
# Normally: Would drop tickers with potentially faulty prices as follows:

# Filter tickers where at least 95% of *observed* prices exceed the 1st percentile of all prices
# price_threshold = price_wide.stack().quantile(0.01)  # or set manually if desired

# valid_price_fraction = (price_wide > price_threshold).sum() / price_wide.notna().sum()
# price_wide = price_wide.loc[:, valid_price_fraction >= 0.95]

# For this project: Use no price filter at all — rely solely on coverage + ticker list membership
# (assumes you're already working off a curated S&P 500 list)

In [7]:
# Coverage Filter (In-Sample Universe)
min_coverage = 0.85 # Set min coverage to 85%
trading_days = price_wide.shape[0] # Set trading days to length of price_wide df
coverage = price_wide.notna().sum() / trading_days # Compute coverage
valid_tickers = coverage[coverage >= min_coverage].index # Filter valid tickers that have >= 85% coverage

In [8]:
# Print universe after coverage filter
baseline_count = len(original_tickers) # Set baseline to length of original tickers list
valid_count = len(valid_tickers) # Set valid count to length of valid tickers list
coverage_pct = valid_count / baseline_count # Compute coverage

print(f"{valid_count} / {baseline_count} tickers passed full cleaning and coverage filter "
      f"({coverage_pct:.2%})")

450 / 503 tickers passed full cleaning and coverage filter (89.46%)


### Step 3 - Create and export filtered prices and valid tickers

In [9]:
# Final filtered price panel
price_filtered = price_wide[valid_tickers]

# Export filtered prices as a parquet file
price_filtered.to_parquet('../data/processed/price_filtered.parquet')

# Save metadata
valid_tickers.to_series().to_csv('../data/metadata/universe_tickers.csv', index=False)

### Step 4 - Print Length of price_filtered

In [10]:
# print max number of trading days
print(f"{len(price_filtered)} number of trading days")

3862 number of trading days


### Step 5 - Create and export log returns and log prices

In [11]:
# Check prices before taking log
if (price_filtered <= 0).any().any():
    raise ValueError("Filtered price data contains non-positive values. Cannot compute log.")

# Compute daily log returns
log_returns = compute_log_returns(price_filtered)

# Export log returns as a parquet file
log_returns.to_parquet("../data/processed/log_returns.parquet")