# QuantDL Tutorial

A comprehensive guide to using QuantDL for alpha research.

**Contents:**
1. Setup & Configuration
2. Security Resolution
3. Daily Price Data
4. Fundamentals & Metrics
5. Time-Series Operators
6. Cross-Sectional Operators
7. Alpha Factor Example
8. Caching

---
## 1. Setup & Configuration

In [None]:
# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Import quantdl
from quantdl import QuantDLClient, SecurityInfo
from quantdl.operators import (
    # Time-series
    ts_mean, ts_sum, ts_std, ts_min, ts_max, ts_delta, ts_delay,
    # Cross-sectional
    rank, zscore, demean, scale
)
import polars as pl
from datetime import date

# Initialize client
client = QuantDLClient()
print(f"QuantDL version: {__import__('quantdl').__version__}")
print("Client initialized successfully")

### Internal: Client Architecture

```
QuantDLClient
     |
     +---> S3StorageBackend ---> Polars scan_parquet() ---> S3 bucket
     |          |
     |          +---> storage_options (AWS credentials)
     |
     +---> DiskCache ---> ~/.quantdl/cache/
     |          |
     |          +---> metadata.json (LRU tracking)
     |          +---> data/*.parquet (cached files)
     |
     +---> SecurityMaster ---> PIT symbol resolution
```

**Code references:**
- `client.py:37-75` - Client initialization
- `storage/s3.py:20-52` - S3 backend with Polars native integration
- `storage/cache.py:28-57` - Cache initialization with LRU/TTL

---
## 2. Security Resolution

Resolve symbols, CIKs, or security IDs to `SecurityInfo` with point-in-time accuracy.

In [None]:
# Resolve single symbol
info = client.resolve("AAPL")
if info:
    print(f"Symbol: {info.symbol}")
    print(f"Security ID: {info.security_id}")
    print(f"Company: {info.company}")
    print(f"CIK: {info.cik}")
    print(f"CUSIP: {info.cusip}")
    print(f"PERMNO: {info.permno}")
else:
    print("AAPL not found in security master")

In [None]:
# Resolve by CIK (requires v0.1.1+ for int64 CIK columns)
try:
    by_cik = client.resolve("0000320193")  # Apple's CIK
    print(f"Resolved by CIK: {by_cik.symbol if by_cik else 'Not found'}")
except Exception as e:
    print(f"CIK resolution error (update to v0.1.1+): {type(e).__name__}")

In [None]:
# Search by partial name
try:
    results = client.security_master.search("Apple", limit=5)
    print(f"Found {len(results)} results:")
    for r in results:
        print(f"  {r.symbol}: {r.company}")
except Exception as e:
    print(f"Search error: {e}")

### Internal: Point-in-Time Resolution

```
resolve(identifier, as_of)
     |
     +---> Load security_master.parquet (cached)
     |
     +---> Filter: start_date <= as_of AND (end_date IS NULL OR end_date >= as_of)
     |
     +---> Match identifier against: symbol, security_id, cik, cusip, permno
     |          (v0.1.1+: cast to string for comparison)
     |
     +---> Return SecurityInfo or None
```

**Code references:**
- `data/security_master.py:60-101` - resolve() with PIT filtering
- `data/security_master.py:87-88` - Cast to Utf8 for type-safe comparison

---
## 3. Daily Price Data

Fetch OHLCV data as wide tables (dates as rows, symbols as columns).

In [None]:
# Try to fetch daily data from S3
try:
    prices = client.daily(
        ["AAPL", "MSFT", "GOOGL"],
        field="close",
        start="2024-01-01",
        end="2024-03-31"
    )
    print(f"Fetched from S3: {prices.shape}")
    print(prices.head())
    USE_MOCK_DATA = False
except Exception as e:
    print(f"S3 data not available: {type(e).__name__}")
    print("Using mock data for operator examples...")
    USE_MOCK_DATA = True
    
    # Create mock price data
    import random
    random.seed(42)
    dates = pl.date_range(date(2024, 1, 1), date(2024, 3, 31), eager=True)
    n = len(dates)
    prices = pl.DataFrame({
        "timestamp": dates,
        "AAPL": [185.0 + sum(random.gauss(0.5, 2) for _ in range(i)) for i in range(n)],
        "MSFT": [375.0 + sum(random.gauss(0.3, 1.5) for _ in range(i)) for i in range(n)],
        "GOOGL": [140.0 + sum(random.gauss(0.2, 1.8) for _ in range(i)) for i in range(n)],
    })
    print(f"Mock data shape: {prices.shape}")
    print(prices.head())

### Internal: Daily Data Fetching

```
daily(symbols, field, start, end)
     |
     +---> Resolve symbols to security_ids via SecurityMaster
     |
     +---> Async parallel fetch (ThreadPoolExecutor, max_concurrency=10)
     |          |
     |          +---> For each security_id:
     |                   +---> Check cache
     |                   +---> If miss: S3 read data/raw/ticks/daily/{security_id}/history.parquet
     |                   +---> Filter by date range
     |
     +---> Build wide table:
     |          +---> Tag each df with symbol
     |          +---> Concat all: (timestamp, symbol, value)
     |          +---> Pivot: rows=timestamp, cols=symbol
     |
     +---> Sort by timestamp, return
```

**Code references:**
- `client.py:167-235` - daily() method
- `client.py:139-165` - Async parallel fetch
- `client.py:107-137` - Single security fetch with caching

---
## 4. Fundamentals & Metrics

Fetch SEC filing fundamentals and derived metrics.

In [None]:
# Fundamentals (requires data in S3)
try:
    revenue = client.fundamentals(["AAPL"], concept="Revenue", start="2023-01-01", end="2024-06-30")
    print("Revenue data:")
    print(revenue)
except Exception as e:
    print(f"Fundamentals not available: {type(e).__name__}")

In [None]:
# Metrics (requires data in S3)
try:
    pe = client.metrics(["AAPL"], metric="pe_ratio", start="2023-01-01", end="2024-06-30")
    print("PE Ratio data:")
    print(pe)
except Exception as e:
    print(f"Metrics not available: {type(e).__name__}")

### Internal: Fundamentals/Metrics Fetching

```
fundamentals(symbols, concept, start, end)
     |
     +---> Resolve symbols to SecurityInfo (need CIK)
     |
     +---> Async parallel fetch by CIK:
     |          +---> S3 read data/raw/fundamental/{cik}/fundamental.parquet
     |
     +---> Filter by concept, pivot to wide table

metrics(symbols, metric, start, end)
     |
     +---> Same flow, but reads from:
              data/derived/features/fundamental/{cik}/metrics.parquet
```

**Code references:**
- `client.py:292-348` - fundamentals()
- `client.py:404-458` - metrics()

---
## 5. Time-Series Operators

Column-wise operations applied over time (down each column).

These operators work on wide DataFrames where:
- First column = timestamp
- Remaining columns = symbol values

In [None]:
print(f"Working with prices data: {prices.shape}")
print(prices.head())

In [None]:
# ts_mean: Rolling mean (moving average)
ma_20 = ts_mean(prices, 20)
print("20-day moving average:")
print(ma_20.tail())

In [None]:
# ts_std: Rolling standard deviation (volatility)
volatility = ts_std(prices, 20)
print("20-day rolling std (volatility):")
print(volatility.tail())

In [None]:
# ts_delta: Difference (daily change)
daily_change = ts_delta(prices, 1)
print("1-day price change:")
print(daily_change.head(10))

In [None]:
# ts_delay: Lag values (shift backward)
lagged = ts_delay(prices, 5)
print("Prices lagged by 5 days:")
print(lagged.head(10))

In [None]:
# ts_min, ts_max: Rolling min/max
rolling_high = ts_max(prices, 20)
rolling_low = ts_min(prices, 20)
print("20-day rolling high:")
print(rolling_high.tail())

### Internal: Time-Series Operators

```
ts_mean(df, window)
     |
     +---> Identify: date_col = columns[0], value_cols = columns[1:]
     |
     +---> For each value column:
     |          +---> Apply pl.col(c).rolling_mean(window)
     |
     +---> Return df with same structure
```

| Operator | Polars Function |
|----------|----------------|
| `ts_mean` | `rolling_mean` |
| `ts_sum` | `rolling_sum` |
| `ts_std` | `rolling_std` |
| `ts_min` | `rolling_min` |
| `ts_max` | `rolling_max` |
| `ts_delta` | `diff` |
| `ts_delay` | `shift` |

**Code references:**
- `operators/timeseries.py:10-85` - All 7 TS operators

---
## 6. Cross-Sectional Operators

Row-wise operations applied across symbols (across each row at each date).

In [None]:
# rank: Cross-sectional rank (1 to N)
ranked = rank(prices)
print("Cross-sectional rank (each row: ranks 1,2,3 -> sum=6):")
print(ranked.head())

In [None]:
# zscore: Standardize across symbols (mean=0, std=1 per row)
standardized = zscore(prices)
print("Cross-sectional z-score:")
print(standardized.head())

In [None]:
# demean: Subtract row mean
demeaned = demean(prices)
print("Demeaned (each row sums to ~0):")
print(demeaned.head())

# Verify: sum across columns should be ~0
row_sums = demeaned.select(pl.sum_horizontal(pl.exclude("timestamp"))).to_series()
print(f"\nRow sums (should be ~0): {row_sums.head().to_list()}")

In [None]:
# scale: Scale so |sum| = target (for dollar-neutral portfolios)
weights = scale(demeaned, target=1.0)
print("Scaled weights (|sum| = 1):")
print(weights.head())

# Verify: sum of absolute values should be ~1
abs_sums = weights.select(pl.sum_horizontal(*[pl.col(c).abs() for c in weights.columns[1:]])).to_series()
print(f"\nAbs sums (should be ~1): {abs_sums.head().to_list()}")

### Internal: Cross-Sectional Operators

**rank():** Uses unpivot/pivot pattern
```
rank(df)
     |
     +---> Unpivot to long: (timestamp, symbol, value)
     |
     +---> Rank within each timestamp: pl.col("value").rank().over("timestamp")
     |
     +---> Pivot back to wide format
```

**zscore(), demean(), scale():** Use horizontal operations
```
zscore(df)
     |
     +---> row_mean = pl.mean_horizontal(*value_cols)
     |
     +---> row_std = pl.concat_list(...).list.eval(pl.element().std())
     |
     +---> Transform: (value - mean) / std for each column
```

**Code references:**
- `operators/crosssectional.py:16-50` - rank()
- `operators/crosssectional.py:53-76` - zscore()
- `operators/crosssectional.py:79-97` - demean()
- `operators/crosssectional.py:100-121` - scale()

---
## 7. Alpha Factor Example

Build a simple **20-day momentum** factor and create **dollar-neutral portfolio weights**.

### Strategy Concept
Stocks that went up recently tend to continue going up (momentum effect).

### Pipeline
1. Calculate 20-day returns (momentum)
2. Rank stocks by momentum cross-sectionally
3. Convert ranks to z-scores (center around 0)
4. Scale to dollar-neutral weights

In [None]:
# Expand to more symbols for better demonstration
if USE_MOCK_DATA:
    import random
    random.seed(42)
    symbols = ["AAPL", "MSFT", "GOOGL", "AMZN", "NVDA", "META", "TSLA", "JPM", "V", "JNJ"]
    dates = pl.date_range(date(2024, 1, 1), date(2024, 6, 30), eager=True)
    n = len(dates)
    
    data = {"timestamp": dates}
    base_prices = {"AAPL": 185.0, "MSFT": 375.0, "GOOGL": 140.0, "AMZN": 150.0, "NVDA": 500.0,
                   "META": 350.0, "TSLA": 250.0, "JPM": 170.0, "V": 275.0, "JNJ": 160.0}
    for sym in symbols:
        drift = random.uniform(-0.1, 0.5)  # Random drift
        vol = random.uniform(1.0, 3.0)     # Random volatility
        data[sym] = [base_prices[sym] + sum(random.gauss(drift, vol) for _ in range(i)) for i in range(n)]
    
    prices = pl.DataFrame(data)

print(f"Price data: {prices.shape}")
print(f"Symbols: {prices.columns[1:]}")
print(prices.head())

In [None]:
# Step 1: Calculate 20-day momentum (return)
# Momentum = (price_today - price_20d_ago) / price_20d_ago

price_20d_ago = ts_delay(prices, 20)
value_cols = prices.columns[1:]

# Calculate returns
momentum_data = {"timestamp": prices["timestamp"]}
for col in value_cols:
    momentum_data[col] = (prices[col] - price_20d_ago[col]) / price_20d_ago[col]
momentum = pl.DataFrame(momentum_data)

print("20-day momentum (returns):")
print(momentum.tail())

In [None]:
# Step 2: Rank momentum cross-sectionally
# Higher rank = higher momentum = we want to be long

momentum_rank = rank(momentum)
print(f"Momentum ranks (1-{len(value_cols)}):")
print(momentum_rank.tail())

In [None]:
# Step 3: Convert ranks to z-scores
# This centers around 0 (some positive, some negative)

alpha = zscore(momentum_rank)
print("Alpha (z-scored ranks):")
print(alpha.tail())

In [None]:
# Step 4: Scale to dollar-neutral weights
# |sum of weights| = 1

weights = scale(alpha, target=1.0)
print("Portfolio weights (positive=long, negative=short):")
print(weights.tail())

In [None]:
# Verify dollar-neutrality
last_row = weights.tail(1)
weight_values = [last_row[col][0] for col in weights.columns[1:]]

print(f"Sum of weights: {sum(weight_values):.6f} (should be ~0)")
print(f"Sum of |weights|: {sum(abs(w) for w in weight_values):.6f} (should be ~1)")
print(f"\nLong positions: {sum(1 for w in weight_values if w > 0)}")
print(f"Short positions: {sum(1 for w in weight_values if w < 0)}")

### Alpha Pipeline Summary

```
prices
   |
   +---> ts_delay(20) ---> price_20d_ago
   |                            |
   +------- (prices - price_20d_ago) / price_20d_ago -----> momentum
                                                                |
                                                           rank()
                                                                |
                                                           zscore()
                                                                |
                                                           scale(target=1)
                                                                |
                                                           weights
                                                    (dollar-neutral portfolio)
```

**Interpretation:**
- **Positive weight** = Long position (bet stock goes up)
- **Negative weight** = Short position (bet stock goes down)
- **Dollar-neutral** = Equal $ long and short (market-neutral, profit from relative performance)

---
## 8. Caching

Monitor and manage the local disk cache.

In [None]:
# Check cache statistics
stats = client.cache_stats()
print(f"Cached entries: {stats['entries']}")
print(f"Cache size: {stats['total_size_bytes'] / 1024 / 1024:.2f} MB")
print(f"Max size: {stats['max_size_bytes'] / 1024 / 1024 / 1024:.1f} GB")
print(f"TTL: {stats['ttl_seconds'] / 3600:.1f} hours")
print(f"Location: {stats['cache_dir']}")

In [None]:
# Clear cache if needed (uncomment to run)
# client.clear_cache()
# print("Cache cleared")

### Internal: Cache Architecture

```
~/.quantdl/cache/
     |
     +---> metadata.json
     |          {
     |            "s3_path": {
     |              "local_path": "data/abc123.parquet",
     |              "size_bytes": 1024000,
     |              "fetched_at": 1704067200.0,
     |              "last_accessed": 1704153600.0
     |            }
     |          }
     |
     +---> data/
              +---> {hash}.parquet
              ...
```

**Eviction policies:**
- **TTL (Time-to-Live):** Entries older than 24h evicted on next access
- **LRU (Least Recently Used):** When cache exceeds 10GB, oldest-accessed removed first

**Code references:**
- `storage/cache.py:97-121` - LRU and TTL eviction
- `storage/cache.py:123-169` - get() with cache hit/miss
- `storage/cache.py:171-200` - put() with atomic writes

---
## Summary

### Data APIs

| API | Description | Returns |
|-----|-------------|--------|
| `client.resolve(id)` | Symbol/CIK/CUSIP lookup | `SecurityInfo` |
| `client.daily(symbols, field)` | OHLCV prices | Wide DataFrame |
| `client.fundamentals(symbols, concept)` | SEC fundamentals | Wide DataFrame |
| `client.metrics(symbols, metric)` | Derived metrics | Wide DataFrame |

### Operators

| Operator | Type | Description |
|----------|------|-------------|
| `ts_mean(df, n)` | Time-series | Rolling mean |
| `ts_std(df, n)` | Time-series | Rolling std |
| `ts_delta(df, n)` | Time-series | Difference |
| `ts_delay(df, n)` | Time-series | Lag |
| `rank(df)` | Cross-sectional | Rank 1 to N |
| `zscore(df)` | Cross-sectional | Standardize |
| `demean(df)` | Cross-sectional | Center around 0 |
| `scale(df, target)` | Cross-sectional | Normalize weights |

In [None]:
# Cleanup
client.close()
print("Done!")