# Feature Engineering Pitfalls for Time Series

## Common Mistakes That Create Lookahead Bias

---

## ðŸš¨ If You Know sklearn But Not Time Series Feature Engineering, Read This First

**What you already know (from standard ML)**:
- `StandardScaler().fit(X)` on the whole dataset is fine
- `df['feature'] = df['value'].rolling(7).mean()` before split is fine
- `SelectKBest().fit(X, y)` on all data is fine
- Centered windows smooth nicely

**What's different with time series** (every one of these is a bug!):

| Standard ML | Time Series Bug | Why It's Wrong |
|-------------|-----------------|----------------|
| Fit scaler on all X | Train info leaks to test | Scaler knows test distribution |
| Rolling mean before split | Future leaks to past | rolling_mean[t=100] uses t=101,102... |
| SelectKBest on all data | Feature selection leaks | Selection optimized for test |
| Centered window | Uses future values | `center=True` looks ahead |

**The pattern**: Anything computed "before split" or "on all data" leaks future information.

**Real-world impact**: A practitioner computed a 7-day rolling mean on the full series before splitting. The model reported 35% improvement over baseline. After fixing: 3% improvement. The 32% was pure leakage.

---

**What you'll learn:**
1. Why standard feature engineering patterns create leakage in time series
2. How rolling statistics, feature selection, and windowed calculations can look ahead
3. Safe patterns for creating lag features and rolling stats
4. How to validate your features with `gate_shuffled_target`

**Prerequisites:** Notebooks 01, 05

---

In [None]:
# Setup
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_absolute_error

from temporalcv.cv import WalkForwardCV
from temporalcv.gates import gate_shuffled_target, GateStatus

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

print("Setup complete.")

In [None]:
# Generate AR(1) process
def generate_ar1(n=500, phi=0.9, sigma=1.0, seed=42):
    """Generate AR(1) process."""
    rng = np.random.default_rng(seed)
    y = np.zeros(n)
    y[0] = rng.normal(0, sigma / np.sqrt(1 - phi**2))
    for t in range(1, n):
        y[t] = phi * y[t-1] + sigma * rng.normal()
    return y

# Generate data
series = generate_ar1(n=600, phi=0.9, seed=42)
print(f"Generated series with {len(series)} observations")

---

## Section 1: Bug Category #2 â€” Rolling Statistics Before Split

### The Problem

Computing rolling statistics on the **full series** before train/test split uses future information.

```python
# WRONG: Rolling mean computed on full series
df['rolling_mean'] = df['target'].rolling(10).mean()
train, test = df[:split], df[split:]
# Bug: Test features used future data!
```

### Why It's Wrong

The rolling mean at time `t` in the test set includes information from times `t-9` to `t`. If `t-9` is in the test period, you've leaked test data into training features.

In [None]:
# Demonstrate the rolling stats bug
window = 10
split_idx = 400

# WRONG: Rolling mean on full series
rolling_mean_wrong = pd.Series(series).rolling(window).mean().values

# Show where the problem occurs
print("Rolling Statistics Bug")
print("=" * 60)
print(f"\nWindow size: {window}")
print(f"Split index: {split_idx}")
print(f"\nFor test observation at index {split_idx}:")
print(f"  Rolling mean uses indices: {split_idx - window + 1} to {split_idx}")
print(f"  This is FINE (all from training period)")

print(f"\nFor test observation at index {split_idx + 5}:")
print(f"  Rolling mean uses indices: {split_idx + 5 - window + 1} to {split_idx + 5}")
print(f"  Some of these ({split_idx}, {split_idx + 1}, ...) are from TEST period!")
print(f"  BUG: This rolling mean 'knows' test period values")

In [None]:
# Visualize the problem
fig, ax = plt.subplots(figsize=(14, 5))

t = np.arange(len(series))

# Mark train/test regions
ax.axvspan(0, split_idx, alpha=0.2, color='blue', label='Train period')
ax.axvspan(split_idx, len(series), alpha=0.2, color='red', label='Test period')

# Plot series and rolling mean
ax.plot(t, series, 'b-', alpha=0.5, linewidth=0.5, label='Original')
ax.plot(t, rolling_mean_wrong, 'r-', linewidth=2, label=f'Rolling mean (window={window})')

# Highlight problematic region
ax.axvspan(split_idx, split_idx + window, alpha=0.5, color='orange', 
           label='Contaminated region')

ax.set_xlabel('Time Index', fontsize=11)
ax.set_ylabel('Value', fontsize=11)
ax.set_title('Rolling Statistics Bug: Test Features Use Future Data',
             fontsize=13, fontweight='bold', color='red')
ax.legend(loc='upper left')
ax.set_xlim(split_idx - 50, split_idx + 50)

plt.tight_layout()
plt.show()

print("\nThe orange region shows where rolling mean is 'contaminated' by test data.")

In [None]:
# The fix: Compute rolling stats within each fold
def create_features_wrong(series):
    """WRONG: Rolling stats on full series."""
    df = pd.DataFrame({'y': series})
    df['lag_1'] = df['y'].shift(1)
    df['rolling_mean'] = df['y'].rolling(10).mean()  # BUG: Uses future!
    df['rolling_std'] = df['y'].rolling(10).std()    # BUG: Uses future!
    return df.dropna()

def create_features_right(series, is_train_mask):
    """RIGHT: Rolling stats computed separately for train."""
    df = pd.DataFrame({'y': series})
    df['lag_1'] = df['y'].shift(1)
    
    # Compute rolling stats ONLY on training data
    train_series = df['y'].copy()
    train_series[~is_train_mask] = np.nan  # Mask test data
    
    df['rolling_mean'] = train_series.rolling(10, min_periods=1).mean()
    df['rolling_std'] = train_series.rolling(10, min_periods=1).std()
    
    # Forward fill for test period (use last known training value)
    df['rolling_mean'] = df['rolling_mean'].ffill()
    df['rolling_std'] = df['rolling_std'].ffill()
    
    return df.dropna()

# Compare
print("Comparing WRONG vs RIGHT Approaches")
print("=" * 60)

df_wrong = create_features_wrong(series)
print(f"\nWRONG approach: Rolling stats computed on full series")
print(f"  At index {split_idx + 5}:")
print(f"  rolling_mean = {df_wrong.loc[split_idx + 5, 'rolling_mean']:.4f}")

is_train = np.zeros(len(series), dtype=bool)
is_train[:split_idx] = True
df_right = create_features_right(series, is_train)
print(f"\nRIGHT approach: Rolling stats from training only")
print(f"  At index {split_idx + 5}:")
print(f"  rolling_mean = {df_right.loc[split_idx + 5, 'rolling_mean']:.4f}")
print(f"  (Uses last known training mean, forward-filled)")

---

## Section 2: Bug Category #4 â€” Feature Selection on Full Data

### The Problem

Selecting features using the **full dataset** (including test data) creates leakage.

```python
# WRONG: Select features on full data
selector = SelectKBest(k=5).fit(X_all, y_all)
X_selected = selector.transform(X_all)
train, test = X_selected[:split], X_selected[split:]
# Bug: Selection used test data information!
```

### Why It's Wrong

Feature selection scores (correlation, F-statistics) computed on the full dataset include test-period target values. The selected features are "optimized" for test data they shouldn't know about.

In [None]:
# Demonstrate feature selection leakage
rng = np.random.default_rng(42)

# Create features: some useful, some noise
n = len(series)
n_features = 20
n_useful = 5

# First n_useful features have some signal
X_all = np.zeros((n, n_features))
for i in range(n_useful):
    X_all[:, i] = np.roll(series, i + 1)  # Lag features
X_all[:(n_useful), :n_useful] = np.nan  # Clean up NaNs

# Rest are noise
X_all[:, n_useful:] = rng.normal(0, 1, (n, n_features - n_useful))

# Remove NaN rows
valid = ~np.isnan(X_all).any(axis=1)
X_all = X_all[valid]
y_all = series[valid]

# Split
split = int(len(X_all) * 0.7)
X_train, X_test = X_all[:split], X_all[split:]
y_train, y_test = y_all[:split], y_all[split:]

print("Feature Selection Comparison")
print("=" * 60)
print(f"\nTotal features: {n_features}")
print(f"Useful features: {n_useful} (lag features)")
print(f"Noise features: {n_features - n_useful}")

In [None]:
# WRONG: Feature selection on full data
selector_wrong = SelectKBest(score_func=f_regression, k=5)
selector_wrong.fit(X_all, y_all)  # Uses ALL data!
selected_wrong = selector_wrong.get_support(indices=True)

# RIGHT: Feature selection on training data only
selector_right = SelectKBest(score_func=f_regression, k=5)
selector_right.fit(X_train, y_train)  # Uses only training!
selected_right = selector_right.get_support(indices=True)

print("\nWRONG: Selection on full data")
print(f"  Selected features: {selected_wrong}")

print("\nRIGHT: Selection on training only")
print(f"  Selected features: {selected_right}")

# Check which are actually useful (lag features are indices 0-4)
useful_idx = set(range(n_useful))
print(f"\nActual useful features: {list(useful_idx)}")
print(f"WRONG correctly identified: {len(set(selected_wrong) & useful_idx)}/{n_useful}")
print(f"RIGHT correctly identified: {len(set(selected_right) & useful_idx)}/{n_useful}")

In [None]:
# Test for leakage with gate_shuffled_target
print("Leakage Detection with gate_shuffled_target")
print("=" * 60)

model = Ridge(alpha=1.0)

# Test WRONG approach
X_selected_wrong = selector_wrong.transform(X_all)
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    result_wrong = gate_shuffled_target(
        model, X_selected_wrong, y_all,
        method='effect_size',
        random_state=42
    )

print(f"\nWRONG (selection on full data):")
print(f"  Status: {result_wrong.status.value}")
print(f"  Improvement: {result_wrong.details.get('improvement_ratio', 0)*100:.1f}%")

# Test RIGHT approach (selection within CV)
# For proper testing, we need to do selection inside CV
print(f"\nRIGHT approach requires selection INSIDE each CV fold.")
print(f"See the safe feature engineering template below.")

---

## Section 3: Bug Category #6 â€” Centered Windows

### The Problem

Centered window calculations include **future** observations.

```python
# WRONG: Centered rolling mean
smoothed[t] = np.mean(series[t-3:t+4])  # Uses t+1, t+2, t+3!

# RIGHT: Backward-only window
smoothed[t] = np.mean(series[t-6:t+1])  # Uses only past + present
```

### Common Culprits
- `center=True` in pandas rolling
- Symmetric smoothing filters
- Moving average with equal weights on both sides

In [None]:
# Demonstrate centered vs backward-only windows
window = 7

# Centered window (WRONG for time series)
centered = pd.Series(series).rolling(window, center=True).mean().values

# Backward-only window (RIGHT)
backward = pd.Series(series).rolling(window, center=False).mean().values

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

t_demo = 50  # Time point to demonstrate
t_range = np.arange(t_demo - 10, t_demo + 10)

# Left: Centered (WRONG)
ax = axes[0]
ax.plot(t_range, series[t_range], 'b-o', markersize=4, label='Series')
ax.axvline(x=t_demo, color='black', linestyle='--', label=f'Time t={t_demo}')

# Highlight window for centered
half = window // 2
ax.axvspan(t_demo - half, t_demo + half, alpha=0.3, color='red', 
           label=f'Centered window')
ax.scatter([t_demo], [centered[t_demo]], color='red', s=100, zorder=5,
           marker='X', label=f'Centered mean')

ax.set_xlabel('Time')
ax.set_ylabel('Value')
ax.set_title(f'WRONG: Centered Window (center=True)\nUses FUTURE data!',
             fontsize=12, fontweight='bold', color='red')
ax.legend(loc='upper right', fontsize=9)

# Right: Backward (RIGHT)
ax = axes[1]
ax.plot(t_range, series[t_range], 'b-o', markersize=4, label='Series')
ax.axvline(x=t_demo, color='black', linestyle='--', label=f'Time t={t_demo}')

# Highlight window for backward
ax.axvspan(t_demo - window + 1, t_demo + 1, alpha=0.3, color='green',
           label=f'Backward window')
ax.scatter([t_demo], [backward[t_demo]], color='green', s=100, zorder=5,
           marker='o', label=f'Backward mean')

ax.set_xlabel('Time')
ax.set_ylabel('Value')
ax.set_title(f'RIGHT: Backward Window (center=False)\nUses only PAST data',
             fontsize=12, fontweight='bold', color='green')
ax.legend(loc='upper right', fontsize=9)

plt.tight_layout()
plt.show()

print(f"\nAt time t={t_demo}:")
print(f"  Centered mean (WRONG): uses data from t-{half} to t+{half}")
print(f"  Backward mean (RIGHT): uses data from t-{window-1} to t")

In [None]:
# Test centered vs backward windows for leakage
print("Centered vs Backward Window Leakage Test")
print("=" * 60)

# Create features with centered window (WRONG)
df = pd.DataFrame({'y': series})
df['lag_1'] = df['y'].shift(1)
df['centered_mean'] = df['y'].rolling(7, center=True).mean()  # WRONG!
df_wrong = df.dropna()

X_centered = df_wrong[['lag_1', 'centered_mean']].values
y_centered = df_wrong['y'].values

# Create features with backward window (RIGHT)
df = pd.DataFrame({'y': series})
df['lag_1'] = df['y'].shift(1)
df['backward_mean'] = df['y'].rolling(7, center=False).mean()  # RIGHT
df_right = df.dropna()

X_backward = df_right[['lag_1', 'backward_mean']].values
y_backward = df_right['y'].values

model = Ridge(alpha=1.0)

# Test centered
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    result_centered = gate_shuffled_target(
        model, X_centered, y_centered,
        method='permutation',
        n_shuffles=30,
        random_state=42
    )

print(f"\nCentered window (center=True):")
print(f"  Status: {result_centered.status.value}")
print(f"  p-value: {result_centered.details.get('pvalue', 'N/A'):.4f}")

# Test backward
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    result_backward = gate_shuffled_target(
        model, X_backward, y_backward,
        method='permutation',
        n_shuffles=30,
        random_state=42
    )

print(f"\nBackward window (center=False):")
print(f"  Status: {result_backward.status.value}")
print(f"  p-value: {result_backward.details.get('pvalue', 'N/A'):.4f}")

---

## Section 4: Safe Feature Engineering Template

### The Golden Rule

**All statistics must be computed using only past data relative to each observation.**

### Safe Pattern: Compute Inside CV Folds

```python
for train_idx, test_idx in cv.split(X, y):
    # Compute features using ONLY training data
    train_mean = y[train_idx].mean()
    train_std = y[train_idx].std()
    
    # Apply to both train and test
    X_train_normalized = (X[train_idx] - train_mean) / train_std
    X_test_normalized = (X[test_idx] - train_mean) / train_std  # Same stats!
```

In [None]:
# Safe feature engineering template
class SafeFeatureEngineer:
    """
    Feature engineering that respects temporal boundaries.
    
    All statistics computed during fit() use only training data.
    Transform() applies the same statistics to new data.
    """
    
    def __init__(self, window=10):
        self.window = window
        self.train_mean = None
        self.train_std = None
        
    def fit(self, series, y=None):
        """Compute statistics from training data only."""
        self.train_mean = np.mean(series)
        self.train_std = np.std(series)
        
        # Store last window values for rolling features
        self.last_window = series[-self.window:].copy()
        return self
    
    def transform(self, series):
        """Create features using only past information."""
        n = len(series)
        features = []
        
        # Lag features (always safe)
        for lag in [1, 2, 3]:
            lagged = np.concatenate([[np.nan]*lag, series[:-lag]])
            features.append(lagged)
        
        # Backward-only rolling mean
        rolling = pd.Series(series).rolling(self.window, min_periods=1).mean().values
        features.append(rolling)
        
        # Normalized using training statistics
        normalized = (series - self.train_mean) / self.train_std
        features.append(normalized)
        
        X = np.column_stack(features)
        return X
    
    def fit_transform(self, series):
        return self.fit(series).transform(series)


# Demonstrate safe feature engineering in CV
print("Safe Feature Engineering in Walk-Forward CV")
print("=" * 60)

cv = WalkForwardCV(n_splits=5, window_type='expanding', test_size=50)
model = Ridge(alpha=1.0)
engineer = SafeFeatureEngineer(window=10)

all_maes = []
for i, (train_idx, test_idx) in enumerate(cv.split(series)):
    # Fit engineer on TRAINING data only
    engineer.fit(series[train_idx])
    
    # Transform both sets using training statistics
    X_train = engineer.transform(series[train_idx])
    X_test = engineer.transform(series[test_idx])
    y_train = series[train_idx]
    y_test = series[test_idx]
    
    # Remove NaN rows
    valid_train = ~np.isnan(X_train).any(axis=1)
    valid_test = ~np.isnan(X_test).any(axis=1)
    
    X_train = X_train[valid_train]
    y_train = y_train[valid_train]
    X_test = X_test[valid_test]
    y_test = y_test[valid_test]
    
    # Train and evaluate
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    mae = mean_absolute_error(y_test, preds)
    all_maes.append(mae)
    
    print(f"  Fold {i+1}: Train size={len(X_train)}, Test size={len(X_test)}, MAE={mae:.4f}")

print(f"\nMean MAE: {np.mean(all_maes):.4f}")

In [None]:
# Validate the safe approach has no leakage
print("\nValidating Safe Approach with gate_shuffled_target")
print("=" * 60)

# Create features for full validation
engineer_full = SafeFeatureEngineer(window=10)
X_safe = engineer_full.fit_transform(series)
y_safe = series

# Remove NaN rows
valid = ~np.isnan(X_safe).any(axis=1)
X_safe = X_safe[valid]
y_safe = y_safe[valid]

model = Ridge(alpha=1.0)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    result = gate_shuffled_target(
        model, X_safe, y_safe,
        method='permutation',
        n_shuffles=50,
        random_state=42
    )

print(f"\nSafe feature engineering:")
print(f"  Status: {result.status.value}")
print(f"  p-value: {result.details.get('pvalue', 'N/A'):.4f}")
print(f"\n  The safe approach passes the shuffled target test!")

---

## Pitfall Section

### Pitfall 1: Rolling Stats Before Split

```python
# WRONG: Compute on full data before split
df['rolling_mean'] = df['y'].rolling(10).mean()
train, test = df[:split], df[split:]

# RIGHT: Use expanding window or compute in CV
# Option A: Expanding window (always looks back)
df['expanding_mean'] = df['y'].expanding().mean()

# Option B: Compute inside CV folds
for train_idx, test_idx in cv.split(X, y):
    train_mean = y[train_idx].rolling(10).mean()
    # Apply last training mean to test
```

### Pitfall 2: Feature Selection on Full Data

```python
# WRONG: Select features using all data
selector = SelectKBest(k=5).fit(X_all, y_all)

# RIGHT: Select features inside each CV fold
for train_idx, test_idx in cv.split(X, y):
    selector = SelectKBest(k=5).fit(X[train_idx], y[train_idx])
    X_train = selector.transform(X[train_idx])
    X_test = selector.transform(X[test_idx])
```

### Pitfall 3: Centered Windows

```python
# WRONG: Centered window uses future
df['smoothed'] = df['y'].rolling(7, center=True).mean()

# RIGHT: Backward-only window
df['smoothed'] = df['y'].rolling(7, center=False).mean()
```

### Pitfall 4: Standardization on Full Data

```python
# WRONG: Fit scaler on all data
scaler = StandardScaler().fit(X_all)

# RIGHT: Fit scaler on training only
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Same scaler!
```

### Pitfall 5: Target Encoding Without Temporal Awareness

```python
# WRONG: Category means using all data
category_means = df.groupby('category')['target'].mean()

# RIGHT: Category means from training only
train_means = df.loc[:split].groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(train_means)
```

In [None]:
# Quick reference: Safe vs Unsafe patterns
print("QUICK REFERENCE: Safe vs Unsafe Patterns")
print("=" * 70)

patterns = [
    ("Rolling stats", 
     "df['y'].rolling(10).mean() before split",
     "Compute inside each CV fold"),
    ("Feature selection",
     "SelectKBest().fit(X_all, y_all)",
     "SelectKBest().fit(X_train, y_train)"),
    ("Window type",
     "rolling(7, center=True)",
     "rolling(7, center=False)"),
    ("Standardization",
     "StandardScaler().fit(X_all)",
     "StandardScaler().fit(X_train)"),
    ("Target encoding",
     "groupby('cat')['y'].mean() on full df",
     "groupby('cat')['y'].mean() on train only"),
]

print(f"\n{'Operation':<20} {'WRONG':<35} {'RIGHT'}")
print("-" * 90)
for op, wrong, right in patterns:
    print(f"{op:<20} {wrong:<35} {right}")

---

## Key Insights

### 1. Order Matters in Time Series
Every computation must respect temporal ordering. What works in tabular ML creates leakage in time series.

### 2. "Before Split" = Leakage
Any statistic computed before train/test split potentially uses future information.

### 3. Centered Windows Look Ahead
Always use `center=False` for rolling statistics in time series.

### 4. Feature Selection Must Happen Inside CV
Selecting features on the full dataset leaks test information into the selection process.

### 5. Validate with `gate_shuffled_target`
After feature engineering, run the shuffled target test to verify no leakage.

---

## Next Steps

- **07_threshold_leakage.ipynb**: Regime and percentile computation without lookahead
- **08_validation_workflow.ipynb**: Complete HALT/WARN/PASS pipeline
- **05_shuffled_target_gate.ipynb**: Review if needed

---

*"In time series, your features can only see the past. Anything else is cheating."*