# TimingAgent Training with DQN (Phase 2)

This notebook trains a Deep Q-Network (DQN) agent for single-asset market timing on SPY.

## Objectives:
1. Load featured data and create TimingEnv
2. Initialize and train DQN agent
3. Evaluate on validation set
4. Compare against baseline strategies
5. Analyze learned policy

**Key Innovation**: RL agent learns optimal entry/exit timing with transaction costs

In [None]:
# Setup
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import torch
warnings.filterwarnings('ignore')

# RL imports
from stable_baselines3 import DQN
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.callbacks import EvalCallback

# Project imports
from src.utils.config import ConfigLoader
from src.environments.timing_env import TimingEnv
from src.backtesting.baselines import BuyAndHold, SMAcrossover, compare_baselines

# Style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

# Check device availability
if torch.xpu.is_available():
    device = 'xpu'
    print(f"✓ Using Intel XPU: {torch.xpu.get_device_name(0)}")
else:
    device = 'cpu'
    print("✓ Using CPU (XPU not available)")

print("✓ Imports successful")

## 1. Load Configuration and Data

In [2]:
# Load configs
config_loader = ConfigLoader('../config')
timing_config = config_loader.load('timing_config')
cv_config = config_loader.load('cv_config')

print("TimingAgent Configuration:")
print(f"  Algorithm: {timing_config['agent']['algorithm']}")
print(f"  Reward type: {timing_config['environment']['reward_type']}")
print(f"  Initial cash: ${timing_config['environment']['initial_cash']:,}")
print(f"  Transaction costs: {timing_config['environment']['transaction_costs']}")
print(f"\nTraining:")
print(f"  Total timesteps: {timing_config['training']['total_timesteps']:,}")
print(f"  Eval frequency: {timing_config['training']['eval_freq']:,}")

# Load featured data
data = pd.read_parquet('../data/features/featured_data.parquet')
print(f"\n✓ Loaded featured data: {data.shape}")

# Filter to SPY
ticker = timing_config['data']['ticker']
if 'ticker' in data.columns:
    data = data[data['ticker'] == ticker].copy()
    print(f"  Filtered to {ticker}: {data.shape}")

print(f"  Date range: {data.index.min()} to {data.index.max()}")

INFO:src.utils.config:Loaded config: timing_config
INFO:src.utils.config:Loaded config: cv_config


TimingAgent Configuration:
  Algorithm: DQN
  Reward type: sharpe
  Initial cash: $100,000
  Transaction costs: {'commission_bps': 10, 'slippage_bps': 5}

Training:
  Total timesteps: 100,000
  Eval frequency: 5,000

✓ Loaded featured data: (19452, 45)
  Filtered to SPY: (4863, 45)
  Date range: 2005-09-02 00:00:00 to 2024-12-30 00:00:00


## 2. Train/Validation Split

In [3]:
# Split data (exclude test set)
test_start = pd.Timestamp(cv_config['test_set']['start_date'])
train_val_data = data[data.index < test_start].copy()

# Simple 85/15 split for initial training
train_size = int(len(train_val_data) * 0.85)
train_data = train_val_data.iloc[:train_size]
val_data = train_val_data.iloc[train_size:]

print(f"Data Split:")
print(f"  Train: {len(train_data):,} rows ({train_data.index.min()} to {train_data.index.max()})")
print(f"  Val:   {len(val_data):,} rows ({val_data.index.min()} to {val_data.index.max()})")
print(f"  Test:  Held out until {test_start.date()}")

Data Split:
  Train: 2,850 rows (2005-09-02 00:00:00 to 2016-12-28 00:00:00)
  Val:   504 rows (2016-12-29 00:00:00 to 2018-12-31 00:00:00)
  Test:  Held out until 2019-01-01


## 3. Create Trading Environment

In [4]:
# Get feature list
features = timing_config['data']['features']

# Validate features exist
missing_features = [f for f in features if f not in train_data.columns]
if missing_features:
    print(f"Warning: Missing features: {missing_features}")
    features = [f for f in features if f in train_data.columns]

print(f"Using {len(features)} features:")
for i, f in enumerate(features, 1):
    print(f"  {i:2d}. {f}")

# Create training environment
train_env = TimingEnv(
    data=train_data,
    config=timing_config['environment'],
    features=features
)

# Wrap in Monitor
train_env = Monitor(train_env, './logs/timing_agent/train')

print(f"\n✓ Created TimingEnv")
print(f"  Observation space: {train_env.observation_space.shape}")
print(f"  Action space: Discrete({train_env.action_space.n})")
print(f"  Actions: 0=Hold, 1=Long, 2=Short")

INFO:src.environments.timing_env:TimingEnv initialized: 17 features, 3 actions


Using 17 features:
   1. return_1d
   2. return_5d
   3. return_10d
   4. rsi
   5. rsi_norm
   6. macd
   7. macd_signal
   8. macd_diff
   9. sma_50
  10. sma_crossover
  11. ema_12
  12. ema_26
  13. bb_width
  14. atr
  15. atr_pct
  16. volume_ratio
  17. obv

✓ Created TimingEnv
  Observation space: (17,)
  Action space: Discrete(3)
  Actions: 0=Hold, 1=Long, 2=Short


## 4. Test Environment (Sanity Check)

In [5]:
# Test environment with random actions
print("Testing environment with 100 random actions...\n")

obs, info = train_env.reset()
episode_reward = 0

for step in range(100):
    action = train_env.action_space.sample()  # Random action
    obs, reward, done, truncated, info = train_env.step(action)
    episode_reward += reward
    
    if done or truncated:
        break

print(f"Random agent results:")
print(f"  Steps: {step + 1}")
print(f"  Episode reward: {episode_reward:.2f}")
print(f"  Final portfolio value: ${info['portfolio_value']:,.2f}")
print(f"  Total trades: {info['total_trades']}")

# Get episode stats
stats = train_env.env.get_episode_stats()
print(f"\nRandom Policy Performance:")
print(f"  Total return: {stats['total_return']:.2%}")
print(f"  Sharpe ratio: {stats['sharpe_ratio']:.2f}")
print(f"  Max drawdown: {stats['max_drawdown']:.2%}")

print("\n✓ Environment works correctly!")

Testing environment with 100 random actions...

Random agent results:
  Steps: 100
  Episode reward: -6867.68
  Final portfolio value: $77,413.58
  Total trades: 73

Random Policy Performance:
  Total return: -22.59%
  Sharpe ratio: -6.70
  Max drawdown: 22.59%

✓ Environment works correctly!


## 5. Create DQN Agent

In [None]:
# Create evaluation environment
val_env = TimingEnv(
    data=val_data,
    config=timing_config['environment'],
    features=features
)
val_env = Monitor(val_env, './logs/timing_agent/eval')

# Create DQN agent
agent_config = timing_config['agent']

model = DQN(
    policy=agent_config['policy'],
    env=train_env,
    learning_rate=agent_config['learning_rate'],
    buffer_size=agent_config['buffer_size'],
    learning_starts=agent_config['learning_starts'],
    batch_size=agent_config['batch_size'],
    tau=agent_config['tau'],
    gamma=agent_config['gamma'],
    train_freq=agent_config['train_freq'],
    gradient_steps=agent_config['gradient_steps'],
    exploration_fraction=agent_config['exploration_fraction'],
    exploration_initial_eps=agent_config['exploration_initial_eps'],
    exploration_final_eps=agent_config['exploration_final_eps'],
    target_update_interval=agent_config['target_update_interval'],
    tensorboard_log=agent_config['tensorboard_log'],
    device=device,  # Use XPU if available
    verbose=1
)

print("✓ Created DQN agent")
print(f"  Policy: {agent_config['policy']}")
print(f"  Device: {device}")
print(f"  Learning rate: {agent_config['learning_rate']}")
print(f"  Buffer size: {agent_config['buffer_size']:,}")
print(f"  Network architecture: {agent_config['policy_kwargs']['net_arch']}")

## 6. Train the Agent

**Note**: This will take 5-15 minutes depending on your system.

You can monitor training progress in TensorBoard:
```bash
tensorboard --logdir=./logs/timing_agent
```

In [7]:
# Create evaluation callback
eval_callback = EvalCallback(
    val_env,
    best_model_save_path='./models/timing_agent/best',
    log_path='./logs/timing_agent/eval',
    eval_freq=timing_config['training']['eval_freq'],
    n_eval_episodes=5,
    deterministic=True,
    render=False,
    verbose=1
)

# Train!
print("Starting training...\n")
print(f"Total timesteps: {timing_config['training']['total_timesteps']:,}")
print("This may take 5-15 minutes...\n")

model.learn(
    total_timesteps=timing_config['training']['total_timesteps'],
    callback=eval_callback,
    log_interval=100,
    progress_bar=True
)

print("\n✓ Training complete!")

# Save final model
model.save('./models/timing_agent/final_model')
print("✓ Saved final model")

Starting training...

Total timesteps: 100,000
This may take 5-15 minutes...



ImportError: Trying to log data to tensorboard but tensorboard is not installed.

## 7. Load Best Model and Evaluate

In [None]:
# Load best model from training
best_model = DQN.load('./models/timing_agent/best/best_model')
print("✓ Loaded best model (based on validation performance)\n")

# Evaluate on validation set
print("Evaluating on validation set (10 episodes)...\n")

val_rewards = []
val_stats = []

for ep in range(10):
    obs, info = val_env.reset()
    done = False
    truncated = False
    episode_reward = 0
    
    while not (done or truncated):
        action, _ = best_model.predict(obs, deterministic=True)
        obs, reward, done, truncated, info = val_env.step(action)
        episode_reward += reward
    
    stats = val_env.env.get_episode_stats()
    val_rewards.append(episode_reward)
    val_stats.append(stats)
    
    print(f"Episode {ep+1:2d}: Return={stats['total_return']:+.2%}, Sharpe={stats['sharpe_ratio']:.2f}, Trades={stats['total_trades']}")

# Aggregate results
print(f"\n{'='*60}")
print("DQN Agent Validation Results:")
print(f"{'='*60}")
print(f"Mean reward: {np.mean(val_rewards):.2f} ± {np.std(val_rewards):.2f}")
print(f"Mean return: {np.mean([s['total_return'] for s in val_stats]):.2%}")
print(f"Mean Sharpe: {np.mean([s['sharpe_ratio'] for s in val_stats]):.2f}")
print(f"Mean max DD: {np.mean([s['max_drawdown'] for s in val_stats]):.2%}")
print(f"Mean trades: {np.mean([s['total_trades'] for s in val_stats]):.1f}")

## 8. Compare with Baseline Strategies

In [None]:
# Run baseline strategies on validation set
print("Running baseline strategies on validation set...\n")

baseline_results = compare_baselines(val_data)

print("Baseline Strategy Results:")
print(baseline_results.to_string(index=False))

# Add DQN results
dqn_results = pd.DataFrame([{
    'strategy': 'DQN Agent',
    'total_return': np.mean([s['total_return'] for s in val_stats]),
    'sharpe_ratio': np.mean([s['sharpe_ratio'] for s in val_stats]),
    'max_drawdown': np.mean([s['max_drawdown'] for s in val_stats]),
    'total_trades': np.mean([s['total_trades'] for s in val_stats]),
    'final_value': timing_config['environment']['initial_cash'] * (1 + np.mean([s['total_return'] for s in val_stats]))
}])

all_results = pd.concat([baseline_results, dqn_results], ignore_index=True)

print(f"\n{'='*80}")
print("COMPARISON: DQN vs Baselines")
print(f"{'='*80}")
print(all_results.to_string(index=False))

## 9. Visualize Performance

In [None]:
# Run single episode to get detailed trajectory
obs, info = val_env.reset()
done = False
truncated = False

portfolio_values = []
actions_taken = []
prices = []

while not (done or truncated):
    action, _ = best_model.predict(obs, deterministic=True)
    obs, reward, done, truncated, info = val_env.step(action)
    
    portfolio_values.append(info['portfolio_value'])
    actions_taken.append(action)
    prices.append(info['current_price'])

# Create figure with subplots
fig, axes = plt.subplots(3, 1, figsize=(16, 12))

# Plot 1: Portfolio Value
axes[0].plot(portfolio_values, linewidth=2, color='darkgreen', label='DQN Agent')
axes[0].axhline(timing_config['environment']['initial_cash'], color='gray', linestyle='--', alpha=0.5, label='Initial Capital')
axes[0].set_title('Portfolio Value Over Time', fontweight='bold', fontsize=14)
axes[0].set_ylabel('Portfolio Value ($)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: SPY Price
axes[1].plot(prices, linewidth=1.5, color='steelblue', label='SPY Price')
axes[1].set_title('SPY Price', fontweight='bold', fontsize=14)
axes[1].set_ylabel('Price ($)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Plot 3: Actions Taken
action_colors = ['gray', 'green', 'red']  # Hold, Long, Short
action_labels = ['Hold', 'Long', 'Short']
for action_val in [0, 1, 2]:
    action_steps = [i for i, a in enumerate(actions_taken) if a == action_val]
    if action_steps:
        axes[2].scatter(action_steps, [action_val] * len(action_steps), 
                       c=action_colors[action_val], label=action_labels[action_val],
                       alpha=0.6, s=20)

axes[2].set_title('Agent Actions', fontweight='bold', fontsize=14)
axes[2].set_ylabel('Action')
axes[2].set_xlabel('Time Step')
axes[2].set_yticks([0, 1, 2])
axes[2].set_yticklabels(['Hold', 'Long', 'Short'])
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Action distribution
action_counts = pd.Series(actions_taken).value_counts().sort_index()
print("\nAction Distribution:")
for action_val, count in action_counts.items():
    pct = count / len(actions_taken) * 100
    print(f"  {action_labels[action_val]:6s}: {count:4d} ({pct:5.1f}%)")

## 10. Performance Comparison Chart

In [None]:
# Create comparison bar charts
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Total Return
axes[0].bar(all_results['strategy'], all_results['total_return'] * 100, 
           color=['steelblue', 'orange', 'lightcoral', 'darkgreen'])
axes[0].set_title('Total Return', fontweight='bold')
axes[0].set_ylabel('Return (%)')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3, axis='y')

# Sharpe Ratio
axes[1].bar(all_results['strategy'], all_results['sharpe_ratio'], 
           color=['steelblue', 'orange', 'lightcoral', 'darkgreen'])
axes[1].set_title('Sharpe Ratio', fontweight='bold')
axes[1].set_ylabel('Sharpe Ratio')
axes[1].tick_params(axis='x', rotation=45)
axes[1].axhline(0, color='black', linestyle='-', linewidth=0.5)
axes[1].grid(True, alpha=0.3, axis='y')

# Max Drawdown
axes[2].bar(all_results['strategy'], all_results['max_drawdown'] * 100, 
           color=['steelblue', 'orange', 'lightcoral', 'darkgreen'])
axes[2].set_title('Max Drawdown', fontweight='bold')
axes[2].set_ylabel('Drawdown (%)')
axes[2].tick_params(axis='x', rotation=45)
axes[2].invert_yaxis()  # Lower is better
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## Summary

In this notebook, we:
1. ✓ Created and tested TimingEnv trading environment
2. ✓ Initialized DQN agent with stable-baselines3
3. ✓ Trained agent on SPY historical data
4. ✓ Evaluated on validation set
5. ✓ Compared against Buy & Hold and SMA Crossover baselines
6. ✓ Visualized learned policy and performance

**Key Findings**:
- DQN agent learns non-random policy (check action distribution)
- Risk-adjusted returns compared to baselines (Sharpe ratio)
- Transaction costs significantly impact profitability
- Agent behavior makes intuitive sense (or needs debugging!)

**Next Steps**:
1. **Hyperparameter Tuning**: Try different learning rates, buffer sizes, reward functions
2. **Longer Training**: Increase total_timesteps to 500k-1M
3. **Advanced Rewards**: Experiment with Sortino or Drawdown-aware rewards
4. **Walk-Forward Training**: Train on multiple CV folds for robustness
5. **Test Set Evaluation**: Final evaluation on held-out 2019-2024 data
6. **Phase 3**: Move to PortfolioAgent with continuous actions!

**Model Artifacts**:
- Best model: `./models/timing_agent/best/best_model.zip`
- Final model: `./models/timing_agent/final_model.zip`
- Logs: `./logs/timing_agent/`

Load saved model later:
```python
from stable_baselines3 import DQN
model = DQN.load('./models/timing_agent/best/best_model')
```