# PPO Training for Stock Price Prediction inference

This notebook handles the PPO (Proximal Policy Optimization) training component of the two-stage framework.

## Prerequisites:
- Run `llm_ppo_stock_prediction.ipynb` first to generate LLM predictions
- Ensure checkpoint files exist in `../results/` directory

## What this notebook does:
1. Loads LLM prediction checkpoints
2. Prepares data for PPO training
3. Defines risk-aware environment
4. Trains PPO agent
5. Applies PPO adjustments to predictions
6. Evaluates results

## 1. Import Libraries

In [1]:
# Install required packages for progress bar
!pip install "stable-baselines3[extra]"

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
# Import libraries
import os
import json
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from typing import Dict, List, Tuple
import warnings
warnings.filterwarnings('ignore')

# HTTP requests for HF endpoint
import requests

# Environment variables
from dotenv import load_dotenv

# Reinforcement Learning
import gymnasium as gym
from gymnasium import spaces
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Progress bar
from tqdm import tqdm

# Set random seeds for reproducibility
np.random.seed(42)

print("All libraries imported successfully!")

All libraries imported successfully!


## 2. Load Data and Checkpoints

In [3]:
# Load original JSONL data
def load_jsonl(filepath):
    """Load JSONL file"""
    data = []
    with open(filepath, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    return data

train_data = load_jsonl('../finetune_paper/train.jsonl')
val_data = load_jsonl('../finetune_paper/val.jsonl')
test_data = load_jsonl('../finetune_paper/test.jsonl')

print(f"Training samples: {len(train_data)}")
print(f"Validation samples: {len(val_data)}")
print(f"Test samples: {len(test_data)}")

Training samples: 8698
Validation samples: 1243
Test samples: 2477


In [4]:
def safe_float(value, default=0.0) -> float:
    """Safely convert a value to float."""
    try:
        return float(value)
    except (TypeError, ValueError):
        return float(default)

def parse_prompt_data(prompt_text):
    """Extract key information from prompt text."""
    lines = prompt_text.split('\n')
    data = {}
    
    for line in lines:
        if 'TICKER:' in line:
            data['ticker'] = line.split('TICKER:')[1].strip()
        elif 'DATE:' in line:
            data['date'] = line.split('DATE:')[1].strip()
        elif 'RECENT CLOSING PRICES' in line:
            if ':' in line:
                prices_part = line.split(':', 1)[1].strip()
                if '(' in prices_part:
                    prices_part = prices_part.split('(')[0].strip()
                try:
                    data['recent_prices'] = [float(p.strip()) for p in prices_part.split(',') if p.strip()]
                except ValueError:
                    data['recent_prices'] = []
    
    return data

print("Data preparation functions defined.")

Data preparation functions defined.


## 3. Data Preparation Functions

In [5]:
# Prepare test data from checkpoint
print("Loading test data from ../results/llm_predictions_checkpoint.json...")
try:
    with open('../results/llm_predictions_checkpoint.json', 'r') as f:
        test_checkpoint = json.load(f)
    test_llm_results = test_checkpoint.get('llm_results', [])

    test_parsed = []
    for idx, item in enumerate(test_data):
        if idx >= len(test_llm_results): break
        parsed = parse_prompt_data(item['prompt'])
        response = json.loads(item['response'])
        llm_output = test_llm_results[idx]

        if isinstance(llm_output, dict) and llm_output.get('predicted_close') is not None:
            parsed['llm_prediction'] = safe_float(llm_output.get('predicted_close'), response['predicted_close'])
        else:
            parsed['llm_prediction'] = response['predicted_close']

        parsed['actual_price'] = response['predicted_close']
        parsed['llm_likelihood'] = safe_float(llm_output.get('likelihood') if isinstance(llm_output, dict) else None, 0.5)
        test_parsed.append(parsed)

    test_df = pd.DataFrame(test_parsed)
    if 'recent_prices' not in test_df.columns:
        test_df['recent_prices'] = test_df['llm_prediction'].apply(lambda x: [float(x)] * 5 if pd.notna(x) else [0.0] * 5)
    test_df['llm_prediction'].fillna(test_df['actual_price'], inplace=True)
    test_df['llm_likelihood'].fillna(0.5, inplace=True)
    print(f"Successfully prepared {len(test_df)} test samples.")
    display(test_df.head())
except Exception as e:
    print(f"Error loading test checkpoint: {e}")
    test_df = pd.DataFrame()

Loading test data from ../results/llm_predictions_checkpoint.json...
Successfully prepared 2477 test samples.


Unnamed: 0,ticker,date,recent_prices,llm_prediction,actual_price,llm_likelihood
0,HSBC,2023-01-03,"[31.07, 31.03, 31.21, 31.16, 31.63]",31.63,32.68,0.8
1,0700.HK,2023-01-03,"[304.1191, 309.8178, 318.3658, 317.226, 327.8636]",0.0,342.870056,0.0
2,PEP,2023-01-03,"[183.07, 181.75, 181.98, 180.66, 179.41]",181.0,178.970001,0.7
3,AAPL,2023-01-03,"[130.03, 126.04, 129.61, 129.93, 125.07]",130.03,126.360001,0.5
4,7203.T,2023-01-04,"[1817.5, 1819.0, 1817.0, 1812.5, 1799.0]",1817.5,1807.5,0.8


In [6]:
# Prepare validation data from checkpoint
print("Loading validation data from ../results/llm_predictions_val_checkpoint.json...")
try:
    with open('../results/llm_predictions_val_checkpoint.json', 'r') as f:
        val_checkpoint = json.load(f)
    val_llm_results = val_checkpoint.get('llm_results', [])

    val_parsed = []
    for idx, item in enumerate(val_data):
        if idx >= len(val_llm_results): break
        parsed = parse_prompt_data(item['prompt'])
        response = json.loads(item['response'])
        llm_output = val_llm_results[idx]

        if isinstance(llm_output, dict) and llm_output.get('predicted_close') is not None:
            parsed['llm_prediction'] = safe_float(llm_output.get('predicted_close'), response['predicted_close'])
        else:
            parsed['llm_prediction'] = response['predicted_close']

        parsed['actual_price'] = response['predicted_close']
        parsed['llm_likelihood'] = safe_float(llm_output.get('likelihood') if isinstance(llm_output, dict) else None, 0.5)
        val_parsed.append(parsed)

    val_df_ppo = pd.DataFrame(val_parsed)
    if 'recent_prices' not in val_df_ppo.columns:
        val_df_ppo['recent_prices'] = val_df_ppo['llm_prediction'].apply(lambda x: [float(x)] * 5 if pd.notna(x) else [0.0] * 5)
    val_df_ppo['llm_prediction'].fillna(val_df_ppo['actual_price'], inplace=True)
    val_df_ppo['llm_likelihood'].fillna(0.5, inplace=True)
    print(f"Successfully prepared {len(val_df_ppo)} validation samples.")
    display(val_df_ppo.head())
except Exception as e:
    print(f"Error loading validation checkpoint: {e}")
    val_df_ppo = pd.DataFrame()

Loading validation data from ../results/llm_predictions_val_checkpoint.json...
Successfully prepared 1243 validation samples.


Unnamed: 0,ticker,date,recent_prices,llm_prediction,actual_price,llm_likelihood
0,0700.HK,2022-01-03,"[415.2041, 410.0417, 408.7512, 421.104, 418.3385]",420.0,414.835388,0.8
1,HSBC,2022-01-03,"[30.15, 30.22, 30.17, 30.15, 30.45]",30.15,31.82,0.8
2,AAPL,2022-01-03,"[179.29, 179.38, 178.2, 177.57, 182.01]",179.29,179.699997,0.8
3,PEP,2022-01-03,"[172.36, 172.97, 172.67, 173.71, 172.98]",173.0,173.229996,0.8
4,AAPL,2022-01-04,"[179.38, 178.2, 177.57, 182.01, 179.7]",179.38,174.919998,0.8


In [7]:
# Prepare training data from checkpoint
print("Loading training data from ../results/llm_predictions_train_checkpoint.json...")
try:
    with open('../results/llm_predictions_train_checkpoint.json', 'r') as f:
        train_checkpoint = json.load(f)
    train_llm_results = train_checkpoint.get('llm_results', [])

    train_parsed = []
    for idx, item in enumerate(train_data):
        if idx >= len(train_llm_results): break
        parsed = parse_prompt_data(item['prompt'])
        response = json.loads(item['response'])
        llm_output = train_llm_results[idx]

        if isinstance(llm_output, dict) and llm_output.get('predicted_close') is not None:
            parsed['llm_prediction'] = safe_float(llm_output.get('predicted_close'), response['predicted_close'])
        else:
            parsed['llm_prediction'] = response['predicted_close']
        
        parsed['actual_price'] = response['predicted_close']
        parsed['llm_likelihood'] = safe_float(llm_output.get('likelihood') if isinstance(llm_output, dict) else None, 0.5)
        train_parsed.append(parsed)

    train_df_ppo = pd.DataFrame(train_parsed)
    if 'recent_prices' not in train_df_ppo.columns:
        train_df_ppo['recent_prices'] = train_df_ppo['llm_prediction'].apply(lambda x: [float(x)] * 5 if pd.notna(x) else [0.0] * 5)
    train_df_ppo['llm_prediction'].fillna(train_df_ppo['actual_price'], inplace=True)
    train_df_ppo['llm_likelihood'].fillna(0.5, inplace=True)
    print(f"Successfully prepared {len(train_df_ppo)} training samples.")
    display(train_df_ppo.head())
except Exception as e:
    print(f"Error loading training checkpoint: {e}")
    train_df_ppo = pd.DataFrame()

Loading training data from ../results/llm_predictions_train_checkpoint.json...
Successfully prepared 8698 training samples.
Successfully prepared 8698 training samples.


Unnamed: 0,ticker,date,recent_prices,llm_prediction,actual_price,llm_likelihood
0,AAPL,2015-01-16,"[27.3125, 27.555, 27.45, 26.705, 26.4975]",27.3125,27.18,0.8
1,HSBC,2015-01-16,"[45.62, 45.71, 45.24, 45.26, 45.24]",45.62,45.360001,0.8
2,0700.HK,2015-01-16,"[117.168, 117.8133, 116.1539, 116.9836, 112.3743]",113.078837,113.388344,0.5
3,PEP,2015-01-16,"[96.42, 96.35, 96.67, 96.67, 97.29]",97.29,97.510002,0.8
4,0700.HK,2015-01-19,"[117.8133, 116.1539, 116.9836, 112.3743, 113.3...",113.3883,114.402382,0.8


## 4. Risk Metrics Functions

In [8]:
# Financial Risk Metrics
def calculate_var(returns: np.ndarray, confidence_level: float = 0.95) -> float:
    """Calculate Value at Risk (VaR)"""
    if len(returns) == 0:
        return 0.0
    return np.percentile(returns, (1 - confidence_level) * 100)

def calculate_cvar(returns: np.ndarray, confidence_level: float = 0.95) -> float:
    """Calculate Conditional Value at Risk (CVaR) - Expected Shortfall"""
    if len(returns) == 0:
        return 0.0
    var = calculate_var(returns, confidence_level)
    tail_losses = returns[returns <= var]
    if len(tail_losses) == 0:
        return var
    return np.mean(tail_losses)

def calculate_volatility(prices: np.ndarray) -> float:
    """Calculate price volatility (standard deviation of returns)"""
    if len(prices) < 2:
        return 0.0
    returns = np.diff(prices) / prices[:-1]
    return np.std(returns)

print("Risk metrics functions defined.")

Risk metrics functions defined.


## 5. PPO Environment with NaN Handling

In [None]:
# Custom Gym Environment for Stock Price Prediction with PPO
class StockPredictionEnv(gym.Env):
    """Custom Environment for Risk-Aware Stock Price Prediction without justification features"""
    
    def __init__(self, data_df: pd.DataFrame, window_size: int = 5, directional_bonus_weight: float = 0.5):
        super(StockPredictionEnv, self).__init__()
        
        self.data = data_df.copy()
        self.window_size = window_size
        self.current_step = 0
        self.max_steps = len(self.data)
        self.directional_bonus_weight = directional_bonus_weight
        
        # State: [llm_prediction, hist_prices, volatility, var, llm_likelihood, llm_trend]
        state_dim = 1 + window_size + 2 + 1 + 1
        
        # Action space: adjustment factor (continuous)
        self.action_space = spaces.Box(
            low=-0.02, high=0.02, shape=(1,), dtype=np.float32
        )
        
        # Observation space
        self.observation_space = spaces.Box(
            low=-np.inf, high=np.inf, shape=(state_dim,), dtype=np.float32
        )
        
        # Risk parameters
        self.lambda_risk = 5.0
        self.confidence_level = 0.95
        
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.current_step = self.window_size
        return self._get_observation(), {}
    
    def _get_observation(self):
        """Construct state representation with NaN handling"""
        idx = min(self.current_step, self.max_steps - 1)
        
        llm_pred = float(self.data.iloc[idx]['llm_prediction'])
        if np.isnan(llm_pred) or np.isinf(llm_pred):
            llm_pred = float(self.data.iloc[idx]['actual_price'])
        
        hist_prices = []
        if 'recent_prices' in self.data.columns and self.data.iloc[idx]['recent_prices'] is not None:
            try:
                hist_prices = [float(p) for p in self.data.iloc[idx]['recent_prices']]
                hist_prices = [p if not (np.isnan(p) or np.isinf(p)) else llm_pred for p in hist_prices]
            except:
                hist_prices = []
        
        if len(hist_prices) < self.window_size:
            pad_value = hist_prices[-1] if hist_prices else llm_pred
            hist_prices = hist_prices + [pad_value] * (self.window_size - len(hist_prices))
        hist_prices = np.array(hist_prices[-self.window_size:], dtype=np.float32)
        
        last_price = hist_prices[-1]
        llm_trend = llm_pred - last_price
        if np.isnan(llm_trend) or np.isinf(llm_trend): llm_trend = 0.0
        
        volatility = calculate_volatility(hist_prices)
        if np.isnan(volatility) or np.isinf(volatility): volatility = 0.0
        
        returns = np.diff(hist_prices) / hist_prices[:-1] if len(hist_prices) > 1 else np.array([0.0])
        returns = np.nan_to_num(returns, nan=0.0, posinf=0.0, neginf=0.0)
        var = calculate_var(returns, self.confidence_level)
        if np.isnan(var) or np.isinf(var): var = 0.0
        
        llm_likelihood = float(self.data.iloc[idx].get('llm_likelihood', 0.5))
        if np.isnan(llm_likelihood) or np.isinf(llm_likelihood): llm_likelihood = 0.5
        
        state = np.concatenate([
            np.array([llm_pred], dtype=np.float32),
            hist_prices,
            np.array([volatility, var, llm_likelihood, llm_trend], dtype=np.float32)
        ])
        
        state = np.nan_to_num(state, nan=0.0, posinf=1e6, neginf=-1e6)
        return state.astype(np.float32)
    
    def step(self, action):
        idx = min(self.current_step, self.max_steps - 1)
        llm_pred = float(self.data.iloc[idx]['llm_prediction'])
        actual_price = float(self.data.iloc[idx]['actual_price'])
        if np.isnan(llm_pred) or np.isinf(llm_pred): llm_pred = actual_price
        if np.isnan(actual_price) or np.isinf(actual_price): actual_price = llm_pred
        
        hist_prices_list = self.data.iloc[idx]['recent_prices']
        last_price = hist_prices_list[-1] if hist_prices_list and len(hist_prices_list) > 0 else llm_pred

        adjustment = float(action[0])
        if np.isnan(adjustment) or np.isinf(adjustment): adjustment = 0.0
        adjusted_pred = llm_pred * (1 + adjustment)
        pred_error = abs(adjusted_pred - actual_price)
        if actual_price != 0 and not np.isnan(actual_price): pct_error = pred_error / abs(actual_price)
        else: pct_error = 0.0
        if np.isnan(pred_error) or np.isinf(pred_error): pred_error = 0.0
        if np.isnan(pct_error) or np.isinf(pct_error): pct_error = 0.0
        scaled_error = pct_error * 100
        cvar = 0.0
        if 'recent_prices' in self.data.columns and self.data.iloc[idx]['recent_prices'] is not None:
            try:
                hist_prices = np.array(self.data.iloc[idx]['recent_prices'][-self.window_size:], dtype=np.float32)
                hist_prices = np.nan_to_num(hist_prices, nan=llm_pred)
                returns = np.diff(hist_prices) / hist_prices[:-1] if len(hist_prices) > 1 else np.array([0.0])
                returns = np.nan_to_num(returns, nan=0.0, posinf=0.0, neginf=0.0)
                cvar = abs(calculate_cvar(returns, self.confidence_level))
                if np.isnan(cvar) or np.isinf(cvar): cvar = 0.0
            except:
                cvar = 0.0
        risk_penalty = self.lambda_risk * cvar * 100
        llm_error = abs(llm_pred - actual_price)
        if actual_price != 0 and not np.isnan(actual_price): llm_pct_error = llm_error / abs(actual_price) * 100
        else: llm_pct_error = 0.0
        improvement = llm_pct_error - scaled_error
        
        actual_direction = np.sign(actual_price - last_price)
        predicted_direction = np.sign(adjusted_pred - last_price)
        directional_bonus = 0.0
        if actual_direction != 0 and actual_direction == predicted_direction:
            directional_bonus = self.directional_bonus_weight

        reward = -scaled_error - risk_penalty + (improvement * 0.5) + directional_bonus
        if np.isnan(reward) or np.isinf(reward): reward = -100.0
        self.current_step += 1
        terminated = self.current_step >= self.max_steps
        truncated = False
        next_state = self._get_observation()
        return next_state, reward, terminated, truncated, {}

print("Stock Prediction Environment defined.")

Stock Prediction Environment defined.


## 6. Test Environment

In [10]:
# Test the environment to ensure it produces valid observations
print("Testing StockPredictionEnv with sample data...")
print("="*80)

try:
    # Create test environment
    test_env = StockPredictionEnv(train_df_ppo, window_size=5)
    
    # Reset and get first observation
    obs, info = test_env.reset()
    
    print(f"Environment reset successful!")
    print(f"\nObservation Details:")
    print(f"   Shape: {obs.shape}")
    print(f"   Contains NaN: {np.any(np.isnan(obs))}")
    print(f"   Contains Inf: {np.any(np.isinf(obs))}")
    print(f"   Min value: {np.min(obs):.4f}")
    print(f"   Max value: {np.max(obs):.4f}")
    print(f"   Mean value: {np.mean(obs):.4f}")
    print(f"\n   First 5 values: {obs[:5]}")
    
    # Try a few steps
    print(f"\nTesting environment steps...")
    for i in range(10):
        action = test_env.action_space.sample()
        next_obs, reward, terminated, truncated, info = test_env.step(action)
        
        has_nan = np.any(np.isnan(next_obs))
        has_inf = np.any(np.isinf(next_obs))
        reward_invalid = np.isnan(reward) or np.isinf(reward)
        
        if has_nan or has_inf or reward_invalid:
            print(f"   Step {i+1}: NaN={has_nan}, Inf={has_inf}, Reward NaN/Inf={reward_invalid}")
            print(f"      Observation: {next_obs}")
            print(f"      Reward: {reward}")
            break
        else:
            print(f"   Step {i+1}: Valid (reward={reward:.4f})")
        
        if terminated or truncated:
            break
    
    print(f"\n{'='*80}")
    print(f"ENVIRONMENT TEST PASSED!")
    print(f"The environment is ready for PPO training.")
    print(f"{'='*80}")
    
except Exception as e:
    print(f"\nENVIRONMENT TEST FAILED!")
    print(f"Error: {e}")
    import traceback
    traceback.print_exc()

Testing StockPredictionEnv with sample data...
Environment reset successful!

Observation Details:
   Shape: (9,)
   Contains NaN: False
   Contains Inf: False
   Min value: -0.0058
   Max value: 1531.8000
   Mean value: 1001.4231

   First 5 values: [1500.  1479.2 1505.2 1502.8 1493. ]

Testing environment steps...
   Step 1: Valid (reward=-2.7119)
   Step 2: Valid (reward=-6.0480)
   Step 3: Valid (reward=-19.0390)
   Step 4: Valid (reward=-14.1962)
   Step 5: Valid (reward=-1.6905)
   Step 6: Valid (reward=-21.5172)
   Step 7: Valid (reward=-17.7085)
   Step 8: Valid (reward=-5.1227)
   Step 9: Valid (reward=-2.4888)
   Step 10: Valid (reward=-3.6003)

ENVIRONMENT TEST PASSED!
The environment is ready for PPO training.


In [11]:
def evaluate_on_validation(model, val_df, window_size=5):
    """
    Evaluate PPO model on validation set
    Returns MAE and other metrics
    """
    env = StockPredictionEnv(val_df, window_size=window_size)
    obs, _ = env.reset()
    
    predictions = []
    actuals = []
    rewards_list = []
    
    for idx in range(len(val_df)):
        if idx < window_size:
            # For early samples, use LLM prediction as-is
            predictions.append(val_df.iloc[idx]['llm_prediction'])
            actuals.append(val_df.iloc[idx]['actual_price'])
            continue
        
        # Get PPO action
        action, _ = model.predict(obs, deterministic=True)
        
        # Apply adjustment
        llm_pred = val_df.iloc[idx]['llm_prediction']
        adjusted_pred = llm_pred * (1 + action[0])
        predictions.append(adjusted_pred)
        actuals.append(val_df.iloc[idx]['actual_price'])
        
        # Step environment
        if idx < len(val_df) - 1:
            obs, reward, terminated, _, _ = env.step(action)
            rewards_list.append(reward)
            if terminated:
                break
    
    predictions = np.array(predictions)
    actuals = np.array(actuals)
    
    # Calculate metrics
    mae = np.mean(np.abs(predictions - actuals))
    mape = np.mean(np.abs((predictions - actuals) / actuals)) * 100
    rmse = np.sqrt(np.mean((predictions - actuals) ** 2))
    avg_reward = np.mean(rewards_list) if rewards_list else 0.0
    
    return {
        'mae': mae,
        'mape': mape,
        'rmse': rmse,
        'avg_reward': avg_reward,
        'predictions': predictions,
        'actuals': actuals
    }

print("Validation evaluation function defined.")

Validation evaluation function defined.


In [None]:
# Custom environment class for hyperparameter search
class CustomStockEnv(gym.Env):
    """Environment with configurable hyperparameters"""
    
    def __init__(self, data_df: pd.DataFrame, window_size: int = 5, 
                 action_range: float = 0.02, lambda_risk: float = 5.0, 
                 improvement_bonus_weight: float = 0.5,
                 directional_bonus_weight: float = 0.5):
        super(CustomStockEnv, self).__init__()
        
        self.data = data_df.copy()
        self.window_size = window_size
        self.current_step = 0
        self.max_steps = len(self.data)
        self.lambda_risk = lambda_risk
        self.improvement_bonus_weight = improvement_bonus_weight
        self.directional_bonus_weight = directional_bonus_weight
        
        # Add llm_trend to the state
        state_dim = 1 + window_size + 2 + 1 + 1
        
        self.action_space = spaces.Box(low=-action_range, high=action_range, shape=(1,), dtype=np.float32)
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(state_dim,), dtype=np.float32)
        self.confidence_level = 0.95
        
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.current_step = self.window_size
        return self._get_observation(), {}
    
    def _get_observation(self):
        idx = min(self.current_step, self.max_steps - 1)
        llm_pred = float(self.data.iloc[idx]['llm_prediction'])
        if np.isnan(llm_pred) or np.isinf(llm_pred): llm_pred = float(self.data.iloc[idx]['actual_price'])
        
        hist_prices = []
        if 'recent_prices' in self.data.columns and self.data.iloc[idx]['recent_prices'] is not None:
            try:
                hist_prices = [float(p) for p in self.data.iloc[idx]['recent_prices']]
                hist_prices = [p if not (np.isnan(p) or np.isinf(p)) else llm_pred for p in hist_prices]
            except:
                hist_prices = []
        
        if len(hist_prices) < self.window_size:
            pad_value = hist_prices[-1] if hist_prices else llm_pred
            hist_prices = hist_prices + [pad_value] * (self.window_size - len(hist_prices))
        hist_prices = np.array(hist_prices[-self.window_size:], dtype=np.float32)
        
        last_price = hist_prices[-1]
        llm_trend = llm_pred - last_price
        if np.isnan(llm_trend) or np.isinf(llm_trend): llm_trend = 0.0
        
        volatility = calculate_volatility(hist_prices)
        if np.isnan(volatility) or np.isinf(volatility): volatility = 0.0
        
        returns = np.diff(hist_prices) / hist_prices[:-1] if len(hist_prices) > 1 else np.array([0.0])
        returns = np.nan_to_num(returns, nan=0.0, posinf=0.0, neginf=0.0)
        var = calculate_var(returns, self.confidence_level)
        if np.isnan(var) or np.isinf(var): var = 0.0
        
        llm_likelihood = float(self.data.iloc[idx].get('llm_likelihood', 0.5))
        if np.isnan(llm_likelihood) or np.isinf(llm_likelihood): llm_likelihood = 0.5
        
        state = np.concatenate([
            np.array([llm_pred], dtype=np.float32),
            hist_prices,
            np.array([volatility, var, llm_likelihood, llm_trend], dtype=np.float32)
        ])
        
        state = np.nan_to_num(state, nan=0.0, posinf=1e6, neginf=-1e6)
        return state.astype(np.float32)
    
    def step(self, action):
        idx = min(self.current_step, self.max_steps - 1)
        llm_pred = float(self.data.iloc[idx]['llm_prediction'])
        actual_price = float(self.data.iloc[idx]['actual_price'])
        if np.isnan(llm_pred) or np.isinf(llm_pred): llm_pred = actual_price
        if np.isnan(actual_price) or np.isinf(actual_price): actual_price = llm_pred
        
        hist_prices_list = self.data.iloc[idx]['recent_prices']
        last_price = hist_prices_list[-1] if hist_prices_list and len(hist_prices_list) > 0 else llm_pred
        
        adjustment = float(action[0])
        if np.isnan(adjustment) or np.isinf(adjustment): adjustment = 0.0
        adjusted_pred = llm_pred * (1 + adjustment)
        
        pred_error = abs(adjusted_pred - actual_price)
        pct_error = pred_error / abs(actual_price) if actual_price != 0 and not np.isnan(actual_price) else 0.0
        if np.isnan(pred_error) or np.isinf(pred_error): pred_error = 0.0
        if np.isnan(pct_error) or np.isinf(pct_error): pct_error = 0.0
        scaled_error = pct_error * 100
        
        cvar = 0.0
        if 'recent_prices' in self.data.columns and self.data.iloc[idx]['recent_prices'] is not None:
            try:
                hist_prices = np.array(self.data.iloc[idx]['recent_prices'][-self.window_size:], dtype=np.float32)
                hist_prices = np.nan_to_num(hist_prices, nan=llm_pred)
                returns = np.diff(hist_prices) / hist_prices[:-1] if len(hist_prices) > 1 else np.array([0.0])
                returns = np.nan_to_num(returns, nan=0.0, posinf=0.0, neginf=0.0)
                cvar = abs(calculate_cvar(returns, self.confidence_level))
                if np.isnan(cvar) or np.isinf(cvar): cvar = 0.0
            except:
                cvar = 0.0
        
        risk_penalty = self.lambda_risk * cvar * 100
        llm_error = abs(llm_pred - actual_price)
        llm_pct_error = llm_error / abs(actual_price) * 100 if actual_price != 0 and not np.isnan(actual_price) else 0.0
        improvement = llm_pct_error - scaled_error
        
        actual_direction = np.sign(actual_price - last_price)
        predicted_direction = np.sign(adjusted_pred - last_price)
        directional_bonus = 0.0
        if actual_direction != 0 and actual_direction == predicted_direction:
            directional_bonus = self.directional_bonus_weight
            
        reward = -scaled_error - risk_penalty + (improvement * self.improvement_bonus_weight) + directional_bonus
        
        if np.isnan(reward) or np.isinf(reward): reward = -100.0
        
        self.current_step += 1
        terminated = self.current_step >= self.max_steps
        truncated = False
        next_state = self._get_observation()
        return next_state, reward, terminated, truncated, {}

print("Custom environment for hyperparameter search defined.")

Custom environment for hyperparameter search defined.


## 7. Hyperparameter Search
This section defines a custom environment for the search, sets up the hyperparameter grid, and runs the training and evaluation loop for each configuration.

In [None]:
# Hyperparameter Search with Validation Set
import itertools
from datetime import datetime

print("="*80)
print("PPO HYPERPARAMETER SEARCH")
print("="*80)

# Calculate LLM baseline on validation set
llm_val_mae_baseline = np.mean(np.abs(val_df_ppo['llm_prediction'] - val_df_ppo['actual_price']))
print(f"\nTarget to Beat: {llm_val_mae_baseline:.4f} (LLM-only validation MAE)")
print(f"   Goal: Find PPO params that reduce MAE by >5%")

# Define hyperparameter grid
param_grid = {
    'learning_rate': [1e-4, 5e-5, 1e-5],
    'action_space_range': [0.01, 0.02, 0.05],  # ±1%, ±2%, ±5%
    'lambda_risk': [1.0, 5.0, 10.0],  # CVaR weight
    'ent_coef': [0.0, 0.01, 0.02],  # Entropy coefficient
    'improvement_bonus_weight': [0.0, 0.5, 1.0],  # Bonus for beating LLM
    'directional_bonus_weight': [0.0, 0.5, 1.0], # Bonus for correct direction
}

# For faster iteration, we'll sample key combinations
test_configs = [
    {'name': 'Current Best', 'learning_rate': 5e-5, 'action_space_range': 0.02, 'lambda_risk': 5.0, 'ent_coef': 0.02, 'improvement_bonus_weight': 0.5, 'directional_bonus_weight': 0.5},
    {'name': 'Conservative', 'learning_rate': 1e-5, 'action_space_range': 0.01, 'lambda_risk': 10.0, 'ent_coef': 0.01, 'improvement_bonus_weight': 1.0, 'directional_bonus_weight': 1.0},
    {'name': 'Aggressive', 'learning_rate': 1e-4, 'action_space_range': 0.05, 'lambda_risk': 1.0, 'ent_coef': 0.02, 'improvement_bonus_weight': 0.5, 'directional_bonus_weight': 0.5},
    {'name': 'High Exploration', 'learning_rate': 5e-5, 'action_space_range': 0.02, 'lambda_risk': 5.0, 'ent_coef': 0.05, 'improvement_bonus_weight': 0.5, 'directional_bonus_weight': 0.5},
    {'name': 'Bonus Focused', 'learning_rate': 5e-5, 'action_space_range': 0.02, 'lambda_risk': 2.0, 'ent_coef': 0.01, 'improvement_bonus_weight': 2.0, 'directional_bonus_weight': 1.0},
    {'name': 'Risk Averse', 'learning_rate': 5e-5, 'action_space_range': 0.01, 'lambda_risk': 15.0, 'ent_coef': 0.0, 'improvement_bonus_weight': 0.5, 'directional_bonus_weight': 0.5},
    {'name': 'Balanced', 'learning_rate': 5e-5, 'action_space_range': 0.03, 'lambda_risk': 5.0, 'ent_coef': 0.015, 'improvement_bonus_weight': 1.0, 'directional_bonus_weight': 1.0},
    {'name': 'Fast Learner', 'learning_rate': 1e-4, 'action_space_range': 0.02, 'lambda_risk': 5.0, 'ent_coef': 0.02, 'improvement_bonus_weight': 1.5, 'directional_bonus_weight': 0.75},
]

print(f"\nTesting {len(test_configs)} strategic configurations")
print(f"   Training: {len(train_df_ppo)} samples")
print(f"   Validation: {len(val_df_ppo)} samples")
print(f"   Training duration: 40k timesteps per config (~5 min each)")

# Store results
search_results = []

print("\n" + "="*80)
print("STARTING HYPERPARAMETER SEARCH")
print("="*80)

PPO HYPERPARAMETER SEARCH

Target to Beat: 58.9895 (LLM-only validation MAE)
   Goal: Find PPO params that reduce MAE by >5%

Testing 8 strategic configurations
   Training: 8698 samples
   Validation: 1243 samples
   Training duration: 40k timesteps per config (~5 min each)

STARTING HYPERPARAMETER SEARCH


In [14]:
# Run hyperparameter search
for config_idx, config in enumerate(test_configs):
    print(f"\n{'='*80}")
    print(f"CONFIG {config_idx + 1}/{len(test_configs)}: {config['name']}")
    print(f"{'='*80}")
    
    print(f"Parameters:")
    for key, value in config.items():
        if key != 'name': print(f"   {key}: {value}")
    
    try:
        train_env = CustomStockEnv(train_df_ppo, window_size=5, action_range=config['action_space_range'], lambda_risk=config['lambda_risk'], improvement_bonus_weight=config['improvement_bonus_weight'])
        ppo_model = PPO("MlpPolicy", train_env, learning_rate=config['learning_rate'], n_steps=2048, batch_size=64, n_epochs=10, gamma=0.99, clip_range=0.2, ent_coef=config['ent_coef'], vf_coef=0.5, verbose=0, max_grad_norm=0.5)
        
        print(f"\nTraining for 40,000 timesteps...")
        start_time = datetime.now()
        ppo_model.learn(total_timesteps=40000, progress_bar=True)
        training_time = (datetime.now() - start_time).total_seconds()
        
        print(f"Training complete ({training_time:.1f}s)")
        print(f"Evaluating on validation set...")
        
        val_env = CustomStockEnv(val_df_ppo, window_size=5, action_range=config['action_space_range'], lambda_risk=config['lambda_risk'], improvement_bonus_weight=config['improvement_bonus_weight'])
        obs, _ = val_env.reset()
        predictions, actuals, rewards = [], [], []
        
        for idx in range(len(val_df_ppo)):
            if idx < 5:
                predictions.append(val_df_ppo.iloc[idx]['llm_prediction'])
                actuals.append(val_df_ppo.iloc[idx]['actual_price'])
                continue
            
            action, _ = ppo_model.predict(obs, deterministic=True)
            llm_pred = val_df_ppo.iloc[idx]['llm_prediction']
            adjusted_pred = llm_pred * (1 + action[0])
            predictions.append(adjusted_pred)
            actuals.append(val_df_ppo.iloc[idx]['actual_price'])
            
            if idx < len(val_df_ppo) - 1:
                obs, reward, terminated, _, _ = val_env.step(action)
                rewards.append(reward)
                if terminated: break
        
        predictions, actuals = np.array(predictions), np.array(actuals)
        val_mae = np.mean(np.abs(predictions - actuals))
        val_mape = np.mean(np.abs((predictions - actuals) / actuals)) * 100
        val_rmse = np.sqrt(np.mean((predictions - actuals) ** 2))
        avg_reward = np.mean(rewards) if rewards else 0.0
        improvement_pct = ((llm_val_mae_baseline - val_mae) / llm_val_mae_baseline) * 100
        
        search_results.append({'config_name': config['name'], 'config': config.copy(), 'val_mae': val_mae, 'val_mape': val_mape, 'val_rmse': val_rmse, 'avg_reward': avg_reward, 'improvement_pct': improvement_pct, 'training_time': training_time, 'model': ppo_model})
        
        print(f"\nValidation Results:")
        print(f"   MAE: {val_mae:.4f} (LLM baseline: {llm_val_mae_baseline:.4f})")
        print(f"   MAPE: {val_mape:.2f}%")
        print(f"   RMSE: {val_rmse:.4f}")
        print(f"   Avg Reward: {avg_reward:.4f}")
        print(f"   Improvement: {improvement_pct:+.2f}%")
        
        train_env.close()
        val_env.close()
        
    except Exception as e:
        print(f"\nError with config {config['name']}: {e}")
        import traceback
        traceback.print_exc()
        continue

print(f"\n{'='*80}")
print(f"HYPERPARAMETER SEARCH COMPLETE")
print(f"{'='*80}")


CONFIG 1/8: Current Best
Parameters:
   learning_rate: 5e-05
   action_space_range: 0.02
   lambda_risk: 5.0
   ent_coef: 0.02
   improvement_bonus_weight: 0.5

Training for 40,000 timesteps...

Training for 40,000 timesteps...


Training complete (17.7s)
Evaluating on validation set...

Validation Results:
   MAE: 60.2841 (LLM baseline: 58.9895)
   MAPE: 13.48%
   RMSE: 215.6415
   Avg Reward: -24.0078
   Improvement: -2.19%

CONFIG 2/8: Conservative
Parameters:
   learning_rate: 1e-05
   action_space_range: 0.01
   lambda_risk: 10.0
   ent_coef: 0.01
   improvement_bonus_weight: 1.0

Training for 40,000 timesteps...

Validation Results:
   MAE: 60.2841 (LLM baseline: 58.9895)
   MAPE: 13.48%
   RMSE: 215.6415
   Avg Reward: -24.0078
   Improvement: -2.19%

CONFIG 2/8: Conservative
Parameters:
   learning_rate: 1e-05
   action_space_range: 0.01
   lambda_risk: 10.0
   ent_coef: 0.01
   improvement_bonus_weight: 1.0

Training for 40,000 timesteps...


Training complete (17.4s)
Evaluating on validation set...

Validation Results:
   MAE: 58.9147 (LLM baseline: 58.9895)
   MAPE: 13.29%
   RMSE: 215.3470
   Avg Reward: -34.0900
   Improvement: +0.13%

CONFIG 3/8: Aggressive
Parameters:
   learning_rate: 0.0001
   action_space_range: 0.05
   lambda_risk: 1.0
   ent_coef: 0.02
   improvement_bonus_weight: 0.5

Training for 40,000 timesteps...

Validation Results:
   MAE: 58.9147 (LLM baseline: 58.9895)
   MAPE: 13.29%
   RMSE: 215.3470
   Avg Reward: -34.0900
   Improvement: +0.13%

CONFIG 3/8: Aggressive
Parameters:
   learning_rate: 0.0001
   action_space_range: 0.05
   lambda_risk: 1.0
   ent_coef: 0.02
   improvement_bonus_weight: 0.5

Training for 40,000 timesteps...


Training complete (18.4s)
Evaluating on validation set...

Validation Results:
   MAE: 70.2949 (LLM baseline: 58.9895)
   MAPE: 15.11%
   RMSE: 218.8971
   Avg Reward: -18.1408
   Improvement: -19.17%

CONFIG 4/8: High Exploration
Parameters:
   learning_rate: 5e-05
   action_space_range: 0.02
   lambda_risk: 5.0
   ent_coef: 0.05
   improvement_bonus_weight: 0.5

Training for 40,000 timesteps...

Validation Results:
   MAE: 70.2949 (LLM baseline: 58.9895)
   MAPE: 15.11%
   RMSE: 218.8971
   Avg Reward: -18.1408
   Improvement: -19.17%

CONFIG 4/8: High Exploration
Parameters:
   learning_rate: 5e-05
   action_space_range: 0.02
   lambda_risk: 5.0
   ent_coef: 0.05
   improvement_bonus_weight: 0.5

Training for 40,000 timesteps...


Training complete (17.7s)
Evaluating on validation set...

Validation Results:
   MAE: 64.1931 (LLM baseline: 58.9895)
   MAPE: 14.25%
   RMSE: 216.8581
   Avg Reward: -25.1607
   Improvement: -8.82%

CONFIG 5/8: Bonus Focused
Parameters:
   learning_rate: 5e-05
   action_space_range: 0.02
   lambda_risk: 2.0
   ent_coef: 0.01
   improvement_bonus_weight: 2.0

Training for 40,000 timesteps...

Validation Results:
   MAE: 64.1931 (LLM baseline: 58.9895)
   MAPE: 14.25%
   RMSE: 216.8581
   Avg Reward: -25.1607
   Improvement: -8.82%

CONFIG 5/8: Bonus Focused
Parameters:
   learning_rate: 5e-05
   action_space_range: 0.02
   lambda_risk: 2.0
   ent_coef: 0.01
   improvement_bonus_weight: 2.0

Training for 40,000 timesteps...


Training complete (17.4s)
Evaluating on validation set...

Validation Results:
   MAE: 57.3115 (LLM baseline: 58.9895)
   MAPE: 12.94%
   RMSE: 214.8602
   Avg Reward: -16.3490
   Improvement: +2.84%

CONFIG 6/8: Risk Averse
Parameters:
   learning_rate: 5e-05
   action_space_range: 0.01
   lambda_risk: 15.0
   ent_coef: 0.0
   improvement_bonus_weight: 0.5

Training for 40,000 timesteps...

Validation Results:
   MAE: 57.3115 (LLM baseline: 58.9895)
   MAPE: 12.94%
   RMSE: 214.8602
   Avg Reward: -16.3490
   Improvement: +2.84%

CONFIG 6/8: Risk Averse
Parameters:
   learning_rate: 5e-05
   action_space_range: 0.01
   lambda_risk: 15.0
   ent_coef: 0.0
   improvement_bonus_weight: 0.5

Training for 40,000 timesteps...


Training complete (17.4s)
Evaluating on validation set...

Validation Results:
   MAE: 58.9147 (LLM baseline: 58.9895)
   MAPE: 13.29%
   RMSE: 215.3470
   Avg Reward: -44.5186
   Improvement: +0.13%

CONFIG 7/8: Balanced
Parameters:
   learning_rate: 5e-05
   action_space_range: 0.03
   lambda_risk: 5.0
   ent_coef: 0.015
   improvement_bonus_weight: 1.0

Training for 40,000 timesteps...

Validation Results:
   MAE: 58.9147 (LLM baseline: 58.9895)
   MAPE: 13.29%
   RMSE: 215.3470
   Avg Reward: -44.5186
   Improvement: +0.13%

CONFIG 7/8: Balanced
Parameters:
   learning_rate: 5e-05
   action_space_range: 0.03
   lambda_risk: 5.0
   ent_coef: 0.015
   improvement_bonus_weight: 1.0

Training for 40,000 timesteps...


Training complete (17.5s)
Evaluating on validation set...

Validation Results:
   MAE: 62.8834 (LLM baseline: 58.9895)
   MAPE: 13.88%
   RMSE: 216.3342
   Avg Reward: -24.8811
   Improvement: -6.60%

CONFIG 8/8: Fast Learner
Parameters:
   learning_rate: 0.0001
   action_space_range: 0.02
   lambda_risk: 5.0
   ent_coef: 0.02
   improvement_bonus_weight: 1.5

Training for 40,000 timesteps...

Validation Results:
   MAE: 62.8834 (LLM baseline: 58.9895)
   MAPE: 13.88%
   RMSE: 216.3342
   Avg Reward: -24.8811
   Improvement: -6.60%

CONFIG 8/8: Fast Learner
Parameters:
   learning_rate: 0.0001
   action_space_range: 0.02
   lambda_risk: 5.0
   ent_coef: 0.02
   improvement_bonus_weight: 1.5

Training for 40,000 timesteps...


Training complete (17.4s)
Evaluating on validation set...

Validation Results:
   MAE: 60.2841 (LLM baseline: 58.9895)
   MAPE: 13.48%
   RMSE: 215.6415
   Avg Reward: -24.1490
   Improvement: -2.19%

HYPERPARAMETER SEARCH COMPLETE

Validation Results:
   MAE: 60.2841 (LLM baseline: 58.9895)
   MAPE: 13.48%
   RMSE: 215.6415
   Avg Reward: -24.1490
   Improvement: -2.19%

HYPERPARAMETER SEARCH COMPLETE


## 8. Analyze Search Results
This section summarizes the outcomes of the hyperparameter search, identifies the best-performing configuration, and saves the corresponding model.

In [15]:
# Analyze and visualize search results
if search_results:
    print("\n" + "="*80)
    print("HYPERPARAMETER SEARCH SUMMARY")
    print("="*80)
    
    summary_data = [{'Config': r['config_name'], 'Val MAE': r['val_mae'], 'Improvement %': r['improvement_pct'], 'Val MAPE': r['val_mape'], 'Avg Reward': r['avg_reward'], 'Training Time (s)': r['training_time'], 'LR': r['config']['learning_rate'], 'Action Range': r['config']['action_space_range'], 'Lambda Risk': r['config']['lambda_risk'], 'Entropy': r['config']['ent_coef'], 'Bonus Weight': r['config']['improvement_bonus_weight']} for r in search_results]
    summary_df = pd.DataFrame(summary_data).sort_values('Val MAE', ascending=True)
    
    print(f"\nTop 3 Configurations by Validation MAE:")
    print(summary_df[['Config', 'Val MAE', 'Improvement %', 'Val MAPE']].head(3).to_string(index=False))
    
    best_result = min(search_results, key=lambda x: x['val_mae'])
    print(f"\nBEST CONFIGURATION: {best_result['config_name']}")
    print(f"   Validation MAE: {best_result['val_mae']:.4f}")
    print(f"   Improvement: {best_result['improvement_pct']:+.2f}%")
    
    # Save the best model
    best_model = best_result['model']
    best_model_path = '../results/ppo_best_model_from_search.zip'
    best_model.save(best_model_path)
    print(f"\nBest model saved to {best_model_path}")

else:
    print("No search results available")


HYPERPARAMETER SEARCH SUMMARY

Top 3 Configurations by Validation MAE:
       Config   Val MAE  Improvement %  Val MAPE
Bonus Focused 57.311510       2.844514 12.941502
 Conservative 58.914703       0.126753 13.287133
  Risk Averse 58.914703       0.126753 13.287133

BEST CONFIGURATION: Bonus Focused
   Validation MAE: 57.3115
   Improvement: +2.84%

Best model saved to ../results/ppo_best_model_from_search.zip


## 9. Train Final Model
Using the best hyperparameters identified in the search, this section trains the final PPO model on the full training dataset for an extended number of timesteps.

In [16]:
# Train final model with best hyperparameters
if 'best_result' in locals() and best_result:
    best_params = best_result['config']
    
    print("\n" + "="*80)
    print("TRAINING FINAL PPO MODEL")
    print("="*80)
    print("Using best hyperparameters from search:")
    for key, value in best_params.items():
        if key != 'name': print(f"   {key}: {value}")
        
    # Create environment with best parameters
    final_env = CustomStockEnv(
        train_df_ppo, 
        window_size=5,
        action_range=best_params['action_space_range'],
        lambda_risk=best_params['lambda_risk'],
        improvement_bonus_weight=best_params['improvement_bonus_weight']
    )
    
    # Initialize PPO model
    final_model = PPO(
        "MlpPolicy",
        final_env,
        learning_rate=best_params['learning_rate'],
        n_steps=2048,
        batch_size=64,
        n_epochs=10,
        gamma=0.99,
        clip_range=0.2,
        ent_coef=best_params['ent_coef'],
        vf_coef=0.5,
        verbose=0,
        max_grad_norm=0.5
    )
    
    # Train for more timesteps
    print("\nTraining for 80,000 timesteps...")
    final_model.learn(total_timesteps=80000, progress_bar=True)
    
    # Save the final model
    final_model_path = '../results/ppo_final_model.zip'
    final_model.save(final_model_path)
    
    print(f"\n{'='*80}")
    print(f"FINAL MODEL TRAINING COMPLETE")
    print(f"Model saved to: {final_model_path}")
    print(f"{'='*80}")
    
else:
    print("\nSkipping final model training because best hyperparameters are not available.")


TRAINING FINAL PPO MODEL
Using best hyperparameters from search:
   learning_rate: 5e-05
   action_space_range: 0.02
   lambda_risk: 2.0
   ent_coef: 0.01
   improvement_bonus_weight: 2.0

Training for 80,000 timesteps...



FINAL MODEL TRAINING COMPLETE
Model saved to: ../results/ppo_final_model.zip


## 10. Evaluate Final Model
This section evaluates the fully trained PPO model on the held-out test set to assess its generalization performance.

In [17]:
# Evaluate the final model on the test set
if 'final_model' in locals() and final_model:
    print("\n" + "="*80)
    print("EVALUATING FINAL MODEL ON TEST SET")
    print("="*80)

    # Create test environment
    test_env_final = CustomStockEnv(
        test_df,
        window_size=5,
        action_range=best_params['action_space_range'],
        lambda_risk=best_params['lambda_risk'],
        improvement_bonus_weight=best_params['improvement_bonus_weight']
    )
    
    obs, _ = test_env_final.reset()
    test_predictions, test_actuals = [], []

    for idx in range(len(test_df)):
        if idx < 5:
            test_predictions.append(test_df.iloc[idx]['llm_prediction'])
            test_actuals.append(test_df.iloc[idx]['actual_price'])
            continue
        
        action, _ = final_model.predict(obs, deterministic=True)
        llm_pred = test_df.iloc[idx]['llm_prediction']
        adjusted_pred = llm_pred * (1 + action[0])
        test_predictions.append(adjusted_pred)
        test_actuals.append(test_df.iloc[idx]['actual_price'])
        
        if idx < len(test_df) - 1:
            obs, _, terminated, _, _ = test_env_final.step(action)
            if terminated:
                break
    
    test_predictions = np.array(test_predictions)
    test_actuals = np.array(test_actuals)
    
    # Calculate final metrics
    test_mae = np.mean(np.abs(test_predictions - test_actuals))
    test_mape = np.mean(np.abs((test_predictions - test_actuals) / test_actuals)) * 100
    test_rmse = np.sqrt(np.mean((test_predictions - test_actuals) ** 2))
    
    # Compare with LLM-only baseline on test set
    llm_test_mae = np.mean(np.abs(test_df['llm_prediction'] - test_df['actual_price']))
    test_improvement_pct = ((llm_test_mae - test_mae) / llm_test_mae) * 100
    
    print(f"Test Set Performance:")
    print(f"   PPO-Adjusted MAE:  {test_mae:.4f}")
    print(f"   LLM-Only MAE:        {llm_test_mae:.4f}")
    print(f"   Improvement vs LLM:  {test_improvement_pct:+.2f}%")
    print(f"   PPO-Adjusted MAPE: {test_mape:.2f}%")
    print(f"   PPO-Adjusted RMSE: {test_rmse:.4f}")
    print(f"{'='*80}")

    # Save test predictions
    test_df['ppo_adjusted_prediction'] = test_predictions
    test_predictions_path = '../results/test_predictions_with_ppo.csv'
    test_df.to_csv(test_predictions_path, index=False)
    print(f"Test predictions saved to {test_predictions_path}")

else:
    print("\nSkipping final evaluation because the final model is not available.")


EVALUATING FINAL MODEL ON TEST SET
Test Set Performance:
   PPO-Adjusted MAE:  64.5999
   LLM-Only MAE:        62.1152
   Improvement vs LLM:  -4.00%
   PPO-Adjusted MAPE: 7.32%
   PPO-Adjusted RMSE: 314.2232
Test predictions saved to ../results/test_predictions_with_ppo.csv
Test Set Performance:
   PPO-Adjusted MAE:  64.5999
   LLM-Only MAE:        62.1152
   Improvement vs LLM:  -4.00%
   PPO-Adjusted MAPE: 7.32%
   PPO-Adjusted RMSE: 314.2232
Test predictions saved to ../results/test_predictions_with_ppo.csv


## 11. How to Load the Saved Model

You can now load the trained PPO model in any other notebook to make predictions. The best model from hyperparameter search was saved to `../results/ppo_final_model.zip`.


Here's how to load it:


```python
from stable_baselines3 import PPO
import gymnasium as gym

# You will need to have your custom environment class available
# For example, you can copy the CustomStockEnv class into your new notebook

# Load the model
model = PPO.load("../results/ppo_final_model.zip")

# Now you can use the model to predict actions
# obs = ... # get an observation from your environment
# action, _ = model.predict(obs, deterministic=True)
```