# PPO Inference on normal-Based Predictions

This notebook loads a pre-trained PPO model and applies it to the test set predictions generated by the base small LLM.

## Prerequisites:
- A trained PPO model saved at `../results/ppo_final_model.zip`
- Justification-based LLM predictions at `../results/llm_predictions_checkpoint.json`

## What this notebook does:
1. Loads the justification-based LLM predictions for the test set.
2. Loads the pre-trained PPO model.
3. Defines the RL environment required for inference.
4. Applies the PPO model to adjust the LLM's predictions.
5. Evaluates and saves the final, adjusted predictions.

## 1. Import Libraries

In [11]:
# Install required packages for progress bar
!pip install "stable-baselines3[extra]"

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [12]:
# Import libraries
import os
import json
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from typing import Dict, List, Tuple
import warnings
warnings.filterwarnings('ignore')

# HTTP requests for HF endpoint
import requests

# Environment variables
from dotenv import load_dotenv

# Reinforcement Learning
import gymnasium as gym
from gymnasium import spaces
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Progress bar
from tqdm import tqdm

# Set random seeds for reproducibility
np.random.seed(42)

print("All libraries imported successfully!")

All libraries imported successfully!


## 2. Data Loading and Preparation

In [13]:


def safe_float(value, default=0.0) -> float:
    try:
        return float(value)
    except (TypeError, ValueError):
        return float(default)

def parse_prompt_data(prompt_text):
    """Extract key information from prompt"""
    lines = prompt_text.split('\n')
    data = {}
    
    for line in lines:
        if 'TICKER:' in line:
            data['ticker'] = line.split('TICKER:')[1].strip()
        elif 'DATE:' in line:
            data['date'] = line.split('DATE:')[1].strip()
        elif 'RECENT CLOSING PRICES' in line:
            # Prices are on the same line after the colon
            if ':' in line:
                prices_part = line.split(':', 1)[1].strip()
                # Remove any parenthetical text like "(most recent last)"
                if '(' in prices_part:
                    prices_part = prices_part.split('(')[0].strip()
                # Extract comma-separated prices
                try:
                    data['recent_prices'] = [float(p.strip()) for p in prices_part.split(',') if p.strip()]
                except ValueError:
                    # If parsing fails, set empty list
                    data['recent_prices'] = []
    
    return data

print("Data preparation functions defined.")

Data preparation functions defined.


In [14]:
def load_jsonl(filepath):
    """Load JSONL file"""
    data = []
    with open(filepath, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    return data

In [15]:
# Load the pre-computed LLM predictions from the checkpoint file
checkpoint_file = '../results/llm_predictions_checkpoint.json'
print(f"Loading pre-computed LLM predictions from {checkpoint_file}...")

try:
    # Load original test data to get prompts
    test_data = load_jsonl('../finetune_paper/test.jsonl')
    
    with open(checkpoint_file, 'r') as f:
        checkpoint_data = json.load(f)
    
    llm_results = checkpoint_data.get('llm_results', [])
    
    # Align with the test set only
    if len(llm_results) != len(test_data):
        raise ValueError(f"Checkpoint size ({len(llm_results)}) does not match test data size ({len(test_data)})")

    # Prepare the test DataFrame
    parsed_data = []
    for idx, item in enumerate(test_data):
        parsed = parse_prompt_data(item['prompt'])
        response = json.loads(item['response'])
        llm_output = llm_results[idx]

        parsed['llm_prediction'] = safe_float(llm_output.get('predicted_close'), response['predicted_close'])
        parsed['actual_price'] = response['predicted_close']
        parsed['llm_likelihood'] = safe_float(llm_output.get('likelihood'), 0.5)
        # The model was not trained on justification features, so we do not extract them here.
        # justification_text = llm_output.get('justification', '')
        # parsed.update(extract_justification_features(justification_text))
        parsed_data.append(parsed)

    test_df = pd.DataFrame(parsed_data)

    # Data Cleaning
    if 'recent_prices' not in test_df.columns:
        test_df['recent_prices'] = test_df['llm_prediction'].apply(lambda x: [float(x)] * 5 if pd.notna(x) else [0.0] * 5)
    test_df['llm_prediction'].fillna(test_df['actual_price'], inplace=True)
    test_df['llm_likelihood'].fillna(0.5, inplace=True)

    print(f"Successfully loaded and prepared {len(test_df)} test samples.")
    display(test_df.head())

except FileNotFoundError:
    print(f"ERROR: Checkpoint file not found at {checkpoint_file}")
    test_df = pd.DataFrame()
except Exception as e:
    print(f"An error occurred: {e}")
    test_df = pd.DataFrame()

Loading pre-computed LLM predictions from ../results/llm_predictions_checkpoint.json...
Successfully loaded and prepared 2477 test samples.


Unnamed: 0,ticker,date,recent_prices,llm_prediction,actual_price,llm_likelihood
0,HSBC,2023-01-03,"[31.07, 31.03, 31.21, 31.16, 31.63]",31.63,32.68,0.8
1,0700.HK,2023-01-03,"[304.1191, 309.8178, 318.3658, 317.226, 327.8636]",0.0,342.870056,0.0
2,PEP,2023-01-03,"[183.07, 181.75, 181.98, 180.66, 179.41]",181.0,178.970001,0.7
3,AAPL,2023-01-03,"[130.03, 126.04, 129.61, 129.93, 125.07]",130.03,126.360001,0.5
4,7203.T,2023-01-04,"[1817.5, 1819.0, 1817.0, 1812.5, 1799.0]",1817.5,1807.5,0.8


In [16]:
# Financial Risk Metrics
def calculate_var(returns: np.ndarray, confidence_level: float = 0.95) -> float:
    """Calculate Value at Risk (VaR)"""
    if len(returns) == 0:
        return 0.0
    return np.percentile(returns, (1 - confidence_level) * 100)

def calculate_cvar(returns: np.ndarray, confidence_level: float = 0.95) -> float:
    """Calculate Conditional Value at Risk (CVaR) - Expected Shortfall"""
    if len(returns) == 0:
        return 0.0
    var = calculate_var(returns, confidence_level)
    tail_losses = returns[returns <= var]
    if len(tail_losses) == 0:
        return var
    return np.mean(tail_losses)

def calculate_volatility(prices: np.ndarray) -> float:
    """Calculate price volatility (standard deviation of returns)"""
    if len(prices) < 2:
        return 0.0
    returns = np.diff(prices) / prices[:-1]
    return np.std(returns)

print("Risk metrics functions defined.")

Risk metrics functions defined.


## 3. PPO Environment Definition

In [17]:
# Custom Gym Environment for Stock Price Prediction with PPO
class StockPredictionEnv(gym.Env):
    """Custom Environment for Risk-Aware Stock Price Prediction without justification features"""
    
    def __init__(self, data_df: pd.DataFrame, window_size: int = 5):
        super(StockPredictionEnv, self).__init__()
        
        self.data = data_df.copy()
        self.window_size = window_size
        self.current_step = 0
        self.max_steps = len(self.data)
        
        # State: [llm_prediction, historical_prices (window), volatility, var, llm_likelihood]
        state_dim = 1 + window_size + 2 + 1  # 1 for llm_likelihood
        
        # Action space: adjustment factor (continuous)
        self.action_space = spaces.Box(
            low=-0.02, high=0.02, shape=(1,), dtype=np.float32
        )
        
        # Observation space
        self.observation_space = spaces.Box(
            low=-np.inf, high=np.inf, shape=(state_dim,), dtype=np.float32
        )
        
        # Risk parameters
        self.lambda_risk = 5.0
        self.confidence_level = 0.95
        
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.current_step = self.window_size
        return self._get_observation(), {}
    
    def _get_observation(self):
        """Construct state representation with NaN handling"""
        idx = min(self.current_step, self.max_steps - 1)
        
        llm_pred = float(self.data.iloc[idx]['llm_prediction'])
        if np.isnan(llm_pred) or np.isinf(llm_pred):
            llm_pred = float(self.data.iloc[idx]['actual_price'])
        
        hist_prices = []
        if 'recent_prices' in self.data.columns and self.data.iloc[idx]['recent_prices'] is not None:
            try:
                hist_prices = [float(p) for p in self.data.iloc[idx]['recent_prices']]
                hist_prices = [p if not (np.isnan(p) or np.isinf(p)) else llm_pred for p in hist_prices]
            except:
                hist_prices = []
        
        if len(hist_prices) < self.window_size:
            pad_value = hist_prices[-1] if hist_prices else llm_pred
            hist_prices = hist_prices + [pad_value] * (self.window_size - len(hist_prices))
        hist_prices = np.array(hist_prices[-self.window_size:], dtype=np.float32)
        
        volatility = calculate_volatility(hist_prices)
        if np.isnan(volatility) or np.isinf(volatility): volatility = 0.0
        
        returns = np.diff(hist_prices) / hist_prices[:-1] if len(hist_prices) > 1 else np.array([0.0])
        returns = np.nan_to_num(returns, nan=0.0, posinf=0.0, neginf=0.0)
        var = calculate_var(returns, self.confidence_level)
        if np.isnan(var) or np.isinf(var): var = 0.0
        
        llm_likelihood = float(self.data.iloc[idx].get('llm_likelihood', 0.5))
        if np.isnan(llm_likelihood) or np.isinf(llm_likelihood): llm_likelihood = 0.5
        
        state = np.concatenate([
            np.array([llm_pred], dtype=np.float32),
            hist_prices,
            np.array([volatility, var, llm_likelihood], dtype=np.float32)
        ])
        
        state = np.nan_to_num(state, nan=0.0, posinf=1e6, neginf=-1e6)
        return state.astype(np.float32)
    
    def step(self, action):
        idx = min(self.current_step, self.max_steps - 1)
        llm_pred = float(self.data.iloc[idx]['llm_prediction'])
        actual_price = float(self.data.iloc[idx]['actual_price'])
        if np.isnan(llm_pred) or np.isinf(llm_pred): llm_pred = actual_price
        if np.isnan(actual_price) or np.isinf(actual_price): actual_price = llm_pred
        adjustment = float(action[0])
        if np.isnan(adjustment) or np.isinf(adjustment): adjustment = 0.0
        adjusted_pred = llm_pred * (1 + adjustment)
        pred_error = abs(adjusted_pred - actual_price)
        if actual_price != 0 and not np.isnan(actual_price): pct_error = pred_error / abs(actual_price)
        else: pct_error = 0.0
        if np.isnan(pred_error) or np.isinf(pred_error): pred_error = 0.0
        if np.isnan(pct_error) or np.isinf(pct_error): pct_error = 0.0
        scaled_error = pct_error * 100
        cvar = 0.0
        if 'recent_prices' in self.data.columns and self.data.iloc[idx]['recent_prices'] is not None:
            try:
                hist_prices = np.array(self.data.iloc[idx]['recent_prices'][-self.window_size:], dtype=np.float32)
                hist_prices = np.nan_to_num(hist_prices, nan=llm_pred)
                returns = np.diff(hist_prices) / hist_prices[:-1] if len(hist_prices) > 1 else np.array([0.0])
                returns = np.nan_to_num(returns, nan=0.0, posinf=0.0, neginf=0.0)
                cvar = abs(calculate_cvar(returns, self.confidence_level))
                if np.isnan(cvar) or np.isinf(cvar): cvar = 0.0
            except:
                cvar = 0.0
        risk_penalty = self.lambda_risk * cvar * 100
        llm_error = abs(llm_pred - actual_price)
        if actual_price != 0 and not np.isnan(actual_price): llm_pct_error = llm_error / abs(actual_price) * 100
        else: llm_pct_error = 0.0
        improvement = llm_pct_error - scaled_error
        reward = -scaled_error - risk_penalty + (improvement * 0.5)
        if np.isnan(reward) or np.isinf(reward): reward = -100.0
        self.current_step += 1
        terminated = self.current_step >= self.max_steps
        truncated = False
        next_state = self._get_observation()
        return next_state, reward, terminated, truncated, {}

print("Stock Prediction Environment defined.")

Stock Prediction Environment defined.


## 4. Load Pre-Trained PPO Model

In [18]:
# Load the saved PPO model
try:
    model_path = '../results/ppo_final_model.zip'
    model = PPO.load(model_path)
    print(f"Successfully loaded PPO model from {model_path}")
except Exception as e:
    print(f"Error loading PPO model: {e}")
    model = None

Successfully loaded PPO model from ../results/ppo_final_model.zip


## 5. Apply PPO to Test Data

In [19]:
# Apply PPO adjustments to test predictions
def apply_ppo_adjustment(model, test_df):
    """Apply trained PPO model to adjust predictions"""
    adjusted_predictions = []
    
    env = StockPredictionEnv(test_df, window_size=5)
    obs, _ = env.reset()
    
    for idx in range(len(test_df)):
        if idx < env.window_size:
            # For early samples, use LLM prediction as-is
            adjusted_predictions.append(test_df.iloc[idx]['llm_prediction'])
            continue
        
        # Get PPO action
        action, _ = model.predict(obs, deterministic=True)
        
        # Apply adjustment
        llm_pred = test_df.iloc[idx]['llm_prediction']
        adjusted_pred = llm_pred * (1 + action[0])
        adjusted_predictions.append(adjusted_pred)
        
        # Step environment
        if idx < len(test_df) - 1:
            obs, _, terminated, _, _ = env.step(action)
            if terminated:
                break
    
    return adjusted_predictions

print("Applying PPO adjustments to test set...")
test_df['ppo_adjusted_prediction'] = apply_ppo_adjustment(model, test_df)
print("PPO adjustments applied!")

# Display results
test_df[['ticker', 'llm_prediction', 'ppo_adjusted_prediction', 'actual_price']].head(10)

Applying PPO adjustments to test set...
PPO adjustments applied!


Unnamed: 0,ticker,llm_prediction,ppo_adjusted_prediction,actual_price
0,HSBC,31.63,31.63,32.68
1,0700.HK,0.0,0.0,342.870056
2,PEP,181.0,181.0,178.970001
3,AAPL,130.03,130.03,126.360001
4,7203.T,1817.5,1817.5,1807.5
5,HSBC,31.63,30.997401,33.759998
6,PEP,181.75,178.115003,177.100006
7,AAPL,123.456,120.986882,125.019997
8,0700.HK,0.0,0.0,347.799988
9,AAPL,129.61,127.017802,129.619995


## 6. Save and Evaluate Test Results

In [20]:
# Save test predictions with PPO adjustments
test_df.to_csv('../results/test_predictions_with_ppo.csv', index=False)
print(f"Test predictions with PPO adjustments saved to ../results/test_predictions_with_ppo.csv")

# Quick comparison
llm_mae = np.mean(np.abs(test_df['llm_prediction'] - test_df['actual_price']))
ppo_mae = np.mean(np.abs(test_df['ppo_adjusted_prediction'] - test_df['actual_price']))

print(f"\nQuick Comparison:")
print(f"LLM MAE: {llm_mae:.4f}")
print(f"LLM-PPO MAE: {ppo_mae:.4f}")
print(f"Improvement: {((llm_mae - ppo_mae) / llm_mae * 100):.2f}%")

Test predictions with PPO adjustments saved to ../results/test_predictions_with_ppo.csv

Quick Comparison:
LLM MAE: 62.1152
LLM-PPO MAE: 64.5999
Improvement: -4.00%
