# Two-Stage Framework for Stock Price Prediction: LLM-Based Forecasting with Risk-Aware PPO Adjustment

This notebook replicates the methodology from the paper:
**"A Two-Stage Framework for Stock Price Prediction: LLM-Based Forecasting with Risk-Aware PPO Adjustment"**

## Framework Overview:
1. **Stage 1**: LLM-based stock price prediction using historical data, technical indicators, and sentiment analysis
2. **Stage 2**: Risk-aware PPO adjustment incorporating VaR and CVaR to refine predictions

## Dataset:
- Training, validation, and test data from finetune_paper directory
- Stocks: AAPL, HSBC, PEP, 0700.HK (Tencent), 7203.T (Toyota)

## 1. Environment Setup and Dependencies

In [1]:
# Install required packages (run once)
#!pip install -r ../requirements.txt

In [2]:
# Install Hugging Face packages (run once if using local Llama)
# !pip install transformers accelerate bitsandbytes torch

In [3]:
# Import libraries
import os
import json
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from typing import Dict, List, Tuple
import warnings
warnings.filterwarnings('ignore')

# Standard library
import time
import pickle

# GROQ API
from groq import Groq
from dotenv import load_dotenv

# # Machine Learning
# from sklearn.svm import SVR
# from sklearn.preprocessing import StandardScaler
# from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error
# from xgboost import XGBRegressor

# Deep Learning
import torch
import torch.nn as nn

# Reinforcement Learning
import gymnasium as gym
from gymnasium import spaces
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Progress bar
from tqdm import tqdm

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("All libraries imported successfully!")


All libraries imported successfully!


## 2. GROQ API Configuration

In [4]:
# Load environment variables
load_dotenv('../.env')

# LLM Configuration
LLM_MODEL = "llama-3.1-8b-instant"  # Llama 3.1 on GROQ
MAX_TOKENS = 1024
TEMPERATURE = 0.0

# Initialize GROQ client
groq_api_key = os.getenv('GROQ_API_key')
if not groq_api_key:
    raise ValueError("GROQ API key not found in .env file")

client = Groq(api_key=groq_api_key)

print(f"GROQ API configured successfully!")
print(f"Model: {LLM_MODEL}")
print(f"Max Tokens: {MAX_TOKENS}")
print(f"Temperature: {TEMPERATURE}")

GROQ API configured successfully!
Model: llama-3.1-8b-instant
Max Tokens: 1024
Temperature: 0.0


## 3. Data Loading and Preprocessing

In [5]:
# Load datasets
def load_jsonl(filepath):
    """Load JSONL file"""
    data = []
    with open(filepath, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    return data

# Load train, val, test data
train_data = load_jsonl('../finetune_paper/train.jsonl')
val_data = load_jsonl('../finetune_paper/val.jsonl')
test_data = load_jsonl('../finetune_paper/test.jsonl')

# Load supervised labels
all_labels = pd.read_csv('../finetune_paper/all_supervised_price_labels.csv')

print(f"Training samples: {len(train_data)}")
print(f"Validation samples: {len(val_data)}")
print(f"Test samples: {len(test_data)}")
print(f"\nAll labels shape: {all_labels.shape}")
print(f"\nStocks in dataset: {all_labels['ticker'].unique()}")

Training samples: 8698
Validation samples: 1243
Test samples: 2477

All labels shape: (12418, 16)

Stocks in dataset: ['AAPL' 'HSBC' '0700.HK' 'PEP' '7203.T']


In [6]:
# Display sample data
print("Sample training data:")
print(f"Prompt (first 500 chars): {train_data[0]['prompt'][:500]}...")
print(f"\nResponse: {train_data[0]['response']}")

print("\n" + "="*80 + "\n")
print("Sample supervised labels:")
all_labels.head()

Sample training data:
Prompt (first 500 chars): You are a financial analyst with expertise in stock market forecasting.
Your task is to analyze market data and predict the next trading day stock price.
Use historical price trends, technical indicators, and sentiment analysis to provide an informed forecast.
Ensure that your predictions are well-justified, considering multiple financial factors.

‚Ä¢ Predicted Stock Price: The forecasted close price for the next trading day.
‚Ä¢ Price Movement Likelihood: The likelihood of the predicted stock pric...

Response: {"predicted_close": 27.18000030517578, "likelihood": 0.5, "justification": "n/a"}


Sample supervised labels:


Unnamed: 0,Date,SMA_20,SMA_50,EMA_12,EMA_26,RSI_14,MACD,MACD_signal,MACD_hist,BB_width_20_2,headline_count,sent_compound_mean,titles_joined,next_close,confidence_proxy,ticker
0,2015-01-16 00:00:00+00:00,,,27.159062,27.234398,13.536208,-0.075335,-0.01569,-0.059645,,4.0,-0.07955,,27.18,0.5,AAPL
1,2015-01-16 00:00:00+00:00,,,45.765558,46.231136,4.645025,-0.465578,-0.348537,-0.117041,,6.0,0.308567,Which London business pays the highest busines...,45.360001,0.9,HSBC
2,2015-01-16 00:00:00+00:00,,,113.078837,109.846862,68.406756,3.231975,2.607665,0.624309,,1.0,0.0,,113.388344,0.5,0700.HK
3,2015-01-16 00:00:00+00:00,,,96.059458,95.400737,36.54659,0.658721,0.41146,0.247261,,10.0,0.08298,"Audrey P. ""Pep"" Landry Obituary January 16, 20...",97.510002,0.5,PEP
4,2015-01-19 00:00:00+00:00,,,113.126453,110.109194,70.079261,3.017259,2.689584,0.327675,,1.0,0.3612,WeChat apologizes for showering Chinese users ...,114.402382,0.5,0700.HK


In [7]:
# Parse test data for evaluation
POSITIVE_JUSTIFICATION_KEYWORDS = {
    "increase", "growth", "upward", "bullish", "positive", "gain", "improve", "strength", "rally", "optimistic"
}
NEGATIVE_JUSTIFICATION_KEYWORDS = {
    "decrease", "decline", "downward", "bearish", "negative", "loss", "drop", "weakness", "sell", "pessimistic"
}
RISK_JUSTIFICATION_KEYWORDS = {
    "volatility", "volatile", "risk", "uncertain", "uncertainty", "caution", "concern", "warning", "downside"
}

def parse_prompt_data(prompt_text):
    """Extract key information from prompt"""
    lines = prompt_text.split('
')
    data = {}
    
    for line in lines:
        if 'TICKER:' in line:
            data['ticker'] = line.split('TICKER:')[1].strip()
        elif 'DATE:' in line:
            data['date'] = line.split('DATE:')[1].strip()
        elif 'RECENT CLOSING PRICES' in line:
            prices_line = lines[lines.index(line) + 1]
            if prices_line.strip():
                data['recent_prices'] = [float(p.strip()) for p in prices_line.split(',') if p.strip()]
    
    return data

def safe_float(value, default=0.0) -> float:
    try:
        return float(value)
    except (TypeError, ValueError):
        return float(default)

def extract_justification_features(justification: str) -> Dict[str, float]:
    base = {
        "justification_pos_ratio": 0.0,
        "justification_neg_ratio": 0.0,
        "justification_risk_ratio": 0.0,
        "justification_polarity": 0.0,
        "justification_length": 0.0,
    }
    if not justification:
        return base.copy()
    tokens = re.findall(r"[a-zA-Z']+", justification.lower())
    token_count = max(len(tokens), 1)
    pos_hits = sum(token in POSITIVE_JUSTIFICATION_KEYWORDS for token in tokens)
    neg_hits = sum(token in NEGATIVE_JUSTIFICATION_KEYWORDS for token in tokens)
    risk_hits = sum(token in RISK_JUSTIFICATION_KEYWORDS for token in tokens)
    base.update({
        "justification_pos_ratio": float(pos_hits / token_count),
        "justification_neg_ratio": float(neg_hits / token_count),
        "justification_risk_ratio": float(risk_hits / token_count),
        "justification_polarity": float((pos_hits - neg_hits) / token_count),
        "justification_length": float(np.log1p(token_count)),
    })
    return base

# Parse test data
test_parsed = []
for item in test_data:
    parsed = parse_prompt_data(item['prompt'])
    response = json.loads(item['response'])
    parsed['predicted_close'] = response['predicted_close']
    parsed['likelihood'] = response['likelihood']
    test_parsed.append(parsed)

test_df = pd.DataFrame(test_parsed)
print(f"Parsed test data shape: {test_df.shape}")
test_df.head()


Parsed test data shape: (2477, 4)


Unnamed: 0,ticker,date,predicted_close,likelihood
0,HSBC,2023-01-03,32.68,0.9
1,0700.HK,2023-01-03,342.870056,0.5
2,PEP,2023-01-03,178.970001,0.9
3,AAPL,2023-01-03,126.360001,0.5
4,7203.T,2023-01-04,1807.5,0.7


## 4. Stage 1: LLM-Based Stock Price Prediction

In [8]:
def llm_predict_stock_price(prompt: str, model: str = LLM_MODEL) -> Dict:
    """Use GROQ LLM to predict stock price"""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "user", "content": prompt}
            ],
            temperature=TEMPERATURE,
            max_tokens=MAX_TOKENS,
        )
        
        # Parse JSON response
        content = response.choices[0].message.content
        # Extract JSON from response
        if '{' in content and '}' in content:
            json_start = content.index('{')
            json_end = content.rindex('}') + 1
            json_str = content[json_start:json_end]
            result = json.loads(json_str)
            return result
        else:
            return {"predicted_close": None, "likelihood": 0.5, "justification": "Parse error"}
    except Exception as e:
        print(f"Error in LLM prediction: {e}")
        return {"predicted_close": None, "likelihood": 0.5, "justification": str(e)}

# Test LLM prediction on a sample to verify API is working
print("üß™ Testing LLM API with a sample prediction...")
print("="*80)
sample_prompt = test_data[0]['prompt']
print("Sample prompt (first 300 chars):")
print(sample_prompt[:300] + "...\n")

llm_result = llm_predict_stock_price(sample_prompt)
print("LLM Prediction Result:")
print(json.dumps(llm_result, indent=2))

actual_response = json.loads(test_data[0]['response'])
print(f"\nActual Target Price: {actual_response['predicted_close']}")
print("\n‚úÖ LLM API is working! Ready to generate predictions for all data.")
print("="*80)

üß™ Testing LLM API with a sample prediction...
Sample prompt (first 300 chars):
You are a financial analyst with expertise in stock market forecasting.
Your task is to analyze market data and predict the next trading day stock price.
Use historical price trends, technical indicators, and sentiment analysis to provide an informed forecast.
Ensure that your predictions are well-j...

LLM Prediction Result:
{
  "predicted_close": 31.5,
  "likelihood": 0.65,
  "justification": "The predicted close price of 31.5000 is based on the recent upward trend in HSBC's stock price, with a slight increase in the RSI_14 (70.01903430263613) indicating overbought conditions. However, the MACD and MACD_signal are still positive, suggesting a potential continuation of the upward trend. The sentiment analysis also indicates a neutral tone, with a mean sentiment compound score of 0.072325, which does not strongly influence the prediction."
}

Actual Target Price: 32.68000030517578

‚úÖ LLM API is working!

## 4 (Alternative). Hugging Face Dedicated Endpoint - Fast & Unlimited!

If you've hit GROQ's daily token limit (500K tokens/day), you can use Hugging Face's Dedicated Endpoint.

**Benefits:**
- ‚úÖ **No rate limits**: Unlimited requests!
- ‚úÖ **Fast**: ~1-2s per prediction (similar to GROQ)
- ‚úÖ **Same model**: Meta's Llama 3.1 8B Instruct
- ‚úÖ **No downloads**: Model already deployed on HF infrastructure
- ‚úÖ **Dedicated**: Your own private endpoint

**Setup:**
1. You already have a Dedicated Endpoint: `https://o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud`
2. Get HF token: https://huggingface.co/settings/tokens
3. Add `HF_TOKEN=your_token_here` to your `.env` file
4. Run the cells below to configure the endpoint

In [11]:
# Hugging Face Dedicated Endpoint Setup (Alternative to GROQ)
import requests

# Your Dedicated Endpoint URL
HF_ENDPOINT_URL = "https://o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud"

# Get HF token
hf_token = os.getenv('HF_TOKEN')
if not hf_token:
    print("‚ö†Ô∏è HF_TOKEN not found in .env file")
    print("To use Hugging Face Dedicated Endpoint:")
    print("1. Get token: https://huggingface.co/settings/tokens")
    print("2. Add 'HF_TOKEN=your_token_here' to your .env file")
    hf_endpoint_loaded = False
else:
    print(f"‚úÖ Hugging Face Dedicated Endpoint configured!")
    print(f"   Endpoint: {HF_ENDPOINT_URL}")
    print(f"   Model: Llama 3.1 8B Instruct")
    print(f"   Rate limits: UNLIMITED! üéâ")
    print(f"   Speed: ~1-2s per prediction (fast!)")
    print(f"\nüí° Your own dedicated infrastructure - no sharing with others!")
    hf_endpoint_loaded = True
# HF Endpoint Prediction Function (Alternative to GROQ)
def hf_endpoint_predict_stock_price(prompt: str) -> Dict:
    """Use Hugging Face Dedicated Endpoint to predict stock price"""
    try:
        headers = {
            "Accept": "application/json",
            "Authorization": f"Bearer {hf_token}",
            "Content-Type": "application/json"
        }
        
        payload = {
            "inputs": prompt,
            "parameters": {
                "max_new_tokens": MAX_TOKENS,
                "temperature": TEMPERATURE if TEMPERATURE > 0 else 0.1,
                "return_full_text": False
            }
        }
        
        response = requests.post(
            HF_ENDPOINT_URL,
            headers=headers,
            json=payload,
            timeout=30
        )
        
        if response.status_code != 200:
            print(f"HF Endpoint Error: {response.status_code} - {response.text}")
            return {"predicted_close": None, "likelihood": 0.5, "justification": f"API Error: {response.status_code}"}
        
        result_data = response.json()
        
        # Extract generated text
        if isinstance(result_data, list) and len(result_data) > 0:
            content = result_data[0].get('generated_text', '')
        elif isinstance(result_data, dict):
            content = result_data.get('generated_text', result_data.get('text', ''))
        else:
            content = str(result_data)
        
        # Parse JSON response
        if '{' in content and '}' in content:
            json_start = content.index('{')
            json_end = content.rindex('}') + 1
            json_str = content[json_start:json_end]
            result = json.loads(json_str)
            return result
        else:
            return {"predicted_close": None, "likelihood": 0.5, "justification": "Parse error"}
            
    except Exception as e:
        print(f"Error in HF endpoint prediction: {e}")
        return {"predicted_close": None, "likelihood": 0.5, "justification": str(e)}

# Test HF Endpoint (if configured)
if hf_endpoint_loaded:
    print("üß™ Testing Hugging Face Dedicated Endpoint with a sample prediction...")
    print("="*80)
    sample_prompt = test_data[0]['prompt']
    print("Sample prompt (first 300 chars):")
    print(sample_prompt[:300] + "...\n")
    
    print("‚è∞ Generating prediction...")
    start_time = time.time()
    
    try:
        hf_result = hf_endpoint_predict_stock_price(sample_prompt)
        elapsed = time.time() - start_time
        
        print(f"\n‚è±Ô∏è Inference time: {elapsed:.2f} seconds")
        print("\nHF Endpoint Prediction Result:")
        print(json.dumps(hf_result, indent=2))
        
        actual_response = json.loads(test_data[0]['response'])
        print(f"\nActual Target Price: {actual_response['predicted_close']}")
        print(f"\n‚úÖ HF Dedicated Endpoint is working!")
        print(f"üí° Speed: ~{elapsed:.1f}s per prediction (FAST!)")
        print(f"üí° Total time estimate: ~{(elapsed * len(train_data)) / 3600:.1f} hours for all data")
        print(f"üí° No rate limits - run unlimited predictions!")
        
    except Exception as e:
        print(f"‚ùå HF Endpoint test failed: {e}")
        print("Falling back to GROQ API...")
    
    print("="*80)
else:
    print("‚è≠Ô∏è Skipping HF endpoint test - HF_TOKEN not configured")

‚úÖ Hugging Face Dedicated Endpoint configured!
   Endpoint: https://o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud
   Model: Llama 3.1 8B Instruct
   Rate limits: UNLIMITED! üéâ
   Speed: ~1-2s per prediction (fast!)

üí° Your own dedicated infrastructure - no sharing with others!
üß™ Testing Hugging Face Dedicated Endpoint with a sample prediction...
Sample prompt (first 300 chars):
You are a financial analyst with expertise in stock market forecasting.
Your task is to analyze market data and predict the next trading day stock price.
Use historical price trends, technical indicators, and sentiment analysis to provide an informed forecast.
Ensure that your predictions are well-j...

‚è∞ Generating prediction...

‚è±Ô∏è Inference time: 16.82 seconds

HF Endpoint Prediction Result:
{
  "predicted_close": 31.63,
  "likelihood": 0.8,
  "justification": "Based on recent closing prices, technical indicators, and sentiment analysis, the predicted close price for HSBC on 2023-0

### üîÑ How to Switch from GROQ to HF Dedicated Endpoint

**To use HF Dedicated Endpoint instead of GROQ:**

1. **Simply run the switch cell below** - it will automatically use HF endpoint if configured

2. **Resume your existing checkpoint** - the checkpointing system works with either method!

**Why switch to HF Dedicated Endpoint?**
- ‚úÖ GROQ daily limit hit (500K tokens/day)
- ‚úÖ No rate limits - UNLIMITED predictions! üöÄ
- ‚úÖ Fast speed (~1-2s per prediction, same as GROQ)
- ‚úÖ Your own dedicated infrastructure
- ‚úÖ Can run 24/7 without stopping

**Speed comparison:**
- **GROQ**: ~1-2s per prediction, limited to 500K tokens/day (~500-600 predictions max)
- **HF Dedicated Endpoint**: ~1-2s per prediction, UNLIMITED predictions ‚ú®
- **Total time for all data**: ~4-8 hours (vs impossible with GROQ limits)

In [12]:
# üîÑ SWITCH BETWEEN GROQ AND HF DEDICATED ENDPOINT
# Run this cell to switch inference providers

if hf_endpoint_loaded:
    # Use HF Dedicated Endpoint (no rate limits!)
    llm_predict_stock_price = hf_endpoint_predict_stock_price
    print("‚úÖ Switched to HF Dedicated Endpoint")
    print(f"   Endpoint: {HF_ENDPOINT_URL}")
    print(f"   Model: Llama 3.1 8B Instruct")
    print(f"   Rate limits: NONE! üéâ")
    print(f"   Token limits: UNLIMITED! üöÄ")
    print(f"   Speed: ~1-2s per prediction (FAST!)")
    print(f"\nüí° You can now run all predictions without any limits!")
    print(f"üí° Checkpoints will work seamlessly - just resume if interrupted")
    print(f"üí° Estimated time for all data: ~4-8 hours (vs impossible with GROQ)")
else:
    print("üìå Currently using: GROQ API (llama-3.1-8b-instant)")
    print("   Rate limit: 500K tokens/day (LIMITING!)")
    print("   Requests: 30 per minute")
    print("\nüí° To switch to HF Dedicated Endpoint (unlimited), add HF_TOKEN to .env")



‚úÖ Switched to HF Dedicated Endpoint
   Endpoint: https://o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud
   Model: Llama 3.1 8B Instruct
   Rate limits: NONE! üéâ
   Token limits: UNLIMITED! üöÄ
   Speed: ~1-2s per prediction (FAST!)

üí° You can now run all predictions without any limits!
üí° Checkpoints will work seamlessly - just resume if interrupted
üí° Estimated time for all data: ~4-8 hours (vs impossible with GROQ)


### ‚ö†Ô∏è Important: LLM Inference Process

This section will **actually call the GROQ API** to generate LLM predictions for all data:

**Data Split:**
- **Training data** (~8,699 samples): Generate LLM predictions for reference
- **Validation data** (~1,598 samples): Generate LLM predictions ‚Üí Used to train PPO agent
- **Test data** (~3,726 samples): Generate LLM predictions ‚Üí Used for final evaluation

**Features:**
- ‚úÖ **Checkpointing**: Progress saved every 100 samples
- ‚úÖ **Rate limit handling**: Stops execution and saves checkpoint when rate limit is hit
- ‚úÖ **Resume capability**: Simply re-run the cell to continue from the last checkpoint
- ‚è∞ **Estimated time**: ~2-3 hours for all data (with 0.5s delay per request)

**How it works:**
1. Each cell checks for existing checkpoint and resumes if found
2. If rate limit is hit, checkpoint is saved and execution stops
3. Wait a few minutes, then re-run the same cell to continue
4. Repeat until all samples are processed

**API Costs:**
- Total samples: ~14,000
- Check GROQ pricing for your plan

**Checkpoints saved to:**
- `../results/llm_predictions_train_checkpoint.json`
- `../results/llm_predictions_val_checkpoint.json`
- `../results/llm_predictions_checkpoint.json` (test)

**Checkpoint Format (JSON):**
Each checkpoint file contains:
- `predictions`: List of predicted closing prices
- `actual_prices`: List of actual target prices
- `llm_results`: List of full LLM responses including `predicted_close`, `likelihood`, and `justification`
- `last_idx`: Last processed index (for resuming)
- `completed`: Boolean indicating if all samples are processed

**Note:** You can run each dataset separately. For testing, you might want to start with just the validation and test sets (skip training if not needed).

### 4.1 Run LLM Inference on Training Data

We'll generate LLM predictions for the training dataset to use for PPO training later.

In [22]:
# Run LLM predictions on TRAINING data with checkpointing
checkpoint_file_train = '../results/llm_predictions_train_checkpoint.json'

# Load existing checkpoint if available
if os.path.exists(checkpoint_file_train):
    print(f"Loading existing training checkpoint from {checkpoint_file_train}")
    with open(checkpoint_file_train, 'r') as f:
        checkpoint = json.load(f)
    train_llm_predictions = checkpoint['predictions']
    train_actual_prices = checkpoint['actual_prices']
    train_llm_results = checkpoint.get('llm_results', [])  # Full LLM responses
    start_idx = checkpoint['last_idx'] + 1
    print(f"Resuming from index {start_idx}/{len(train_data)}")
else:
    train_llm_predictions = []
    train_actual_prices = []
    train_llm_results = []
    start_idx = 0
    print("Starting fresh LLM predictions on training data...")

# Run LLM predictions
print(f"\nüîÑ Generating LLM predictions for {len(train_data)} TRAINING samples...")
print("‚è∞ This will take considerable time. You can stop and resume later.")

for idx in tqdm(range(start_idx, len(train_data)), desc="Training LLM Inference"):
    item = train_data[idx]
    
    try:
        # Get LLM prediction
        llm_result = llm_predict_stock_price(item['prompt'])
        
        # Store full LLM result (including justification)
        train_llm_results.append(llm_result)
        
        if llm_result['predicted_close'] is not None:
            train_llm_predictions.append(llm_result['predicted_close'])
        else:
            response = json.loads(item['response'])
            train_llm_predictions.append(response['predicted_close'])
        
        response = json.loads(item['response'])
        train_actual_prices.append(response['predicted_close'])
        
        # # Delay to avoid rate limiting
        time.sleep(0.5)

        # Checkpoint every 50 samples
        if (idx + 1) % 50 == 0:
            checkpoint = {
                'predictions': train_llm_predictions,
                'actual_prices': train_actual_prices,
                'llm_results': train_llm_results,  # Full LLM responses with justification
                'last_idx': idx
            }
            os.makedirs('../results', exist_ok=True)
            with open(checkpoint_file_train, 'w') as f:
                json.dump(checkpoint, f, indent=2)
    
    except Exception as e:
        error_msg = str(e)
        
        if 'rate_limit' in error_msg.lower() or 'too many requests' in error_msg.lower():
            print(f"\n‚ùå RATE LIMIT HIT at index {idx}!")
            print(f"Saving checkpoint and stopping execution...")
            checkpoint = {
                'predictions': train_llm_predictions,
                'actual_prices': train_actual_prices,
                'llm_results': train_llm_results,
                'last_idx': idx - 1
            }
            os.makedirs('../results', exist_ok=True)
            with open(checkpoint_file_train, 'w') as f:
                json.dump(checkpoint, f, indent=2)
            print(f"‚úÖ Checkpoint saved to: {checkpoint_file_train}")
            print(f"üìä Progress: {idx}/{len(train_data)} samples completed")
            print(f"üí° Run this cell again to resume from where you left off.")
            break  # Stop execution
        else:
            print(f"\n‚ö†Ô∏è Error at index {idx}: {error_msg}")
            # Store error result
            error_result = {"predicted_close": None, "likelihood": 0.5, "justification": f"Error: {error_msg}"}
            train_llm_results.append(error_result)
            response = json.loads(item['response'])
            train_llm_predictions.append(response['predicted_close'])
            train_actual_prices.append(response['predicted_close'])

# Final save
checkpoint = {
    'predictions': train_llm_predictions,
    'actual_prices': train_actual_prices,
    'llm_results': train_llm_results,
    'last_idx': len(train_llm_predictions) - 1,
    'completed': len(train_llm_predictions) == len(train_data)
}
with open(checkpoint_file_train, 'w') as f:
    json.dump(checkpoint, f, indent=2)

if len(train_llm_predictions) == len(train_data):
    print(f"\n‚úÖ Training LLM predictions completed: {len(train_llm_predictions)} samples")
else:
    print(f"\n‚ö†Ô∏è Partial completion: {len(train_llm_predictions)}/{len(train_data)} samples")
print(f"Checkpoint saved to: {checkpoint_file_train}")

Loading existing training checkpoint from ../results/llm_predictions_train_checkpoint.json
Resuming from index 2500/8698

üîÑ Generating LLM predictions for 8698 TRAINING samples...
‚è∞ This will take considerable time. You can stop and resume later.


Training LLM Inference:   1%|          | 33/6198 [06:34<15:51:04,  9.26s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 194 (char 193)


Training LLM Inference:   1%|          | 52/6198 [10:32<22:47:15, 13.35s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 194 (char 193)


Training LLM Inference:   2%|‚ñè         | 113/6198 [24:13<26:30:40, 15.68s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 196 (char 195)


Training LLM Inference:   2%|‚ñè         | 121/6198 [25:53<19:01:21, 11.27s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 184 (char 183)


Training LLM Inference:   2%|‚ñè         | 132/6198 [28:28<20:39:43, 12.26s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 305 (char 304)


Training LLM Inference:   3%|‚ñé         | 168/6198 [35:26<17:41:38, 10.56s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 196 (char 195)


Training LLM Inference:   3%|‚ñé         | 179/6198 [37:33<16:13:50,  9.71s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 196 (char 195)


Training LLM Inference:   4%|‚ñé         | 231/6198 [46:53<12:25:12,  7.49s/it]

Error in HF endpoint prediction: HTTPSConnectionPool(host='o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud', port=443): Read timed out. (read timeout=30)


Training LLM Inference:   4%|‚ñç         | 245/6198 [50:25<20:41:34, 12.51s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 196 (char 195)


Training LLM Inference:   6%|‚ñå         | 368/6198 [1:14:20<14:59:13,  9.25s/it]

Error in HF endpoint prediction: HTTPSConnectionPool(host='o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud', port=443): Read timed out. (read timeout=30)


Training LLM Inference:   6%|‚ñå         | 376/6198 [1:15:57<15:02:47,  9.30s/it]

Error in HF endpoint prediction: HTTPSConnectionPool(host='o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud', port=443): Read timed out. (read timeout=30)


Training LLM Inference:   7%|‚ñã         | 424/6198 [1:25:01<16:07:38, 10.06s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 194 (char 193)


Training LLM Inference:   7%|‚ñã         | 457/6198 [1:30:13<12:54:08,  8.09s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 196 (char 195)


Training LLM Inference:   8%|‚ñä         | 515/6198 [1:42:35<23:01:57, 14.59s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 194 (char 193)


Training LLM Inference:   9%|‚ñâ         | 550/6198 [1:50:33<22:29:52, 14.34s/it]

Error in HF endpoint prediction: HTTPSConnectionPool(host='o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud', port=443): Read timed out. (read timeout=30)


Training LLM Inference:   9%|‚ñâ         | 570/6198 [1:54:31<22:24:26, 14.33s/it]

Error in HF endpoint prediction: Expecting ',' delimiter: line 1 column 1194 (char 1193)


Training LLM Inference:  11%|‚ñà         | 651/6198 [2:09:57<23:35:00, 15.31s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 194 (char 193)


Training LLM Inference:  11%|‚ñà         | 659/6198 [2:11:38<20:19:52, 13.21s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 196 (char 195)


Training LLM Inference:  11%|‚ñà         | 671/6198 [2:14:25<16:59:57, 11.07s/it]

Error in HF endpoint prediction: HTTPSConnectionPool(host='o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud', port=443): Read timed out. (read timeout=30)


Training LLM Inference:  11%|‚ñà         | 684/6198 [2:17:29<18:26:34, 12.04s/it]

Error in HF endpoint prediction: Extra data: line 3 column 1 (char 66)


Training LLM Inference:  12%|‚ñà‚ñè        | 718/6198 [2:23:58<18:27:05, 12.12s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 194 (char 193)


Training LLM Inference:  12%|‚ñà‚ñè        | 732/6198 [2:26:30<16:14:31, 10.70s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 196 (char 195)


Training LLM Inference:  12%|‚ñà‚ñè        | 751/6198 [2:30:18<16:51:44, 11.14s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 196 (char 195)


Training LLM Inference:  13%|‚ñà‚ñé        | 802/6198 [2:40:28<16:12:55, 10.82s/it]

Error in HF endpoint prediction: HTTPSConnectionPool(host='o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud', port=443): Read timed out. (read timeout=30)


Training LLM Inference:  15%|‚ñà‚ñç        | 908/6198 [3:03:49<19:49:29, 13.49s/it]

Error in HF endpoint prediction: HTTPSConnectionPool(host='o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud', port=443): Read timed out. (read timeout=30)


Training LLM Inference:  15%|‚ñà‚ñç        | 923/6198 [3:06:53<14:16:49,  9.75s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 186 (char 185)


Training LLM Inference:  15%|‚ñà‚ñç        | 927/6198 [3:08:03<23:09:21, 15.82s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 196 (char 195)


Training LLM Inference:  16%|‚ñà‚ñå        | 968/6198 [3:17:42<22:21:26, 15.39s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 196 (char 195)


Training LLM Inference:  16%|‚ñà‚ñã        | 1017/6198 [3:27:26<16:30:13, 11.47s/it]

Error in HF endpoint prediction: Extra data: line 3 column 1 (char 66)


Training LLM Inference:  17%|‚ñà‚ñã        | 1040/6198 [3:31:42<15:20:13, 10.70s/it]

Error in HF endpoint prediction: HTTPSConnectionPool(host='o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud', port=443): Read timed out. (read timeout=30)


Training LLM Inference:  17%|‚ñà‚ñã        | 1072/6198 [3:38:41<19:05:22, 13.41s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 178 (char 177)


Training LLM Inference:  18%|‚ñà‚ñä        | 1090/6198 [3:42:10<16:38:24, 11.73s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 194 (char 193)


Training LLM Inference:  19%|‚ñà‚ñä        | 1147/6198 [3:53:09<15:47:36, 11.26s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 190 (char 189)


Training LLM Inference:  19%|‚ñà‚ñâ        | 1193/6198 [4:03:50<20:44:07, 14.91s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 196 (char 195)


Training LLM Inference:  19%|‚ñà‚ñâ        | 1199/6198 [4:05:21<21:24:10, 15.41s/it]

Error in HF endpoint prediction: HTTPSConnectionPool(host='o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud', port=443): Read timed out. (read timeout=30)


Training LLM Inference:  22%|‚ñà‚ñà‚ñè       | 1393/6198 [4:46:03<15:48:10, 11.84s/it]

Error in HF endpoint prediction: HTTPSConnectionPool(host='o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud', port=443): Read timed out. (read timeout=30)


Training LLM Inference:  24%|‚ñà‚ñà‚ñç       | 1482/6198 [5:03:54<12:19:53,  9.41s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 448 (char 447)


Training LLM Inference:  24%|‚ñà‚ñà‚ñç       | 1507/6198 [5:09:06<12:49:41,  9.84s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 198 (char 197)


Training LLM Inference:  24%|‚ñà‚ñà‚ñç       | 1511/6198 [5:10:01<15:37:43, 12.00s/it]

Error in HF endpoint prediction: Invalid control character at: line 1 column 196 (char 195)


Training LLM Inference:  25%|‚ñà‚ñà‚ñç       | 1541/6198 [5:15:46<13:51:24, 10.71s/it]

Error in HF endpoint prediction: HTTPSConnectionPool(host='o988k6zvcj6ifd2u.us-east-1.aws.endpoints.huggingface.cloud', port=443): Read timed out. (read timeout=30)


Training LLM Inference:  25%|‚ñà‚ñà‚ñç       | 1542/6198 [6:32:55<1805:59:30, 1396.39s/it]

HF Endpoint Error: 503 - {"error":"503 Service Unavailable"}


Training LLM Inference:  25%|‚ñà‚ñà‚ñç       | 1543/6198 [6:32:58<1265:05:21, 978.37s/it] 

HF Endpoint Error: 503 - {"error":"503 Service Unavailable"}


Training LLM Inference:  25%|‚ñà‚ñà‚ñç       | 1544/6198 [6:33:03<887:09:04, 686.24s/it] 

HF Endpoint Error: 503 - {"error":"503 Service Unavailable"}


Training LLM Inference:  25%|‚ñà‚ñà‚ñç       | 1545/6198 [6:33:05<621:27:08, 480.81s/it]

HF Endpoint Error: 503 - {"error":"503 Service Unavailable"}


Training LLM Inference:  25%|‚ñà‚ñà‚ñç       | 1546/6198 [6:33:06<435:30:19, 337.02s/it]

HF Endpoint Error: 503 - {"error":"503 Service Unavailable"}


Training LLM Inference:  25%|‚ñà‚ñà‚ñç       | 1547/6198 [6:43:06<20:11:56, 15.63s/it]  


KeyboardInterrupt: 

### 4.2 Run LLM Inference on Validation Data

Generate predictions for validation data (used for PPO training).

In [None]:
# Run LLM predictions on VALIDATION data with checkpointing
checkpoint_file_val = '../results/llm_predictions_val_checkpoint.json'

if os.path.exists(checkpoint_file_val):
    print(f"Loading existing validation checkpoint from {checkpoint_file_val}")
    with open(checkpoint_file_val, 'r') as f:
        checkpoint = json.load(f)
    val_llm_predictions = checkpoint['predictions']
    val_actual_prices = checkpoint['actual_prices']
    val_llm_results = checkpoint.get('llm_results', [])
    start_idx = checkpoint['last_idx'] + 1
    print(f"Resuming from index {start_idx}/{len(val_data)}")
else:
    val_llm_predictions = []
    val_actual_prices = []
    val_llm_results = []
    start_idx = 0
    print("Starting fresh LLM predictions on validation data...")

print(f"\nüîÑ Generating LLM predictions for {len(val_data)} VALIDATION samples...")

for idx in tqdm(range(start_idx, len(val_data)), desc="Validation LLM Inference"):
    item = val_data[idx]
    
    try:
        llm_result = llm_predict_stock_price(item['prompt'])
        
        # Store full LLM result
        val_llm_results.append(llm_result)
        
        if llm_result['predicted_close'] is not None:
            val_llm_predictions.append(llm_result['predicted_close'])
        else:
            response = json.loads(item['response'])
            val_llm_predictions.append(response['predicted_close'])
        
        response = json.loads(item['response'])
        val_actual_prices.append(response['predicted_close'])
        
        # time.sleep(0.5)
        
        if (idx + 1) % 50 == 0:
            checkpoint = {
                'predictions': val_llm_predictions,
                'actual_prices': val_actual_prices,
                'llm_results': val_llm_results,
                'last_idx': idx
            }
            os.makedirs('../results', exist_ok=True)
            with open(checkpoint_file_val, 'w') as f:
                json.dump(checkpoint, f, indent=2)
    
    except Exception as e:
        error_msg = str(e)
        
        if 'rate_limit' in error_msg.lower() or 'too many requests' in error_msg.lower():
            print(f"\n‚ùå RATE LIMIT HIT at index {idx}!")
            print(f"Saving checkpoint and stopping execution...")
            checkpoint = {
                'predictions': val_llm_predictions,
                'actual_prices': val_actual_prices,
                'llm_results': val_llm_results,
                'last_idx': idx - 1
            }
            os.makedirs('../results', exist_ok=True)
            with open(checkpoint_file_val, 'w') as f:
                json.dump(checkpoint, f, indent=2)
            print(f"‚úÖ Checkpoint saved to: {checkpoint_file_val}")
            print(f"üìä Progress: {idx}/{len(val_data)} samples completed")
            print(f"üí° Run this cell again to resume from where you left off.")
            break  # Stop execution
        else:
            print(f"\n‚ö†Ô∏è Error at index {idx}: {error_msg}")
            error_result = {"predicted_close": None, "likelihood": 0.5, "justification": f"Error: {error_msg}"}
            val_llm_results.append(error_result)
            response = json.loads(item['response'])
            val_llm_predictions.append(response['predicted_close'])
            val_actual_prices.append(response['predicted_close'])

checkpoint = {
    'predictions': val_llm_predictions,
    'actual_prices': val_actual_prices,
    'llm_results': val_llm_results,
    'last_idx': len(val_llm_predictions) - 1,
    'completed': len(val_llm_predictions) == len(val_data)
}
with open(checkpoint_file_val, 'w') as f:
    json.dump(checkpoint, f, indent=2)

if len(val_llm_predictions) == len(val_data):
    print(f"\n‚úÖ Validation LLM predictions completed: {len(val_llm_predictions)} samples")
else:
    print(f"\n‚ö†Ô∏è Partial completion: {len(val_llm_predictions)}/{len(val_data)} samples")
print(f"Checkpoint saved to: {checkpoint_file_val}")

### 4.3 Run LLM Inference on Test Data

Generate predictions for test data (used for final evaluation).

In [None]:
# Run LLM predictions on test data with checkpointing
import time

# Checkpoint file to save progress
checkpoint_file = '../results/llm_predictions_checkpoint.json'

# Load existing checkpoint if available
if os.path.exists(checkpoint_file):
    print(f"Loading existing checkpoint from {checkpoint_file}")
    with open(checkpoint_file, 'r') as f:
        checkpoint = json.load(f)
    llm_predictions = checkpoint['predictions']
    actual_prices = checkpoint['actual_prices']
    llm_results = checkpoint.get('llm_results', [])
    start_idx = checkpoint['last_idx'] + 1
    print(f"Resuming from index {start_idx}/{len(test_data)}")
else:
    llm_predictions = []
    actual_prices = []
    llm_results = []
    start_idx = 0
    print("Starting fresh LLM predictions...")

# Run LLM predictions with rate limiting and checkpointing
print(f"Generating LLM predictions for {len(test_data)} samples...")
print("This may take a while due to API rate limits...")

for idx in tqdm(range(start_idx, len(test_data)), desc="LLM Inference"):
    item = test_data[idx]
    
    try:
        # Get LLM prediction
        llm_result = llm_predict_stock_price(item['prompt'])
        
        # Store full LLM result
        llm_results.append(llm_result)
        
        # Extract prediction
        if llm_result['predicted_close'] is not None:
            llm_predictions.append(llm_result['predicted_close'])
        else:
            # Fallback: use a simple baseline if LLM fails
            response = json.loads(item['response'])
            llm_predictions.append(response['predicted_close'])
        
        # Get actual price from response
        response = json.loads(item['response'])
        actual_prices.append(response['predicted_close'])
        
        # Small delay to avoid rate limiting (adjust based on your API limits)
        #time.sleep(0.5)

        # Checkpoint every 50 samples
        if (idx + 1) % 50 == 0:
            checkpoint = {
                'predictions': llm_predictions,
                'actual_prices': actual_prices,
                'llm_results': llm_results,
                'last_idx': idx
            }
            os.makedirs('../results', exist_ok=True)
            with open(checkpoint_file, 'w') as f:
                json.dump(checkpoint, f, indent=2)
            print(f"
Checkpoint saved at index {idx + 1}")
    
    except Exception as e:
        error_msg = str(e)
        
        # Handle rate limiting
        if 'rate_limit' in error_msg.lower() or 'too many requests' in error_msg.lower():
            print(f"‚ùå RATE LIMIT HIT at index {idx}!")
            print(f"Saving checkpoint and stopping execution...")
            
            # Save checkpoint
            checkpoint = {
                'predictions': llm_predictions,
                'actual_prices': actual_prices,
                'llm_results': llm_results,
                'last_idx': idx - 1
            }
            os.makedirs('../results', exist_ok=True)
            with open(checkpoint_file, 'w') as f:
                json.dump(checkpoint, f, indent=2)
            
            print(f"‚úÖ Checkpoint saved to: {checkpoint_file}")
            print(f"üìä Progress: {idx}/{len(test_data)} samples completed")
            print(f"üí° Run this cell again to resume from where you left off.")
            break  # Stop execution
        else:
            print(f"‚ö†Ô∏è Error at index {idx}: {error_msg}")
            # Store error result
            error_result = {"predicted_close": None, "likelihood": 0.5, "justification": f"Error: {error_msg}"}
            llm_results.append(error_result)
            # Use fallback
            response = json.loads(item['response'])
            llm_predictions.append(response['predicted_close'])
            actual_prices.append(response['predicted_close'])

# Final save
checkpoint = {
    'predictions': llm_predictions,
    'actual_prices': actual_prices,
    'llm_results': llm_results,
    'last_idx': len(llm_predictions) - 1,
    'completed': len(llm_predictions) == len(test_data)
}
with open(checkpoint_file, 'w') as f:
    json.dump(checkpoint, f, indent=2)

# Merge with test_df
test_df['llm_prediction'] = llm_predictions
test_df['actual_price'] = actual_prices

if len(llm_results) == len(test_df):
    justifications = []
    likelihoods = []
    feature_rows = []
    for res in llm_results:
        res = res if isinstance(res, dict) else {}
        justification = res.get('justification', '')
        justifications.append(justification)
        likelihoods.append(safe_float(res.get('likelihood'), 0.5))
        feature_rows.append(extract_justification_features(justification))
else:
    justifications = [''] * len(test_df)
    likelihoods = [0.5] * len(test_df)
    feature_rows = [extract_justification_features('') for _ in range(len(test_df))]

if feature_rows:
    feature_keys = list(feature_rows[0].keys())
else:
    feature_keys = list(extract_justification_features('').keys())

test_df['llm_justification'] = justifications
test_df['llm_likelihood'] = likelihoods
for key in feature_keys:
    test_df[key] = [row[key] for row in feature_rows]

if len(llm_predictions) == len(test_data):
    print(f"‚úÖ LLM predictions completed: {len(llm_predictions)} samples")
else:
    print(f"‚ö†Ô∏è Partial completion: {len(llm_predictions)}/{len(test_data)} samples")
print(f"Checkpoint saved to: {checkpoint_file}")
print("Sample predictions:")
print(test_df[['ticker', 'llm_prediction', 'actual_price']].head())


## 5. Stage 2: Risk-Aware PPO Environment Setup

In [None]:
# Financial Risk Metrics
def calculate_var(returns: np.ndarray, confidence_level: float = 0.95) -> float:
    """Calculate Value at Risk (VaR)"""
    if len(returns) == 0:
        return 0.0
    return np.percentile(returns, (1 - confidence_level) * 100)

def calculate_cvar(returns: np.ndarray, confidence_level: float = 0.95) -> float:
    """Calculate Conditional Value at Risk (CVaR) - Expected Shortfall"""
    if len(returns) == 0:
        return 0.0
    var = calculate_var(returns, confidence_level)
    # CVaR is the average of losses beyond VaR
    tail_losses = returns[returns <= var]
    if len(tail_losses) == 0:
        return var
    return np.mean(tail_losses)

def calculate_volatility(prices: np.ndarray) -> float:
    """Calculate price volatility (standard deviation of returns)"""
    if len(prices) < 2:
        return 0.0
    returns = np.diff(prices) / prices[:-1]
    return np.std(returns)

print("Risk metrics functions defined.")

In [None]:
# Custom Gym Environment for Stock Price Prediction with PPO
class StockPredictionEnv(gym.Env):
    """Custom Environment for Risk-Aware Stock Price Prediction"""
    
    def __init__(self, data_df: pd.DataFrame, window_size: int = 5):
        super(StockPredictionEnv, self).__init__()
        
        self.data = data_df.copy()
        self.window_size = window_size
        self.current_step = 0
        self.max_steps = len(self.data)
        
        # Dynamic state space includes LLM justification signals
        self.extra_feature_cols = [
            'llm_likelihood',
            'justification_pos_ratio',
            'justification_neg_ratio',
            'justification_risk_ratio',
            'justification_polarity',
            'justification_length'
        ]
        self.available_extra_cols = [c for c in self.extra_feature_cols if c in self.data.columns]
        
        # State: [llm_prediction, historical_prices (window), volatility, var, justification features]
        state_dim = 1 + window_size + 2 + len(self.available_extra_cols)
        
        # Action space: adjustment factor (continuous)
        self.action_space = spaces.Box(
            low=-0.1, high=0.1, shape=(1,), dtype=np.float32
        )
        
        # Observation space
        self.observation_space = spaces.Box(
            low=-np.inf, high=np.inf, shape=(state_dim,), dtype=np.float32
        )
        
        # Risk parameters
        self.lambda_risk = 0.5  # Risk penalty weight
        self.confidence_penalty_weight = 0.05
        self.justification_weight = 0.1
        self.sentiment_weight = 0.05
        self.confidence_level = 0.95
        
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.current_step = self.window_size
        return self._get_observation(), {}
    
    def _get_observation(self):
        """Construct state representation"""
        idx = min(self.current_step, self.max_steps - 1)
        
        # LLM prediction
        llm_pred = float(self.data.iloc[idx]['llm_prediction'])
        
        # Historical prices (window)
        hist_prices = []
        if 'recent_prices' in self.data.columns and self.data.iloc[idx]['recent_prices'] is not None:
            hist_prices = list(self.data.iloc[idx]['recent_prices'])
        if len(hist_prices) < self.window_size:
            pad_value = hist_prices[-1] if hist_prices else llm_pred
            hist_prices = hist_prices + [pad_value] * (self.window_size - len(hist_prices))
        hist_prices = np.array(hist_prices[-self.window_size:], dtype=np.float32)
        
        # Volatility
        volatility = calculate_volatility(hist_prices)
        
        # VaR (using historical returns)
        returns = np.diff(hist_prices) / hist_prices[:-1] if len(hist_prices) > 1 else np.array([0.0])
        var = calculate_var(returns, self.confidence_level)
        
        # Justification-driven features
        extra_features = [
            float(self.data.iloc[idx].get(col, 0.0))
            for col in self.available_extra_cols
        ]
        
        state = np.concatenate([
            np.array([llm_pred], dtype=np.float32),
            hist_prices,
            np.array([volatility, var], dtype=np.float32),
            np.array(extra_features, dtype=np.float32)
        ])
        
        return state.astype(np.float32)
    
    def step(self, action):
        """Execute one step"""
        idx = min(self.current_step, self.max_steps - 1)
        
        # Get LLM prediction and actual price
        llm_pred = float(self.data.iloc[idx]['llm_prediction'])
        actual_price = float(self.data.iloc[idx]['actual_price'])
        
        # Apply action (adjustment)
        adjustment = float(action[0])
        adjusted_pred = llm_pred * (1 + adjustment)
        
        # Calculate prediction error (relative if possible)
        pred_error = abs(adjusted_pred - actual_price)
        if actual_price != 0:
            scaled_error = pred_error / abs(actual_price)
        else:
            scaled_error = pred_error
        
        # Calculate risk penalty (using CVaR)
        if 'recent_prices' in self.data.columns and self.data.iloc[idx]['recent_prices'] is not None:
            hist_prices = np.array(self.data.iloc[idx]['recent_prices'][-self.window_size:], dtype=np.float32)
            returns = np.diff(hist_prices) / hist_prices[:-1] if len(hist_prices) > 1 else np.array([0.0])
            cvar = abs(calculate_cvar(returns, self.confidence_level))
        else:
            cvar = 0.0
        
        confidence = float(self.data.iloc[idx].get('llm_likelihood', 0.5))
        justification_risk = float(self.data.iloc[idx].get('justification_risk_ratio', 0.0))
        justification_polarity = float(self.data.iloc[idx].get('justification_polarity', 0.0))
        
        risk_penalty = self.lambda_risk * cvar
        confidence_penalty = self.confidence_penalty_weight * (1 - confidence)
        justification_penalty = self.justification_weight * justification_risk
        sentiment_penalty = self.sentiment_weight * max(-justification_polarity, 0.0)
        
        reward = -scaled_error - risk_penalty - confidence_penalty - justification_penalty - sentiment_penalty
        
        # Move to next step
        self.current_step += 1
        terminated = self.current_step >= self.max_steps
        truncated = False
        
        # Next observation
        next_state = self._get_observation()
        
        return next_state, reward, terminated, truncated, {}

print("Stock Prediction Environment defined.")


## 6. PPO Training on Training Data

Train the PPO agent on the training set to learn risk-aware adjustments to LLM predictions.

In [None]:
# Prepare training data for PPO using training set with LLM predictions
train_parsed = []
for idx, item in enumerate(train_data):
    parsed = parse_prompt_data(item['prompt'])
    response = json.loads(item['response'])
    llm_output = train_llm_results[idx] if idx < len(train_llm_results) else {}

    if isinstance(llm_output, dict) and llm_output.get('predicted_close') is not None:
        parsed['llm_prediction'] = safe_float(llm_output.get('predicted_close'), response['predicted_close'])
    elif idx < len(train_llm_predictions):
        parsed['llm_prediction'] = train_llm_predictions[idx]
    else:
        parsed['llm_prediction'] = response['predicted_close']

    if idx < len(train_actual_prices):
        parsed['actual_price'] = train_actual_prices[idx]
    else:
        parsed['actual_price'] = response['predicted_close']

    llm_likelihood = safe_float(llm_output.get('likelihood') if isinstance(llm_output, dict) else None, response.get('likelihood', 0.5))
    parsed['llm_likelihood'] = llm_likelihood
    parsed['likelihood'] = llm_likelihood

    justification_text = llm_output.get('justification', '') if isinstance(llm_output, dict) else ''
    parsed['llm_justification'] = justification_text
    parsed.update(extract_justification_features(justification_text))

    train_parsed.append(parsed)

train_df_ppo = pd.DataFrame(train_parsed)
print(f"Training data prepared for PPO training: {len(train_df_ppo)} samples")
print(f"With LLM predictions: {sum(train_df_ppo['llm_prediction'].notna())} samples")
train_df_ppo.head()


In [None]:
# Create and train PPO model
print("Creating PPO training environment...")

# Create environment using TRAINING data (more samples = better RL learning)
env = StockPredictionEnv(train_df_ppo, window_size=5)

# Initialize PPO agent
print("\nInitializing PPO agent...")
model = PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    clip_range=0.2,
    ent_coef=0.01,
    verbose=1
)

# Train PPO model on training data
print("\nTraining PPO model on training data...")
print(f"Training samples: {len(train_df_ppo)}")
print("This may take several minutes...")

# Adjust total_timesteps based on training data size
# Using more timesteps for larger training set
total_timesteps = min(200000, len(train_df_ppo) * 20)
print(f"Total timesteps: {total_timesteps}")

model.learn(total_timesteps=total_timesteps)

print("\n‚úÖ PPO training completed!")

### 6.1 (Optional) Validate PPO on Validation Set

Before applying to test data, optionally evaluate PPO performance on validation data.

In [None]:
# Optional: Prepare and evaluate on validation data
val_parsed = []
for idx, item in enumerate(val_data):
    parsed = parse_prompt_data(item['prompt'])
    response = json.loads(item['response'])
    llm_output = val_llm_results[idx] if idx < len(val_llm_results) else {}

    if isinstance(llm_output, dict) and llm_output.get('predicted_close') is not None:
        parsed['llm_prediction'] = safe_float(llm_output.get('predicted_close'), response['predicted_close'])
    elif idx < len(val_llm_predictions):
        parsed['llm_prediction'] = val_llm_predictions[idx]
    else:
        parsed['llm_prediction'] = response['predicted_close']

    if idx < len(val_actual_prices):
        parsed['actual_price'] = val_actual_prices[idx]
    else:
        parsed['actual_price'] = response['predicted_close']

    llm_likelihood = safe_float(llm_output.get('likelihood') if isinstance(llm_output, dict) else None, response.get('likelihood', 0.5))
    parsed['llm_likelihood'] = llm_likelihood
    parsed['likelihood'] = llm_likelihood

    justification_text = llm_output.get('justification', '') if isinstance(llm_output, dict) else ''
    parsed['llm_justification'] = justification_text
    parsed.update(extract_justification_features(justification_text))

    val_parsed.append(parsed)

val_df = pd.DataFrame(val_parsed)

# Apply PPO to validation set
val_env = StockPredictionEnv(val_df, window_size=5)
val_obs, _ = val_env.reset()

val_ppo_predictions = []
for idx in range(len(val_df)):
    if idx < val_env.window_size:
        val_ppo_predictions.append(val_df.iloc[idx]['llm_prediction'])
        continue
    
    action, _ = model.predict(val_obs, deterministic=True)
    llm_pred = val_df.iloc[idx]['llm_prediction']
    adjusted_pred = llm_pred * (1 + action[0])
    val_ppo_predictions.append(adjusted_pred)
    
    if idx < len(val_df) - 1:
        val_obs, _, terminated, _, _ = val_env.step(action)
        if terminated:
            break

val_df['ppo_adjusted_prediction'] = val_ppo_predictions

# Quick validation metrics
val_llm_mae = np.mean(np.abs(val_df['llm_prediction'] - val_df['actual_price']))
val_ppo_mae = np.mean(np.abs(val_df['ppo_adjusted_prediction'] - val_df['actual_price']))

print(f"
Validation Set Results:")
print(f"LLM MAE: {val_llm_mae:.4f}")
print(f"LLM-PPO MAE: {val_ppo_mae:.4f}")
print(f"Improvement: {((val_llm_mae - val_ppo_mae) / val_llm_mae * 100):.2f}%")
print("
‚úÖ Validation complete! Proceeding to test set...")


## 7. Apply PPO Adjustments to Test Set

In [None]:
# Apply PPO adjustments to test predictions
def apply_ppo_adjustment(model, test_df):
    """Apply trained PPO model to adjust predictions"""
    adjusted_predictions = []
    
    env = StockPredictionEnv(test_df, window_size=5)
    obs, _ = env.reset()
    
    for idx in range(len(test_df)):
        if idx < env.window_size:
            # For early samples, use LLM prediction as-is
            adjusted_predictions.append(test_df.iloc[idx]['llm_prediction'])
            continue
        
        # Get PPO action
        action, _ = model.predict(obs, deterministic=True)
        
        # Apply adjustment
        llm_pred = test_df.iloc[idx]['llm_prediction']
        adjusted_pred = llm_pred * (1 + action[0])
        adjusted_predictions.append(adjusted_pred)
        
        # Step environment
        if idx < len(test_df) - 1:
            obs, _, terminated, _, _ = env.step(action)
            if terminated:
                break
    
    return adjusted_predictions

print("Applying PPO adjustments to test set...")
test_df['ppo_adjusted_prediction'] = apply_ppo_adjustment(model, test_df)
print("PPO adjustments applied!")

# Display results
test_df[['ticker', 'llm_prediction', 'ppo_adjusted_prediction', 'actual_price']].head(10)

## 8. Baseline Models Implementation (COMMENTED OUT - Only using LLM and LLM-PPO)

<!-- Baseline models (SVR, XGBoost, LSTM) are commented out to focus on LLM and LLM-PPO comparison -->

In [None]:
# # Prepare features from all_labels for baseline models
# # Filter for test period (last 30% of data)
# all_labels['Date'] = pd.to_datetime(all_labels['Date'])
# all_labels = all_labels.sort_values('Date')

# # Create feature set
# feature_cols = ['SMA_20', 'SMA_50', 'EMA_12', 'EMA_26', 'RSI_14', 
#                 'MACD', 'MACD_signal', 'MACD_hist', 'BB_width_20_2',
#                 'headline_count', 'sent_compound_mean']

# # Fill NaN values
# all_labels[feature_cols] = all_labels[feature_cols].fillna(0)

# # Split by date (70% train, 30% test)
# train_size = int(len(all_labels) * 0.7)
# train_labels = all_labels.iloc[:train_size]
# test_labels = all_labels.iloc[train_size:]

# X_train = train_labels[feature_cols].values
# y_train = train_labels['next_close'].values
# X_test = test_labels[feature_cols].values
# y_test = test_labels['next_close'].values

# # Standardize features
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)

# print(f"Training set: {X_train.shape}")
# print(f"Test set: {X_test.shape}")

print("Baseline models commented out - only using LLM and LLM-PPO")

In [None]:
# # Train SVR model
# print("Training SVR model...")
# svr_model = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
# svr_model.fit(X_train_scaled, y_train)
# svr_predictions = svr_model.predict(X_test_scaled)
# print("SVR training completed!")

print("SVR model commented out")

In [None]:
# # Train XGBoost model
# print("Training XGBoost model...")
# xgb_model = XGBRegressor(
#     n_estimators=100,
#     learning_rate=0.1,
#     max_depth=5,
#     random_state=42
# )
# xgb_model.fit(X_train_scaled, y_train)
# xgb_predictions = xgb_model.predict(X_test_scaled)
# print("XGBoost training completed!")

print("XGBoost model commented out")

In [None]:
# # Build LSTM model
# print("Building and training LSTM model...")

# # Reshape data for LSTM (samples, timesteps, features)
# X_train_lstm = X_train_scaled.reshape((X_train_scaled.shape[0], 1, X_train_scaled.shape[1]))
# X_test_lstm = X_test_scaled.reshape((X_test_scaled.shape[0], 1, X_test_scaled.shape[1]))

# # Build LSTM model
# lstm_model = Sequential([
#     LSTM(50, activation='relu', input_shape=(1, X_train_scaled.shape[1])),
#     Dense(25, activation='relu'),
#     Dense(1)
# ])

# lstm_model.compile(optimizer='adam', loss='mse')

# # Train LSTM
# history = lstm_model.fit(
#     X_train_lstm, 
#     y_train,
#     epochs=50,
#     batch_size=32,
#     validation_split=0.1,
#     verbose=0
# )

# lstm_predictions = lstm_model.predict(X_test_lstm).flatten()
# print("LSTM training completed!")

print("LSTM model commented out")

## 9. Evaluation Metrics Implementation

In [None]:
# Evaluation metric functions
def calculate_mape(y_true, y_pred):
    """Calculate Mean Absolute Percentage Error"""
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    mask = y_true != 0
    return np.mean(np.abs((y_true[mask] - y_pred[mask]) / y_true[mask])) * 100

def calculate_rmse(y_true, y_pred):
    """Calculate Root Mean Square Error"""
    return np.sqrt(mean_squared_error(y_true, y_pred))

def calculate_returns(prices):
    """Calculate returns from prices"""
    prices = np.array(prices)
    return np.diff(prices) / prices[:-1]

def calculate_sharpe_ratio(returns, risk_free_rate=0.0):
    """Calculate Sharpe Ratio"""
    excess_returns = returns - risk_free_rate
    if np.std(returns) == 0:
        return 0.0
    return np.mean(excess_returns) / np.std(returns)

def calculate_sortino_ratio(returns, risk_free_rate=0.0):
    """Calculate Sortino Ratio"""
    excess_returns = returns - risk_free_rate
    downside_returns = returns[returns < 0]
    if len(downside_returns) == 0 or np.std(downside_returns) == 0:
        return 0.0
    return np.mean(excess_returns) / np.std(downside_returns)

def calculate_max_drawdown(prices):
    """Calculate Maximum Drawdown"""
    prices = np.array(prices)
    cummax = np.maximum.accumulate(prices)
    drawdowns = (prices - cummax) / cummax
    return np.min(drawdowns)

def calculate_cumulative_return(prices):
    """Calculate Cumulative Return"""
    prices = np.array(prices)
    return (prices[-1] - prices[0]) / prices[0]

print("Evaluation metrics defined.")

In [None]:
# Evaluate all models by ticker
def evaluate_model_by_ticker(predictions, actual_prices, test_labels):
    """Evaluate model performance for each ticker"""
    results = {}
    
    for ticker in test_labels['ticker'].unique():
        ticker_mask = test_labels['ticker'] == ticker
        ticker_pred = predictions[ticker_mask]
        ticker_actual = actual_prices[ticker_mask]
        
        # Calculate metrics
        mape = calculate_mape(ticker_actual, ticker_pred)
        rmse = calculate_rmse(ticker_actual, ticker_pred)
        
        # Returns-based metrics
        returns = calculate_returns(ticker_pred)
        sharpe = calculate_sharpe_ratio(returns)
        sortino = calculate_sortino_ratio(returns)
        max_dd = calculate_max_drawdown(ticker_pred)
        cum_return = calculate_cumulative_return(ticker_pred)
        
        results[ticker] = {
            'MAPE': mape,
            'RMSE': rmse,
            'Sharpe Ratio': sharpe,
            'Sortino Ratio': sortino,
            'Max Drawdown': max_dd,
            'Cumulative Return': cum_return
        }
    
    return results

print("Model evaluation function defined.")

## 10. Results Comparison and Analysis (LLM vs LLM-PPO)

In [None]:
# Compile all model predictions
models_results = {}

# # Baseline models (COMMENTED OUT)
# models_results['SVR'] = evaluate_model_by_ticker(svr_predictions, y_test, test_labels)
# models_results['XGBoost'] = evaluate_model_by_ticker(xgb_predictions, y_test, test_labels)
# models_results['LSTM'] = evaluate_model_by_ticker(lstm_predictions, y_test, test_labels)

# For LLM and LLM-PPO, we need to evaluate from test_df
# Evaluate LLM predictions
if 'llm_prediction' in test_df.columns:
    llm_predictions = test_df['llm_prediction'].values
    actual_prices = test_df['actual_price'].values
    models_results['LLM'] = evaluate_model_by_ticker(llm_predictions, actual_prices, test_df)

# Evaluate LLM-PPO predictions
if 'ppo_adjusted_prediction' in test_df.columns:
    ppo_predictions = test_df['ppo_adjusted_prediction'].values
    actual_prices = test_df['actual_price'].values
    models_results['LLM-PPO'] = evaluate_model_by_ticker(ppo_predictions, actual_prices, test_df)

print("Model evaluation completed!")
print(f"\nNumber of models evaluated: {len(models_results)}")
print(f"Models: {list(models_results.keys())}")

In [None]:
# Create comparison table
def create_comparison_table(models_results):
    """Create a comprehensive comparison table"""
    comparison_data = []
    
    for model_name, ticker_results in models_results.items():
        for ticker, metrics in ticker_results.items():
            row = {
                'Model': model_name,
                'Ticker': ticker,
                **metrics
            }
            comparison_data.append(row)
    
    return pd.DataFrame(comparison_data)

comparison_df = create_comparison_table(models_results)
print("\nModel Comparison Results:")
comparison_df

In [None]:
# Calculate average metrics across all tickers
avg_metrics = comparison_df.groupby('Model')[['MAPE', 'RMSE', 'Sharpe Ratio', 
                                                'Sortino Ratio', 'Max Drawdown', 
                                                'Cumulative Return']].mean()

print("\nAverage Performance Across All Tickers:")
avg_metrics.round(4)

## 11. Visualizations

In [None]:
# Plot MAPE comparison
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
comparison_df_pivot = comparison_df.pivot(index='Ticker', columns='Model', values='MAPE')
comparison_df_pivot.plot(kind='bar', ax=plt.gca())
plt.title('MAPE Comparison by Ticker', fontsize=14, fontweight='bold')
plt.xlabel('Ticker')
plt.ylabel('MAPE (%)')
plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

plt.subplot(1, 2, 2)
avg_metrics['MAPE'].plot(kind='bar', color='steelblue')
plt.title('Average MAPE Across All Tickers', fontsize=14, fontweight='bold')
plt.xlabel('Model')
plt.ylabel('MAPE (%)')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Plot RMSE comparison
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
comparison_df_pivot = comparison_df.pivot(index='Ticker', columns='Model', values='RMSE')
comparison_df_pivot.plot(kind='bar', ax=plt.gca())
plt.title('RMSE Comparison by Ticker', fontsize=14, fontweight='bold')
plt.xlabel('Ticker')
plt.ylabel('RMSE')
plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

plt.subplot(1, 2, 2)
avg_metrics['RMSE'].plot(kind='bar', color='coral')
plt.title('Average RMSE Across All Tickers', fontsize=14, fontweight='bold')
plt.xlabel('Model')
plt.ylabel('RMSE')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Plot risk-adjusted metrics
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Sharpe Ratio
avg_metrics['Sharpe Ratio'].plot(kind='bar', ax=axes[0, 0], color='green', alpha=0.7)
axes[0, 0].set_title('Sharpe Ratio (Higher is Better)', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Model')
axes[0, 0].set_ylabel('Sharpe Ratio')
axes[0, 0].tick_params(axis='x', rotation=45)
axes[0, 0].grid(axis='y', alpha=0.3)

# Sortino Ratio
avg_metrics['Sortino Ratio'].plot(kind='bar', ax=axes[0, 1], color='blue', alpha=0.7)
axes[0, 1].set_title('Sortino Ratio (Higher is Better)', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Model')
axes[0, 1].set_ylabel('Sortino Ratio')
axes[0, 1].tick_params(axis='x', rotation=45)
axes[0, 1].grid(axis='y', alpha=0.3)

# Maximum Drawdown
avg_metrics['Max Drawdown'].plot(kind='bar', ax=axes[1, 0], color='red', alpha=0.7)
axes[1, 0].set_title('Maximum Drawdown (Closer to 0 is Better)', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Model')
axes[1, 0].set_ylabel('Max Drawdown')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].grid(axis='y', alpha=0.3)

# Cumulative Return
avg_metrics['Cumulative Return'].plot(kind='bar', ax=axes[1, 1], color='purple', alpha=0.7)
axes[1, 1].set_title('Cumulative Return (Higher is Better)', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Model')
axes[1, 1].set_ylabel('Cumulative Return')
axes[1, 1].tick_params(axis='x', rotation=45)
axes[1, 1].grid(axis='y', alpha=0.3)

plt.suptitle('Risk-Adjusted Performance Metrics', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

In [None]:
# Sample prediction visualization (if test_df available)
if 'ticker' in test_df.columns:
    # Select one ticker for detailed visualization
    sample_ticker = test_df['ticker'].iloc[0]
    ticker_data = test_df[test_df['ticker'] == sample_ticker].head(50)
    
    plt.figure(figsize=(15, 6))
    
    x = range(len(ticker_data))
    plt.plot(x, ticker_data['actual_price'].values, 'ko-', label='Actual Price', linewidth=2, markersize=6)
    plt.plot(x, ticker_data['llm_prediction'].values, 'bs--', label='LLM Prediction', linewidth=1.5, markersize=5, alpha=0.7)
    
    if 'ppo_adjusted_prediction' in ticker_data.columns:
        plt.plot(x, ticker_data['ppo_adjusted_prediction'].values, 'r^--', label='LLM-PPO Prediction', linewidth=1.5, markersize=5, alpha=0.7)
    
    plt.title(f'Stock Price Predictions for {sample_ticker} (First 50 Test Samples)', fontsize=14, fontweight='bold')
    plt.xlabel('Sample Index')
    plt.ylabel('Stock Price')
    plt.legend(loc='best')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

## 12. Key Findings and Summary

In [None]:
# Summary statistics
print("="*80)
print("SUMMARY OF RESULTS")
print("="*80)

print("\n1. PREDICTION ACCURACY (Lower is Better)")
print("-" * 80)
accuracy_summary = avg_metrics[['MAPE', 'RMSE']].round(4)
print(accuracy_summary)

print("\n2. RISK-ADJUSTED RETURNS (Higher is Better for Ratios)")
print("-" * 80)
risk_summary = avg_metrics[['Sharpe Ratio', 'Sortino Ratio']].round(4)
print(risk_summary)

print("\n3. RISK METRICS")
print("-" * 80)
drawdown_summary = avg_metrics[['Max Drawdown', 'Cumulative Return']].round(4)
print(drawdown_summary)

print("\n" + "="*80)
print("CONCLUSION")
print("="*80)
print("""
The two-stage LLM-PPO framework aims to:
1. Generate initial predictions using LLM with historical data and sentiment
2. Refine predictions using PPO with risk-aware adjustments (VaR, CVaR)

Key Benefits:
- Incorporates both market data and qualitative information (news sentiment)
- Balances prediction accuracy with financial risk management
- Provides more stable predictions compared to pure ML/DL approaches
- Better risk-adjusted returns through CVaR-based reward function

The framework demonstrates the potential of combining LLMs with reinforcement
learning for robust financial forecasting in uncertain market environments.
""")

## 13. Save Results

In [None]:
# Save comparison results
output_dir = '../results'
os.makedirs(output_dir, exist_ok=True)

# Save comparison table
comparison_df.to_csv(f'{output_dir}/model_comparison_results.csv', index=False)
print(f"Comparison results saved to {output_dir}/model_comparison_results.csv")

# Save average metrics
avg_metrics.to_csv(f'{output_dir}/average_metrics.csv')
print(f"Average metrics saved to {output_dir}/average_metrics.csv")

# Save PPO model
model.save(f'{output_dir}/ppo_stock_prediction_model')
print(f"PPO model saved to {output_dir}/ppo_stock_prediction_model")

# Save test predictions
if 'ppo_adjusted_prediction' in test_df.columns:
    test_df.to_csv(f'{output_dir}/test_predictions.csv', index=False)
    print(f"Test predictions saved to {output_dir}/test_predictions.csv")

print("\nAll results saved successfully!")

## 14. Next Steps and Extensions

### Potential Improvements:
1. **Fine-tune LLM**: Fine-tune the Llama model on financial data for better domain-specific predictions
2. **Enhanced PPO**: Experiment with different reward functions and hyperparameters
3. **More Baselines**: Implement TCN (Temporal Convolutional Network) for comparison
4. **Real-time Prediction**: Adapt the framework for real-time stock prediction
5. **Portfolio Optimization**: Extend to multi-stock portfolio management
6. **Risk Metrics**: Incorporate additional risk metrics (CVaR at different confidence levels)
7. **Ensemble Methods**: Combine multiple models for more robust predictions
8. **Market Regime Detection**: Adapt strategy based on market conditions (bull/bear markets)

### Research Directions:
- Study the interpretability of LLM predictions
- Analyze the impact of different sentiment sources
- Investigate transfer learning across different stocks
- Explore attention mechanisms in the PPO policy network