# Generate Justifications and Q&A with OpenAI Batch API

This notebook processes your stock prediction data using **OpenAI's Batch API**. This approach avoids rate-limiting issues and is highly efficient for large datasets.

## Workflow Overview:

**RECOMMENDED: Use the Async Workflow (Sections 5A-5C)**
1. **Section 5A**: Submit all batches at once (non-blocking)
2. Wait hours/day while OpenAI processes
3. **Section 5B**: Check status periodically
4. **Section 5C**: Retrieve all completed results
5. **Section 6**: Save augmented data

**Alternative: Use Section 5 for Blocking Workflow** (waits for each batch sequentially - slower)

**Note**: Batch jobs are processed by OpenAI within a 24-hour window.

## 1. Setup and Imports

In [1]:
# Setup
import os
import json
import time
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Configuration
MODEL_NAME = "gpt-4o-mini"
BATCH_INPUT_DIR = "batch_inputs"
os.makedirs(BATCH_INPUT_DIR, exist_ok=True)

# File paths
TRAIN_FILE = "../finetune_paper/train.jsonl"
VAL_FILE = "../finetune_paper/val.jsonl"
TEST_FILE = "../finetune_paper/test.jsonl"

## 2. Load Original Data

In [2]:
def load_jsonl(file_path):
    """Load JSONL file"""
    data = []
    with open(file_path, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    return data

# Load data
print("Loading datasets...")
train_data = load_jsonl(TRAIN_FILE)
val_data = load_jsonl(VAL_FILE)
test_data = load_jsonl(TEST_FILE)
print(f"Loaded {len(train_data)} train, {len(val_data)} val, {len(test_data)} test samples")

Loading datasets...
Loaded 8698 train, 1243 val, 2477 test samples


## 3. Helper Functions

In [4]:
def prepare_batch_file(data, task_type, output_file, chunk_size=1000):
    """
    Prepare batch file for OpenAI Batch API (with chunking support)
    
    Args:
        data: List of samples
        task_type: "justification" or "qa_chain"
        output_file: Path to save batch input file
        chunk_size: Max samples per file (default 1000 to stay under token limit)
    
    Returns:
        List of created batch file paths
    """
    batch_requests = []
    
    for idx, sample in enumerate(data):
        # Get the prompt and response from the sample
        prompt = sample["prompt"]
        response = sample["response"]
        
        if task_type == "justification":
            system_prompt = "You are a financial analyst who provides clear, concise explanations for stock price predictions."
            user_prompt = f"{prompt}\n\nThe predicted answer is: {response}\n\nProvide a 2-3 sentence justification explaining WHY this prediction makes sense based on the indicators and sentiment provided in the context above."
            response_format = None
        else:  # qa_chain
            system_prompt = "You are a financial analyst who helps explain stock predictions through Q&A."
            user_prompt = f"{prompt}\n\nThe predicted answer is: {response}\n\nGenerate 3-4 question-answer pairs that help explain the reasoning behind this prediction based on the market data provided. Return as JSON with format: {{\"qa_pairs\": [{{\"question\": \"...\", \"answer\": \"...\"}}]}}"
            response_format = {"type": "json_object"}

        print(system_prompt)
        print(user_prompt)
        
        request = {
            "custom_id": f"{task_type}_{idx}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": MODEL_NAME,
                "messages": [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": user_prompt}
                ],
                "max_tokens": 500
            }
        }
        
        if response_format:
            request["body"]["response_format"] = response_format
        
        batch_requests.append(request)
    
    # Split into chunks and write multiple files if needed
    created_files = []
    num_chunks = (len(batch_requests) + chunk_size - 1) // chunk_size
    
    for chunk_idx in range(num_chunks):
        start_idx = chunk_idx * chunk_size
        end_idx = min(start_idx + chunk_size, len(batch_requests))
        chunk = batch_requests[start_idx:end_idx]
        
        # Create chunk filename
        if num_chunks > 1:
            base_name = output_file.replace('.jsonl', f'_chunk{chunk_idx+1}.jsonl')
        else:
            base_name = output_file
        
        # Write chunk to file
        with open(base_name, 'w') as f:
            for request in chunk:
                f.write(json.dumps(request) + '\n')
        
        print(f"Created batch file: {base_name} with {len(chunk)} requests")
        created_files.append(base_name)
    
    return created_files


def upload_file_and_wait(file_path):
    """Upload file to OpenAI"""
    print(f"Uploading {file_path}...")
    with open(file_path, "rb") as f:
        batch_input_file = client.files.create(file=f, purpose="batch")
    print(f"Uploaded file ID: {batch_input_file.id}")
    return batch_input_file.id


def retrieve_existing_batches():
    """
    Retrieve all existing batch jobs and organize by description
    Returns dict mapping description to batch info
    """
    batches = client.batches.list(limit=100)
    
    existing = {}
    for batch in batches.data:
        desc = batch.metadata.get('description', '')
        if desc:
            existing[desc] = {
                'id': batch.id,
                'status': batch.status,
                'output_file_id': batch.output_file_id if batch.status == 'completed' else None,
                'request_counts': batch.request_counts
            }
    
    return existing


def process_results(output_file_id, original_data):
    """Download and process batch results"""
    print(f"\nDownloading results from {output_file_id}...")
    
    # Download the output file
    file_response = client.files.content(output_file_id)
    results = []
    
    for line in file_response.text.strip().split('\n'):
        results.append(json.loads(line))
    
    # Sort by custom_id to match original order
    # Handle both "justification_123" and "qa_chain_123" formats
    results.sort(key=lambda x: int(x['custom_id'].split('_')[-1]))
    
    # Extract the generated content
    augmented_data = []
    for i, result in enumerate(results):
        original_sample = original_data[i]
        response_content = result['response']['body']['choices'][0]['message']['content']
        
        augmented_sample = original_sample.copy()
        augmented_sample['generated_content'] = response_content
        augmented_data.append(augmented_sample)
    
    print(f"Processed {len(augmented_data)} results")
    return augmented_data

def generate_sample_prompts(data, num_samples=2):
    """
    Generate sample prompts to show what gets sent to GPT-4o
    """
    
    for idx in range(min(num_samples, len(data))):
        sample = data[idx]
        prompt = sample["prompt"]
        response = sample["response"]
        
        print(f"\n{'='*80}")
        print(f"SAMPLE {idx + 1}")
        print(f"{'='*80}\n")
        
        # ==================== JUSTIFICATION TASK ====================
        print(" TASK: JUSTIFICATION GENERATION")
        print("-" * 80)
        
        system_prompt_just = "You are a financial analyst who provides clear, concise explanations for stock price predictions."
        user_prompt_just = f"{prompt}\n\nThe predicted answer is: {response}\n\nProvide a 2-3 sentence justification explaining WHY this prediction makes sense based on the indicators and sentiment provided in the context above."
        
        print("\n SYSTEM PROMPT:")
        print(system_prompt_just)
        
        print("\n USER PROMPT:")
        print(user_prompt_just)
        
        print("\n FULL API REQUEST BODY:")
        request_body_just = {
            "model": MODEL_NAME,
            "messages": [
                {"role": "system", "content": system_prompt_just},
                {"role": "user", "content": user_prompt_just}
            ],
            "max_tokens": 500
        }
        print(json.dumps(request_body_just, indent=2)[:1000] + "..." if len(json.dumps(request_body_just, indent=2)) > 1000 else json.dumps(request_body_just, indent=2))
        
        print("\n" + "="*80 + "\n")
        
        # ==================== Q&A TASK ====================
        print(" TASK: Q&A CHAIN GENERATION")
        print("-" * 80)
        
        system_prompt_qa = "You are a financial analyst who helps explain stock predictions through Q&A."
        user_prompt_qa = f"{prompt}\n\nThe predicted answer is: {response}\n\nGenerate 3-4 question-answer pairs that help explain the reasoning behind this prediction based on the market data provided. Return as JSON with format: {{\"qa_pairs\": [{{\"question\": \"...\", \"answer\": \"...\"}}]}}"
        
        print("\n SYSTEM PROMPT:")
        print(system_prompt_qa)
        
        print("\n USER PROMPT:")
        print(user_prompt_qa)
        
        print("\n FULL API REQUEST BODY:")
        request_body_qa = {
            "model": MODEL_NAME,
            "messages": [
                {"role": "system", "content": system_prompt_qa},
                {"role": "user", "content": user_prompt_qa}
            ],
            "max_tokens": 500,
            "response_format": {"type": "json_object"}
        }
        print(json.dumps(request_body_qa, indent=2)[:1000] + "..." if len(json.dumps(request_body_qa, indent=2)) > 1000 else json.dumps(request_body_qa, indent=2))
        
        print("\n" + "="*80 + "\n")




# Sample prompts (visible in finetune_ablations/batch_inputs)

In [5]:
# Generate sample prompts from test data
print(" Generating Sample Prompts for GPT-4o")
print("="*80)
generate_sample_prompts(test_data, num_samples=1)

 Generating Sample Prompts for GPT-4o

SAMPLE 1

 TASK: JUSTIFICATION GENERATION
--------------------------------------------------------------------------------

 SYSTEM PROMPT:
You are a financial analyst who provides clear, concise explanations for stock price predictions.

 USER PROMPT:
You are a financial analyst with expertise in stock market forecasting.
Your task is to analyze market data and predict the next trading day stock price.
Use historical price trends, technical indicators, and sentiment analysis to provide an informed forecast.
Ensure that your predictions are well-justified, considering multiple financial factors.

‚Ä¢ Predicted Stock Price: The forecasted close price for the next trading day.
‚Ä¢ Price Movement Likelihood: The likelihood of the predicted stock price.
‚Ä¢ Justification: Provide an explanation for the predicted stock price and the corresponding likelihood, considering the following:
  - Historical market data (e.g., recent closing prices).
  - Techni

## 4. Utility: Check Batch Status

In [4]:
# List all existing batches
existing = retrieve_existing_batches()
print(f"Found {len(existing)} existing batches:\n")
for desc, info in existing.items():
    print(f"{desc}: {info['status']} - {info['request_counts']}")

Found 31 existing batches:

test_qa_chain_2: completed - BatchRequestCounts(completed=477, failed=0, total=477)
test_qa_chain_1: completed - BatchRequestCounts(completed=1000, failed=0, total=1000)
test_qa_chain_0: completed - BatchRequestCounts(completed=1000, failed=0, total=1000)
test_justification_2: completed - BatchRequestCounts(completed=477, failed=0, total=477)
test_justification_1: completed - BatchRequestCounts(completed=1000, failed=0, total=1000)
test_justification_0: completed - BatchRequestCounts(completed=1000, failed=0, total=1000)
val_qa_chain_1: completed - BatchRequestCounts(completed=243, failed=0, total=243)
val_qa_chain_0: completed - BatchRequestCounts(completed=1000, failed=0, total=1000)
val_justification_1: completed - BatchRequestCounts(completed=243, failed=0, total=243)
val_justification_0: completed - BatchRequestCounts(completed=1000, failed=0, total=1000)
train_qa_chain_8: completed - BatchRequestCounts(completed=698, failed=0, total=698)
train_qa_chain

## 5A. RECOMMENDED: Submit All Batches (Async - Non-blocking)

**Use this approach!** Submit all batches at once and check back later.

In [5]:
def submit_all_batches(chunk_size=1000):
    """
    Submit all batch jobs without waiting for completion
    Returns dict of submitted batch IDs organized by dataset/task/chunk
    """
    submitted_batches = {}
    existing_batches = retrieve_existing_batches()
    
    datasets = {
        'train': train_data,
        'val': val_data,
        'test': test_data
    }
    
    for dataset_name, data in datasets.items():
        print(f"\n{'='*60}")
        print(f"Submitting {dataset_name.upper()} batches")
        print(f"{'='*60}")
        
        for task_type in ['justification', 'qa_chain']:
            print(f"\n--- Task: {task_type} ---")
            
            # Prepare batch files
            batch_file_base = os.path.join(BATCH_INPUT_DIR, f"{dataset_name}_{task_type}_batch.jsonl")
            batch_files = prepare_batch_file(data, task_type, batch_file_base, chunk_size=chunk_size)
            
            # Submit each chunk
            for chunk_idx, batch_file in enumerate(batch_files):
                batch_desc = f"{dataset_name}_{task_type}_{chunk_idx}"
                
                # Check if already exists
                if batch_desc in existing_batches:
                    existing = existing_batches[batch_desc]
                    print(f"‚úì Batch '{batch_desc}' already exists: {existing['status']}")
                    submitted_batches[batch_desc] = existing['id']
                    continue
                
                # Upload and submit new batch
                file_id = upload_file_and_wait(batch_file)
                
                print(f"Submitting batch job: {batch_desc}")
                batch = client.batches.create(
                    input_file_id=file_id,
                    endpoint="/v1/chat/completions",
                    completion_window="24h",
                    metadata={"description": batch_desc}
                )
                
                print(f"‚úì Submitted batch ID: {batch.id} - Status: {batch.status}")
                submitted_batches[batch_desc] = batch.id
    
    print(f"\n{'='*60}")
    print(f"‚úì Submitted {len(submitted_batches)} batch jobs!")
    print(f"{'='*60}")
    print("\nBatch IDs:")
    for desc, batch_id in submitted_batches.items():
        print(f"  {desc}: {batch_id}")
    
    return submitted_batches

# Submit all batches
submitted_batches = submit_all_batches(chunk_size=1000)


Submitting TRAIN batches

--- Task: justification ---
Created batch file: batch_inputs/train_justification_batch_chunk1.jsonl with 1000 requests
Created batch file: batch_inputs/train_justification_batch_chunk2.jsonl with 1000 requests
Created batch file: batch_inputs/train_justification_batch_chunk3.jsonl with 1000 requests
Created batch file: batch_inputs/train_justification_batch_chunk4.jsonl with 1000 requests
Created batch file: batch_inputs/train_justification_batch_chunk5.jsonl with 1000 requests
Created batch file: batch_inputs/train_justification_batch_chunk6.jsonl with 1000 requests
Created batch file: batch_inputs/train_justification_batch_chunk7.jsonl with 1000 requests
Created batch file: batch_inputs/train_justification_batch_chunk8.jsonl with 1000 requests
Created batch file: batch_inputs/train_justification_batch_chunk9.jsonl with 698 requests
‚úì Batch 'train_justification_0' already exists: completed
‚úì Batch 'train_justification_1' already exists: completed
‚úì Bat

## 5B. Check Status of All Batches

Run this cell periodically to check progress.

In [6]:
def check_all_batches_status():
    """Check status of all batches"""
    existing = retrieve_existing_batches()
    
    status_summary = {
        'completed': [],
        'in_progress': [],
        'validating': [],
        'failed': [],
        'other': []
    }
    
    print("Batch Status Summary:")
    print("=" * 80)
    
    for desc, info in existing.items():
        status = info['status']
        if status == 'completed':
            status_summary['completed'].append(desc)
        elif status in ['in_progress', 'finalizing']:
            status_summary['in_progress'].append(desc)
        elif status == 'validating':
            status_summary['validating'].append(desc)
        elif status == 'failed':
            status_summary['failed'].append(desc)
        else:
            status_summary['other'].append(desc)
        
        print(f"\n{desc}")
        print(f"  Status: {status}")
        print(f"  Counts: {info['request_counts']}")
    
    print(f"\n{'='*80}")
    print("Summary:")
    print(f"  ‚úì Completed: {len(status_summary['completed'])}")
    print(f"  ‚è≥ In Progress: {len(status_summary['in_progress'])}")
    print(f"  üîÑ Validating: {len(status_summary['validating'])}")
    print(f"  ‚úó Failed: {len(status_summary['failed'])}")
    print(f"  ? Other: {len(status_summary['other'])}")
    
    return status_summary

# Check status
status = check_all_batches_status()

Batch Status Summary:

test_qa_chain_2
  Status: completed
  Counts: BatchRequestCounts(completed=477, failed=0, total=477)

test_qa_chain_1
  Status: completed
  Counts: BatchRequestCounts(completed=1000, failed=0, total=1000)

test_qa_chain_0
  Status: completed
  Counts: BatchRequestCounts(completed=1000, failed=0, total=1000)

test_justification_2
  Status: completed
  Counts: BatchRequestCounts(completed=477, failed=0, total=477)

test_justification_1
  Status: completed
  Counts: BatchRequestCounts(completed=1000, failed=0, total=1000)

test_justification_0
  Status: completed
  Counts: BatchRequestCounts(completed=1000, failed=0, total=1000)

val_qa_chain_1
  Status: completed
  Counts: BatchRequestCounts(completed=243, failed=0, total=243)

val_qa_chain_0
  Status: completed
  Counts: BatchRequestCounts(completed=1000, failed=0, total=1000)

val_justification_1
  Status: completed
  Counts: BatchRequestCounts(completed=243, failed=0, total=243)

val_justification_0
  Status: co

## 5C. Retrieve All Completed Results

Run this once all batches are completed to download results.

In [8]:
def retrieve_all_results(chunk_size=1000):
    """
    Retrieve results from all completed batches
    Returns dict of augmented data organized by dataset and task
    """
    results = {}
    existing_batches = retrieve_existing_batches()
    
    datasets = {
        'train': train_data,
        'val': val_data,
        'test': test_data
    }
    
    for dataset_name, data in datasets.items():
        print(f"\n{'='*60}")
        print(f"Retrieving {dataset_name.upper()} results")
        print(f"{'='*60}")
        
        for task_type in ['justification', 'qa_chain']:
            print(f"\n--- Task: {task_type} ---")
            
            # Calculate number of chunks
            num_chunks = (len(data) + chunk_size - 1) // chunk_size
            
            all_augmented = []
            for chunk_idx in range(num_chunks):
                batch_desc = f"{dataset_name}_{task_type}_{chunk_idx}"
                
                if batch_desc not in existing_batches:
                    print(f"‚ö† Warning: Batch '{batch_desc}' not found!")
                    continue
                
                batch_info = existing_batches[batch_desc]
                
                if batch_info['status'] != 'completed':
                    print(f"‚ö† Batch '{batch_desc}' not completed yet (status: {batch_info['status']})")
                    continue
                
                if not batch_info['output_file_id']:
                    print(f"‚ö† Batch '{batch_desc}' has no output file!")
                    continue
                
                # Get data slice for this chunk
                start_idx = chunk_idx * chunk_size
                end_idx = min(start_idx + chunk_size, len(data))
                data_slice = data[start_idx:end_idx]
                
                # Process results
                print(f"‚úì Retrieving chunk {chunk_idx+1}/{num_chunks} from {batch_info['output_file_id']}")
                augmented_chunk = process_results(batch_info['output_file_id'], data_slice)
                all_augmented.extend(augmented_chunk)
            
            # Store results
            if all_augmented:
                results[f"{dataset_name}_{task_type}"] = all_augmented
                print(f"‚úì Retrieved {len(all_augmented)} samples for {dataset_name}_{task_type}")
            else:
                print(f"‚úó No results retrieved for {dataset_name}_{task_type}")
    
    print(f"\n{'='*60}")
    print(f"‚úì Retrieval complete!")
    print(f"{'='*60}")
    
    return results

# Retrieve all completed results
results = retrieve_all_results(chunk_size=1000)


Retrieving TRAIN results

--- Task: justification ---
‚úì Retrieving chunk 1/9 from file-Ciu5etNRMzs1cMNSnAjhfD

Downloading results from file-Ciu5etNRMzs1cMNSnAjhfD...
Processed 1000 results
‚úì Retrieving chunk 2/9 from file-KHVjZWSgmdfrREWDvfSuxs

Downloading results from file-KHVjZWSgmdfrREWDvfSuxs...
Processed 1000 results
‚úì Retrieving chunk 3/9 from file-1SuuHMQk2zusEBxHCw4Drq

Downloading results from file-1SuuHMQk2zusEBxHCw4Drq...
Processed 1000 results
‚úì Retrieving chunk 4/9 from file-2wwTfNDBqJnjWPiEdvPXbq

Downloading results from file-2wwTfNDBqJnjWPiEdvPXbq...
Processed 1000 results
‚úì Retrieving chunk 5/9 from file-1WDW2bfe33chEUoAAFNg8B

Downloading results from file-1WDW2bfe33chEUoAAFNg8B...
Processed 1000 results
‚úì Retrieving chunk 6/9 from file-WqxGSTtJ99v1KvzPH5nxxB

Downloading results from file-WqxGSTtJ99v1KvzPH5nxxB...
Processed 1000 results
‚úì Retrieving chunk 7/9 from file-AvtZBKssNNCqSZsJbz23Mq

Downloading results from file-AvtZBKssNNCqSZsJbz23Mq...
Pr

## 6. Save Augmented Data

Save the final datasets in multiple formats for fine-tuning.

In [11]:
def save_augmented_data(results):
    """Save results in various formats for fine-tuning"""
    
    for dataset_name in ['train', 'val', 'test']:
        print(f"\nSaving {dataset_name} data...")
        
        just_key = f"{dataset_name}_justification"
        qa_key = f"{dataset_name}_qa_chain"
        
        if just_key not in results or qa_key not in results:
            print(f"Missing results for {dataset_name}, skipping...")
            continue
        
        # Get the data
        just_data = results[just_key]
        qa_data = results[qa_key]
        
        # Format 1: With justification - use LLM response directly
        finetuning_just = []
        for sample in just_data:
            ft_sample = {
                "prompt": sample["prompt"],
                "response": sample['generated_content']
            }
            finetuning_just.append(ft_sample)
        
        output_file = f"../finetune_paper/{dataset_name}_with_justifications.jsonl"
        with open(output_file, 'w') as f:
            for sample in finetuning_just:
                f.write(json.dumps(sample) + '\n')
        print(f"  ‚úì Saved with justifications: {output_file}")
        
        # Format 2: With Q&A - create full response with predicted_close, likelihood, and Q&A in justification
        finetuning_qa = []
        for sample in qa_data:
            # Parse the original response to get predicted_close and likelihood
            try:
                original_response = json.loads(sample['response'])
                predicted_close = original_response.get('predicted_close', 0.0)
                likelihood = original_response.get('likelihood', 0.5)
            except (json.JSONDecodeError, KeyError):
                predicted_close = 0.0
                likelihood = 0.5
            
            # Parse the Q&A content and extract the array
            try:
                qa_content = json.loads(sample['generated_content'])
                if 'qa_pairs' in qa_content:
                    # Extract just the qa_pairs array
                    qa_array = qa_content['qa_pairs']
                else:
                    qa_array = []
            except (json.JSONDecodeError, KeyError):
                qa_array = []
            
            # Build complete response JSON with Q&A in justification field
            complete_response = {
                "predicted_close": predicted_close,
                "likelihood": likelihood,
                "justification": qa_array
            }
            
            ft_sample = {
                "prompt": sample["prompt"],
                "response": json.dumps(complete_response)
            }
            finetuning_qa.append(ft_sample)
        
        output_file = f"../finetune_paper/{dataset_name}_with_qa.jsonl"
        with open(output_file, 'w') as f:
            for sample in finetuning_qa:
                f.write(json.dumps(sample) + '\n')
        print(f"  ‚úì Saved with Q&A: {output_file}")

# Save all the data
save_augmented_data(results)
print("\n‚úì All data saved successfully!")


Saving train data...
  ‚úì Saved with justifications: ../finetune_paper/train_with_justifications.jsonl
  ‚úì Saved with Q&A: ../finetune_paper/train_with_qa.jsonl

Saving val data...
  ‚úì Saved with justifications: ../finetune_paper/val_with_justifications.jsonl
  ‚úì Saved with Q&A: ../finetune_paper/val_with_qa.jsonl

Saving test data...
  ‚úì Saved with justifications: ../finetune_paper/test_with_justifications.jsonl
  ‚úì Saved with Q&A: ../finetune_paper/test_with_qa.jsonl

‚úì All data saved successfully!
  ‚úì Saved with Q&A: ../finetune_paper/train_with_qa.jsonl

Saving val data...
  ‚úì Saved with justifications: ../finetune_paper/val_with_justifications.jsonl
  ‚úì Saved with Q&A: ../finetune_paper/val_with_qa.jsonl

Saving test data...
  ‚úì Saved with justifications: ../finetune_paper/test_with_justifications.jsonl
  ‚úì Saved with Q&A: ../finetune_paper/test_with_qa.jsonl

‚úì All data saved successfully!


## 7. Convert to Instruction/Input/Output Format

Convert the data into instruction/input/output format for fine-tuning.

In [12]:
def convert_to_instruction_input_output(input_file, output_file):
    """
    Convert prompt/response format to instruction/input/output format
    
    instruction: The system prompt (general task description)
    input: The specific data for this prediction (ticker, date, prices, indicators, etc.)
    output: The response
    """
    
    # Define the instruction (the general task without specific data)
    instruction = """You are a financial analyst with expertise in stock market forecasting.
Your task is to analyze market data and predict the next trading day stock price.
Use historical price trends, technical indicators, and sentiment analysis to provide an informed forecast.
Ensure that your predictions are well-justified, considering multiple financial factors.

‚Ä¢ Predicted Stock Price: The forecasted close price for the next trading day.
‚Ä¢ Price Movement Likelihood: The likelihood of the predicted stock price.
‚Ä¢ Justification: Provide an explanation for the predicted stock price and the corresponding likelihood, considering the following:
  - Historical market data (e.g., recent closing prices).
  - Technical indicators (e.g., SMA, EMA, RSI, MACD, Bollinger Bands).
  - Sentiment analysis (e.g., news sentiment, market sentiment).

Please weigh these signals and justify the predicted stock price.

Return STRICT JSON with keys:
- predicted_close (float, next-day close price),
- likelihood (float in [0,1]),
- justification (string, 1‚Äì2 sentences)."""
    
    converted_data = []
    
    with open(input_file, 'r') as f:
        for line in f:
            sample = json.loads(line)
            prompt = sample['prompt']
            response = sample['response']
            
            # Extract the input part (everything after "Please weigh these signals...")
            # Find where the specific data starts (after the instruction part)
            if "TICKER:" in prompt:
                # Split at TICKER to separate instruction from input
                parts = prompt.split("TICKER:", 1)
                if len(parts) == 2:
                    # Extract just the data portion
                    input_data = "TICKER:" + parts[1].strip()
                    
                    # Remove the "Return STRICT JSON..." part from input if it exists
                    if "Return STRICT JSON" in input_data:
                        input_data = input_data.split("Return STRICT JSON")[0].strip()
                    
                    converted_sample = {
                        "instruction": instruction,
                        "input": input_data,
                        "output": response
                    }
                    converted_data.append(converted_sample)
    
    # Save converted data
    with open(output_file, 'w') as f:
        for sample in converted_data:
            f.write(json.dumps(sample) + '\n')
    
    print(f"‚úì Converted {len(converted_data)} samples")
    print(f"‚úì Saved to: {output_file}")
    return converted_data


# Convert all datasets
print("Converting datasets to instruction/input/output format...")
print("="*60)

for dataset_name in ['train', 'val', 'test']:
    print(f"\n{dataset_name.upper()} dataset:")
    
    # Convert base dataset
    convert_to_instruction_input_output(
        f"../finetune_paper/{dataset_name}.jsonl",
        f"../finetune_paper/{dataset_name}_instruction_format.jsonl"
    )
    
    # Convert with justifications
    convert_to_instruction_input_output(
        f"../finetune_paper/{dataset_name}_with_justifications.jsonl",
        f"../finetune_paper/{dataset_name}_with_justifications_instruction_format.jsonl"
    )
    
    # Convert with Q&A
    convert_to_instruction_input_output(
        f"../finetune_paper/{dataset_name}_with_qa.jsonl",
        f"../finetune_paper/{dataset_name}_with_qa_instruction_format.jsonl"
    )

print("\n" + "="*60)
print("‚úì All conversions complete!")

Converting datasets to instruction/input/output format...

TRAIN dataset:
‚úì Converted 8698 samples
‚úì Saved to: ../finetune_paper/train_instruction_format.jsonl
‚úì Converted 8698 samples
‚úì Saved to: ../finetune_paper/train_with_justifications_instruction_format.jsonl
‚úì Converted 8698 samples
‚úì Saved to: ../finetune_paper/train_with_qa_instruction_format.jsonl

VAL dataset:
‚úì Converted 1243 samples
‚úì Saved to: ../finetune_paper/val_instruction_format.jsonl
‚úì Converted 1243 samples
‚úì Saved to: ../finetune_paper/val_with_justifications_instruction_format.jsonl
‚úì Converted 1243 samples
‚úì Saved to: ../finetune_paper/val_with_qa_instruction_format.jsonl

TEST dataset:
‚úì Converted 2477 samples
‚úì Saved to: ../finetune_paper/test_instruction_format.jsonl
‚úì Converted 2477 samples
‚úì Saved to: ../finetune_paper/test_with_justifications_instruction_format.jsonl
‚úì Converted 2477 samples
‚úì Saved to: ../finetune_paper/test_with_qa_instruction_format.jsonl

‚úì All con