# Reasoning usecases : Model comparison Reasoning vs GPT models

### Cell 2
This cell imports necessary libraries, sets environment variables, and initializes the OpenAI client. Provide your GPT and reasoning deployment names to the variables GPT_MODEL and REASONING_MODEL. Make sure to provide your AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY in env file.

In [30]:
from pydantic import BaseModel, Field
from openai import AzureOpenAI
import os
from dotenv import load_dotenv
import json
import copy
import textwrap
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display, Image, Markdown
# Load environment variables
load_dotenv("./.env")

client = AzureOpenAI(  
    api_version="2025-04-01-preview",  
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),  
    api_key=os.getenv("AZURE_OPENAI_API_KEY")  
)  

GPT_MODEL = 'gpt-4o-mini'
REASONING_MODEL = 'o4-mini'
print(" GPT-4o Model Name for the first time...", GPT_MODEL)
print(" Reasoning Model Name for the first time...", REASONING_MODEL)

 GPT-4o Model Name for the first time... gpt-4o-mini
 Reasoning Model Name for the first time... o4-mini


### Cell 4
This cell defines the `get_chat_completion` function used to request model responses.

In [31]:
def get_chat_completion(model, prompt):
    """
    Calls the OpenAI API to get a chat completion.

    :param model: The model to use for the completion.
    :param prompt: The prompt to send to the model.
    :return: The completion response from the model.
    """
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

### Cell 6
This cell holds the multi-line `demo_prompt`. Provide your prompt here or use one of the prompts from the usecases in this repo.

In [32]:
demo_prompt = """
# Cross-Asset Volatility and Sentiment Forecasting: A Multi-Modal Analysis

## Datasets
Below are multiple micro datasets provided in markdown format. They include structured tables, time-series data, and unstructured text meant to help you assess market sentiment, volatility drivers, and macroeconomic conditions. Note that some datasets (marked with *DS*) are included as potential distractions.

**Dataset 1: Volatility Indices Data** (Visual: Line Graph)

| Date       | VIX (S&P 500) | MOVE (10Y Treasuries) | Implied Volatility - Brent (%) | Implied Volatility - Gold (%) |
|------------|---------------|-----------------------|--------------------------------|-------------------------------|
| 2023-09-01 | 18.2          | 9.5                   | 22.0                           | 12.5                          |
| 2023-09-08 | 19.0          | 10.0                  | 22.5                           | 12.8                          |
| 2023-09-15 | 20.1          | 10.2                  | 23.0                           | 13.0                          |
| 2023-09-22 | 21.0          | 10.5                  | 23.8                           | 13.5                          |
| 2023-09-29 | 20.5          | 10.3                  | 23.2                           | 13.2                          |
| 2023-10-06 | 22.0          | 10.7                  | 24.0                           | 13.8                          |
| 2023-10-13 | 22.5          | 11.0                  | 24.5                           | 14.0                          |

**Dataset 2: Market Sentiment Scores** (Visual: Bar Chart)

| Date       | Central Bank Sentiment (1-10) | Financial News Tone (1-10) | Reddit/Twitter Sentiment (1-10) |
|------------|-------------------------------|----------------------------|---------------------------------|
| 2023-09-01 | 7                             | 6                          | 5                               |
| 2023-09-08 | 6                             | 5                          | 4                               |
| 2023-09-15 | 5                             | 4                          | 3                               |
| 2023-09-22 | 4                             | 3                          | 2                               |
| 2023-09-29 | 5                             | 4                          | 3                               |
| 2023-10-06 | 6                             | 5                          | 4                               |
| 2023-10-13 | 7                             | 6                          | 5                               |

**Dataset 3: Macroeconomic Indicators** (Visual: Clustered Bar Chart)

| Date       | CPI YoY (%) | PMI | NFP (in thousands) | Unemployment Rate (%) |
|------------|-------------|-----|--------------------|-----------------------|
| 2023-09-01 | 2.1         | 55  | 150                | 4.0                   |
| 2023-09-15 | 2.2         | 54  | 145                | 4.1                   |
| 2023-09-29 | 2.3         | 53  | 140                | 4.2                   |
| 2023-10-13 | 2.4         | 52  | 135                | 4.3                   |

**Dataset 4: Option Term Structure Data** (Visual: Line Graph)
| Date       | S&P 500 30d IV (%) | 10Y T-Note 30d IV (%) | Brent 30d IV (%) | Gold 30d IV (%) |
|------------|--------------------|-----------------------|------------------|-----------------|
| 2023-09-01 | 18.0               | 10.0                  | 22.0             | 12.5            |
| 2023-10-01 | 19.5               | 10.5                  | 23.5             | 13.0            |

**Dataset 5: News Headlines and Central Bank Statements** (Unstructured Data)

- "Central Bank signals gradual tightening amid global uncertainty." (Date: 2023-10-05)
- "Financial news: Equity markets rally despite mixed earnings reports." (Date: 2023-10-06)
- "Breaking: New policy measures expected to cushion market headwinds." (Date: 2023-10-07)
- "Central Bank statement: 'We maintain cautious optimism despite rising inflation.'" (Date: 2023-10-08)

**Dataset 6: ETF Flows and Institutional Positioning *DS*** (Visual: Bar Chart)

| Date       | ETF Flows (USD Millions) | Institutional Net Long (%) |
|------------|--------------------------|----------------------------|
| 2023-09-01 | 500                      | 55                         |
| 2023-09-15 | 450                      | 53                         |
| 2023-09-29 | 520                      | 56                         |
| 2023-10-13 | 480                      | 54                         |

**Dataset 7: Social Media Sentiment Aggregates** (Visual: Scatter Plot)

| Date       | Twitter Sentiment Score | Reddit Sentiment Score |
|------------|-------------------------|------------------------|
| 2023-09-01 | 5.0                     | 4.8                    |
| 2023-09-15 | 4.2                     | 4.0                    |
| 2023-09-29 | 3.5                     | 3.4                    |
| 2023-10-13 | 4.0                     | 3.9                    |

**Dataset 8: Commodity Price Indexes** (Visual: Line Graph)

| Date       | Brent Crude Spot Price (USD/barrel) | Gold Spot Price (USD/oz) |
|------------|-------------------------------------|--------------------------|
| 2023-09-01 | 85                                  | 1800                     |
| 2023-09-15 | 88                                  | 1780                     |
| 2023-09-29 | 90                                  | 1750                     |
| 2023-10-13 | 92                                  | 1765                     |

**Dataset 9: Historical Asset Returns *DS*** (Visual: Line Charts)

| Date Range                 | Asset Class  | Average Daily Return (%) | Volatility (%) |
|----------------------------|--------------|--------------------------|----------------|
| 2023-07-01 to 2023-09-30   | Equities     | 0.05                     | 1.2            |
| 2023-07-01 to 2023-09-30   | Fixed Income | 0.02                     | 0.8            |

## Question
As a senior strategist at a global macro hedge fund, your objective is to deliver a high-conviction forecast for cross-asset volatility and market sentiment over the next 30 days. Based on the provided datasets, please:

1. Classify the current market sentiment regime (e.g., risk-on, risk-off, or transitional) by integrating the sentiment scores from central bank communications, financial news tone, and social media data.
2. Forecast the 30-day forward implied and realized volatility for the following asset classes:
   - Equities (S&P 500 via VIX dynamics)
   - Fixed Income (10Y Treasury via MOVE index)
   - Commodities (Brent Crude and Gold)
3. Identify and quantify the key drivers (e.g., macroeconomic indicators, option term structures, central bank statements) influencing the forecast and highlight any potential volatility clusters or contagion risks across asset classes.
4. Recommend tactical hedging strategies (e.g., VIX calls, long gamma positions, or correlation dispersion trades) based on your forecast.

## Instruction
Please include all intermediate calculations, working steps, and references to the appropriate datasets as evidence in your analysis. Begin your final answer with a brief executive summary that highlights the key findings and recommendations of your forecast. Be sure to illustrate your thought process and justify your conclusions using both quantitative data and qualitative insights drawn from the datasets.
"""

### Cell 8
This cell uses the GPT model to generate a response based on the prompt.

In [33]:
# Read the content of the template file and assign it to demo_prompt
#template_path = '../usecases/credit-risk-assessment-and-management/prompt.md'
#template_path = '../usecases/customer-relationship-management/prompt.md'
#template_path = '../usecases/data-analysis/prompt.md'
#template_path = '../usecases/fraud-detection-and-prevention/prompt.md'
#template_path = '../usecases/insurance-claims-processing/prompt.md'
#template_path = '../usecases/insurance-plan/prompt.md'
#template_path = '../usecases/loan-agreement/prompt.md'
#template_path = '../usecases/market-sentiment-and-volatility-forecasting/prompt.md'
#template_path = '../usecases/portfolio-optimization/prompt.md'
template_path = '../usecases/risk-assessment-for-underwriting-health-auto-life-property/prompt.md'
#template_path = '../usecases/underwriting-analysis/prompt1.md'

try:
	with open(template_path, 'r') as file:
		demo_prompt = file.read()
except FileNotFoundError:
	print(f"Error: The file '{template_path}' was not found. Using default demo prompt.")
	demo_prompt = demo_prompt

# Generate GPT output using the prompt
gpt_output = get_chat_completion(GPT_MODEL, demo_prompt)

### Cell 10
This cell prints the output from the GPT model.

In [34]:
print(gpt_output)

### Executive Summary

After a thorough analysis of John Miller's auto insurance application, several key findings reveal a moderate risk profile. His driving history displays multiple infractions, and he has documented accident claims. However, recent improvements in his driving behavior, positive telematics data, and a strong credit score contribute to a more favorable overall assessment. Based on the evaluation, it is recommended that John Miller be offered comprehensive auto insurance coverage with standard premium rates but under a condition of monitored telematics for ongoing risk evaluation.

---

### Detailed Analysis

#### 1. **Applicant Profile (Dataset 1)**
- **Age**: 28 years
- **Vehicle Year**: 2018 (Moderately new vehicle)
- **Annual Mileage**: 12,000 miles (Average; typical driving mileage reflects a lower risk)
- **Resident Area**: Urban (Higher risk exposure typically associated with urban driving)

#### 2. **Driving Record History (Dataset 2)**
- **Traffic Infractions

### Cell 12
This cell uses the reasoning model to generate a response using the same prompt.

In [35]:
reasoning_output = get_chat_completion(REASONING_MODEL, demo_prompt)

### Cell 14
This cell prints the output from the O1 model.

In [36]:
print(reasoning_output)

Executive Summary  
• Decision: Approve John Miller for comprehensive auto coverage at standard premium rates with a usage-based monitoring discount program (no surcharge).  
• Key Risk Metrics:  
  – Accident frequency: 3 accidents over 5 years = 0.6 accidents/year  
  – Total claim amount: $7,000 over 3 claims; average $2,333/claim  
  – Violation load: 7 points over 4.4 years ≈1.6 points/year; $420 total fines  
  – Telematics: 3.83 harsh-braking events/month; 2.83 rapid-acceleration events/month  
  – Credit quality: Score 720, 25% debt-to-income, excellent payment history  
• Conditions: Continued telematics enrollment; if harsh events drop below 3/month for next 6 months → 5% premium discount  

Underwriting Decision Rationale  
1. Applicant Profile (Dataset 1)  
   – Age 28, licensed since 2015 → 8 years licensed, meets minimum experience.  
   – Vehicle: 2018 Toyota Camry with average usage 12,000 miles/year in urban area (moderate risk).  

2. Driving Record History (Dataset 2

### Cell 16
This cell evaluates and compares both model outputs using a specialized prompt.

In [37]:
result = get_chat_completion(
    REASONING_MODEL,
    f"""
You are a seasoned expert in evaluating and comparing outputs from large language models. You are analyzing responses to a complex analytical reasoning task that tests a model’s ability to:

- Interpret and synthesize information from multiple sources
- Perform multi-step quantitative reasoning
- Generate structured and actionable insights
- Present logic transparently and clearly
- Cross-reference data or contextual elements
- Follow domain-specific best practices

Given the question prompt and the two model outputs, evaluate and compare them on the following dimensions:

1. **Clarity:** How clear, well-organized, and easy to follow is each response?
2. **Accuracy & Correctness:** Are the facts, interpretations, and calculations correct? Is the reasoning sound?
3. **Completeness:** Does the answer fully address all parts of the prompt?
4. **Relevance & Adherence:** How well does each answer follow the instructions and respond to the task as described?
5. **Analytical Depth:** Does the answer show strong reasoning, meaningful insight, and appropriate use of supporting evidence?
6. **Multi-Dataset Synthesis:** How well does each model integrate insights from the datasets (e.g., transcripts, checklists, historical data)?
7. **Robustness to Ambiguity:** How well does each model handle faint, incomplete, or unclear legal text?
8. **Format & Usability:** Which response is more practically useful for a legal, compliance, or due diligence team?


For each model’s output:
- Provide specific strengths and weaknesses
- Reference examples from the text
- Highlight any errors, gaps, or standout reasoning

Finally, provide a **concise summary** stating which answer is better and why, backed by concrete observations.

---

**Question Prompt:**  
"{demo_prompt}"  

**Answer 1 (Reasoning Model):**  
{reasoning_output}  

**Answer 2 (GPT Model):**  
{gpt_output}  
"""
)
display(Markdown(result))

Evaluation of Answer 1 (Reasoning Model)

Strengths  
• Clarity & Organization  
  – Begins with a concise Executive Summary with decision, key metrics, and conditions.  
  – Uses numbered sections for each dataset, making it easy to follow.  
  – Clearly demarcates Relevant vs. Excluded datasets (8–10).  

• Accuracy & Correctness  
  – Correctly extracts points, fines, claim amounts, and telematics counts.  
  – Computes accident frequency (3/5 years ≈0.6/year) and violation rate (7 points/4.4 years ≈1.6 points/year).  
  – Uses industry‐standard thresholds (e.g., 2 points/year trigger) to benchmark risk.  

• Completeness  
  – Addresses every requested element: profile, driving record, accidents, telematics, maintenance, credit, transcript.  
  – Provides intermediate calculations (e.g., totals, averages).  
  – Lists underwriting decision, rationale, and conditions.  

• Relevance & Adherence  
  – References specific data points by date and value.  
  – Follows the instruction to “include all intermediate calculations and working steps.”  
  – Excludes irrelevant datasets with justification.  

• Analytical Depth  
  – Applies meaningful thresholds to categorize risk (e.g., “harsh events <4/month = moderate risk”).  
  – Leverages self-reported defensive driving course from transcript to corroborate telematics.  

• Multi-Dataset Synthesis  
  – Integrates quantitative (claims, infractions, telematics) and qualitative (call transcript) inputs.  
  – Cross-references maintenance history to reduce mechanical failure risk.  

• Robustness to Ambiguity  
  – Recognizes and discards non-material external data (sports, restaurants, weather).  
  – Clearly states why those datasets are excluded.  

• Format & Usability  
  – Structured for legal/underwriting teams: executive summary, data sections, calculations, decision.  
  – Actionable: lays out discount program triggers and monitoring conditions.  

Weaknesses  
• Minor inconsistency in time denominators (at times uses 5 vs. 5.3 years for accident frequency).  
• Could have more explicitly weighted credit metrics in the premium recommendation.  

Evaluation of Answer 2 (GPT Model)

Strengths  
• Clarity  
  – Uses a clear Executive Summary and sectional headings for each dataset.  
  – Summarizes risk profile and recommendation up front.  

• Relevance  
  – Recognizes the importance of telematics monitoring as a conditional requirement.  
  – Mentions positive credit and maintenance histories as risk mitigators.  

Weaknesses  
• Accuracy & Correctness  
  – Miscalculates telematics averages (states 4.2 harsh braking vs. actual ~3.8).  
  – Infraction and accident rates are computed using inconsistent periods (e.g., 6 years for infractions, 4 years for accidents).  

• Completeness  
  – Omits the customer call transcript entirely—misses a key qualitative input.  
  – Does not provide intermediate calculations or working steps as instructed.  
  – Fails to mention total fines or precise point thresholds.  

• Analytical Depth  
  – Relies on generic language (“concerning driving behavior”) without applying specific underwriting thresholds.  
  – Lacks any numerical benchmarks for surcharge triggers or detailed discount mechanics.  

• Multi-Dataset Synthesis  
  – References only Datasets 1–6; ignores the transcript (Dataset 7) and does not explain exclusion of Datasets 8–10.  

• Robustness to Ambiguity  
  – Does not explicitly address or discard irrelevant datasets, leaving potential confusion.  

• Format & Usability  
  – While readable, it lacks the detailed calculations and annexed conditions that compliance teams require.  
  – Provides no structured decision table or conditional clauses (e.g., exact premium adjustments).  

Concise Summary

Answer 1 is the superior response. It is more clearly organized, adheres strictly to the instruction to show intermediate calculations, synthesizes all relevant datasets (including the call transcript), and applies domain-specific thresholds to reach a transparent underwriting decision. Answer 2, by contrast, contains calculation errors, omits parts of the prompt (no transcript analysis, missing finer details on fines and points), and fails to deliver the actionable, data-driven rationale required for a compliance-focused underwriting report.