# Reasoning usecases : Model comparison Reasoning vs GPT models

### Step 1
This cell imports necessary libraries, sets environment variables, and initializes the OpenAI client. Provide your GPT and reasoning deployment names to the variables GPT_MODEL and REASONING_MODEL. Makre sure to provide your AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY in env file.

In [22]:
from pydantic import BaseModel, Field
from openai import AzureOpenAI
import os
from dotenv import load_dotenv
import json
import copy
import textwrap
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display, Image, Markdown
# Load environment variables
load_dotenv("./.env")

client = AzureOpenAI(  
    api_version="2025-04-01-preview",  
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),  
    api_key=os.getenv("AZURE_OPENAI_API_KEY")  
)  

GPT_MODEL = 'gpt-4o-mini'
REASONING_MODEL = 'o4-mini'
print(" GPT-4o Model Name for the first time...", GPT_MODEL)
print(" Reasoning Model Name for the first time...", REASONING_MODEL)


 GPT-4o Model Name for the first time... gpt-4o-mini
 Reasoning Model Name for the first time... o4-mini


### Step 2
This cell defines the `get_chat_completion` function used to request model responses.

In [23]:
def get_chat_completion(model, prompt):
    """
    Calls the OpenAI API to get a chat completion.

    :param model: The model to use for the completion.
    :param prompt: The prompt to send to the model.
    :return: The completion response from the model.
    """
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

### Step 3
This cell holds the multi-line `demo_prompt`. Provide your prompt here or use one of the prompts from the usecases in this repo.

In [24]:
demo_prompt = """


# Cross-Asset Volatility and Sentiment Forecasting: A Multi-Modal Analysis

## Datasets
Below are multiple micro datasets provided in markdown format. They include structured tables, time-series data, and unstructured text meant to help you assess market sentiment, volatility drivers, and macroeconomic conditions. Note that some datasets (marked with *DS*) are included as potential distractions.

**Dataset 1: Volatility Indices Data** (Visual: Line Graph)

| Date       | VIX (S&P 500) | MOVE (10Y Treasuries) | Implied Volatility - Brent (%) | Implied Volatility - Gold (%) |
|------------|---------------|-----------------------|--------------------------------|-------------------------------|
| 2023-09-01 | 18.2          | 9.5                   | 22.0                           | 12.5                          |
| 2023-09-08 | 19.0          | 10.0                  | 22.5                           | 12.8                          |
| 2023-09-15 | 20.1          | 10.2                  | 23.0                           | 13.0                          |
| 2023-09-22 | 21.0          | 10.5                  | 23.8                           | 13.5                          |
| 2023-09-29 | 20.5          | 10.3                  | 23.2                           | 13.2                          |
| 2023-10-06 | 22.0          | 10.7                  | 24.0                           | 13.8                          |
| 2023-10-13 | 22.5          | 11.0                  | 24.5                           | 14.0                          |

**Dataset 2: Market Sentiment Scores** (Visual: Bar Chart)

| Date       | Central Bank Sentiment (1-10) | Financial News Tone (1-10) | Reddit/Twitter Sentiment (1-10) |
|------------|-------------------------------|----------------------------|---------------------------------|
| 2023-09-01 | 7                             | 6                          | 5                               |
| 2023-09-08 | 6                             | 5                          | 4                               |
| 2023-09-15 | 5                             | 4                          | 3                               |
| 2023-09-22 | 4                             | 3                          | 2                               |
| 2023-09-29 | 5                             | 4                          | 3                               |
| 2023-10-06 | 6                             | 5                          | 4                               |
| 2023-10-13 | 7                             | 6                          | 5                               |

**Dataset 3: Macroeconomic Indicators** (Visual: Clustered Bar Chart)

| Date       | CPI YoY (%) | PMI | NFP (in thousands) | Unemployment Rate (%) |
|------------|-------------|-----|--------------------|-----------------------|
| 2023-09-01 | 2.1         | 55  | 150                | 4.0                   |
| 2023-09-15 | 2.2         | 54  | 145                | 4.1                   |
| 2023-09-29 | 2.3         | 53  | 140                | 4.2                   |
| 2023-10-13 | 2.4         | 52  | 135                | 4.3                   |

**Dataset 4: Option Term Structure Data** (Visual: Line Graph)

| Date       | S&P 500 30d IV (%) | 10Y T-Note 30d IV (%) | Brent 30d IV (%) | Gold 30d IV (%) |
|------------|--------------------|-----------------------|------------------|-----------------|
| 2023-09-01 | 18.0               | 10.0                  | 22.0             | 12.5            |
| 2023-10-01 | 19.5               | 10.5                  | 23.5             | 13.0            |

**Dataset 5: News Headlines and Central Bank Statements** (Unstructured Data)

- "Central Bank signals gradual tightening amid global uncertainty." (Date: 2023-10-05)
- "Financial news: Equity markets rally despite mixed earnings reports." (Date: 2023-10-06)
- "Breaking: New policy measures expected to cushion market headwinds." (Date: 2023-10-07)
- "Central Bank statement: 'We maintain cautious optimism despite rising inflation.'" (Date: 2023-10-08)

**Dataset 6: ETF Flows and Institutional Positioning *DS*** (Visual: Bar Chart)

| Date       | ETF Flows (USD Millions) | Institutional Net Long (%) |
|------------|--------------------------|----------------------------|
| 2023-09-01 | 500                      | 55                         |
| 2023-09-15 | 450                      | 53                         |
| 2023-09-29 | 520                      | 56                         |
| 2023-10-13 | 480                      | 54                         |

**Dataset 7: Social Media Sentiment Aggregates** (Visual: Scatter Plot)

| Date       | Twitter Sentiment Score | Reddit Sentiment Score |
|------------|-------------------------|------------------------|
| 2023-09-01 | 5.0                     | 4.8                    |
| 2023-09-15 | 4.2                     | 4.0                    |
| 2023-09-29 | 3.5                     | 3.4                    |
| 2023-10-13 | 4.0                     | 3.9                    |

**Dataset 8: Commodity Price Indexes** (Visual: Line Graph)

| Date       | Brent Crude Spot Price (USD/barrel) | Gold Spot Price (USD/oz) |
|------------|-------------------------------------|--------------------------|
| 2023-09-01 | 85                                  | 1800                     |
| 2023-09-15 | 88                                  | 1780                     |
| 2023-09-29 | 90                                  | 1750                     |
| 2023-10-13 | 92                                  | 1765                     |

**Dataset 9: Historical Asset Returns *DS*** (Visual: Line Charts)

| Date Range                 | Asset Class  | Average Daily Return (%) | Volatility (%) |
|----------------------------|--------------|--------------------------|----------------|
| 2023-07-01 to 2023-09-30   | Equities     | 0.05                     | 1.2            |
| 2023-07-01 to 2023-09-30   | Fixed Income | 0.02                     | 0.8            |

## Question
As a senior strategist at a global macro hedge fund, your objective is to deliver a high-conviction forecast for cross-asset volatility and market sentiment over the next 30 days. Based on the provided datasets, please:

1. Classify the current market sentiment regime (e.g., risk-on, risk-off, or transitional) by integrating the sentiment scores from central bank communications, financial news tone, and social media data.
2. Forecast the 30-day forward implied and realized volatility for the following asset classes:
   - Equities (S&P 500 via VIX dynamics)
   - Fixed Income (10Y Treasury via MOVE index)
   - Commodities (Brent Crude and Gold)
3. Identify and quantify the key drivers (e.g., macroeconomic indicators, option term structures, central bank statements) influencing the forecast and highlight any potential volatility clusters or contagion risks across asset classes.
4. Recommend tactical hedging strategies (e.g., VIX calls, long gamma positions, or correlation dispersion trades) based on your forecast.

## Instruction
Please include all intermediate calculations, working steps, and references to the appropriate datasets as evidence in your analysis. Begin your final answer with a brief executive summary that highlights the key findings and recommendations of your forecast. Be sure to illustrate your thought process and justify your conclusions using both quantitative data and qualitative insights drawn from the datasets.


"""

### Step 4
This cell uses the GPT model to generate a response based on the prompt.

In [25]:
# Read the content of the template file and assign it to demo_prompt
#template_path = '../usecases/credit-risk-assessment-and-management/prompt.md'
#template_path = '../usecases/customer-relationship-management/prompt.md'
#template_path = '../usecases/data-analysis/prompt.md'
#template_path = '../usecases/fraud-detection-and-prevention/prompt.md'
#template_path = '../usecases/insurance-claims-processing/prompt.md'
#template_path = '../usecases/insurance-plan/prompt.md'
#template_path = '../usecases/loan-agreement/prompt.md'
#template_path = '../usecases/market-sentiment-and-volatility-forecasting/prompt.md'
#template_path = '../usecases/portfolio-optimization/prompt.md'
template_path = '../usecases/risk-assessment-for-underwriting-health-auto-life-property/prompt.md'
#template_path = '../usecases/underwriting-analysis/prompt1.md'

try:
	with open(template_path, 'r') as file:
		demo_prompt = file.read()
except FileNotFoundError:
	print(f"Error: The file '{template_path}' was not found. Using default demo prompt.")
	demo_prompt = demo_prompt

# Generate GPT output using the prompt
gpt_output = get_chat_completion(GPT_MODEL, demo_prompt)


### Step 5
This cell prints the output from the GPT model.

In [26]:
print(gpt_output)

### Executive Summary
After a thorough evaluation of John Miller's auto insurance application using multiple datasets, the following key findings and risk metrics were identified:

1. **Driving Record History**: John has a history of three traffic infractions, accumulating a total of 7 points. While he has taken steps to improve his driving behavior (including a defensive driving course), the points present a moderate risk.
   
2. **Accident & Claim History**: He has had three accidents since 2018, one of which resulted in a moderate claim. The frequency of these claims suggests a potential for higher risk.
   
3. **Telematics Driving Behavior**: His telematics records show a relatively high average monthly driving speed (averaging around 46 mph) and some concern with harsh braking events, peaking at 6 in June. Such behavior could indicate aggressive driving patterns, increasing the risk profile.
   
4. **Credit History**: John has an excellent credit score of 720 and a reasonable debt

### Step 6
This cell uses the reasoning model to generate a response using the same prompt.

In [27]:
reasoning_output = get_chat_completion(REASONING_MODEL, demo_prompt)

### Step 7
This cell prints the output from the O1 model.

In [28]:
print(reasoning_output)

Executive Summary  
John Miller is a 28-year-old urban driver with a 2018 Toyota Camry and 12,000 annual miles. Over the past ~4.5 years he has:  
• Incurred 3 traffic infractions (total 7 points, $420 fines) for an average violation frequency of 0.67/year.  
• Filed 3 claims (accident frequency ≈0.67/year) totaling $7,000; one moderate (2020 side-impact).  
• Generated an average of 3.8 harsh braking events and 2.8 rapid accelerations per month over six months (≈3.8 and 2.8 events per 1,000 miles).  
• Maintained his vehicle regularly (brake pads, oil, tire rotation) and completed a defensive driving course post-2019.  
• Holds a strong credit profile (score 720, DTI 25%, excellent payment history).  

Underwriting Decision:  
Offer comprehensive coverage at standard base rates with:  
• A 5% surcharge to offset moderate accident/violation history.  
• Enrollment in usage-based monitoring with a tiered safe-driving discount (up to 10%) contingent on reducing harsh events below 3/month

### Step 8
This cell evaluates and compares both model outputs using a specialized prompt.

In [29]:
result = get_chat_completion(
    REASONING_MODEL,
    f"""
You are a seasoned expert in evaluating and comparing outputs from large language models. You are analyzing responses to a complex analytical reasoning task that tests a model’s ability to:

- Interpret and synthesize information from multiple sources
- Perform multi-step quantitative reasoning
- Generate structured and actionable insights
- Present logic transparently and clearly
- Cross-reference data or contextual elements
- Follow domain-specific best practices

Given the question prompt and the two model outputs, evaluate and compare them on the following dimensions:

1. **Clarity:** How clear, well-organized, and easy to follow is each response?
2. **Accuracy & Correctness:** Are the facts, interpretations, and calculations correct? Is the reasoning sound?
3. **Completeness:** Does the answer fully address all parts of the prompt?
4. **Relevance & Adherence:** How well does each answer follow the instructions and respond to the task as described?
5. **Analytical Depth:** Does the answer show strong reasoning, meaningful insight, and appropriate use of supporting evidence?
6. **Multi-Dataset Synthesis:** How well does each model integrate insights from the datasets (e.g., transcripts, checklists, historical data)?
7. **Robustness to Ambiguity:** How well does each model handle faint, incomplete, or unclear legal text?
8. **Format & Usability:** Which response is more practically useful for a legal, compliance, or due diligence team?


For each model’s output:
- Provide specific strengths and weaknesses
- Reference examples from the text
- Highlight any errors, gaps, or standout reasoning

Finally, provide a **concise summary** stating which answer is better and why, backed by concrete observations.

---

**Question Prompt:**  
"{demo_prompt}"  

**Answer 1 (Reasoning Model):**  
{reasoning_output}  

**Answer 2 (GPT Model):**  
{gpt_output}  
"""
)
display(Markdown(result))


Below is a side-by-side evaluation of Answer 1 (Reasoning Model) and Answer 2 (GPT Model) against the eight dimensions you provided. After the comparison, you’ll find a concise recommendation.

1. Clarity  
Answer 1 Strengths:  
• Begins with a labeled Executive Summary that highlights key metrics.  
• Uses clear subsections (Applicant Profile, Driving Record, etc.) and numbered steps.  
• Presents data in concise bullets and tabular‐style summaries.  

Answer 1 Weaknesses:  
• A couple of parenthetical “industry average” notes could distract a non-actuarial reader.  

Answer 2 Strengths:  
• Has an Executive Summary and clearly titled sections.  
• Uses plain language.  

Answer 2 Weaknesses:  
• Much more narrative and paragraph-driven—harder to scan.  
• Does not separate out datasets or calculations clearly.  

2. Accuracy & Correctness  
Answer 1 Strengths:  
• Infractions: 2+3+2=7 points, $100+$200+$120=$420.  
• Violation frequency 3 events/4.5 yrs≈0.67 /yr; points/yr≈7/4.5=1.56.  
• Accident frequency 3/4.5≈0.67/yr; average claim $7,000/3≈$2,333.  
• Telematics: (3+4+2+5+3+6)=23 harsh brakings →23/6≈3.8/mo.  

Answer 1 Weaknesses:  
• Assumes exactly 4.5 years since first infraction—license issued in 2015, infractions from 2017–2021 (actually ~4.3 years). This rounding is minor and typical in underwriting.  

Answer 2 Strengths:  
• Correctly totals 7 points from infractions.  
• Notes credit score and maintenance correctly.  

Answer 2 Weaknesses:  
• Accident summary cites only two claim amounts ($1,200 and $5,000) and omits the $800 animal collision.  
• Accident frequency given as 0.75/yr (3 over ~4 yrs) is a crude approximation; the actual window (Dec 2018–Mar 2022) is ~3.3 yrs (~0.9/yr).  
• Telematics average harsh braking “3.5” is understated (actual 23/6≈3.8).  

3. Completeness  
Answer 1 Strengths:  
• Addresses every requested dataset (1–7) in detail and explicitly rules out 8–10.  
• Provides intermediate calculations for rates, averages, surcharges, discounts.  
• References the call transcript to confirm the defensive-driving course.  

Answer 1 Weaknesses:  
• None significant.  

Answer 2 Strengths:  
• Covers core datasets: driving record, accidents, telematics, credit, maintenance.  

Answer 2 Weaknesses:  
• Omits any mention of the customer call transcript or how qualitative insights were used.  
• Does not quantify fines or point rates.  
• Fails to comment on irrelevant datasets (8–10) or justify their exclusion.  

4. Relevance & Adherence  
Answer 1 Strengths:  
• Faithfully follows the instruction to “include all intermediate calculations” and “reference specific data points.”  
• Presents a clear underwriting decision with surcharge and usage-based monitoring.  

Answer 1 Weaknesses:  
• Slight overuse of actuarial jargon without lay translation.  

Answer 2 Strengths:  
• Provides a decision on premium rates and conditions.  

Answer 2 Weaknesses:  
• Lacks detailed rationale on exactly how much and why the premium should change.  
• Does not show “all intermediate calculations,” as instructed.  

5. Analytical Depth  
Answer 1 Strengths:  
• Benchmarks John’s metrics against typical cohort thresholds (e.g., ≤3 harsh events/1,000 miles).  
• Calculates severity, frequency, and correlates credit profile to claim propensity.  
• Proposes tiered safe-driving discount tied to telematics targets.  

Answer 1 Weaknesses:  
• Could have briefly noted that weather patterns (DS 10) have minimal direct underwriting impact.  

Answer 2 Strengths:  
• Identifies moderate-risk flags (infractions, accident frequency).  

Answer 2 Weaknesses:  
• Surface-level—does not benchmark against industry norms or derive actionable rate adjustments.  
• Offers “review every 6–12 months” but no structured monitoring plan.  

6. Multi-Dataset Synthesis  
Answer 1 Strengths:  
• Seamlessly integrates profile, infractions, claims, telematics, maintenance, credit, and transcript.  
• Explicitly calls out non-insurance datasets as irrelevant.  

Answer 1 Weaknesses:  
• None significant.  

Answer 2 Strengths:  
• Mentions each major category but without cross-referencing (e.g., no tie from telematics improvement claim in transcript to data).  

Answer 2 Weaknesses:  
• Fails to integrate the transcript or vehicle maintenance into the formal risk score.  

7. Robustness to Ambiguity  
Answer 1 Strengths:  
• Clarifies assumptions (e.g., mileage per month = annual/12).  
• Acknowledges gaps (no infractions in last 18 months).  

Answer 1 Weaknesses:  
• Slight rounding in time periods.  

Answer 2 Strengths:  
• Takes a conservative stance recommending periodic reviews.  

Answer 2 Weaknesses:  
• Leaves unaddressed how to treat the unstructured call transcript.  

8. Format & Usability  
Answer 1 Strengths:  
• Easy-to-scan sections, numbered risk metrics, clear bullets.  
• Readily usable by underwriting/compliance teams to plug into rate models.  

Answer 1 Weaknesses:  
• Slightly dense for non-technical stakeholders, but that is typical in compliance docs.  

Answer 2 Strengths:  
• Plain text may be friendlier to non-actuaries.  

Answer 2 Weaknesses:  
• Lacks the precision and structure underwriters need for audit trails.  

──────────────────────────────────────────────────  
Concise Recommendation  
Answer 1 is clearly superior. It delivers a well-structured, fully referenced analysis; accurate intermediate calculations; explicit rationale tied to each dataset; and a concrete underwriting decision including surcharge, discount levers, and monitoring terms. Answer 2 is readable but falls short on completeness, quantitative rigor, dataset integration, and adherence to the instruction to show detailed working steps.  
