# Reasoning usecases : Model comparison o1 vs gpt-4o

In [33]:
from pydantic import BaseModel, Field
from openai import AzureOpenAI
import os
from dotenv import load_dotenv
import json
import copy
import textwrap
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display, Image, Markdown
# Load environment variables
load_dotenv("./.env")

client = AzureOpenAI(  
    api_version="2024-12-01-preview",  
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),  
    api_key=os.getenv("AZURE_OPENAI_API_KEY")  
)  

GPT_MODEL = 'gpt-40-globalStandard'
O1_MODEL = 'o1'

In [34]:
def get_chat_completion(model, prompt):
    """
    Calls the OpenAI API to get a chat completion.

    :param model: The model to use for the completion.
    :param prompt: The prompt to send to the model.
    :return: The completion response from the model.
    """
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

In [35]:
demo_prompt = """


# Cross-Asset Volatility and Sentiment Forecasting: A Multi-Modal Analysis

## Datasets
Below are multiple micro datasets provided in markdown format. They include structured tables, time-series data, and unstructured text meant to help you assess market sentiment, volatility drivers, and macroeconomic conditions. Note that some datasets (marked with *DS*) are included as potential distractions.

**Dataset 1: Volatility Indices Data** (Visual: Line Graph)

| Date       | VIX (S&P 500) | MOVE (10Y Treasuries) | Implied Volatility - Brent (%) | Implied Volatility - Gold (%) |
|------------|---------------|-----------------------|--------------------------------|-------------------------------|
| 2023-09-01 | 18.2          | 9.5                   | 22.0                           | 12.5                          |
| 2023-09-08 | 19.0          | 10.0                  | 22.5                           | 12.8                          |
| 2023-09-15 | 20.1          | 10.2                  | 23.0                           | 13.0                          |
| 2023-09-22 | 21.0          | 10.5                  | 23.8                           | 13.5                          |
| 2023-09-29 | 20.5          | 10.3                  | 23.2                           | 13.2                          |
| 2023-10-06 | 22.0          | 10.7                  | 24.0                           | 13.8                          |
| 2023-10-13 | 22.5          | 11.0                  | 24.5                           | 14.0                          |

**Dataset 2: Market Sentiment Scores** (Visual: Bar Chart)

| Date       | Central Bank Sentiment (1-10) | Financial News Tone (1-10) | Reddit/Twitter Sentiment (1-10) |
|------------|-------------------------------|----------------------------|---------------------------------|
| 2023-09-01 | 7                             | 6                          | 5                               |
| 2023-09-08 | 6                             | 5                          | 4                               |
| 2023-09-15 | 5                             | 4                          | 3                               |
| 2023-09-22 | 4                             | 3                          | 2                               |
| 2023-09-29 | 5                             | 4                          | 3                               |
| 2023-10-06 | 6                             | 5                          | 4                               |
| 2023-10-13 | 7                             | 6                          | 5                               |

**Dataset 3: Macroeconomic Indicators** (Visual: Clustered Bar Chart)

| Date       | CPI YoY (%) | PMI | NFP (in thousands) | Unemployment Rate (%) |
|------------|-------------|-----|--------------------|-----------------------|
| 2023-09-01 | 2.1         | 55  | 150                | 4.0                   |
| 2023-09-15 | 2.2         | 54  | 145                | 4.1                   |
| 2023-09-29 | 2.3         | 53  | 140                | 4.2                   |
| 2023-10-13 | 2.4         | 52  | 135                | 4.3                   |

**Dataset 4: Option Term Structure Data** (Visual: Line Graph)

| Date       | S&P 500 30d IV (%) | 10Y T-Note 30d IV (%) | Brent 30d IV (%) | Gold 30d IV (%) |
|------------|--------------------|-----------------------|------------------|-----------------|
| 2023-09-01 | 18.0               | 10.0                  | 22.0             | 12.5            |
| 2023-10-01 | 19.5               | 10.5                  | 23.5             | 13.0            |

**Dataset 5: News Headlines and Central Bank Statements** (Unstructured Data)

- "Central Bank signals gradual tightening amid global uncertainty." (Date: 2023-10-05)
- "Financial news: Equity markets rally despite mixed earnings reports." (Date: 2023-10-06)
- "Breaking: New policy measures expected to cushion market headwinds." (Date: 2023-10-07)
- "Central Bank statement: 'We maintain cautious optimism despite rising inflation.'" (Date: 2023-10-08)

**Dataset 6: ETF Flows and Institutional Positioning *DS*** (Visual: Bar Chart)

| Date       | ETF Flows (USD Millions) | Institutional Net Long (%) |
|------------|--------------------------|----------------------------|
| 2023-09-01 | 500                      | 55                         |
| 2023-09-15 | 450                      | 53                         |
| 2023-09-29 | 520                      | 56                         |
| 2023-10-13 | 480                      | 54                         |

**Dataset 7: Social Media Sentiment Aggregates** (Visual: Scatter Plot)

| Date       | Twitter Sentiment Score | Reddit Sentiment Score |
|------------|-------------------------|------------------------|
| 2023-09-01 | 5.0                     | 4.8                    |
| 2023-09-15 | 4.2                     | 4.0                    |
| 2023-09-29 | 3.5                     | 3.4                    |
| 2023-10-13 | 4.0                     | 3.9                    |

**Dataset 8: Commodity Price Indexes** (Visual: Line Graph)

| Date       | Brent Crude Spot Price (USD/barrel) | Gold Spot Price (USD/oz) |
|------------|-------------------------------------|--------------------------|
| 2023-09-01 | 85                                  | 1800                     |
| 2023-09-15 | 88                                  | 1780                     |
| 2023-09-29 | 90                                  | 1750                     |
| 2023-10-13 | 92                                  | 1765                     |

**Dataset 9: Historical Asset Returns *DS*** (Visual: Line Charts)

| Date Range                 | Asset Class  | Average Daily Return (%) | Volatility (%) |
|----------------------------|--------------|--------------------------|----------------|
| 2023-07-01 to 2023-09-30   | Equities     | 0.05                     | 1.2            |
| 2023-07-01 to 2023-09-30   | Fixed Income | 0.02                     | 0.8            |

## Question
As a senior strategist at a global macro hedge fund, your objective is to deliver a high-conviction forecast for cross-asset volatility and market sentiment over the next 30 days. Based on the provided datasets, please:

1. Classify the current market sentiment regime (e.g., risk-on, risk-off, or transitional) by integrating the sentiment scores from central bank communications, financial news tone, and social media data.
2. Forecast the 30-day forward implied and realized volatility for the following asset classes:
   - Equities (S&P 500 via VIX dynamics)
   - Fixed Income (10Y Treasury via MOVE index)
   - Commodities (Brent Crude and Gold)
3. Identify and quantify the key drivers (e.g., macroeconomic indicators, option term structures, central bank statements) influencing the forecast and highlight any potential volatility clusters or contagion risks across asset classes.
4. Recommend tactical hedging strategies (e.g., VIX calls, long gamma positions, or correlation dispersion trades) based on your forecast.

## Instruction
Please include all intermediate calculations, working steps, and references to the appropriate datasets as evidence in your analysis. Begin your final answer with a brief executive summary that highlights the key findings and recommendations of your forecast. Be sure to illustrate your thought process and justify your conclusions using both quantitative data and qualitative insights drawn from the datasets.


"""

In [36]:
gpt_output = get_chat_completion(GPT_MODEL, demo_prompt)

In [37]:
print(gpt_output)

Below is the complete analysis and forecast, including our step‐by‐step reasoning and our references to the datasets. Read the executive summary first, then follow the detailed calculations and narrative.

─────────────────────────────
EXECUTIVE SUMMARY

Our analysis indicates that markets have shifted into a risk‐off, transitional regime marked by increasing uncertainty. Sentiment indicators—from central bank communications, financial news tone, and social media scores—suggest an overall cautious mood despite some recent positive sentiment rebounds that remain fragile. Implied volatilities across asset classes are trending upward. For the next 30 days, we forecast that forward implied and realized volatility will continue to elevate in equities (S&P 500), fixed income (10‐year Treasuries), and certain commodities (notably Brent crude and gold). Key drivers include decelerating sentiment, a steepening option term structure, and worsening macroeconomic indicators (rising inflation, soft

In [38]:
o1_output = get_chat_completion(O1_MODEL, demo_prompt)

In [39]:
print(o1_output)

Below is the final high‐conviction forecast and analysis based on a careful review and synthesis of the multi‐modal datasets.

──────────────────────────────
Executive Summary

Our integrated analysis suggests that the market is currently in a transitional, risk‐off regime. Recent central bank communications, cautious news headlines, and declining sentiment scores (Datasets 2 and 5) point to mounting concerns about inflation and slower economic momentum (Dataset 3). Cross‐asset volatility measures—evident in the rising VIX (from 18.2 to 22.5, Dataset 1) and MOVE indices plus the upward drift in option term structures (Dataset 4)—indicate that both implied and realized volatility are likely to rise over the next 30 days. In our forecast we see the following:
• Equities (S&P 500) are expected to see forward 30‐day implied volatility climb into the mid‐20 percent range with realized daily volatility increasing by roughly 10–15% over current levels.
• Fixed income (10Y Treasury) volatility

In [40]:
result = get_chat_completion(
    O1_MODEL,
    f"""
You are a seasoned expert in evaluating and comparing outputs from large language models. You are analyzing responses to a complex analytical reasoning task that tests a model’s ability to:

- Interpret and synthesize information from multiple sources
- Perform multi-step quantitative reasoning
- Generate structured and actionable insights
- Present logic transparently and clearly
- Cross-reference data or contextual elements
- Follow domain-specific best practices

Given the question prompt and the two model outputs, evaluate and compare them on the following dimensions:

1. **Clarity:** How clear, well-organized, and easy to follow is each response?
2. **Accuracy & Correctness:** Are the facts, interpretations, and calculations correct? Is the reasoning sound?
3. **Completeness:** Does the answer fully address all parts of the prompt?
4. **Relevance & Adherence:** How well does each answer follow the instructions and respond to the task as described?
5. **Analytical Depth:** Does the answer show strong reasoning, meaningful insight, and appropriate use of supporting evidence?
6. **Multi-Dataset Synthesis:** How well does each model integrate insights from the datasets (e.g., transcripts, checklists, historical data)?
7. **Robustness to Ambiguity:** How well does each model handle faint, incomplete, or unclear legal text?
8. **Format & Usability:** Which response is more practically useful for a legal, compliance, or due diligence team?


For each model’s output:
- Provide specific strengths and weaknesses
- Reference examples from the text
- Highlight any errors, gaps, or standout reasoning

Finally, provide a **concise summary** stating which answer is better and why, backed by concrete observations.

---

**Question Prompt:**  
"{demo_prompt}"  

**Answer 1 (O1):**  
{o1_output}  

**Answer 2 (GPT-4o):**  
{gpt_output}  
"""
)
display(Markdown(result))


Below is an evaluation and comparison of the two outputs across the requested dimensions, along with specific examples drawn from each answer. At the end, a concise summary states which answer is preferable and why.

──────────────────────────────
1. Clarity

• Answer 1 is highly structured, with clear section headings (Executive Summary, numbered sections, intermediate calculations, etc.). For example, it neatly breaks down “Market Sentiment Regime,” “30-Day Forward Volatility Forecast,” “Key Drivers,” and “Tactical Hedging Strategies” – making it easy to follow the line of reasoning.

• Answer 2 is also well organized with an executive summary and detailed numbered steps. However, its narrative style is a bit denser in places. While it uses bullet lists, the presentation is slightly less compartmentalized than Answer 1.

Strengths:
 – Answer 1’s visual separation (using horizontal rules and bullet points) makes the response more accessible.
Weaknesses:
 – Answer 2, though complete, occasionally mixes introductory statements with details, which may require an extra pass to extract the key steps.

──────────────────────────────
2. Accuracy & Correctness

• Both answers correctly identify the prevailing market sentiment as “risk‑off” or “transitional” and reference central bank, news, and sentiment datasets appropriately.

• Answer 1 computes specific numbers (e.g., forecasting the VIX to rise from 22.5 to roughly 25.4 by linear extrapolation) and provides intermediate calculations (e.g., weekly increase of 0.72 points) that are internally consistent.

• Answer 2 also produces credible forecasts (e.g., forecasting equities IV in the 22%–23% range and fixed-income IV around 11%–11.5%). Although the precise numerical targets differ slightly between answers, both remain within reasonable bounds given the datasets.

Strengths:
 – Answer 1’s detailed step‐by‐step calculations provide confidence in the numbers.
 – Answer 2 cross-checks data trends and incorporates mean reversion concepts.
Weaknesses:
 – The small variance in forecasts (for equity implied volatility, for instance) means either approach might be subject to model assumptions. However, Answer 1 is slightly more explicit about its numerical assumptions.

──────────────────────────────
3. Completeness

• Both answers fully address all parts of the prompt – from classifying the sentiment regime and forecasting volatilities for equities, fixed income, and commodities, to identifying key drivers and tactical hedging recommendations.

• Answer 1 includes very detailed intermediate steps and draws on multiple datasets (e.g., referencing datasets 1, 3, and 4 for numerical trends), ensuring that no part of the instruction is overlooked.

Strengths:
 – Both responses cover all required areas.
 – Answer 1’s in-depth computation and reference to every dataset lends additional assurance.
Weaknesses:
 – Answer 2, while comprehensive, is slightly less explicit in showing every calculation step.

──────────────────────────────
4. Relevance & Adherence

• Both answers adhere closely to the task. They directly reference the datasets (e.g., sentiment scores, volatility indices, macroeconomic indicators) and clearly tie them to the forecasting and hedging strategy.

• Answer 1 uses explicit quotes (e.g., “rising VIX from 18.2 to 22.5” or “CPI rising from 2.1% to 2.4%”) that directly map to the dataset values, demonstrating tight adherence to the instructions.

Strengths:
 – Both responses are highly relevant.
 – Answer 1 stands out with its specific, traceable data references.
Weaknesses:
 – Answer 2 is equally appropriate but sometimes gives slightly broader ranges without detailing each computation round.

──────────────────────────────
5. Analytical Depth

• Both responses offer strong reasoning and meaningful insights. They interpret macroeconomic signals, option term structures, and sentiment data to support their forecasts and hedging recommendations.

• Answer 1 provides a detailed breakdown of calculations (e.g., computing weekly changes in the VIX) and ties these to tactical recommendations, such as using VIX calls or long gamma positions.

• Answer 2 discusses mean‑reverting properties and contagion risks in a similarly robust manner but relies a bit more on qualitative description.

Strengths:
 – Answer 1’s step-by-step logic and clear linking of numbers to market behavior display deep analytical thought.
 – Answer 2 is also thorough in discussing cross‑asset contagion and macro drivers.
Weaknesses:
 – Answer 2, while comprehensive, sometimes presents conclusions without exposing all intermediate steps as transparently as Answer 1.

──────────────────────────────
6. Multi-Dataset Synthesis

• Both outputs integrate insights from multiple datasets (volatility indices, sentiment measures, macroeconomic indicators, option term structures, news headlines).

• Answer 1 explicitly cites dataset values in its narrative (e.g., “Dataset 2 shows...,” “Dataset 4 reveals...”) and ties these robustly to both quantitative and qualitative analyses.

• Answer 2 covers the integration well and refers to the datasets systematically, though its internal cross-checks are less detailed in computation.

Strengths:
 – Answer 1 shines in explicitly triangulating values from different datasets to derive forecast numbers.
Weaknesses:
 – Answer 2 might benefit from more concrete numerical references in some parts of its synthesis.

──────────────────────────────
7. Robustness to Ambiguity

• Both answers display competence in handling potential ambiguities in the unstructured text (central bank statements and news headlines). They manage to extract consistent signals of caution and potential volatility.

• Answer 1 refines ambiguous aspects by directly linking sentiment drops and subsequent partial recoveries to its classification of a “transitional, risk-off environment.”

• Answer 2 similarly notes the rebound in the sentiment scores yet emphasizes that the early “risk‑off” signals dominate the longer trend.

Strengths:
 – Both responses effectively clarify ambiguous data points.
 – Answer 1’s methodical breakdown minimizes any ambiguity in its interpretation.
Weaknesses:
 – There is little to fault with either approach given the available information.

──────────────────────────────
8. Format & Usability

• For a legal, compliance, or due diligence team, the structured, bullet-point format and clearly delineated sections in Answer 1 provide a strong, readily digestible format.

• Answer 2 is also usable but its narrative style—with longer paragraphs interspersed with bullet points—may require additional review to quickly locate key data points or calculations.

Strengths:
 – Answer 1’s format is particularly practical for teams needing clear evidence trails and calculation details.
Weaknesses:
 – Answer 2, while thorough, might be less immediately scannable for a team seeking quick insights.

──────────────────────────────
Concise Summary

Overall, Answer 1 is the stronger response. It is more clearly structured, presents detailed intermediate calculations that transparently link dataset values to its forecasts, and does an excellent job of integrating multiple data sources into actionable recommendations. While Answer 2 is also thorough and accurate, Answer 1’s explicit breakdown of reasoning and its format make it particularly practical and robust for legal, compliance, or due diligence teams.