# Experimental Plan: ReAct with Chain of Thought (CoT) for Validation Reports
This notebook implements the ReAct framework combined with Chain of Thought (CoT) prompting to generate validation reports.

## **Steps:**
1. Define validation assessment instructions.
2. Implement ReAct with step-by-step reasoning before generating conclusions.
3. Evaluate generated validation reports.
4. Compare outputs with different prompting techniques.


### **Experimental Plan to Test ReAct with Chain of Thought (CoT) Prompting for Generating Validation Reports**

#### **Objective:**
This experiment aims to evaluate the effectiveness of combining the **ReAct** (Reasoning + Acting) prompting technique with **Chain of Thought (CoT)** prompting to generate structured validation reports based on a given set of validation assessment instructions.

---

### **Experimental Steps:**

1. **Define the Validation Assessment Instructions:**
   - Select a set of example validation guidelines (e.g., assessing the consistency of model assumptions, evaluating data integrity, verifying model performance).

2. **Implement ReAct with Chain of Thought (CoT):**
   - Use the **Python ReAct snippet** described in [Simon Willison’s post](https://til.simonwillison.net/llms/python-react-pattern).
   - Modify the ReAct approach to include **step-by-step reasoning** before taking an action.

3. **Prompt Structure:**
   - Construct prompts that encourage the model to first **explain** its thought process before retrieving additional data or generating conclusions.
   - Example prompt:
     ```
     You are an expert model validator. Your task is to analyze the following validation guideline step by step.
     Step 1: Break down the key aspects of the requirement.
     Step 2: Retrieve any necessary supporting evidence.
     Step 3: Assess the evidence and form a logical conclusion.
     Step 4: Generate a structured validation report.
     ```

4. **Run the Experiment with Different Models:**
   - Test the approach using **Llama 3.2 8B** and **GPT-4** to compare results.

5. **Evaluate Output Quality:**
   - Use automated metrics such as **relevancy, hallucination, groundedness, and comprehensiveness** to assess the quality of the generated reports.
   - Compare outputs with human expert reviews.

6. **Compare with Standard ReAct:**
   - Run the same validation process with **ReAct alone (without CoT)** and compare the quality of reports.

7. **Refinement and Iteration:**
   - Adjust the prompt structure to optimize the clarity, accuracy, and completeness of the generated validation reports.

---



In [1]:
import openai
import json

In [2]:
# Set your OpenAI API Key
OPENAI_API_KEY = ""
# openai.api_key = OPENAI_API_KEY

In [3]:
# Set your API key
client = openai.OpenAI(api_key=OPENAI_API_KEY)

# Step 3: Call GPT-4o
chat_completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's the difference between GPT-4 and GPT-4o?"}
    ]
)

# Step 4: Print the assistant's reply
print(chat_completion.choices[0].message.content)

As of my last update, there is no model officially named "GPT-4o." It's possible that you might be referring to a specific variant or custom implementation of GPT-4 by an organization or a typo. 

If you're asking about any variations or improvements that could have been made in a variant of GPT-4, it would most likely involve differences in architecture, training data, applications, or specific features like enhanced performance, optimizations for certain tasks, or better integration with specific platforms.

For the most accurate and up-to-date information, checking the latest announcements from OpenAI or other involved organizations would be advisable. Additionally, if by any chance "GPT-4o" is a recent development beyond my latest update, you may want to look into the latest AI research or technology news resources.


In [4]:


import openai
import os
import json
import re
from getpass import getpass

# Securely set your API key
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Call LLM using GPT-4o
def call_llm(prompt, temperature=0.3):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an expert model validation reviewer."},
            {"role": "user", "content": prompt}
        ],
        temperature=temperature
    )
    return response.choices[0].message.content

# Evaluate context quality using LLM + fallback logic
def evaluate_context_quality(context, validation_instruction, max_retries=2):
    """
    Use structured JSON output from LLM to evaluate retrieved context.
    Includes retry logic and fallback regex-based parsing.
    Returns average score and individual metric scores.
    """
    evaluation_prompt = f"""
Evaluate the following retrieved context in relation to the validation instruction using three criteria:

1. Relevancy – How directly does the context address the instruction?
2. Completeness – Are all required elements present to answer the instruction fully?
3. Specificity – Does the context cite specific items (terms, metrics, definitions) from the document?

Validation Instruction:
"{validation_instruction}"

Retrieved Context:
"{context}"

Return your scores as a valid JSON object using this format:
{{
  "Relevancy": X,
  "Completeness": X,
  "Specificity": X
}}

Where X is a number from 1 (very poor) to 5 (excellent).
Do not include any explanation or commentary.
"""

    for attempt in range(max_retries):
        response = call_llm(evaluation_prompt, temperature=0.3)

        # Try parsing as JSON
        try:
            scores = json.loads(response.strip())
            if all(metric in scores for metric in ["Relevancy", "Completeness", "Specificity"]):
                avg_score = sum(scores.values()) / 3
                return avg_score, scores
        except Exception:
            pass

        # Fallback: Try regex extraction
        try:
            relevancy = re.search(r'Relevancy\s*[:=]\s*([1-5])', response, re.IGNORECASE)
            completeness = re.search(r'Completeness\s*[:=]\s*([1-5])', response, re.IGNORECASE)
            specificity = re.search(r'Specificity\s*[:=]\s*([1-5])', response, re.IGNORECASE)

            if relevancy and completeness and specificity:
                scores = {
                    "Relevancy": int(relevancy.group(1)),
                    "Completeness": int(completeness.group(1)),
                    "Specificity": int(specificity.group(1))
                }
                avg_score = sum(scores.values()) / 3
                return avg_score, scores
        except Exception:
            pass

    # Default if all parsing fails
    return 0.0, {"Relevancy": 0, "Completeness": 0, "Specificity": 0}


Enter your OpenAI API key: ··········


In [5]:
mdd_context = """
Example MDD Sections
Section 2.1 – Model Objectives
The purpose of this model is to identify instances of elder financial abuse (EFA) in retail banking customer interactions. The model is intended to support analysts and fraud investigators by prioritizing transaction records and banker notes that are likely to involve EFA.

Specifically, the model aims to:

Detect unreported cases of elder financial abuse through natural language cues in transaction narratives.

Improve operational efficiency by reducing manual screening workload by at least 25%.

Maintain high precision to minimize false positives and avoid unnecessary customer inquiries.

Operate effectively across customer segments, including different age groups, geographies, and banking product types.

Support decision-making with interpretable outputs that can be audited by compliance and risk teams.

Section 3.1 – Core Model Requirements
The core requirements for the model are as follows:

Model Type: A supervised classification model using XGBoost with TF-IDF features derived from banker notes.

Data: Use pre-2023 transaction records and customer interactions; exclude age and gender fields to mitigate bias.

Performance Thresholds:

Minimum Precision: ≥ 0.80

Minimum Recall: ≥ 0.65

Minimum AUC: ≥ 0.85

Robustness: Model should demonstrate stable performance (±5% variation) across key customer segments.

Interpretability: Use SHAP values to provide explanation of top 5 features contributing to a positive prediction.

Deployment: Model should score within 250 ms per transaction and be containerized for cloud deployment.

Governance: Adhere to internal model documentation standards and satisfy third-party audit readiness.
"""
validation_instructions = "Assess whether core model requirements are aligned with model objectives."


In [6]:
relevancy_score, metric_breakdown = evaluate_context_quality(mdd_context, validation_instructions)
print(f"Context Quality Scores: {metric_breakdown} (Average: {relevancy_score})")

Context Quality Scores: {'Relevancy': 5, 'Completeness': 5, 'Specificity': 5} (Average: 5.0)


In [7]:
import openai
import os
import json
import re
from getpass import getpass

# Securely set your API key
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Call LLM using GPT-4o
def call_llm(prompt, temperature=0.3, iteration=None, step=None):
    print("\n" + "=" * 60)
    print(f"🌀 Iteration {iteration} | Step: {step}")
    print("-" * 60)
    print("📤 Prompt:\n", prompt.strip())

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an expert model validation reviewer."},
            {"role": "user", "content": prompt}
        ],
        temperature=temperature
    )

    content = response.choices[0].message.content
    print("-" * 60)
    print("📥 Response:\n", content.strip())
    print("=" * 60 + "\n")

    return content

# Evaluate context quality using LLM + fallback logic
def evaluate_context_quality(context, validation_instruction, max_retries=2, iteration=None):
    """
    Use structured JSON output from LLM to evaluate retrieved context.
    Includes retry logic and fallback regex-based parsing.
    Returns average score and individual metric scores.
    """
    evaluation_prompt = f"""
Evaluate the following retrieved context in relation to the validation instruction using three criteria:

1. Relevancy – How directly does the context address the instruction?
2. Completeness – Are all required elements present to answer the instruction fully?
3. Specificity – Does the context cite specific items (terms, metrics, definitions) from the document?

Validation Instruction:
"{validation_instruction}"

Retrieved Context:
"{context}"

Return your scores as a valid JSON object using this format:
{{
  "Relevancy": X,
  "Completeness": X,
  "Specificity": X
}}

Where X is a number from 1 (very poor) to 5 (excellent).
Do not include any explanation or commentary.
"""

    for attempt in range(max_retries):
        response = call_llm(
            prompt=evaluation_prompt,
            temperature=0.3,
            iteration=iteration,
            step=f"Evaluation Attempt {attempt + 1}"
        )

        # Try parsing as JSON
        try:
            scores = json.loads(response.strip())
            if all(metric in scores for metric in ["Relevancy", "Completeness", "Specificity"]):
                avg_score = sum(scores.values()) / 3
                return avg_score, scores
        except Exception:
            pass

        # Fallback: Try regex extraction
        try:
            relevancy = re.search(r'Relevancy\s*[:=]\s*([1-5])', response, re.IGNORECASE)
            completeness = re.search(r'Completeness\s*[:=]\s*([1-5])', response, re.IGNORECASE)
            specificity = re.search(r'Specificity\s*[:=]\s*([1-5])', response, re.IGNORECASE)

            if relevancy and completeness and specificity:
                scores = {
                    "Relevancy": int(relevancy.group(1)),
                    "Completeness": int(completeness.group(1)),
                    "Specificity": int(specificity.group(1))
                }
                avg_score = sum(scores.values()) / 3
                return avg_score, scores
        except Exception:
            pass

    # Default if all parsing fails
    return 0.0, {"Relevancy": 0, "Completeness": 0, "Specificity": 0}


Enter your OpenAI API key: ··········


In [8]:
relevancy_score, metric_breakdown = evaluate_context_quality(mdd_context, validation_instructions)
print(f"Context Quality Scores: {metric_breakdown} (Average: {relevancy_score})")


🌀 Iteration None | Step: Evaluation Attempt 1
------------------------------------------------------------
📤 Prompt:
 Evaluate the following retrieved context in relation to the validation instruction using three criteria:

1. Relevancy – How directly does the context address the instruction?
2. Completeness – Are all required elements present to answer the instruction fully?
3. Specificity – Does the context cite specific items (terms, metrics, definitions) from the document?

Validation Instruction:
"Assess whether core model requirements are aligned with model objectives."

Retrieved Context:
"
Example MDD Sections
Section 2.1 – Model Objectives
The purpose of this model is to identify instances of elder financial abuse (EFA) in retail banking customer interactions. The model is intended to support analysts and fraud investigators by prioritizing transaction records and banker notes that are likely to involve EFA.

Specifically, the model aims to:

Detect unreported cases of elder 

In [9]:
## Another test

validation_instructions = "Assess whether the implementation test was successful."

relevancy_score, metric_breakdown = evaluate_context_quality(mdd_context, validation_instructions)
print(f"Context Quality Scores: {metric_breakdown} (Average: {relevancy_score})")


🌀 Iteration None | Step: Evaluation Attempt 1
------------------------------------------------------------
📤 Prompt:
 Evaluate the following retrieved context in relation to the validation instruction using three criteria:

1. Relevancy – How directly does the context address the instruction?
2. Completeness – Are all required elements present to answer the instruction fully?
3. Specificity – Does the context cite specific items (terms, metrics, definitions) from the document?

Validation Instruction:
"Assess whether the implementation test was successful."

Retrieved Context:
"
Example MDD Sections
Section 2.1 – Model Objectives
The purpose of this model is to identify instances of elder financial abuse (EFA) in retail banking customer interactions. The model is intended to support analysts and fraud investigators by prioritizing transaction records and banker notes that are likely to involve EFA.

Specifically, the model aims to:

Detect unreported cases of elder financial abuse thr

In [10]:
## UPDATED code to provdie justifications
def evaluate_context_quality(context, validation_instruction, max_retries=2):
    """
    Evaluate retrieved context against a validation instruction using:
    - Relevancy
    - Completeness
    - Specificity

    Each criterion includes both a numeric score (1-5) and justification.
    """
    evaluation_prompt = f"""
Evaluate the following retrieved context in relation to the validation instruction using three criteria:

1. Relevancy – How directly does the context address the instruction?
2. Completeness – Are all required elements present to answer the instruction fully?
3. Specificity – Does the context cite specific items (terms, metrics, definitions) from the document?

Validation Instruction:
"{validation_instruction}"

Retrieved Context:
"{context}"

Return your evaluation as a JSON object in the following format:

{{
  "Relevancy": {{
    "Score": X,
    "Justification": "Explanation for why this score was given for relevancy..."
  }},
  "Completeness": {{
    "Score": Y,
    "Justification": "Explanation for completeness..."
  }},
  "Specificity": {{
    "Score": Z,
    "Justification": "Explanation for specificity..."
  }}
}}

Where X, Y, Z are scores from 1 (very poor) to 5 (excellent).
Do not include any commentary outside the JSON structure.
"""

    for attempt in range(max_retries):
        response = call_llm(evaluation_prompt, temperature=0.3)

        try:
            scores = json.loads(response.strip())
            if all(metric in scores for metric in ["Relevancy", "Completeness", "Specificity"]):
                avg_score = (
                    scores["Relevancy"]["Score"]
                    + scores["Completeness"]["Score"]
                    + scores["Specificity"]["Score"]
                ) / 3
                return avg_score, scores
        except Exception:
            pass  # Optionally log the response here for debugging

    # Fallback if parsing fails
    return 0.0, {
        "Relevancy": {"Score": 0, "Justification": "Parsing failed."},
        "Completeness": {"Score": 0, "Justification": "Parsing failed."},
        "Specificity": {"Score": 0, "Justification": "Parsing failed."}
    }


In [11]:
## Another test

validation_instructions = "Assess whether the implementation test was successful."

relevancy_score, metric_breakdown = evaluate_context_quality(mdd_context, validation_instructions)
print(f"Context Quality Scores: {metric_breakdown} (Average: {relevancy_score})")


🌀 Iteration None | Step: None
------------------------------------------------------------
📤 Prompt:
 Evaluate the following retrieved context in relation to the validation instruction using three criteria:

1. Relevancy – How directly does the context address the instruction?
2. Completeness – Are all required elements present to answer the instruction fully?
3. Specificity – Does the context cite specific items (terms, metrics, definitions) from the document?

Validation Instruction:
"Assess whether the implementation test was successful."

Retrieved Context:
"
Example MDD Sections
Section 2.1 – Model Objectives
The purpose of this model is to identify instances of elder financial abuse (EFA) in retail banking customer interactions. The model is intended to support analysts and fraud investigators by prioritizing transaction records and banker notes that are likely to involve EFA.

Specifically, the model aims to:

Detect unreported cases of elder financial abuse through natural lan

In [24]:
import json
import time

# Simulated call to LLM (you should define or import 'client')
def call_llm(prompt, temperature=0.3, max_tokens=500, system_prompt=""):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an expert model validation reviewer."},
            {"role": "user", "content": prompt}
        ],
        temperature=temperature
    )
    return response.choices[0].message.content

def evaluate_context_quality(context, validation_instruction, max_retries=2):
    evaluation_prompt = f"""
Evaluate the following retrieved context in relation to the validation instruction using three criteria:

1. Relevancy – How directly does the context address the instruction?
2. Completeness – Are all required elements present to answer the instruction fully?
3. Specificity – Does the context cite specific items (terms, metrics, definitions) from the document?

Validation Instruction:
"{validation_instruction}"

Retrieved Context:
"{context}"

Return your evaluation as a JSON object in the following format:

{{
  "Relevancy": {{
    "Score": X,
    "Justification": "Explanation for why this score was given for relevancy..."
  }},
  "Completeness": {{
    "Score": Y,
    "Justification": "Explanation for completeness..."
  }},
  "Specificity": {{
    "Score": Z,
    "Justification": "Explanation for specificity..."
  }}
}}

Where X, Y, Z are scores from 1 (very poor) to 5 (excellent).
Do not include any commentary outside the JSON structure.
"""

    for attempt in range(max_retries):
        response = call_llm(evaluation_prompt, temperature=0.3)

        try:
            scores = json.loads(response.strip())
            if all(metric in scores for metric in ["Relevancy", "Completeness", "Specificity"]):
                avg_score = (
                    scores["Relevancy"]["Score"]
                    + scores["Completeness"]["Score"]
                    + scores["Specificity"]["Score"]
                ) / 3

                explanations = {
                    metric: scores[metric]["Justification"]
                    for metric in ["Relevancy", "Completeness", "Specificity"]
                }

                return avg_score, scores, explanations
        except Exception:
            pass

    fallback_scores = {
        "Relevancy": {"Score": 0, "Justification": "Parsing failed."},
        "Completeness": {"Score": 0, "Justification": "Parsing failed."},
        "Specificity": {"Score": 0, "Justification": "Parsing failed."}
    }
    fallback_explanations = {k: v["Justification"] for k, v in fallback_scores.items()}
    return 0.0, fallback_scores, fallback_explanations

def react_validation_assessment(validation_instruction, text):
    reasoning_steps = []
    context = ""
    iteration = 0
    relevancy_score = 0

    while relevancy_score < RELEVANCY_THRESHOLD and iteration < MAX_ITERATIONS:
        iteration += 1
        print("\n" + "-" * 60)
        print(f"Iteration {iteration}...")

        thought_prompt = f"""
        Given the validation instruction: "{validation_instruction}"
        and the retrieved context: "{context}"
        What additional information is needed for a thorough assessment? Provide a concise answer in one sentence.
        """
        thought = call_llm(thought_prompt, temperature=0.3, max_tokens=500, system_prompt=system_prompt)
        print("\nThought Prompt:\n", thought_prompt)
        print("Generated Thought:\n", thought)
        reasoning_steps.append(f"Thought {iteration}: {thought}")

        action_prompt_for_query_formulation = f"""
        Based on the thought: "{thought}"
        Formulate a query to retrieve missing contextual details from the Model Development Document.
        """
        action_1 = call_llm(action_prompt_for_query_formulation, temperature=0.3, max_tokens=500, system_prompt=system_prompt)
        print("\nQuery Formulation Prompt:\n", action_prompt_for_query_formulation)
        print("Generated Action (Query):\n", action_1)
        reasoning_steps.append(f"Action {iteration} (Query Formulation): {action_1}")

        action_prompt_retrieve_context = f"""
        Answer the QUERY using the provided CONTEXT.
        QUERY: "{action_1}"
        CONTEXT: "{text}"
        """
        additional_context = call_llm(action_prompt_retrieve_context, temperature=0.3, max_tokens=500, system_prompt="")
        print("\nRetrieve Context Prompt:\n", action_prompt_retrieve_context)
        print("Retrieved Additional Context:\n", additional_context)
        reasoning_steps.append(f"Action {iteration} (Context Retrieved): {additional_context}")

        avg_score, scores, explanations = evaluate_context_quality(additional_context, validation_instruction)
        relevancy_score = avg_score

        print("\nContext Evaluation Scores:")
        for metric, score in scores.items():
            explanation = explanations.get(metric, "")
            print(f"{metric}: {score['Score']} — {explanation}")
        print(f"Average Relevancy Score: {avg_score}")

        if relevancy_score >= RELEVANCY_THRESHOLD:
            print("\nSufficient relevant context retrieved. Proceeding to report generation...")
            context += "\n" + additional_context
            break
        else:
            print("\nContext not relevant enough, refining search...")
            time.sleep(2)

    report_prompt = f"""
    Based on the final observations and retrieved context, generate a structured validation report.
    Validation Assessment: {validation_instruction}
    Context: {context}
    Provide a detailed and structured response.
    """
    validation_report = call_llm(report_prompt, temperature=0.7, max_tokens=2000, system_prompt=system_prompt)
    print("\nFinal Report Prompt:\n", report_prompt)
    print("Generated Validation Report:\n", validation_report)

    return validation_report, reasoning_steps

RELEVANCY_THRESHOLD = 4
MAX_ITERATIONS = 1
max_tokens = 500
system_prompt = "Provide accurate answers only. Exclude extra details. Be concise."

# Example call (make sure you define these variables properly in your actual usage):
validation_instruction = "Assess whether the core model requirements are aligned with model objectives."
react_validation_assessment(validation_instruction, mdd_context)



------------------------------------------------------------
Iteration 1...

Thought Prompt:
 
        Given the validation instruction: "Assess whether the core model requirements are aligned with model objectives."
        and the retrieved context: ""
        What additional information is needed for a thorough assessment? Provide a concise answer in one sentence.
        
Generated Thought:
 Additional information needed includes the specific model objectives, the core model requirements, and any documentation detailing how these requirements are intended to support the objectives.

Query Formulation Prompt:
 
        Based on the thought: "Additional information needed includes the specific model objectives, the core model requirements, and any documentation detailing how these requirements are intended to support the objectives."
        Formulate a query to retrieve missing contextual details from the Model Development Document.
        
Generated Action (Query):
 Certainly! He

("**Model Validation Report**\n\n**Title:** Assessment of Core Model Requirements Alignment with Model Objectives\n\n**Date:** [Insert Date]\n\n**Prepared by:** [Your Name/Team]\n\n---\n\n**1. Introduction**\n\nThe purpose of this report is to assess whether the core requirements of the model under review align with its intended objectives. This evaluation is crucial to ensure that the model is not only theoretically sound but also practically applicable in achieving its designated purpose.\n\n---\n\n**2. Model Overview**\n\n- **Model Name:** [Insert Model Name]\n- **Model Type:** [Insert Model Type]\n- **Development Team:** [Insert Team/Organization Name]\n- **Intended Use:** [Describe the primary purpose of the model, e.g., risk assessment, decision support, etc.]\n- **Deployment Environment:** [Describe where and how the model is intended to be used]\n\n---\n\n**3. Objective Assessment**\n\n- **Objective Definition:** Clearly defined objectives are essential for a successful model. 

In [29]:
import json
import time

# Simulated call to LLM (you should define or import 'client')
def call_llm(prompt, temperature=0.3, max_tokens=500, system_prompt=""):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an expert model validation reviewer."},
            {"role": "user", "content": prompt}
        ],
        temperature=temperature
    )
    return response.choices[0].message.content

def evaluate_context_quality(context, validation_instruction, max_retries=2):
    evaluation_prompt_template = """
Evaluate the following retrieved context in relation to the validation instruction using three criteria:

1. Relevancy – How directly does the context address the instruction?
2. Completeness – Are all required elements present to answer the instruction fully?
3. Specificity – Does the context cite specific items (terms, metrics, definitions) from the document?

Validation Instruction:
"{instruction}"

Retrieved Context:
"{ctx}"

Return your evaluation as a JSON object in the following format:

{{
  "Relevancy": {{
    "Score": X,
    "Justification": "Explanation for why this score was given for relevancy..."
  }},
  "Completeness": {{
    "Score": Y,
    "Justification": "Explanation for completeness..."
  }},
  "Specificity": {{
    "Score": Z,
    "Justification": "Explanation for specificity..."
  }}
}}

Where X, Y, Z are scores from 1 (very poor) to 5 (excellent).
Do not include any commentary outside the JSON structure.
"""

    for attempt in range(max_retries):
        evaluation_prompt = evaluation_prompt_template.format(instruction=validation_instruction, ctx=context)
        response = call_llm(evaluation_prompt, temperature=0.3)

        try:
            scores = json.loads(response.strip())
            if all(metric in scores for metric in ["Relevancy", "Completeness", "Specificity"]):
                avg_score = (
                    scores["Relevancy"]["Score"]
                    + scores["Completeness"]["Score"]
                    + scores["Specificity"]["Score"]
                ) / 3

                explanations = {
                    metric: scores[metric]["Justification"]
                    for metric in ["Relevancy", "Completeness", "Specificity"]
                }

                return avg_score, scores, explanations
        except Exception as e:
            print(f"[Attempt {attempt+1}] JSON parsing failed: {e}")
            continue

    # Fallback: use a neutral default instead of zeros
    fallback_scores = {
        "Relevancy": {"Score": 3, "Justification": "LLM output could not be parsed, assigning neutral score."},
        "Completeness": {"Score": 3, "Justification": "LLM output could not be parsed, assigning neutral score."},
        "Specificity": {"Score": 3, "Justification": "LLM output could not be parsed, assigning neutral score."}
    }
    fallback_explanations = {k: v["Justification"] for k, v in fallback_scores.items()}
    avg_score = sum(v["Score"] for v in fallback_scores.values()) / 3

    return avg_score, fallback_scores, fallback_explanations

def react_validation_assessment(validation_instruction, text):
    reasoning_steps = []
    context = ""
    iteration = 0
    relevancy_score = 0

    while relevancy_score < RELEVANCY_THRESHOLD and iteration < MAX_ITERATIONS:
        iteration += 1
        print("\n" + "-" * 60)
        print(f"Iteration {iteration}...")

        thought_prompt = f"""
        Given the validation instruction: "{validation_instruction}"
        and the retrieved context: "{context}"
        What additional information is needed for a thorough assessment? Provide a concise answer in one sentence.
        """
        thought = call_llm(thought_prompt, temperature=0.3, max_tokens=500, system_prompt=system_prompt)
        print("\nThought Prompt:\n", thought_prompt)
        print("Generated Thought:\n", thought)
        reasoning_steps.append(f"Thought {iteration}: {thought}")

        action_prompt_for_query_formulation = f"""
        Based on the thought: "{thought}"
        Formulate a query to retrieve missing contextual details from the Model Development Document.
        """
        action_1 = call_llm(action_prompt_for_query_formulation, temperature=0.3, max_tokens=500, system_prompt=system_prompt)
        print("\nQuery Formulation Prompt:\n", action_prompt_for_query_formulation)
        print("Generated Action (Query):\n", action_1)
        reasoning_steps.append(f"Action {iteration} (Query Formulation): {action_1}")

        action_prompt_retrieve_context = f"""
        Answer the QUERY using the provided CONTEXT.
        QUERY: "{action_1}"
        CONTEXT: "{text}"
        """
        additional_context = call_llm(action_prompt_retrieve_context, temperature=0.3, max_tokens=500, system_prompt="")
        print("\nRetrieve Context Prompt:\n", action_prompt_retrieve_context)
        print("Retrieved Additional Context:\n", additional_context)
        reasoning_steps.append(f"Action {iteration} (Context Retrieved): {additional_context}")

        avg_score, scores, explanations = evaluate_context_quality(additional_context, validation_instruction)
        relevancy_score = avg_score

        print("\nContext Evaluation Scores:")
        for metric, score in scores.items():
            explanation = explanations.get(metric, "")
            print(f"{metric}: {score['Score']} — {explanation}")
        print(f"Average Relevancy Score: {avg_score}")

        if relevancy_score >= RELEVANCY_THRESHOLD:
            print("\nSufficient relevant context retrieved. Proceeding to report generation...")
            context += "\n" + additional_context
            break
        else:
            print("\nContext not relevant enough, refining search...")
            time.sleep(2)

    report_prompt = f"""
    Given the Validation Assessment Instruction and the Retrieved Context, generate a one-paragraph summary of the validation assessment and the conclusion of the assessment:
    Validation Assessment Instruction: {validation_instruction}
    Retrieved Context: {context}
    """
    validation_report = call_llm(report_prompt, temperature=0.7, max_tokens=2000, system_prompt=system_prompt)
    print("\nFinal Report Prompt:\n", report_prompt)
    print("Generated Validation Report:\n", validation_report)

    return validation_report, reasoning_steps

RELEVANCY_THRESHOLD = 4
MAX_ITERATIONS = 3
max_tokens = 500
system_prompt = "Provide accurate answers only. Exclude extra details. Be concise."

# Example call (make sure you define these variables properly in your actual usage):
mdd_context = """
Example MDD Sections
Section 2.1 – Model Objectives
The purpose of this model is to identify instances of elder financial abuse (EFA) in retail banking customer interactions. The model is intended to support analysts and fraud investigators by prioritizing transaction records and banker notes that are likely to involve EFA.

Specifically, the model aims to:

Detect unreported cases of elder financial abuse through natural language cues in transaction narratives.

Improve operational efficiency by reducing manual screening workload by at least 25%.

Maintain high precision to minimize false positives and avoid unnecessary customer inquiries.

Operate effectively across customer segments, including different age groups, geographies, and banking product types.

Support decision-making with interpretable outputs that can be audited by compliance and risk teams.

Section 3.1 – Core Model Requirements
The core requirements for the model are as follows:

Model Type: A supervised classification model using XGBoost with TF-IDF features derived from banker notes.

Data: Use pre-2023 transaction records and customer interactions; exclude age and gender fields to mitigate bias.

Performance Thresholds:

Minimum Precision: ≥ 0.80

Minimum Recall: ≥ 0.65

Minimum AUC: ≥ 0.85

Robustness: Model should demonstrate stable performance (±5% variation) across key customer segments.

Interpretability: Use SHAP values to provide explanation of top 5 features contributing to a positive prediction.

Deployment: Model should score within 250 ms per transaction and be containerized for cloud deployment.

Governance: Adhere to internal model documentation standards and satisfy third-party audit readiness.
"""
validation_instruction = "Assess whether the core model requirements are aligned with model objectives."
react_validation_assessment(validation_instruction, mdd_context)



------------------------------------------------------------
Iteration 1...

Thought Prompt:
 
        Given the validation instruction: "Assess whether the core model requirements are aligned with model objectives."
        and the retrieved context: ""
        What additional information is needed for a thorough assessment? Provide a concise answer in one sentence.
        
Generated Thought:
 Additional information needed includes a detailed description of the model's objectives and the specific core model requirements to determine their alignment.

Query Formulation Prompt:
 
        Based on the thought: "Additional information needed includes a detailed description of the model's objectives and the specific core model requirements to determine their alignment."
        Formulate a query to retrieve missing contextual details from the Model Development Document.
        
Generated Action (Query):
 Could you provide a detailed description of the model's objectives and outline the 

("To generate a comprehensive summary of the validation assessment regarding the alignment of core model requirements with model objectives, it's crucial to examine whether the foundational elements of the model are designed to effectively achieve the intended outcomes. This involves evaluating if the model's structure, inputs, processes, and outputs are in harmony with the overarching goals it aims to fulfill. In this context, the assessment would typically involve a detailed analysis of the model's framework to ensure that each component has a clear purpose and contributes directly to the model's objectives. The conclusion of such an assessment should indicate whether there is a strong alignment between the model requirements and objectives, highlighting any potential discrepancies or areas for improvement that could enhance the model's effectiveness in achieving its goals. If the assessment reveals a robust alignment, it suggests that the model is well-positioned to deliver its inte