# Agentic AI Underwriting Demo (BOP)

*A hands-on look at how “agentic” AI can support straight-through underwriting for Businessowners Policies (BOP)—with transparent reasoning, safe hand-offs, and repeatable evaluation.*

## What this shows
- **Practical workflow:** An AI assistant that breaks underwriting into smaller decisions (check appetite, gather missing context, confirm evidence, and finalize or refer).
- **Clear reasons:** Each decision includes short, human-readable rationale and the specific rule or document section it relied on.
- **Safe-fail behavior:** When the case is ambiguous or information is insufficient, the system recommends **send-to-human** review instead of forcing a decision.
- **Repeatable testing:** The demo runs on a **realistic synthetic dataset**, so new AI methods can be evaluated fairly as the technology evolves.

## Why it matters for actuaries
- **Consistency & speed:** Routine cases can move faster while edge cases are routed for expert review.
- **Auditability:** Decisions are tied to evidence, which helps with governance and peer review.
- **Future-proofing:** Because the dataset is reusable, we can compare today’s models with tomorrow’s “next big thing” on the same scenarios.

## What you’ll see in the notebook
1. **Setup:** Load the demo dataset (guidebook excerpts + sample BOP applications).
2. **Agent flow:** The assistant checks appetite, resolves uncertainties, and either finalizes or refers.
3. **Reason capture:** The assistant records short reasons and the rule(s) consulted.
4. **Evaluation:** We score outcomes (accept/reject/refer) and rationale quality against ground truth.
5. **Summary:** Simple tables/plots that show accuracy, referral behavior, and reason alignment.

> **No code required to browse:** You can scroll to the outputs to see example cases, decisions, and summaries. Running cells is optional.

## What makes this different
- **Agentic design:** Instead of one big answer, the assistant takes **small, verifiable steps** and can reconsider when signals conflict.
- **Evidence-first:** Rationales point back to the exact guideline passages used.
- **Human-in-the-loop by design:** The system prefers referral when rules or data don’t clearly support a decision.
- **Evaluation you can trust:** Results are produced on a curated, **non-proprietary** dataset that can be shared and rerun.

## About the dataset
This demo uses a **synthetic underwriting guidebook** and **scripted application scenarios** that mimic real-world cases (clean approvals, clear declines, and ambiguous files with missing information). Because no private data is used, the materials are easy to share and extend.

## How to read the results
- **Decision table:** Counts of Accept / Decline / Refer.
- **Rationale check:** How often the stated reason matches the expected reason type.
- **Example cases:** A few end-to-end samples with the cited rule text and the assistant’s short explanation.

## Extend the demo (optional)
- Swap in your own rules or forms to see how the flow adapts.
- Compare alternative approaches (e.g., different retrieval settings or model versions) using the same dataset.
- Add checkpoints for **claims triage** or **trend/change detection** using the same agent pattern.

---

**Authors:** Robert Richardson
**Contact:** richardson@stat.byu.edu  
**Notebook:** [Agentic_RAG_Underwriting-2.ipynb](Agentic_RAG_Underwriting-2.ipynb)


Install necessary packages (specifically for use on Google Colab)

In [None]:
pip install langchain openai faiss-cpu pymupdf langchain-community tiktoken langchain-openai langgraph pypdf


Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.0 kB)
Collecting pymupdf
  Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.3.28-py3-none-any.whl.metadata (2.3 kB)
Collecting langgraph
  Downloading langgraph-0.5.4-py3-none-any.whl.metadata (6.8 kB)
Collecting pypdf
  Downloading pypdf-5.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl

Load required libraries

In [None]:
# Enhanced Non-Linear Agentic RAG Underwriting Pipeline (fully integrated)

# Import necessary libraries
import pandas as pd
import numpy as np
import openai
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
import json
from typing import Tuple
from datetime import datetime
import langgraph as lg
from langgraph.graph import StateGraph


Currently using OPENAI API key

In [None]:
# Set your OpenAI API key
# Setup environment and API key
import os
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

Convert the policy guideline into a vector store

In [None]:
# Load and Vectorize Policy Guide
loader = PyPDFLoader('https://raw.githubusercontent.com/drbob-richardson/Actuarial_Agentic_AI/main/bop_agentic_rag/Application_Data_Generation/bop_policyguide_draft2.pdf')
pages = loader.load_and_split()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(pages)

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever()

# Load SIC codes
sic_codes_df = pd.read_csv('https://raw.githubusercontent.com/drbob-richardson/Actuarial_Agentic_AI/main/bop_agentic_rag/Application_Data_Generation/sic-codes.csv')
acceptable_sics = set(sic_codes_df['SIC'].astype(str))

  embeddings = OpenAIEmbeddings()


Make helper functions and create pipeline

In [None]:


def parse_money(value):
    try:
        if isinstance(value, str):
            # Remove all characters except digits, periods, slashes, and commas
            cleaned = re.sub(r'[^\d\.\-]', '', value)
            return float(cleaned)
        return float(value)
    except Exception as e:
        print(f"[parse_money error] Could not parse value '{value}': {e}")
        return np.nan



# Logistic regression rating function (as defined previously)
def calculate_combined_bop_factor(
    year_built: int,
    sq_ft: float,
    building_limit: float,
    distance_to_hydrant: float,
    prior_loss_count: int,
    annual_sales: float,
    sic_code: str,
    current_year: int = datetime.now().year
) -> Tuple[float, float, float]:
    sic_code = str(sic_code).zfill(4)
    sic_base_rates = {"20": 0.40, "50": 0.40, "52": 0.35, "53": 0.35, "54": 0.35,
                      "55": 0.35, "56": 0.35, "57": 0.35, "58": 0.70, "59": 0.35,
                      "72": 0.55, "73": 0.55, "75": 0.55, "76": 0.55, "80": 0.50,
                      "81": 0.50, "82": 0.50, "87": 0.50, "40": 0.45, "41": 0.45,
                      "42": 0.45, "48": 0.45}
    base_rate = sic_base_rates.get(sic_code[:2], 0.50)
    age_factor = .75 + 0.0075 * min(current_year - year_built, 80)
    sq_ft_factor = (sq_ft / 2000) ** 0.15
    limit_factor = min(1 + 0.002 * max(0, building_limit / 1000 - 250), 3.0)
    hydrant_factor = min(1 + 0.00003 * distance_to_hydrant, 1.30)
    loss_factor = min(1.0 * (1.25 ** prior_loss_count), 3.0)
    sales_factor = (annual_sales / 100_000) ** 0.10
    total_factor = (age_factor * sq_ft_factor * limit_factor * hydrant_factor * loss_factor * sales_factor * base_rate)
    return round(total_factor, 3), round(max(min(total_factor, 2.5), 0.5), 3), round(base_rate, 3)

def retrieve_underwriting_guidelines(business_type, application_data=None):
    # Basic
    business_docs = retriever.invoke(f"Underwriting guidelines for {business_type}")
    global_docs = retriever.invoke("General underwriting guidelines and exclusions")

    # LLM-enhanced retrieval
    if application_data:
        query = generate_guideline_retrieval_query(application_data)
        smart_docs = retriever.invoke(query)
    else:
        smart_docs = []

    # Merge and deduplicate
    seen = set()
    all_docs = []
    for doc in business_docs + global_docs + smart_docs:
        if doc.page_content not in seen:
            seen.add(doc.page_content)
            all_docs.append(doc)

    return "\n\n---\n\n".join(doc.page_content for doc in all_docs)



# LLM setup
llm = ChatOpenAI(model_name='gpt-4.1-mini', temperature=0)
qa = RetrievalQA.from_chain_type(llm, retriever=retriever)

def check_sic_node(state):
    application = state['application']
    sic = application.get('SIC')
    prompt = f"""
    Is SIC code {sic} clearly appropriate for a business described as '{application['NATURE OF BUSINESS']}'?
    Respond with one of the following formats exactly:

    - YES - followed by reasoning
    - NO - followed by reasoning

    Example: YES - The SIC code aligns well with the described antique retail business.

    Your answer:
    """
    evaluation = llm.invoke(prompt).content.strip().upper()

    if 'NO' in evaluation:
        state.update({'decision': 'REJECT', 'reason': evaluation})
    elif 'YES' in evaluation:
        state.update({'decision': 'CONTINUE', 'reason': evaluation})
    else:
        state.update({'decision': 'REQUIRES_HUMAN_REVIEW', 'reason': f'Ambiguous SIC evaluation: {evaluation}'})
    return state


def guidelines_eval_node(state):
    application = state['application']
    guidelines = retrieve_underwriting_guidelines(application['NATURE OF BUSINESS'])

    prompt = (
        f"Given this application: {json.dumps(application)}\n\n"
        f"Guidelines:\n{guidelines}\n\n"
        "Categorize the application's underwriting acceptability as:\n"
        "- CLEARLY_ACCEPTABLE\n"
        "- CLEARLY_REJECTABLE\n"
        "- BORDERLINE_REQUIRES_THIRD_PARTY\n"
        "- APPETITE_UNCLEAR\n\n"
        "Respond with a category and a short explanation. Example:\n"
        "CLEARLY_ACCEPTABLE - The business aligns fully with listed criteria."
    )

    evaluation = llm.invoke(prompt).content.upper()

    if 'CLEARLY_ACCEPTABLE' in evaluation:
        decision = 'CLEARLY_ACCEPTABLE'
    elif 'CLEARLY_REJECTABLE' in evaluation:
        decision = 'CLEARLY_REJECTABLE'
    elif 'BORDERLINE_REQUIRES_THIRD_PARTY' in evaluation:
        decision = 'BORDERLINE_REQUIRES_THIRD_PARTY'
    elif 'APPETITE_UNCLEAR' in evaluation:
        decision = 'APPETITE_UNCLEAR'
    elif 'ACCEPTABLE' in evaluation and 'UNCLEAR' not in evaluation:
        decision = 'CLEARLY_ACCEPTABLE'  # fallback from soft language
    else:
        decision = 'APPETITE_UNCLEAR'  # default to unclear, but allow reflection

    state.update({
        'decision': decision,
        'guidelines': guidelines,
        'reason': evaluation
    })
    return state



def logistic_eval_node(state):
    application = state['application']
    third_party_data = state.get('third_party_data', {})

    try:
        total_factor, total_factor_capped, base_rate = calculate_combined_bop_factor(
            year_built=int(application['Year Built']),
            sq_ft=float(application['Square Feet']),
            building_limit=parse_money(application['Building limit']),
            annual_sales=parse_money(application['ANNUAL SALES']),
            distance_to_hydrant=float(application['Distance to hydrant'].split()[0]),
            prior_loss_count=int(third_party_data.get('Number of Claims', 0)),
            sic_code=application['SIC']
        )


        decision = 'ACCEPT' if total_factor_capped < 1.5 else 'REFER' if total_factor_capped < 2.0 else 'REJECT'

    except Exception as e:
        total_factor_capped = float('nan')
        logistic_reason = f"Logistic score could not be calculated: {e}"
        decision = 'REJECT'
    logistic_reason = f'Logistic factor capped: {total_factor_capped}'
    state.update({
        'final_decision': decision,
        'logistic_reason': logistic_reason,
        'reason': logistic_reason
    })
    return state


def reflection_node(state):
    state['reflection_count'] = state.get('reflection_count', 0) + 1

    if state['reflection_count'] > 2:
        # Allow fallback to logistic scoring instead of forcing human review
        state['decision'] = 'CLEAR_FOR_LOGISTIC'
        state['reason'] = 'Guideline uncertainty unresolved after 3 reflections — proceeding to logistic evaluation.'
        return state

    application = state['application']
    uncertainties = state.get('decision', 'appetite level unclear')
    guidelines = retrieve_underwriting_guidelines(application['NATURE OF BUSINESS'])

    prompt = (
        f"Clarify the underwriting appetite decision based on the current uncertainty: '{uncertainties}'.\n\n"
        f"Application:\n{json.dumps(application)}\n\n"
        f"Guidelines:\n{guidelines}\n\n"
        "Respond with one of the following categories ONLY, followed by a short justification:\n"
        "- CLEARLY_ACCEPTABLE\n- CLEARLY_REJECTABLE\n- BORDERLINE_REQUIRES_THIRD_PARTY\n- APPETITE_UNCLEAR\n"
    )
    reflection_result = llm.invoke(prompt).content
    state.update({'reflection': reflection_result})
    return state







# Restore expanded third-party data evaluation node
def third_party_eval_node(state):
    application = state['application']
    guidelines = retrieve_underwriting_guidelines(application['NATURE OF BUSINESS'])

    prompt = (
        f"Third-party data:\n{json.dumps(state['third_party_data'])}\n\n"
        f"Guidelines:\n{guidelines}\n\n"
        "Categorize this data as one of:\n"
        "- DISQUALIFYING_FACTORS_PRESENT\n"
        "- UNCLEAR_REQUIRES_REVIEW\n"
        "- CLEAR\n\n"
        "Respond with the category and a brief explanation."
    )
    evaluation = llm.invoke(prompt).content.upper()

    if 'DISQUALIFYING_FACTORS_PRESENT' in evaluation:
        state.update({'final_decision': 'REJECT', 'reason': evaluation})
    elif 'UNCLEAR_REQUIRES_REVIEW' in evaluation:
        state.update({'decision': 'CLEAR_FOR_LOGISTIC', 'third_party_evaluation': evaluation})
    elif 'CLEAR' in evaluation:
        state.update({'third_party_evaluation': evaluation, 'decision': 'CLEAR_FOR_LOGISTIC'})
    else:
        # Assume fallback to logistic
        state.update({'decision': 'CLEAR_FOR_LOGISTIC', 'third_party_evaluation': f'Ambiguous: {evaluation}'})

    return state






def concerning_details_check_node(state):
    application = state['application']
    guidelines = retrieve_underwriting_guidelines(application['NATURE OF BUSINESS'])

    prompt = (
        f"Review this application: {json.dumps(application)}\n\n"
        f"Using the following underwriting guidelines:\n\n{guidelines}\n\n"
        "List any potentially concerning details in the application that are NOT directly addressed or clearly covered by the guidelines.\n"
        "Only include items that may require clarification or pose potential issues beyond what's defined in the guidelines.\n\n"
        "Respond with either:\n"
        "- NO CONCERNS if everything aligns or is irrelevant to the guidelines\n"
        "- A numbered list of unclear or missing information\n\n"
        "Do not suggest rejection or human review — only identify the guideline coverage status of the concerns."
    )

    concerns = llm.invoke(prompt).content.strip()

    if "NO CONCERNS" in concerns.upper():
        state.update({'concerning_details': None})
        return state
    else:
        state.update({
            'concerning_details': concerns,
            'decision': 'REFLECT_CONCERNS',
            'reason': concerns
        })
        return state




import re

def reflect_concerns_node(state):
    application = state['application']
    concerns = state.get('concerning_details')
    business_type = application.get('NATURE OF BUSINESS', 'Retail')

    # LLM-based retrieval query for reflection
    smart_query = generate_guideline_retrieval_query(application)
    smart_docs = retriever.invoke(smart_query)
    business_docs = retriever.invoke(f"Underwriting guidelines for {business_type}")
    global_docs = retriever.invoke("General underwriting guidelines and exclusions")

    # Combine and deduplicate guidelines
    seen = set()
    all_docs = []
    for doc in smart_docs + business_docs + global_docs:
        if doc.page_content not in seen:
            seen.add(doc.page_content)
            all_docs.append(doc)

    guidelines = "\n\n---\n\n".join(doc.page_content for doc in all_docs)

    prompt = (
        f"The following concerns were identified in the application:\n{concerns}\n\n"
        f"Underwriting Guidelines:\n{guidelines}\n\n"
        "Question:\nDo ANY of these concerns explicitly violate the underwriting guidelines?\n\n"
        "Respond with one of the following formats only:\n"
        "- YES – followed by a short summary of the violating concern(s)\n"
        "- NO – followed by a statement confirming no explicit violations\n\n"
        "Be concise. Do not re-list the concerns or over-explain."
    )

    reflection_output = llm.invoke(prompt).content.strip()
    state['reflection_outcome'] = reflection_output
    state['retrieval_query'] = smart_query  # Optional audit field

    reflection_upper = reflection_output.upper()

    if reflection_upper.startswith("YES"):
        state['final_decision'] = 'REJECT'
        state['reason'] = f"Rejected due to guideline violation: {reflection_output}"
    elif reflection_upper.startswith("NO"):
        state['decision'] = 'CLEAR_FOR_LOGISTIC'
        state['reason'] = "No explicit guideline violations found."
    else:
        state['final_decision'] = 'REQUIRES_HUMAN_REVIEW'
        state['reason'] = f"Uninterpretable reflection result:\n{reflection_output}"

    return state




def generate_guideline_retrieval_query(application_data):
    prompt = f"""
    Given the following business insurance application, generate a precise and specific query to retrieve underwriting guidelines that may apply to edge-case concerns or risks.

    Application Data:
    {json.dumps(application_data, indent=2)}

    Focus especially on:
    - Risky or unusual fields
    - Claimed losses, prior incidents, or unusual coverages
    - Fields that might match underwriting exclusions

    Respond with a single-line retrieval query targeting specific underwriting rules or exclusions:
    """
    query = llm.invoke(prompt).content.strip()
    return query






def final_reject_node(state):
    state['final_decision'] = 'REJECT'

    # Use logistic reason if present
    if "logistic_reason" in state and state["logistic_reason"]:
        state['reason'] = state['logistic_reason']
    elif "reason" in state and state["reason"]:
        state['reason'] = state['reason']
    else:
        state['reason'] = "Rejected without specified reason"

    return state




def human_review_node(state):
    state['final_decision'] = 'REQUIRES_HUMAN_REVIEW'
    state['reason'] = state.get('reason', 'Ambiguity or risk details require human intervention.')
    return state


from typing import TypedDict, Optional

class UnderwritingState(TypedDict):
    application: dict
    third_party_data: Optional[dict]
    decision: Optional[str]
    final_decision: Optional[str]
    reason: Optional[str]
    guidelines: Optional[str]
    reflection: Optional[str]
    reflection_count: Optional[int]
    logistic_reason: Optional[str]
    concerning_details: Optional[str]
    reflection_outcome: Optional[str]
    third_party_evaluation: Optional[str]


flow = StateGraph(UnderwritingState)

# Nodes (already defined by you)
flow.add_node('SIC_CHECK', check_sic_node)
flow.add_node('GUIDELINES_EVAL', guidelines_eval_node)
flow.add_node('THIRD_PARTY_EVAL', third_party_eval_node)
flow.add_node('REFLECTION_NODE', reflection_node)
flow.add_node('LOGISTIC_EVAL', logistic_eval_node)
flow.add_node('CONCERNING_DETAILS_CHECK', concerning_details_check_node)
flow.add_node('REFLECT_CONCERNS', reflect_concerns_node)
flow.add_node('HUMAN_REVIEW', human_review_node)
flow.add_node('FINAL_REJECT', final_reject_node)

# Entry Point
flow.set_entry_point('SIC_CHECK')

# SIC_CHECK always goes to GUIDELINES_EVAL unless immediately rejected
flow.add_edge('SIC_CHECK', 'GUIDELINES_EVAL')

# GUIDELINES_EVAL Conditional Router
def guidelines_router(state):
    decision = state['decision']
    reflections = state.get('reflection_count', 0)

    if decision == 'CLEARLY_REJECTABLE':
        return 'FINAL_REJECT'
    elif decision == 'BORDERLINE_REQUIRES_THIRD_PARTY':
        return 'THIRD_PARTY_EVAL'
    elif decision == 'APPETITE_UNCLEAR':
        if reflections >= 2:
            return 'LOGISTIC_EVAL'  # fallback path
        return 'REFLECTION_NODE'
    elif decision == 'CLEARLY_ACCEPTABLE':
        return 'CONCERNING_DETAILS_CHECK'
    else:
        return 'HUMAN_REVIEW'



flow.add_conditional_edges('GUIDELINES_EVAL', guidelines_router)

# THIRD_PARTY_EVAL Conditional Router
def third_party_router(state):
    if state.get('final_decision') == 'REJECT':
        return 'FINAL_REJECT'
    elif state.get('final_decision') == 'REQUIRES_HUMAN_REVIEW':
        return 'HUMAN_REVIEW'
    elif state.get('decision') == 'CLEAR_FOR_LOGISTIC':
        return 'LOGISTIC_EVAL'
    else:
        return 'HUMAN_REVIEW'


flow.add_conditional_edges('THIRD_PARTY_EVAL', third_party_router)

# REFLECTION_NODE loops back to GUIDELINES_EVAL
flow.add_edge('REFLECTION_NODE', 'GUIDELINES_EVAL')

# CONCERNING_DETAILS_CHECK Conditional Router
def reflect_concerns_router(state):
    outcome = state.get('reflection_outcome', '').lower()
    return 'LOGISTIC_EVAL'

def concerns_router(state):
    if state.get('decision') == 'REFLECT_CONCERNS':
        return 'REFLECT_CONCERNS'
    else:
        return 'LOGISTIC_EVAL'


flow.add_conditional_edges('CONCERNING_DETAILS_CHECK', concerns_router)

# REFLECT_CONCERNS Conditional Router
def reflect_concerns_router(state):
    if state.get('decision') == 'CLEAR_FOR_LOGISTIC':
        return 'LOGISTIC_EVAL'
    elif state.get('final_decision') == 'REJECT':
        return 'FINAL_REJECT'
    else:
        return 'HUMAN_REVIEW'



flow.add_conditional_edges('REFLECT_CONCERNS', reflect_concerns_router)

# LOGISTIC_EVAL Conditional Router (final decision)
def logistic_router(state):
    decision = state.get('final_decision', '').upper()
    if decision == 'ACCEPT':
        return 'ACCEPT'  # Define or handle acceptance clearly
    elif decision == 'REJECT':
        return 'FINAL_REJECT'
    else:  # REFER or ambiguous outcomes
        return 'HUMAN_REVIEW'

# Add explicit ACCEPT node for clarity
def accept_node(state):
    state['reason'] = state.get('logistic_reason', 'Accepted per logistic evaluation.')
    return state

flow.add_node('ACCEPT', accept_node)

flow.add_conditional_edges('LOGISTIC_EVAL', logistic_router)



app = flow.compile()


Access data

In [None]:
import os
import json
import requests
import glob
from tqdm import tqdm
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import openai


# 1. List all files in the public GitHub folder using raw.githubusercontent.com
base_url = "https://raw.githubusercontent.com/drbob-richardson/Actuarial_Agentic_AI/main/bop_agentic_rag/Application_Data_Generation/Application_Data/ToAccept/"

import requests

# Load the filelist.txt from the GitHub repo
filelist_url = "https://raw.githubusercontent.com/drbob-richardson/Actuarial_Agentic_AI/main/bop_agentic_rag/Application_Data_Generation/Application_Data/ToAccept/filelist.txt"

response = requests.get(filelist_url)
response.raise_for_status()

# Get the list of filenames
filenames = [line.strip() for line in response.text.splitlines() if line.strip()]
print(f"Loaded {len(filenames)} filenames from GitHub.")



Loaded 127 filenames from GitHub.


Run the data through the pipeline


In [None]:
filenames = filenames[51:60]

def get_embedding(text: str, model: str = "text-embedding-ada-002") -> list:
    response = openai.embeddings.create(input=[text], model=model)
    return response.data[0].embedding

def cosine_similarity(vec1: list, vec2: list) -> float:
    a = np.array(vec1)
    b = np.array(vec2)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


results = []
for fname in tqdm(filenames):
    url = base_url + requests.utils.quote(fname, safe="()")
    try:
        response = requests.get(url)
        response.raise_for_status()
        data = response.json()

        app_data = data.get("Application Data", {})
        third_party = data.get("Third-Party Data", {})
        expected_decision = data.get("Final Decision", "").strip().upper()
        expected_reason = data.get("Final Reason", "")

        result = app.invoke({"application": app_data, "third_party_data": third_party})

        predicted = result.get("final_decision", "MISSING").strip().upper()
        reason = result.get("reason", "")
        print("Expected decision values:")
        print(expected_decision)

        print("\nPredicted decision values:")
        print(predicted)

        results.append({
            "file": fname,
            "expected_decision": expected_decision,
            "predicted_decision": predicted,
            "expected_reason": expected_reason,
            "predicted_reason": reason,
        })

    except Exception as e:
        results.append({
            "file": fname,
            "expected_decision": "ERROR",
            "predicted_decision": "ERROR",
            "expected_reason": "",
            "predicted_reason": str(e),
        })

# 2. Evaluation
df = pd.DataFrame(results)
valid = df[~df["expected_decision"].isin(["ERROR", ""]) & ~df["predicted_decision"].isin(["ERROR", "MISSING"])]

# Accuracy and classification metrics
y_true = valid["expected_decision"]
y_pred = valid["predicted_decision"]
print("\n=== Decision Evaluation ===")
print(f"Accuracy: {accuracy_score(y_true, y_pred):.3f}")
print(confusion_matrix(y_true, y_pred, labels=["ACCEPT", "REFER", "REJECT", "REQUIRES_HUMAN_REVIEW"]))
print(classification_report(y_true, y_pred))

# 3. Semantic similarity of reasons (optional, uses OpenAI embeddings)
def safe_similarity(a, b):
    try:
        emb1 = get_embedding(a, engine="text-embedding-ada-002")
        emb2 = get_embedding(b, engine="text-embedding-ada-002")
        return cosine_similarity(emb1, emb2)
    except:
        return None

print("\n=== Reason Similarity Evaluation ===")
valid["reason_similarity"] = valid.apply(lambda row: safe_similarity(row["expected_reason"], row["predicted_reason"]), axis=1)
print(f"Average semantic similarity of reasons: {valid['reason_similarity'].dropna().mean():.3f}")

# Save results to CSV
valid.to_csv("decision_eval_results.csv", index=False)
print("\nEvaluation results saved to 'decision_eval_results.csv'")


 11%|█         | 1/9 [00:16<02:12, 16.54s/it]

Expected decision values:
ACCEPT

Predicted decision values:
ACCEPT


 22%|██▏       | 2/9 [00:39<02:21, 20.26s/it]

Expected decision values:
ACCEPT

Predicted decision values:
REQUIRES_HUMAN_REVIEW


 33%|███▎      | 3/9 [00:52<01:42, 17.00s/it]

Expected decision values:
ACCEPT

Predicted decision values:
REJECT


 44%|████▍     | 4/9 [01:04<01:14, 14.97s/it]

Expected decision values:
ACCEPT

Predicted decision values:
REQUIRES_HUMAN_REVIEW


 56%|█████▌    | 5/9 [01:17<00:57, 14.26s/it]

Expected decision values:
ACCEPT

Predicted decision values:
ACCEPT


 67%|██████▋   | 6/9 [01:29<00:40, 13.45s/it]

Expected decision values:
ACCEPT

Predicted decision values:
ACCEPT


 78%|███████▊  | 7/9 [01:39<00:24, 12.45s/it]

Expected decision values:
ACCEPT

Predicted decision values:
ACCEPT


 89%|████████▉ | 8/9 [01:49<00:11, 11.51s/it]

Expected decision values:
ACCEPT

Predicted decision values:
ACCEPT


100%|██████████| 9/9 [02:03<00:00, 13.77s/it]

Expected decision values:
ACCEPT

Predicted decision values:
ACCEPT

=== Decision Evaluation ===
Accuracy: 0.667
[[6 0 1 2]
 [0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]
                       precision    recall  f1-score   support

               ACCEPT       1.00      0.67      0.80         9
               REJECT       0.00      0.00      0.00         0
REQUIRES_HUMAN_REVIEW       0.00      0.00      0.00         0

             accuracy                           0.67         9
            macro avg       0.33      0.22      0.27         9
         weighted avg       1.00      0.67      0.80         9


=== Reason Similarity Evaluation ===
Average semantic similarity of reasons: nan

Evaluation results saved to 'decision_eval_results.csv'



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
df

Unnamed: 0,file,expected_decision,predicted_decision,expected_reason,predicted_reason
0,Delicatessens _ Sandwich Shops_84718d.json,ACCEPT,ACCEPT,The application aligns with underwriting guide...,Logistic factor capped: 1.384
1,Department _ Discount Stores_ef2690.json,ACCEPT,REQUIRES_HUMAN_REVIEW,The application aligns with underwriting guide...,Logistic factor capped: 1.876
2,Dessert Shops_030e06.json,ACCEPT,REJECT,The application meets all underwriting guideli...,Rejected due to guideline violation: YES – The...
3,Detective _ Security Services_840160.json,ACCEPT,REQUIRES_HUMAN_REVIEW,The application aligns with the underwriting g...,Logistic factor capped: 1.57
4,Diaper _ Linen Services_610616.json,ACCEPT,ACCEPT,The application meets all underwriting guideli...,Logistic factor capped: 1.347
5,Door _ Window Installation_Sales_09bad5.json,ACCEPT,ACCEPT,The application aligns with the underwriting g...,Logistic factor capped: 1.376
6,Dry Cleaning _ Laundry Services_d6d47b.json,ACCEPT,ACCEPT,The application aligns with underwriting guide...,Logistic factor capped: 1.18
7,Educational _ School Supply Stores_ebbb0b.json,ACCEPT,ACCEPT,The application meets all underwriting guideli...,Logistic factor capped: 0.983
8,Electrical Equipment _ Supplies_3f501f.json,ACCEPT,ACCEPT,The application meets all underwriting guideli...,Logistic factor capped: 1.073


In [None]:
df.iloc[:,3:5].values


array([["The application is rejected due to a prior theft claim of $2,500, which raises concerns about the business's risk profile and security measures, violating underwriting exclusions for this business type.",
        'Logistic factor capped: 0.705'],
       ['The applicant has uncorrected fire code violations, which violates the underwriting guidelines for this business type.',
        'Logistic factor capped: nan'],
       ['The business has uncorrected fire code violations, which raises significant liability concerns, violating underwriting guidelines.',
        'Logistic factor capped: 2.5'],
       ['The application is rejected due to uncorrected fire code violations, which pose a significant safety risk to the children in care, violating underwriting guidelines.',
        'CLEARLY_REJECTABLE - THE APPLICATION INDICATES UNCORRECTED FIRE CODE VIOLATIONS, WHICH POSE A SIGNIFICANT SAFETY RISK IN A CHILD CARE SETTING AND ARE EXPLICITLY NOTED AS A MAJOR REASON FOR REJECTION. ADDITI