### **Project: Credit Risk Scorecard AI Agent**

**Repositories: ing-bank/skorecard, guillermo-navas-palencia/optbinning, levist7/Credit_Risk_Modelling**

#### **Day 1: Data Ingestion for Credit Risk Scorecard Docs**

**Imports and Setup**

In [1]:
import io
import json
import zipfile
import requests
import frontmatter

**Testing How Frontmatter Works**

In [2]:
example = """---
title: "Getting Started with AI"
author: "John Doe"
tags: ["ai", "machine-learning"]
---
# Hello World
This is the content of the document.
"""

post = frontmatter.loads(example)
print("Metadata:", dict(post))
print("Content:", post.content[:50])

Metadata: {'title': 'Getting Started with AI', 'author': 'John Doe', 'tags': ['ai', 'machine-learning']}
Content: # Hello World
This is the content of the document.


**Downloading the Repo as a Zip**

In [3]:
repos = [
    ('ing-bank', 'skorecard', 'main'),
    ('guillermo-navas-palencia', 'optbinning', 'master'),
    ('levist7', 'Credit_Risk_Modelling', 'main'),
]

for owner, name, branch in repos:
    url = f'https://codeload.github.com/{owner}/{name}/zip/refs/heads/{branch}'
    resp = requests.get(url)
    print(f"{owner}/{name}: Status {resp.status_code}")

ing-bank/skorecard: Status 200
guillermo-navas-palencia/optbinning: Status 200
levist7/Credit_Risk_Modelling: Status 200


**Processing the Zip in Memory**

In [4]:
repository_data = []

url = 'https://codeload.github.com/ing-bank/skorecard/zip/refs/heads/main'
resp = requests.get(url)
zf = zipfile.ZipFile(io.BytesIO(resp.content))

for file_info in zf.infolist():
    filename = file_info.filename.lower()

    if filename.endswith('.md') or filename.endswith('.mdx'):
        with zf.open(file_info) as f_in:
            content = f_in.read()
            post = frontmatter.loads(content)
            data = post.to_dict()
            _, filename_repo = file_info.filename.split('/', maxsplit=1)
            data['filename'] = filename_repo
            repository_data.append(data)

    elif filename.endswith('.ipynb'):
        with zf.open(file_info) as f_in:
            nb = json.loads(f_in.read())
            text = ''
            for cell in nb.get('cells', []):
                source = ''.join(cell.get('source', []))
                text += source + '\n\n'
            _, filename_repo = file_info.filename.split('/', maxsplit=1)
            data = {'content': text, 'filename': filename_repo}
            repository_data.append(data)

    elif filename.endswith('.rst'):
        with zf.open(file_info) as f_in:
            content = f_in.read().decode('utf-8', errors='ignore')
            _, filename_repo = file_info.filename.split('/', maxsplit=1)
            data = {'content': content, 'filename': filename_repo}
            repository_data.append(data)

zf.close()

print(f"Documents loaded: {len(repository_data)}")

Documents loaded: 44


**Checking What We Got**

In [5]:
print(repository_data[1])

{'content': '# Contributing guidelines\n\nMake sure to discuss any changes you would like to make in the issue board, before putting in any work.\n\n## Setup\n\nDevelopment install:\n\n```shell\npip install -e \'.[all]\'\n```\n\nUnit testing:\n\n```shell\npytest\n```\n\nWe use [pre-commit](https://pre-commit.com/) hooks to ensure code styling. Install with:\n\n```shell\npre-commit install\n```\n\nNow if you install it (which you are encouraged to do), you are encouraged to do the following command before committing your work:\n\n```shell\npre-commit run --all-files\n```\n\nThis will allow you to quickly see if the work you made contains some adaptions that you still might need to make before a pull request is accepted.\n\n## Documentation\n\nWe use [mkdocs](https://www.mkdocs.org) with [mkdocs-material](https://squidfunk.github.io/mkdocs-material/) theme. The docs are structured using the [divio documentation system](https://documentation.divio.com/). To view the docs locally:\n\n```sh

**Building a Reusable Function**

In [6]:
def read_repo_data(repo_owner, repo_name, branch='main'):
    prefix = 'https://codeload.github.com'
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/{branch}'
    resp = requests.get(url)

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))

    for file_info in zf.infolist():
        filename = file_info.filename.lower()

        if filename.endswith('.md') or filename.endswith('.mdx'):
            with zf.open(file_info) as f_in:
                content = f_in.read()
                post = frontmatter.loads(content)
                data = post.to_dict()
                _, filename_repo = file_info.filename.split('/', maxsplit=1)
                data['filename'] = filename_repo
                repository_data.append(data)

        elif filename.endswith('.ipynb'):
            with zf.open(file_info) as f_in:
                nb = json.loads(f_in.read())
                text = ''
                for cell in nb.get('cells', []):
                    source = ''.join(cell.get('source', []))
                    text += source + '\n\n'
                _, filename_repo = file_info.filename.split('/', maxsplit=1)
                data = {'content': text, 'filename': filename_repo}
                repository_data.append(data)

        elif filename.endswith('.rst'):
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                _, filename_repo = file_info.filename.split('/', maxsplit=1)
                data = {'content': content, 'filename': filename_repo}
                repository_data.append(data)

    zf.close()
    return repository_data

**Pulling Data from All Three Repos**

In [7]:
skorecard_docs = read_repo_data('ing-bank', 'skorecard')
optbinning_docs = read_repo_data('guillermo-navas-palencia', 'optbinning', branch='master')
credit_risk_docs = read_repo_data('levist7', 'Credit_Risk_Modelling')

print(f"skorecard documents: {len(skorecard_docs)}")
print(f"optbinning documents: {len(optbinning_docs)}")
print(f"Credit Risk Modelling documents: {len(credit_risk_docs)}")

all_docs = skorecard_docs + optbinning_docs + credit_risk_docs
print(f"\nTotal documents: {len(all_docs)}")

skorecard documents: 44
optbinning documents: 44
Credit Risk Modelling documents: 5

Total documents: 93


**Checking Document Sizes**

In [8]:
for doc in all_docs[:15]:
    title = doc.get('description', doc['filename'])[:60]
    print(f"{title:.<65} {len(doc['content']):>6} chars")

CHANGELOG.md.....................................................    607 chars
CONTRIBUTING.md..................................................   2044 chars
README.md........................................................   3780 chars
docs/api/bucketers/AgglomerativeClusteringBucketer.md............     55 chars
docs/api/bucketers/AsIsCategoricalBucketer.md....................     47 chars
docs/api/bucketers/AsIsNumericalBucketer.md......................     45 chars
docs/api/bucketers/DecisionTreeBucketer.md.......................     44 chars
docs/api/bucketers/EqualFrequencyBucketer.md.....................     46 chars
docs/api/bucketers/EqualWidthBucketer.md.........................     42 chars
docs/api/bucketers/OptimalBucketer.md............................     39 chars
docs/api/bucketers/OrdinalCategoricalBucketer.md.................     50 chars
docs/api/bucketers/UserInputBucketer.md..........................     41 chars
docs/api/datasets/load_uci_credit_card.md...........

#### **Day 2: Chunking and Intelligent Processing for Data**

**Sliding Window Function**

In [9]:
def sliding_window(seq, size, step):
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []
    for i in range(0, n, step):
        chunk = seq[i:i+size]
        result.append({'start': i, 'chunk': chunk})
        if i + size >= n:
            break

    return result

**Test Sliding Window on One Document**

In [10]:
text = all_docs[1]['content']
chunks = sliding_window(text, 2000, 1000)
print(f"Generated {len(chunks)} chunks")

Generated 2 chunks


**Apply Sliding Window to All Documents**

In [11]:
all_chunks = []

for doc in all_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    all_chunks.extend(chunks)

print(f"Total chunks: {len(all_chunks)}")

Total chunks: 399


**Section Splitting Function**

In [12]:
import re

def split_markdown_by_level(text, level=2):
    """
    Split markdown text by a specific header level.
    
    :param text: Markdown text as a string
    :param level: Header level to split on
    :return: List of sections as strings
    """
    header_pattern = r'^(#{' + str(level) + r'} )(.+)$'
    pattern = re.compile(header_pattern, re.MULTILINE)

    parts = pattern.split(text)
    
    sections = []
    for i in range(1, len(parts), 3):
        header = parts[i] + parts[i+1]
        header = header.strip()

        content = ""
        if i+2 < len(parts):
            content = parts[i+2].strip()

        if content:
            section = f'{header}\n\n{content}'
        else:
            section = header
        sections.append(section)
    
    return sections

**Test Section Splitting on One Document**

In [13]:
text = all_docs[1]['content']
sections = split_markdown_by_level(text, level=2)
print(f"Sections: {len(sections)}")

Sections: 5


**Apply Section Splitting to All Documents**

In [14]:
all_chunks = []

for doc in all_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    sections = split_markdown_by_level(doc_content, level=2)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        all_chunks.append(section_doc)

print(f"Sections: {len(all_chunks)}")

Sections: 122


**Load API Key**

In [15]:
import os
from dotenv import load_dotenv
load_dotenv('.env', override=True)
print("Key loaded:", "OPENAI_API_KEY" in os.environ)

Key loaded: True


In [16]:
print("Key loaded:", "OPENAI_API_KEY" in os.environ)
print("Key ends with:", os.environ.get('OPENAI_API_KEY', '')[-4:])

Key loaded: True
Key ends with: 3IEA


In [17]:
from openai import OpenAI
openai_client = OpenAI()

**OpenAI Client and LLM Helper**

In [18]:
from openai import OpenAI

openai_client = OpenAI()

def llm(prompt, model='gpt-4o-mini'):
    messages = [
        {"role": "user", "content": prompt}
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages
    )

    return response.choices[0].message.content

**Prompt Template**

In [19]:
prompt_template = """
Split the provided document into logical sections
that make sense for a Q&A system.

Each section should be self-contained and cover
a specific topic or concept.

<DOCUMENT>
{document}
</DOCUMENT>

Use this format:

## Section Name

Section content with all relevant details

---

## Another Section Name

Another section content

---
""".strip()

**Intelligent Chunking Function**

In [20]:
def intelligent_chunking(text):
    prompt = prompt_template.format(document=text)
    response = llm(prompt)
    sections = response.split('---')
    sections = [s.strip() for s in sections if s.strip()]
    return sections

**Apply Intelligent Chunking to All Documents**

In [21]:
from tqdm.auto import tqdm

all_chunks = []

for doc in tqdm(all_docs):
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')

    sections = intelligent_chunking(doc_content)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        all_chunks.append(section_doc)

print(f"LLM chunks: {len(all_chunks)}")

  0%|          | 0/93 [00:00<?, ?it/s]

LLM chunks: 675


**Compare All 3 Chunking Methods**

In [22]:
doc = all_docs[1]
text = doc['content']

print(f"Original document: {len(text)} chars")
print(f"Title: {doc.get('description', doc['filename'])}")
print("=" * 60)

Original document: 2044 chars
Title: CONTRIBUTING.md


In [23]:
# Method 1: Sliding window
sw_chunks = sliding_window(text, 2000, 1000)
print(f"\n1. SLIDING WINDOW: {len(sw_chunks)} chunks")
for i, c in enumerate(sw_chunks[:3]):
    print(f"   Chunk {i+1} ({len(c['chunk'])} chars): {c['chunk'][:80]}...")


1. SLIDING WINDOW: 2 chunks
   Chunk 1 (2000 chars): # Contributing guidelines

Make sure to discuss any changes you would like to ma...
   Chunk 2 (1044 chars): # Releases and versioning

We use [semver](https://semver.org/) for versioning. ...


In [24]:
# Method 2: Section splitting
sec_chunks = split_markdown_by_level(text, level=2)
print(f"\n2. SECTION SPLITTING: {len(sec_chunks)} sections")
for i, s in enumerate(sec_chunks[:3]):
    first_line = s.split('\n')[0]
    print(f"   Section {i+1} ({len(s)} chars): {first_line}")


2. SECTION SPLITTING: 5 sections
   Section 1 (551 chars): ## Setup
   Section 2 (312 chars): ## Documentation
   Section 3 (449 chars): ## Releases and versioning


In [25]:
# Method 3: LLM chunking (only if API key works)
# Uncomment below after fixing your API key:
# llm_sec = intelligent_chunking(text)
# print(f"\n3. LLM CHUNKING: {len(llm_sec)} sections")
# for i, s in enumerate(llm_sec[:3]):
#     first_line = s.split('\n')[0]
#     print(f"   Section {i+1} ({len(s)} chars): {first_line}")


In [26]:
print("\n" + "=" * 60)
print("SUMMARY")
print("=" * 60)
print("""
Sliding window cuts text into fixed-size pieces with overlap.
Good for unstructured text, but can split topics mid-sentence.

Section splitting uses ## headers as natural boundaries.
Each chunk covers one complete topic. Best for well-structured docs.

LLM chunking lets AI decide where to split. Most accurate but
costs money and is slow. Only needed for messy documents.

For this project, SECTION SPLITTING is the best choice because
the ML observability docs are well-structured with ## headers
and each section covers one monitoring concept.
""")


SUMMARY

Sliding window cuts text into fixed-size pieces with overlap.
Good for unstructured text, but can split topics mid-sentence.

Section splitting uses ## headers as natural boundaries.
Each chunk covers one complete topic. Best for well-structured docs.

LLM chunking lets AI decide where to split. Most accurate but
costs money and is slow. Only needed for messy documents.

For this project, SECTION SPLITTING is the best choice because
the ML observability docs are well-structured with ## headers
and each section covers one monitoring concept.



#### **Day 3: Add Search**

**Prepare Chunks for Search (if restarting kernel)**

In [None]:
skorecard_docs = read_repo_data('ing-bank', 'skorecard')
optbinning_docs = read_repo_data('guillermo-navas-palencia', 'optbinning', branch='master')
credit_risk_docs = read_repo_data('levist7', 'Credit_Risk_Modelling')

all_docs = skorecard_docs + optbinning_docs + credit_risk_docs

all_chunks = []

for doc in all_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    all_chunks.extend(chunks)

**Text Search: Index the Chunks**

In [None]:
from minsearch import Index

index = Index(
    text_fields=["chunk", "description", "filename"],
    keyword_fields=[]
)

index.fit(all_chunks)

**Text Search: Test It**

In [None]:
query = 'How do I create WoE bins for a credit feature?'
results = index.search(query)
print(results[0])

**Vector Search: Load the Embedding Model**

In [40]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('multi-qa-distilbert-cos-v1')

**Vector Search: Test Embeddings and Similarity**

In [None]:
record = all_chunks[2]
text = record['chunk']
v_doc = embedding_model.encode(text)

query = 'What IV threshold should I use for feature selection?'
v_query = embedding_model.encode(query)

similarity = v_query.dot(v_doc)
print(f"Similarity: {similarity}")

**Vector Search: Create Embeddings for All Chunks**

In [None]:
from tqdm.auto import tqdm
import numpy as np

all_embeddings = []

for d in tqdm(all_chunks):
    v = embedding_model.encode(d['chunk'])
    all_embeddings.append(v)

all_embeddings = np.array(all_embeddings)

**Vector Search: Build the Index**

In [None]:
from minsearch import VectorSearch

credit_risk_vindex = VectorSearch()
credit_risk_vindex.fit(all_embeddings, all_chunks)

**Vector Search: Test It**

In [None]:
query = 'How do I build a scorecard from logistic regression?'
q = embedding_model.encode(query)
results = credit_risk_vindex.search(q)
print(results[0])

**Hybrid Search: Combine Both**

In [None]:
query = 'How do I calculate PSI for model monitoring?'

text_results = index.search(query, num_results=5)

q = embedding_model.encode(query)
vector_results = credit_risk_vindex.search(q, num_results=5)

final_results = text_results + vector_results
print(f"Total results: {len(final_results)}")

**Reusable Search Functions**

In [None]:
def text_search(query):
    return index.search(query, num_results=5)

def vector_search(query):
    q = embedding_model.encode(query)
    return credit_risk_vindex.search(q, num_results=5)

def hybrid_search(query):
    text_results = text_search(query)
    vector_results = vector_search(query)
    
    # Combine and deduplicate results
    seen_ids = set()
    combined_results = []

    for result in text_results + vector_results:
        if result['filename'] not in seen_ids:
            seen_ids.add(result['filename'])
            combined_results.append(result)
    
    return combined_results

**Test All Three Search Functions**

In [None]:
query = 'What IV threshold should I use for feature selection?'

print("Text search:")
print(text_search(query)[0])

print("\nVector search:")
print(vector_search(query)[0])

print("\nHybrid search:")
print(hybrid_search(query)[0])

#### **Day 4: Agents and Tools**

**LLM Without Tools**

In [None]:
import openai

openai_client = openai.OpenAI()

user_prompt = "How do I create WoE bins for a credit feature?"

chat_messages = [
    {"role": "user", "content": user_prompt}
]

response = openai_client.responses.create(
    model='gpt-4o-mini',
    input=chat_messages,
)

print(response.output_text)

**Define the Search Function for OpenAI**

In [55]:
def text_search(query):
    return index.search(query, num_results=5)

**Describe the Function for OpenAI**

In [None]:
text_search_tool = {
    "type": "function",
    "name": "text_search",
    "description": "Search the credit risk scorecard knowledge base",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Search query text to look up in the credit risk docs."
            }
        },
        "required": ["query"],
        "additionalProperties": False
    }
}

**Send Question with Tool Available**

In [None]:
system_prompt = """
You are a helpful assistant for credit risk scorecard development. 
"""

question = "How do I create WoE bins for a credit feature?"

chat_messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": question}
]

response = openai_client.responses.create(
    model='gpt-4o-mini',
    input=chat_messages,
    tools=[text_search_tool]
)

**Inspect the Tool Call**

In [65]:
print(response.output)



In [None]:
system_prompt = """
You are a helpful assistant for credit risk scorecard development. 
Always use the search tool before answering any question. Never answer without searching first.
"""

question = "How do I create WoE bins for a credit feature?"

chat_messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": question}
]

response = openai_client.responses.create(
    model='gpt-4o-mini',
    input=chat_messages,
    tools=[text_search_tool]
)

print(response.output)

**Execute Tool Call and Send Results Back**

In [69]:
import json

call = response.output[0]

arguments = json.loads(call.arguments)
result = text_search(**arguments)

call_output = {
    "type": "function_call_output",
    "call_id": call.call_id,
    "output": json.dumps(result),
}

chat_messages.append(call)
chat_messages.append(call_output)

response = openai_client.responses.create(
    model='gpt-4o-mini',
    input=chat_messages,
    tools=[text_search_tool]
)

print(response.output_text)

To detect data drift in your credit scoring model, consider the following steps:

1. **Monitor Data Drift**: Track changes in the input feature distributions. This involves comparing the statistical properties of the current input data to those of a baseline dataset (the training data, for example).

2. **Output Drift Detection**: Observe shifts in model predictions over time. If the model's predictions change significantly without corresponding changes in the input data, it may indicate issues with model performance.

3. **Statistical Methods**: Use statistical tests or distance metrics to assess distribution drift. Common methods include:
   - Kolmogorov-Smirnov tests for univariate features.
   - Chi-squared tests for categorical variables.
   - Kullback-Leibler divergence for comparing distributions.

4. **Set Thresholds**: Define what constitutes significant drift using acceptable thresholds. These could be based on confidence levels or specific numeric thresholds.

5. **Segmentat

**Better System Prompt**

In [None]:
system_prompt = """
You are a helpful assistant for credit risk scorecard development. 

Use the search tool to find relevant information from the credit risk and scorecard materials before answering questions.

If you can find specific information through search, use it to provide accurate answers.
If the search doesn't return relevant results, let the user know and provide general guidance.
"""

**Encouraging Multiple Searches**

In [None]:
system_prompt = """
You are a helpful assistant for credit risk scorecard development. 

Always search for relevant information before answering. 
If the first search doesn't give you enough information, try different search terms.

Make multiple searches if needed to provide comprehensive answers.
"""

**Pydantic AI: Redefine Search with Type Hints**

In [None]:
from typing import List, Any

def text_search(query: str) -> List[Any]:
    """
    Perform a text-based search on the credit risk scorecard index.

    Args:
        query (str): The search query string.

    Returns:
        List[Any]: A list of up to 5 search results returned by the index.
    """
    return index.search(query, num_results=5)

**Create the Agent**

In [None]:
from pydantic_ai import Agent

agent = Agent(
    name="credit_risk_agent",
    instructions=system_prompt,
    tools=[text_search],
    model='gpt-4o-mini'
)

**Run the Agent**

In [None]:
question = "How do I create WoE bins for a credit feature?"

result = await agent.run(user_prompt=question)
print(result.output)

**Look Inside the Agent's Reasoning**

result.new_messages()

**Update System Prompt with Citations**

In [None]:
system_prompt = """
You are a helpful assistant for credit risk scorecard development.  

Use the search tool to find relevant information from the credit risk and scorecard materials before answering questions.  

If you can find specific information through search, use it to provide accurate answers.

Always include references by citing the filename of the source material you used.  
When citing the reference, use the full path to the GitHub repository for the relevant repo:
- skorecard docs: "https://github.com/ing-bank/skorecard/blob/main/"
- optbinning docs: "https://github.com/guillermo-navas-palencia/optbinning/blob/master/"
- Credit Risk Modelling docs: "https://github.com/levist7/Credit_Risk_Modelling/blob/main/"
Format: [LINK TITLE](FULL_GITHUB_LINK)

If the search doesn't return relevant results, let the user know and provide general guidance.  
""".strip()

agent = Agent(
    name="credit_risk_agent_v2",
    instructions=system_prompt,
    tools=[text_search],
    model='gpt-4o-mini'
)

**Test the Updated Agent**

In [None]:
question = "What IV threshold should I use for feature selection?"

result = await agent.run(user_prompt=question)
print(result.output)

#### **Day 5: Evaluation**

**Recap: Set Up Agent (if kernel restarted)**

In [None]:
from typing import List, Any
from pydantic_ai import Agent


def text_search(query: str) -> List[Any]:
    """
    Perform a text-based search on the credit risk scorecard index.

    Args:
        query (str): The search query string.

    Returns:
        List[Any]: A list of up to 5 search results returned by the index.
    """
    return index.search(query, num_results=5)


system_prompt = """
You are a helpful assistant for credit risk scorecard development.  

Use the search tool to find relevant information from the credit risk and scorecard materials before answering questions.  

If you can find specific information through search, use it to provide accurate answers.

Always include references by citing the filename of the source material you used.  
When citing the reference, use the full path to the GitHub repository for the relevant repo:
- skorecard docs: "https://github.com/ing-bank/skorecard/blob/main/"
- optbinning docs: "https://github.com/guillermo-navas-palencia/optbinning/blob/master/"
- Credit Risk Modelling docs: "https://github.com/levist7/Credit_Risk_Modelling/blob/main/"
Format: [LINK TITLE](FULL_GITHUB_LINK)

If the search doesn't return relevant results, let the user know and provide general guidance.  
""".strip()

agent = Agent(
    name="credit_risk_agent_v2",
    instructions=system_prompt,
    tools=[text_search],
    model='gpt-4o-mini'
)

**Test the Agent**

In [None]:
question = "How do I calculate PSI for model monitoring?"
result = await agent.run(user_prompt=question)
print(result.output)

**Build the Log Entry Extractor**

In [82]:
from pydantic_ai.messages import ModelMessagesTypeAdapter


def log_entry(agent, messages, source="user"):
    tools = []

    for ts in agent.toolsets:
        tools.extend(ts.tools.keys())

    dict_messages = ModelMessagesTypeAdapter.dump_python(messages)

    return {
        "agent_name": agent.name,
        "system_prompt": agent._instructions,
        "provider": agent.model.system,
        "model": agent.model.model_name,
        "tools": tools,
        "messages": dict_messages,
        "source": source
    }

**Write Logs to Files**

In [83]:
import json
import secrets
from pathlib import Path
from datetime import datetime


LOG_DIR = Path('logs')
LOG_DIR.mkdir(exist_ok=True)


def serializer(obj):
    if isinstance(obj, datetime):
        return obj.isoformat()
    raise TypeError(f"Type {type(obj)} not serializable")


def log_interaction_to_file(agent, messages, source='user'):
    entry = log_entry(agent, messages, source)

    ts = entry['messages'][-1]['timestamp']
    
    if isinstance(ts, str):
        ts_obj = datetime.fromisoformat(ts.replace("Z", "+00:00"))
    else:
        ts_obj = ts
    
    ts_str = ts_obj.strftime("%Y%m%d_%H%M%S")
    rand_hex = secrets.token_hex(3)

    filename = f"{agent.name}_{ts_str}_{rand_hex}.json"
    filepath = LOG_DIR / filename

    with filepath.open("w", encoding="utf-8") as f_out:
        json.dump(entry, f_out, indent=2, default=serializer)

    return filepath

**Interactive Vibe Checking with Logging**

In [84]:
question = input()
result = await agent.run(user_prompt=question)
print(result.output)
log_interaction_to_file(agent, result.new_messages())

 What is history


History is the study of past events, particularly in human affairs. It encompasses various aspects such as political, social, economic, and cultural developments over time. Historians analyze written documents, oral accounts, and other forms of evidence to understand and interpret the events that have shaped societies and civilizations. The study of history helps people to understand how the past influences the present and future, providing context for current affairs and societal trends.


PosixPath('logs/ml_monitoring_agent_v2_20260212_230246_248a7f.json')

**Evaluation Prompt**

In [85]:
evaluation_prompt = """
Use this checklist to evaluate the quality of an AI agent's answer (<ANSWER>) to a user question (<QUESTION>).
We also include the entire log (<LOG>) for analysis.

For each item, check if the condition is met. 

Checklist:

- instructions_follow: The agent followed the user's instructions (in <INSTRUCTIONS>)
- instructions_avoid: The agent avoided doing things it was told not to do  
- answer_relevant: The response directly addresses the user's question  
- answer_clear: The answer is clear and correct  
- answer_citations: The response includes proper citations or sources when required  
- completeness: The response is complete and covers all key aspects of the request
- tool_call_search: Is the search tool invoked? 

Output true/false for each check and provide a short explanation for your judgment.
""".strip()

**Structured Output Models**

In [86]:
from pydantic import BaseModel

class EvaluationCheck(BaseModel):
    check_name: str
    justification: str
    check_pass: bool

class EvaluationChecklist(BaseModel):
    checklist: list[EvaluationCheck]
    summary: str

**Create the Evaluation Agent**

In [87]:
eval_agent = Agent(
    name='eval_agent',
    model='gpt-4o-mini',
    instructions=evaluation_prompt,
    output_type=EvaluationChecklist
)

**Evaluation Input Template**

In [88]:
user_prompt_format = """
<INSTRUCTIONS>{instructions}</INSTRUCTIONS>
<QUESTION>{question}</QUESTION>
<ANSWER>{answer}</ANSWER>
<LOG>{log}</LOG>
""".strip()

**Load Log Files Helper**

In [89]:
def load_log_file(log_file):
    with open(log_file, 'r') as f_in:
        log_data = json.load(f_in)
        log_data['log_file'] = log_file
        return log_data

**Check What Log Files Exist**

In [90]:
import os
for f in os.listdir('./logs'):
    print(f)

tech_interview_agent_20260122_131205_432929.json
ml_monitoring_agent_v2_20260212_230246_248a7f.json
tech_interview_agent_20260121_202846_16a492.json
tech_interview_agent_20260122_142625_dc4ce6.json


**Run Evaluation on a Single Log**

In [None]:
import os

log_files = sorted(LOG_DIR.glob('*.json'))
if log_files:
    log_record = load_log_file(log_files[0])

    instructions = log_record['system_prompt']
    question = log_record['messages'][0]['parts'][0]['content']
    answer = log_record['messages'][-1]['parts'][0]['content']
    log = json.dumps(log_record['messages'])

    user_prompt = user_prompt_format.format(
        instructions=instructions,
        question=question,
        answer=answer,
        log=log
    )

    result = await eval_agent.run(user_prompt, output_type=EvaluationChecklist)

    checklist = result.output
    print(checklist.summary)

    for check in checklist.checklist:
        print(check)
else:
    print("No log files found. Run the agent first to generate logs.")

**Simplify Log Messages to Save Tokens**

In [93]:
def simplify_log_messages(messages):
    log_simplified = []

    for m in messages:
        parts = []
    
        for original_part in m['parts']:
            part = original_part.copy()
            kind = part['part_kind']
    
            if kind == 'user-prompt':
                part.pop('timestamp', None)
            if kind == 'tool-call':
                part.pop('tool_call_id', None)
            if kind == 'tool-return':
                part.pop('tool_call_id', None)
                part.pop('metadata', None)
                part.pop('timestamp', None)
                part['content'] = 'RETURN_RESULTS_REDACTED'
            if kind == 'text':
                part.pop('id', None)
    
            parts.append(part)
    
        message = {
            'kind': m['kind'],
            'parts': parts
        }
    
        log_simplified.append(message)
    return log_simplified

**Combined Evaluation Function**

In [94]:
async def evaluate_log_record(eval_agent, log_record):
    messages = log_record['messages']

    instructions = log_record['system_prompt']
    question = messages[0]['parts'][0]['content']
    answer = messages[-1]['parts'][0]['content']

    log_simplified = simplify_log_messages(messages)
    log = json.dumps(log_simplified)

    user_prompt = user_prompt_format.format(
        instructions=instructions,
        question=question,
        answer=answer,
        log=log
    )

    result = await eval_agent.run(user_prompt, output_type=EvaluationChecklist)
    return result.output

**Test the Combined Function**

In [None]:
log_files = sorted(LOG_DIR.glob('*.json'))
if log_files:
    log_record = load_log_file(log_files[-1])
    eval1 = await evaluate_log_record(eval_agent, log_record)
else:
    print("No log files found. Run the agent first to generate logs.")

**View the Evaluation Results**

In [96]:
print(eval1.summary)

for check in eval1.checklist:
    print(check)

The AI agent effectively followed the user's instructions, provided a relevant and clear answer, included the necessary citation, and thoroughly covered all relevant points regarding preparation for an AI Engineer course.
check_name='instructions_follow' justification='The agent provided a comprehensive answer aligned with the instructions, including recommendations for AI preparation.' check_pass=True
check_name='instructions_avoid' justification='The agent did not include any prohibited actions as per the instructions.' check_pass=True
check_name='answer_relevant' justification="The answer directly addresses the user's question about preparing for an AI Engineer course." check_pass=True
check_name='answer_clear' justification='The answer is structured in a clear and concise manner, using bullet points and headings for easy reading.' check_pass=True
check_name='answer_citations' justification='The agent included a citation to the Tech Interview Handbook as required, referencing its Gi

**Question Generator**

In [None]:
question_generation_prompt = """
You are helping to create test questions for an AI agent that answers questions about credit risk scorecard development.

Based on the provided course content, generate realistic questions that practitioners might ask.

The questions should:

- Be natural and varied in style
- Range from simple to complex
- Include both specific technical questions and general credit risk questions

Generate one question for each record.
""".strip()

class QuestionsList(BaseModel):
    questions: list[str]

question_generator = Agent(
    name="question_generator",
    instructions=question_generation_prompt,
    model='gpt-4o-mini',
    output_type=QuestionsList
)

**Generate Questions from Sampled Records**

In [None]:
import random

sample = random.sample(all_docs, 10)
prompt_docs = [d['content'] for d in sample]
prompt = json.dumps(prompt_docs)

result = await question_generator.run(prompt)
questions = result.output.questions

**Run Agent on Generated Questions and Log Results**

In [99]:
from tqdm.auto import tqdm

for q in tqdm(questions):
    print(q)

    result = await agent.run(user_prompt=q)
    print(result.output)

    log_interaction_to_file(
        agent,
        result.new_messages(),
        source='ai-generated'
    )

    print()

  0%|          | 0/10 [00:00<?, ?it/s]

What are the key metrics to prioritize for effective ML monitoring?
To effectively monitor machine learning (ML) models, it's essential to prioritize the right metrics. Here are the key categories and examples of metrics to consider:

1. **Symptoms (Key Metrics)**: These are essential for identifying issues quickly and are typically used for alerting. They include metrics like accuracy and business KPIs (Key Performance Indicators). Examples:
   - 1-day accuracy.
   - KPIs related to business objectives.

2. **Summary Metrics**: These provide extra context to understand the performance of the model better. Examples include:
   - Prediction volume.
   - True positives and true negatives.
   - Share of missing data.

3. **Performance Profiling**: This involves a more detailed examination of model performance and outputs, such as:
   - Per-column descriptive statistics.
   - Distributions of outputs.
   - Segmented model performance information.

4. **Debugging and Analytics Data**: Metri

**Collect AI-Generated Logs for Evaluation**

In [None]:
eval_set = []

for log_file in LOG_DIR.glob('*.json'):
    if 'credit_risk_agent_v2' not in log_file.name:
        continue

    log_record = load_log_file(log_file)
    if log_record['source'] != 'ai-generated':
        continue

    eval_set.append(log_record)

**Run Evaluation on All Logs**

In [101]:
eval_results = []

for log_record in tqdm(eval_set):
    eval_result = await evaluate_log_record(eval_agent, log_record)
    eval_results.append((log_record, eval_result))

  0%|          | 0/10 [00:00<?, ?it/s]

**Convert Results to DataFrame**

In [102]:
rows = []

for log_record, eval_result in eval_results:
    messages = log_record['messages']

    row = {
        'file': log_record['log_file'].name,
        'question': messages[0]['parts'][0]['content'],
        'answer': messages[-1]['parts'][0]['content'],
    }

    checks = {c.check_name: c.check_pass for c in eval_result.checklist}
    row.update(checks)

    rows.append(row)

**View Results and Pass Rates**

In [103]:
import pandas as pd

df_evals = pd.DataFrame(rows)
df_evals.head()

Unnamed: 0,file,question,answer,instructions_follow,instructions_avoid,answer_relevant,answer_clear,answer_citations,completeness,tool_call_search
0,ml_monitoring_agent_v2_20260212_230933_289659....,What is the difference between direct and prox...,In the context of monitoring model performance...,True,True,True,True,True,True,True
1,ml_monitoring_agent_v2_20260212_230959_a5651f....,What are some strategies for monitoring embedd...,When monitoring embedding drift in high-dimens...,True,True,True,True,True,True,True
2,ml_monitoring_agent_v2_20260212_230912_dbb955....,What are the key metrics to prioritize for eff...,To effectively monitor machine learning (ML) m...,True,True,True,True,True,True,True
3,ml_monitoring_agent_v2_20260212_231026_26470a....,How does monitoring data integrity metrics hel...,Monitoring data integrity metrics plays a cruc...,True,True,True,True,True,True,True
4,ml_monitoring_agent_v2_20260212_230940_ed0f31....,In what situations would minimalistic monitori...,Minimalistic monitoring for ML models is appro...,True,True,True,True,True,True,True


**Overall Pass Rates**

In [104]:
df_evals.mean(numeric_only=True)

instructions_follow    1.0
instructions_avoid     1.0
answer_relevant        1.0
answer_clear           1.0
answer_citations       1.0
completeness           1.0
tool_call_search       1.0
dtype: float64