# Institution Checker - Comprehensive Test Harness
Use this notebook to exercise the search and LLM flows with either live services or deterministic fixtures.


## Quick Start
1. Adjust the configuration cell to switch between live calls and offline fixtures.
2. Run the environment check to confirm dependencies are available.
3. Execute the numbered sections in order (search demo -> LLM decision -> batch pipeline -> result inspection).
4. Run the cleanup cell before restarting to ensure pooled clients close cleanly.


## Notebook Modes
- RUN_NETWORK_TESTS = True: hits Bing and the LLM endpoint with real network requests.
- RUN_NETWORK_TESTS = False: uses deterministic fixtures that simulate realistic search results and decisions.
- Offline mode still exercises scoring logic, summarisation, and the async batch pipeline via monkey-patched fixtures.


In [1]:
import asyncio
import importlib
import platform
import sys

packages = ("aiohttp", "httpx", "pandas")

print(f"Python: {sys.version.split()[0]}")
print(f"Platform: {platform.platform()}")

for name in packages:
    try:
        module = importlib.import_module(name)
        version = getattr(module, "__version__", "unknown")
        print(f"{name}: {version}")
    except Exception as exc:
        print(f"{name}: missing ({exc.__class__.__name__})")


Python: 3.12.7
Platform: Windows-11-10.0.26100-SP0
aiohttp: 3.10.5
httpx: 0.28.1
pandas: 2.3.1


In [2]:
# Toggle live calls (Bing + LLM). Leave False to use deterministic fixtures.
RUN_NETWORK_TESTS = True

# Primary test inputs
NAME_UNDER_TEST = "Robert Duncan"
BATCH_NAMES = ["Robert Duncan", "Jane Smith", "Ada Lovelace"]
MAX_RESULTS = 8

print(f"Live tests enabled? {RUN_NETWORK_TESTS}")
print(f"Primary name: {NAME_UNDER_TEST}")
print(f"Batch names: {BATCH_NAMES}")
print(f"Max results per search: {MAX_RESULTS}")


Live tests enabled? True
Primary name: Robert Duncan
Batch names: ['Robert Duncan', 'Jane Smith', 'Ada Lovelace']
Max results per search: 8


In [3]:
from copy import deepcopy
from pprint import pprint
from typing import Any, Dict, List

try:
    import pandas as pd
except Exception:
    pd = None

from institution_checker import INSTITUTION
from search import enhanced_search, close_search_clients, _compute_signals
from llm_processor import analyze_connection, close_session
from main import run_pipeline

print(f"Target institution: {INSTITUTION}")

_DEFAULT_NEGATIVE_DECISION: Dict[str, str] = {
    "connected": "N",
    "connection_detail": "No supporting evidence in fixture.",
    "current_or_past": "N/A",
    "supporting_url": "",
    "confidence": "medium",
    "temporal_evidence": "Fixture fallback: no Purdue-related evidence found.",
}

_RAW_FIXTURE_DATA: Dict[str, Dict[str, Any]] = {
    "Robert Duncan": {
        "search_entries": [
            {
                "title": "Robert Duncan - Assistant Professor of Computer Science - Purdue University",
                "url": "https://www.cs.purdue.edu/people/robert-duncan.html",
                "snippet": "Currently serves as an assistant professor of computer science at Purdue University in West Lafayette. Duncan joined Purdue in 2021 and led the 2024 AI faculty cluster hire initiative.",
            },
            {
                "title": "Purdue AI Institute welcomes Robert Duncan to Engineering Faculty",
                "url": "https://www.purdue.edu/ai/news/2024/robert-duncan.html",
                "snippet": "In March 2024 Purdue University announced that Robert Duncan joined the College of Engineering as an associate professor focusing on autonomous systems.",
            },
            {
                "title": "Computer Science Faculty Directory - Purdue University",
                "url": "https://www.cs.purdue.edu/people/faculty/duncan.html",
                "snippet": "The Purdue Computer Science faculty directory lists Robert Duncan as a current faculty member and research lead for autonomous systems research.",
            },
        ],
        "llm_decision": {
            "connected": "Y",
            "connection_detail": "Assistant professor of computer science at Purdue University since 2021.",
            "current_or_past": "current",
            "supporting_url": "https://www.cs.purdue.edu/people/robert-duncan.html",
            "confidence": "high",
            "temporal_evidence": "Currently listed as Purdue faculty; joined in 2021 and highlighted again in 2024 announcements.",
        },
    },
    "Jane Smith": {
        "search_entries": [
            {
                "title": "Jane Smith - Director of Alumni Relations, University of Michigan",
                "url": "https://alumni.umich.edu/people/jane-smith",
                "snippet": "Jane Smith currently serves as director of alumni relations for the University of Michigan alumni association in Ann Arbor.",
            },
            {
                "title": "Jane Smith joins University of Texas development office",
                "url": "https://news.utexas.edu/2023/08/14/jane-smith-joins-development-office/",
                "snippet": "The University of Texas announced in 2023 that Jane Smith joined its development office to lead new fundraising initiatives.",
            },
            {
                "title": "Professional profile for Jane Smith",
                "url": "https://www.linkedin.com/in/jane-smith",
                "snippet": "Jane Smith is a higher education advancement leader with roles at the University of Michigan and the University of Texas.",
            },
        ],
        "llm_decision": {
            "connected": "N",
            "connection_detail": "Search context only references roles at other universities.",
            "current_or_past": "N/A",
            "supporting_url": "",
            "confidence": "high",
            "temporal_evidence": "Evidence shows current roles at Michigan and Texas, none at Purdue University.",
        },
    },
}

def _build_fixture_results(name: str) -> List[Dict[str, Any]]:
    entries = _RAW_FIXTURE_DATA.get(name, {}).get("search_entries", [])
    compiled: List[Dict[str, Any]] = []
    for entry in entries:
        signals = _compute_signals(entry["title"], entry["snippet"], entry["url"], INSTITUTION, name)
        compiled.append({**entry, "signals": signals})
    return compiled

OFFLINE_FIXTURES: Dict[str, Dict[str, Any]] = {}
for person, payload in _RAW_FIXTURE_DATA.items():
    OFFLINE_FIXTURES[person] = {
        "search_results": _build_fixture_results(person),
        "llm_decision": {**_DEFAULT_NEGATIVE_DECISION, **payload.get("llm_decision", {})},
    }

def get_offline_fixture(name: str) -> Dict[str, Any]:
    base = OFFLINE_FIXTURES.get(name)
    if base:
        return base
    return {
        "search_results": [],
        "llm_decision": deepcopy(_DEFAULT_NEGATIVE_DECISION),
    }

search_results: List[Dict[str, Any]] = []
llm_decision: Dict[str, Any] = {}


Target institution: Purdue University


In [4]:
def summarize_search_results(results, limit=5):
    if not results:
        print("No results returned.")
        return
    highlight_keys = [
        "has_person_name",
        "has_institution",
        "has_academic_role",
        "has_current",
        "has_past",
        "has_recent_year",
        "career_transition",
    ]
    for idx, item in enumerate(results[:limit], 1):
        signals = item.get("signals", {})
        score = signals.get("relevance_score")
        flags = [key for key in highlight_keys if signals.get(key)]
        print(f"#{idx} {item.get('title', '[no title]')}")
        print(f"    URL: {item.get('url', '')}")
        print(f"    Score: {score} | Flags: {', '.join(flags) or 'none'}")
        snippet = item.get("snippet", '').strip()
        if snippet:
            print(f"    Snippet: {snippet}")
        print("")

def print_llm_decision(decision):
    if not decision:
        print("No decision payload received.")
        return
    print("LLM Decision")
    print("------------")
    for key in ("connected", "current_or_past", "confidence"):
        print(f"{key}: {decision.get(key)}")
    print(f"detail: {decision.get('connection_detail')}")
    print(f"url: {decision.get('supporting_url')}")
    print("temporal evidence:")
    print(decision.get("temporal_evidence", ""))

def verify_signal_consistency(name: str, results: List[Dict[str, Any]]) -> bool:
    mismatches = []
    for idx, item in enumerate(results, 1):
        recalculated = _compute_signals(
            item.get("title", ""),
            item.get("snippet", ""),
            item.get("url", ""),
            INSTITUTION,
            name,
        )
        expected = item.get("signals", {})
        if recalculated != expected:
            mismatches.append((idx, expected, recalculated))
    if mismatches:
        print("Signal mismatch detected; recalculated values differ.")
        for idx, expected, recalculated in mismatches:
            print(f"  Result #{idx}: expected {expected} vs recalculated {recalculated}")
        return False
    print(f"Signals verified for {len(results)} result(s).")
    return True

def show_results_table(results: List[Dict[str, Any]]):
    if not results:
        return
    if pd is not None:
        rows = []
        for item in results:
            signals = item.get("signals", {})
            rows.append(
                {
                    "title": item.get("title", ""),
                    "score": signals.get("relevance_score"),
                    "has_current": signals.get("has_current"),
                    "has_past": signals.get("has_past"),
                    "has_institution": signals.get("has_institution"),
                    "domain": signals.get("domain"),
                }
            )
        display(pd.DataFrame(rows).sort_values(by="score", ascending=False))
    else:
        pprint(results)


## 1. Enhanced Search Demo
Fetches results either live or from the offline fixtures to exercise scoring and summarisation.


In [5]:
if RUN_NETWORK_TESTS:
    search_results = await enhanced_search(
        NAME_UNDER_TEST,
        INSTITUTION,
        num_results=MAX_RESULTS,
        debug=False,
    )
    origin = "live Bing search"
else:
    fixture_payload = get_offline_fixture(NAME_UNDER_TEST)
    search_results = fixture_payload["search_results"][:MAX_RESULTS]
    origin = "offline fixture"
    print(f"Using offline fixture for {NAME_UNDER_TEST} with {len(search_results)} result(s).")

if search_results:
    print(f"Retrieved {len(search_results)} result(s) via {origin}.")
else:
    print(f"No results returned from {origin}.")

summarize_search_results(search_results, limit=min(MAX_RESULTS, 5))
if search_results:
    verify_signal_consistency(NAME_UNDER_TEST, search_results)
    show_results_table(search_results)


Retrieved 8 result(s) via live Bing search.
#1 Reducing the Digital Divide for Families: State and …
    URL: https://www.bing.com/ck/a?p=87b8684d9bce9fab3680f73635517829bd3efac6e1570ca690ad4f7f6253514fJmltdHM9MTc1ODg0NDgwMA&ptn=3&ver=2&hsh=4&fclid=1fab75c6-5ead-607b-06ea-63b55fd36135&u=a1aHR0cHM6Ly93d3cubmNmci5vcmcvcG9saWN5L3Jlc2VhcmNoLWFuZC1wb2xpY3ktYnJpZWZzL3JlZHVjaW5nLWRpZ2l0YWwtZGl2aWRlLWZhbWlsaWVzLXN0YXRlLWxvY2FsLXBvbGljeS1vcHBvcnR1bml0aWVz&ntb=1
    Score: 16 | Flags: has_person_name, has_institution, has_academic_role, has_recent_year
    Snippet: Jul 8, 2024 · Robert Duncan, Ph.D., is an Assistant Professor in the Department of Human Development and Family Science at Purdue University. He also serves as …

#2 Robert J Duncan - boiler.courses
    URL: https://www.bing.com/ck/a?p=509dd701be5ec05afaf61475e10b977ea40be30e9d8acc3cf7ad0fc411d24154JmltdHM9MTc1ODg0NDgwMA&ptn=3&ver=2&hsh=4&fclid=386d9e33-b563-6de9-150b-8840b41d6c69&u=a1aHR0cHM6Ly9ib2lsZXIuY291cnNlcy9wcm9mLzgxNQ&ntb=1
 

Unnamed: 0,title,score,has_current,has_past,has_institution,domain
0,Reducing the Digital Divide for Families: Stat...,16,False,False,True,www.bing.com
1,Robert J Duncan - boiler.courses,16,False,False,True,www.bing.com
2,Robert J Duncan - boiler.courses,16,False,False,True,www.bing.com
3,Robert Duncan - Assistant Professor at Purdue ...,14,False,True,True,www.bing.com
4,Robert J Duncan | Faculty | PU | 2019 | OpenPa...,14,False,False,True,www.bing.com
5,Robert J Duncan | Faculty | PU | OpenPayrolls,14,False,False,True,www.bing.com
6,Robert J Duncan | Faculty | PU | OpenPayrolls,14,False,False,True,www.bing.com
7,SEL I E C SEL Interventions in Early Childhood...,13,False,False,True,www.bing.com


## 2. LLM Connection Decision
Analyzes the search results to determine whether the person is connected to the institution.


In [6]:
if RUN_NETWORK_TESTS:
    if not search_results:
        search_results = await enhanced_search(
            NAME_UNDER_TEST,
            INSTITUTION,
            num_results=MAX_RESULTS,
            debug=False,
        )
    llm_decision = await analyze_connection(
        NAME_UNDER_TEST,
        INSTITUTION,
        search_results,
        debug=True,
    )
    origin = "live LLM call"
else:
    fixture_payload = get_offline_fixture(NAME_UNDER_TEST)
    llm_decision = fixture_payload["llm_decision"]
    origin = "offline fixture"
    print(f"Using offline fixture decision for {NAME_UNDER_TEST}.")

print(f"Decision source: {origin}")
print_llm_decision(llm_decision)


[DEBUG] Prompt length: 5599 characters
[DEBUG] Attempt 1 status: 200
[DEBUG] Model content preview: ```json {   "connected": "Y",   "connection_detail": "Robert Duncan is an Assistant Professor at Purdue University in the Department of Human Development and Family Science.",   "current_or_past": "cu...
Decision source: live LLM call
LLM Decision
------------
connected: Y
current_or_past: current
confidence: high
detail: Robert Duncan is an Assistant Professor at Purdue University in the Department of Human Development and Family Science.
url: https://www.bing.com/ck/a?p=87b8684d9bce9fab3680f73635517829bd3efac6e1570ca690ad4f7f6253514fJmltdHM9MTc1ODg0NDgwMA&ptn=3&ver=2&hsh=4&fclid=1fab75c6-5ead-607b-06ea-63b55fd36135&u=a1aHR0cHM6Ly93d3cub...
temporal evidence:
The snippet from July 8, 2024, indicates that Robert Duncan is currently an Assistant Professor at Purdue University.


## 3. Batch Pipeline
Exercises the async CLI pipeline with either live services or offline fixtures patched into place.


In [7]:
batch_output = []

if RUN_NETWORK_TESTS:
    batch_output = await run_pipeline(
        BATCH_NAMES,
        batch_size=2,
        use_enhanced_search=True,
        debug=False,
    )
else:
    from unittest.mock import patch

    async def offline_enhanced(name: str, institution: str, num_results: int = MAX_RESULTS, debug: bool = False):
        payload = get_offline_fixture(name)
        results = payload["search_results"]
        if num_results:
            return results[:num_results]
        return results

    async def offline_llm(name: str, institution: str, results, debug: bool = False, max_retries: int = 2):
        payload = get_offline_fixture(name)
        return payload["llm_decision"]

    def _noop_writer(*args, **kwargs):
        print("[offline] Skipping write to disk.")

    async def _run_offline_pipeline():
        with patch("main.enhanced_search", new=offline_enhanced), \
             patch("main.analyze_connection", new=offline_llm), \
             patch("main.write_partial_results", new=_noop_writer), \
             patch("main.save_final_results", new=_noop_writer):
            return await run_pipeline(
                BATCH_NAMES,
                batch_size=2,
                use_enhanced_search=True,
                debug=False,
            )

    batch_output = await _run_offline_pipeline()

if batch_output:
    print(f"Processed {len(batch_output)} record(s).")
    if pd is not None:
        display(pd.DataFrame(batch_output))
    else:
        pprint(batch_output)
else:
    print("No batch output produced.")


[INFO] Processing 3 name(s) in 2 batch(es) using enhanced search

[INFO] Batch 1/2: 2 name(s)
[OK] Robert Duncan: connected (past, high) - Robert Duncan has been an Assistant Professor at Purdue University in the Department of Human Development and Family Science.
[OK] Jane Smith: connected (past, high) - Jane Smith has a genuine employment connection with Purdue University as a Senior Records Auditor, and she is also listed as an alumnus.
[INFO] Batch 1 completed in 44.2s

[INFO] Batch 2/2: 1 name(s)
[--] Ada Lovelace: no confirmed connection to Purdue University
[INFO] Batch 2 completed in 37.9s

[INFO] Finished processing in 82.1s (27.4s per name)
Processed 3 record(s).


Unnamed: 0,name,institution,connected,connection_detail,current_or_past,supporting_url,confidence,temporal_evidence
0,Robert Duncan,Purdue University,Y,Robert Duncan has been an Assistant Professor ...,past,https://www.bing.com/ck/a?p=87b8684d9bce9fab36...,high,Multiple sources confirm his role as Assistant...
1,Jane Smith,Purdue University,Y,Jane Smith has a genuine employment connection...,past,https://www.bing.com/ck/a?p=b3e2e1c14d58850779...,high,The search finding indicates Jane Smith was a ...
2,Ada Lovelace,Purdue University,N,Ada Lovelace is recognized as the first comput...,,,high,All search findings indicate Ada Lovace's hist...


## 4. Inspect Stored Results (Optional)
Check the current contents of data/results_partial.csv if it exists.


In [8]:
from pathlib import Path

partial_path = Path("data/results_partial.csv")
if partial_path.exists():
    print(f"Found partial results at {partial_path}")
    if pd is not None:
        display(pd.read_csv(partial_path))
    else:
        print(partial_path.read_text())
else:
    print("No partial results file present.")


Found partial results at data\results_partial.csv


Unnamed: 0,name,institution,connected,connection_detail,current_or_past,supporting_url,confidence,temporal_evidence
0,Robert Duncan,Purdue University,Y,Robert Duncan has been an Assistant Professor ...,past,https://www.bing.com/ck/a?p=87b8684d9bce9fab36...,high,Multiple sources confirm his role as Assistant...
1,Jane Smith,Purdue University,Y,Jane Smith has a genuine employment connection...,past,https://www.bing.com/ck/a?p=b3e2e1c14d58850779...,high,The search finding indicates Jane Smith was a ...
2,Ada Lovelace,Purdue University,N,Ada Lovelace is recognized as the first comput...,,,high,All search findings indicate Ada Lovace's hist...


## 5. Cleanup
Close shared clients so reruns start fresh.


In [9]:
await close_search_clients()
await close_session()
print("Cleanup complete.")


Cleanup complete.
