# Institution Checker - Quick Runner
A minimal notebook for processing one or more names against the configured institution.


## 📚 How to Use

### Quick Start (3 steps):
1. **🔄 RELOAD** (Cell 10): Load all fixes ⭐
2. **Configure** (Cell 4): Set INPUT_MODE and names
3. **Run Pipeline** (Cell 9): Process all names

### Full Workflow:
1. **Setup** → Cell 3: Import modules
2. **Configure** → Cell 4: Set input and parameters
3. **Resolve Names** → Cell 7: Validate name list
4. **🔄 RELOAD MODULE** → Cell 10: Load fixes ⭐ **START HERE**
5. **🔬 Diagnose** → Cell 13: Verify LLM is working correctly
6. **Debug Single** → Cell 15: Test problematic names
7. **Run Pipeline** → Cell 9: Process all names
8. **Quality Check** → Cell 19: Analyze results
9. **Retry Failed** → Cell 17: Auto-retry any remaining errors
10. **Export** → Cell 22: Save to Excel

### 🧠 What Was Fixed:
- ✅ **REMOVED max_tokens** - unlimited for thinking models
- ✅ **Fixed format string error** - escaped curly braces
- ✅ **IMPROVED current/past detection** - smarter signal counting
- ✅ **Enhanced prompt** - clearer temporal guidance
- ✅ **Fixed temperature** (0.0) - consistent reasoning

### 💡 Current/Past Detection Logic:
**Simple, evidence-based approach:**
- Counts "current" terms (is, currently, works at) vs "past" terms (was, former, retired)
- Detects ended date ranges like "2010-2015"
- Checks if person is currently at another institution
- Overrides LLM classification when evidence is strong (3+ signals)

In [1]:
import asyncio
import platform
import sys
from pathlib import Path
from typing import List, Optional

import pandas as pd

from institution_checker import (
    INSTITUTION,
    FileSourceContext,
    close_search_clients,
    close_session,
    expand_results_to_source,
    resolve_names,
    run_pipeline,
)

print(f"Python: {sys.version.split()[0]} on {platform.platform()}")
print(f"Pandas: {pd.__version__}")
print(f"Institution: {INSTITUTION}")



Python: 3.12.7 on Windows-11-10.0.26100-SP0
Pandas: 2.3.1
Institution: Purdue University


In [2]:
# Input options: "single", "list", "file"
INPUT_MODE = "list"
SINGLE_NAME = "Christopher L. Eisgruber"
NAME_LIST = [
    "Robert Duncan",
    "Mitch Daniels",
    "Morgan Furze",
    "Connie Weaver",
    "Geoffrey Hinton",
    "Albert Warner Overhauser",
    "Isidor Isaac Rabi",
    "Chintamani Nagesa Ramachandra Rao",
    "Jane Goodall",
    "Marie Curie",
    "Nicholas Rauh",
    "France Cordova",
    "Steve Amireault",
    "Roy Dejoie",
    "Jozef Kokini",
    "Paul Alivisatos",
    "Vernon W. Ruttan",
    "William H. Gass",
    "Axel Hoffmann",
    "Matthew Lanham",
]
INPUT_FILE = r"E:\Ace\To be sorted\Purdue Trial Set HP Awardees 2024-10-23.xlsx"
FILE_COLUMN = "name"
FILE_SHEET: Optional[str] = None  # Only used for Excel sources

BATCH_SIZE = 8
INTER_BATCH_DELAY = 2.5  # Seconds to wait between batches (helps prevent rate limiting)
USE_ENHANCED_SEARCH = True
RUN_NETWORK_TESTS = True
EXPORT_PATH = "data/results.csv"  # Set to "" to skip writing

In [3]:
FILE_CONTEXT: Optional[FileSourceContext] = None


def prepare_names() -> List[str]:
    """Resolve names based on notebook configuration while tracking file context."""
    global FILE_CONTEXT
    mode = INPUT_MODE.lower().strip()
    if mode == "file":
        FILE_CONTEXT = FileSourceContext.from_path(
            INPUT_FILE,
            column=FILE_COLUMN,
            sheet=FILE_SHEET or None,
        )
        return FILE_CONTEXT.unique_names()

    FILE_CONTEXT = None
    return resolve_names(
        input_mode=mode,
        single_name=SINGLE_NAME,
        name_list=NAME_LIST,
        input_file=INPUT_FILE,
        file_column=FILE_COLUMN,
        file_sheet=FILE_SHEET or None,
    )



## Resolve Names
Build the list of names based on the selected input mode.


In [None]:
names_to_check: List[str] = []
try:
    names_to_check = prepare_names()
    if not names_to_check:
        print("No names to process. Update the configuration cell.")
    else:
        print(f"Prepared {len(names_to_check)} name(s):")
        for item in names_to_check:
            print(f" - {item}")
except Exception as exc:
    names_to_check = []
    print(f"Configuration error: {exc}")




Prepared 20 name(s):
 - Robert Duncan
 - Mitch Daniels
 - Morgan Furze
 - Connie Weaver
 - Geoffrey Hinton
 - Albert Warner Overhauser
 - Isidor Isaac Rabi
 - Chintamani Nagesa Ramachandra Rao
 - Jane Goodall
 - Marie Curie
 - Nicholas Rauh
 - France Cordova
 - Steve Amireault
 - Roy Dejoie
 - Jozef Kokini
 - Paul Alivisatos
 - Vernon W. Ruttan
 - William H. Gass
 - Axel Hoffmann
 - Matthew Lanham


: 

## Run Pipeline
Executes the async pipeline using the enhanced search client unless disabled.


In [None]:
pipeline_results = []

if not names_to_check:
    print("No names to process.")
elif not RUN_NETWORK_TESTS:
    print("RUN_NETWORK_TESTS is False; skipping live pipeline execution.")
else:
    print("Starting pipeline with enhanced progress tracking and auto-retry...")
    pipeline_results = await run_pipeline(
        names_to_check,
        batch_size=max(1, int(BATCH_SIZE)),
        use_enhanced_search=USE_ENHANCED_SEARCH,
        inter_batch_delay=INTER_BATCH_DELAY,
        debug=False,  # Enable debug mode for detailed logging
    )
    print(f"Pipeline completed (including auto-retry). Processed {len(pipeline_results)} record(s).")



Starting pipeline with enhanced progress tracking and auto-retry...
[PIPELINE] Starting: 20 name(s) in 3 batch(es) using enhanced search
[PIPELINE] Batch size: 8, Inter-batch delay: 2.5s

[PIPELINE] ===== BATCH 1/3 =====
[PIPELINE] Names in this batch: ['Robert Duncan', 'Mitch Daniels', 'Morgan Furze', 'Connie Weaver', 'Geoffrey Hinton', 'Albert Warner Overhauser', 'Isidor Isaac Rabi', 'Chintamani Nagesa Ramachandra Rao']
[BATCH] Processing 8 names: Robert Duncan, Mitch Daniels, Morgan Furze, Connie Weaver, Geoffrey Hinton, Albert Warner Overhauser, Isidor Isaac Rabi, Chintamani Nagesa Ramachandra Rao
[BATCH] Phase 1: Running searches in parallel for all 8 names
[PROGRESS] Starting search for: Robert Duncan
[PROGRESS] Trying basic search first for efficiency...
[PROGRESS] Starting search for: Mitch Daniels
[PROGRESS] Trying basic search first for efficiency...
[PROGRESS] Starting search for: Morgan Furze
[PROGRESS] Trying basic search first for efficiency...
[PROGRESS] Starting search 

RuntimeError: This event loop is already running

In [None]:
# Reload the module with ALL fixes for thinking models
import importlib
import institution_checker as ic

ic = importlib.reload(ic)

from institution_checker import (
    INSTITUTION,
    FileSourceContext,
    close_search_clients,
    close_session,
    expand_results_to_source,
    resolve_names,
    run_pipeline,
)

print("=" * 70)
print("✓ INSTITUTION CHECKER RELOADED - ALL FIXES APPLIED")
print("=" * 70)
print("\n🧠 YOUR MODEL: GPT-OSS 20B (Thinking Model)")
print("=" * 70)
print("\n🔧 KEY FIXES:")
print("  1. FIXED format string error (escaped curly braces)")
print("  2. REMOVED max_tokens - unlimited tokens for thinking")
print("  3. IMPROVED current/past detection:")
print("     • Counts temporal signals (current vs past terms)")
print("     • Detects date ranges (e.g., '2010-2015' = past)")
print("     • Checks if person is at another institution")
print("     • Overrides LLM when evidence is strong")
print("  4. ENHANCED prompt with clear current/past guidance")
print("  5. FIXED temperature to 0.0 - consistent output")
print("\n💡 CURRENT/PAST DETECTION:")
print("  • Past if: 3+ past terms, ended date ranges, or at other school")
print("  • Current if: 3+ current terms, recent years, more current signals")
print("  • Simple counting - no overcomplication!")
print("\n✅ This should fix classification errors!")
print("\n🎯 Next: Run Cell 9 to test improved detection!")
print("=" * 70)

✓ INSTITUTION CHECKER RELOADED - ALL FIXES APPLIED

🧠 YOUR MODEL: GPT-OSS 20B (Thinking Model)

🔧 KEY FIXES:
  1. FIXED format string error (escaped curly braces)
  2. REMOVED max_tokens - unlimited tokens for thinking
  3. IMPROVED current/past detection:
     • Counts temporal signals (current vs past terms)
     • Detects date ranges (e.g., '2010-2015' = past)
     • Checks if person is at another institution
     • Overrides LLM when evidence is strong
  4. ENHANCED prompt with clear current/past guidance
  5. FIXED temperature to 0.0 - consistent output

💡 CURRENT/PAST DETECTION:
  • Past if: 3+ past terms, ended date ranges, or at other school
  • Current if: 3+ current terms, recent years, more current signals
  • Simple counting - no overcomplication!

✅ This should fix classification errors!

🎯 Next: Run Cell 9 to test improved detection!


## Test Connection Fix
Quick test to verify the connection issues are resolved.

In [None]:
# Quick connection and JSON parsing tests
from institution_checker.search import _get_http_client
from institution_checker.llm_processor import _parse_response, _extract_fields_with_regex
import httpx

print("🔍 RUNNING SYSTEM TESTS...")
print("=" * 70)

# Test 1: HTTP Client
print("\n1. Testing HTTP client...")
try:
    client = await _get_http_client()
    response = await client.get("https://www.bing.com", timeout=5.0)
    print(f"   ✓ HTTP client working (status {response.status_code})")
except Exception as e:
    print(f"   ✗ HTTP client failed: {e}")

# Test 2: JSON Parsing with malformed JSON
print("\n2. Testing robust JSON parsing...")
test_cases = [
    '{"connected": "Y", "confidence": "high"}',  # Valid
    '{"connected": "Y", "confidence": "high",}',  # Trailing comma
    '```json\n{"connected": "Y"}\n```',  # Code block
    '{"connected": "Y", "detail": "Professor at Purdue"}',  # Partial
]
passed = 0
for i, test in enumerate(test_cases, 1):
    try:
        result = _parse_response(test)
        if result.get("connected") in ["Y", "N"]:
            passed += 1
    except:
        pass
print(f"   ✓ Parsed {passed}/{len(test_cases)} test cases")

# Test 3: Regex fallback
print("\n3. Testing regex fallback parser...")
malformed = '"connected": "Y", "confidence": "high", "detail": "test'
try:
    result = _extract_fields_with_regex(malformed)
    if result.get("connected") == "Y":
        print(f"   ✓ Regex fallback working")
    else:
        print(f"   ⚠ Regex fallback returned: {result.get('connected')}")
except Exception as e:
    print(f"   ✗ Regex fallback failed: {e}")

print("\n" + "=" * 70)
print("✓ System tests complete. Ready to process names.")
print("=" * 70)

🔍 RUNNING SYSTEM TESTS...

1. Testing HTTP client...
   ✓ HTTP client working (status 200)

2. Testing robust JSON parsing...
   ✓ Parsed 4/4 test cases

3. Testing regex fallback parser...
   ✓ Regex fallback working

✓ System tests complete. Ready to process names.


## 🔬 ROOT CAUSE DIAGNOSIS
Let's see what the LLM is actually returning and why it's malformed.

In [None]:
# Diagnose what the LLM is actually returning
import aiohttp
import json
from institution_checker.config import LLM_API_URL, LLM_API_KEY, MODEL_NAME

print("=" * 70)
print("DIAGNOSING LLM RESPONSES")
print("=" * 70)
print(f"\nModel: {MODEL_NAME}")
print(f"API: {LLM_API_URL}")

# Simple test prompt
test_prompt = """Return ONLY this JSON with NO additional text:
{
  "connected": "Y",
  "connection_detail": "Test professor",
  "current_or_past": "current",
  "supporting_url": "https://test.edu",
  "confidence": "high",
  "temporal_evidence": "Active since 2020"
}"""

async def test_llm_response():
    async with aiohttp.ClientSession() as session:
        headers = {
            "Authorization": f"Bearer {LLM_API_KEY}",
            "Content-Type": "application/json",
        }
        payload = {
            "model": MODEL_NAME,
            "messages": [
                {"role": "system", "content": "You are a JSON generator. Return ONLY valid JSON with no text."},
                {"role": "user", "content": test_prompt},
            ],
            "temperature": 0.1,
            "max_tokens": 500,
        }
        
        print("\n📤 Sending test request...")
        async with session.post(LLM_API_URL, json=payload, headers=headers) as response:
            print(f"Status: {response.status}")
            
            text = await response.text()
            print(f"Response length: {len(text)} chars\n")
            
            # Parse the API response
            try:
                data = json.loads(text)
                print("✓ API response is valid JSON")
                
                # Get the actual content
                choices = data.get("choices", [])
                if choices:
                    content = choices[0].get("message", {}).get("content", "")
                    print(f"\n📝 LLM Content ({len(content)} chars):")
                    print("-" * 70)
                    print(content)
                    print("-" * 70)
                    
                    # Try to parse it
                    print("\n🔍 Attempting to parse as JSON...")
                    try:
                        parsed = json.loads(content)
                        print("✓ Content is VALID JSON!")
                        print(f"Keys: {list(parsed.keys())}")
                    except json.JSONDecodeError as e:
                        print(f"✗ Content is INVALID JSON!")
                        print(f"Error: {e}")
                        print(f"\nFirst 200 chars: {content[:200]}")
                        print(f"Last 200 chars: {content[-200:]}")
                        
                        # Check for common issues
                        if not content.strip().endswith("}"):
                            print("\n⚠️  ISSUE: Response doesn't end with }")
                            print("   → Likely TRUNCATED due to max_tokens limit")
                        if "```" in content:
                            print("\n⚠️  ISSUE: Response contains markdown code blocks")
                            print("   → Model not following 'no markdown' instruction")
                        if content.count('"') % 2 == 1:
                            print("\n⚠️  ISSUE: Odd number of quotes")
                            print("   → Likely truncated mid-string")
                else:
                    print("✗ No choices in API response")
                    print(f"Response data: {data}")
                    
            except json.JSONDecodeError as e:
                print(f"✗ API response is not valid JSON: {e}")
                print(f"Response text: {text[:500]}")

print("\n" + "=" * 70)
await test_llm_response()
print("=" * 70)

DIAGNOSING LLM RESPONSES

Model: gpt-oss:latest
API: https://genai.rcac.purdue.edu/api/chat/completions


📤 Sending test request...
Status: 200
Response length: 499 chars

✓ API response is valid JSON

📝 LLM Content (178 chars):
----------------------------------------------------------------------
{"connected":"Y","connection_detail":"Test professor","current_or_past":"current","supporting_url":"https://test.edu","confidence":"high","temporal_evidence":"Active since 2020"}
----------------------------------------------------------------------

🔍 Attempting to parse as JSON...
✓ Content is VALID JSON!
Keys: ['connected', 'connection_detail', 'current_or_past', 'supporting_url', 'confidence', 'temporal_evidence']
Status: 200
Response length: 499 chars

✓ API response is valid JSON

📝 LLM Content (178 chars):
----------------------------------------------------------------------
{"connected":"Y","connection_detail":"Test professor","current_or_past":"current","supporting_url":"https://te

## Debug: Test Single Name Processing
Test search and LLM analysis for a specific name to see what's going wrong.

In [None]:
# Debug a specific name to see search results and LLM response
from institution_checker.search import enhanced_search
from institution_checker.llm_processor import analyze_connection, _build_prompt
from institution_checker import INSTITUTION

# Pick a name that should have a connection but was missed
# Try: "Jozef Kokini", "Robert Duncan", "Buzz Aldrin", "Nicholas Rauh"
TEST_NAME = "Jozef Kokini"  

print("=" * 70)
print(f"DEBUGGING: {TEST_NAME}")
print("=" * 70)

# Step 1: Test search
print("\n📍 STEP 1: SEARCH")
print("-" * 70)
search_results = await enhanced_search(TEST_NAME, INSTITUTION, num_results=20, debug=False)
print(f"\n✓ Found {len(search_results)} results\n")

if search_results:
    print("Top 5 results:")
    for i, result in enumerate(search_results[:5], 1):
        print(f"\n{i}. {result.get('title', 'No title')}")
        print(f"   URL: {result.get('url', 'No URL')[:80]}...")
        snippet = result.get('snippet', 'No snippet')
        print(f"   Snippet: {snippet[:120]}...")
        signals = result.get('signals', {})
        print(f"   Score: {signals.get('relevance_score', 0)}, "
              f"Has inst: {signals.get('has_institution', False)}, "
              f"Has person: {signals.get('has_person_name', False)}")
else:
    print("⚠ WARNING: No results found!")

# Step 2: Test LLM analysis
print("\n\n📍 STEP 2: LLM ANALYSIS")
print("-" * 70)
decision = await analyze_connection(TEST_NAME, INSTITUTION, search_results, debug=True)

print("\n\n📍 FINAL DECISION:")
print("-" * 70)
for key, value in decision.items():
    print(f"  {key}: {value}")

print("\n" + "=" * 70)

DEBUGGING: Jozef Kokini

📍 STEP 1: SEARCH
----------------------------------------------------------------------

✓ Found 19 results

Top 5 results:

1. EB Member Biography | Journal of Food and Nutrition | JSCHOLAR
   URL: https://www.bing.com/ck/a?p=57e30ff914952ab5d0482692bf6fd312af9e73dc41aaf140f2ca...
   Snippet: Dr. Jozef L. Kokini is currently the Scholle Endowed Chair in Food Processing in the Department of Food Science at Purdu...
   Score: 18, Has inst: True, Has person: True

2. Jozef L. Kokini | Scholle Endowed Chair | Rheology 2016 ...
   URL: https://www.bing.com/ck/a?p=0958d008dbb0f2da74ca0d4ab04c5d991d96126642fa4e9440c4...
   Snippet: Dr. Jozef L. Kokini is currently the Scholle Endowed Chair in Food Processing in the Department of Food Science at Purdu...
   Score: 18, Has inst: True, Has person: True

3. Speaker : The 12th International Congress on Engineering and
   URL: https://www.bing.com/ck/a?p=9c76e9945b6a3307bf21695af1a4e0ec6d3b380d8ecbad92863a...
   Snippet: D

## Re-run Failed Names
Use this cell to re-process any names that had errors in the previous run.

In [None]:
# Find and re-run names that had errors
if pipeline_results:
    failed_names = [
        r['name'] for r in pipeline_results 
        if 'Error:' in r.get('connection_detail', '') or 'error' in r.get('temporal_evidence', '').lower()
    ]
    
    if failed_names:
        print(f"Found {len(failed_names)} failed names:")
        for name in failed_names:
            print(f"  - {name}")
        
        print("\nRe-running with improved parsing...")
        retry_results = await run_pipeline(
            failed_names,
            batch_size=max(1, int(BATCH_SIZE // 2)),  # Smaller batches for retry
            use_enhanced_search=USE_ENHANCED_SEARCH,
            inter_batch_delay=INTER_BATCH_DELAY,
            debug=True,  # Enable debug mode to see what's happening
        )
        
        # Merge results back
        result_dict = {r['name']: r for r in pipeline_results}
        for retry in retry_results:
            result_dict[retry['name']] = retry
        
        pipeline_results = list(result_dict.values())
        print(f"\n✓ Re-run complete. Updated {len(retry_results)} results.")
        
        # Show any still failing
        still_failed = [
            r['name'] for r in pipeline_results 
            if 'Error:' in r.get('connection_detail', '') or 'error' in r.get('temporal_evidence', '').lower()
        ]
        if still_failed:
            print(f"\n⚠ {len(still_failed)} names still have errors:")
            for name in still_failed:
                print(f"  - {name}")
        else:
            print("\n✓ All names processed successfully!")
    else:
        print("No failed names found. All results processed successfully!")
else:
    print("No pipeline_results found. Run the pipeline first.")

No failed names found. All results processed successfully!


## 🔍 Results Quality Check
Analyze results to identify any remaining issues or suspicious patterns.

In [None]:
# Analyze results for quality issues
if pipeline_results:
    print("=" * 70)
    print("RESULTS QUALITY ANALYSIS")
    print("=" * 70)
    
    total = len(pipeline_results)
    errors = [r for r in pipeline_results if 'Error:' in r.get('connection_detail', '') or 
              'error' in r.get('temporal_evidence', '').lower()]
    connected = [r for r in pipeline_results if r.get('connected') == 'Y']
    not_connected = [r for r in pipeline_results if r.get('connected') == 'N']
    
    print(f"\n📊 SUMMARY:")
    print(f"  Total processed: {total}")
    print(f"  Connected: {len(connected)} ({len(connected)/total*100:.1f}%)")
    print(f"  Not connected: {len(not_connected)} ({len(not_connected)/total*100:.1f}%)")
    print(f"  Errors: {len(errors)} ({len(errors)/total*100:.1f}%)")
    
    if errors:
        print(f"\n⚠️  ERRORS ({len(errors)}):")
        for r in errors:
            print(f"  - {r['name']}: {r.get('connection_detail', 'Unknown error')[:60]}...")
    
    # Check for suspiciously short details
    short_details = [r for r in connected if len(r.get('connection_detail', '')) < 30]
    if short_details:
        print(f"\n⚠️  SUSPICIOUSLY SHORT DETAILS ({len(short_details)}):")
        for r in short_details:
            print(f"  - {r['name']}: \"{r.get('connection_detail', '')}\"")
    
    # Check confidence distribution
    high_conf = len([r for r in pipeline_results if r.get('confidence') == 'high'])
    med_conf = len([r for r in pipeline_results if r.get('confidence') == 'medium'])
    low_conf = len([r for r in pipeline_results if r.get('confidence') == 'low'])
    
    print(f"\n📈 CONFIDENCE DISTRIBUTION:")
    print(f"  High: {high_conf} ({high_conf/total*100:.1f}%)")
    print(f"  Medium: {med_conf} ({med_conf/total*100:.1f}%)")
    print(f"  Low: {low_conf} ({low_conf/total*100:.1f}%)")
    
    # Show current vs past for connected
    if connected:
        current = len([r for r in connected if r.get('current_or_past') == 'current'])
        past = len([r for r in connected if r.get('current_or_past') == 'past'])
        print(f"\n🕒 TEMPORAL CLASSIFICATION (Connected only):")
        print(f"  Current: {current} ({current/len(connected)*100:.1f}%)")
        print(f"  Past: {past} ({past/len(connected)*100:.1f}%)")
    
    if errors:
        print(f"\n💡 RECOMMENDATION: Run the retry cell above to fix {len(errors)} errors")
    elif short_details:
        print(f"\n💡 RECOMMENDATION: Review {len(short_details)} results with short details")
    else:
        print(f"\n✅ ALL RESULTS LOOK GOOD!")
    
    print("=" * 70)
else:
    print("No results available. Run the pipeline first.")

RESULTS QUALITY ANALYSIS

📊 SUMMARY:
  Total processed: 20
  Connected: 14 (70.0%)
  Not connected: 6 (30.0%)
  Errors: 0 (0.0%)

📈 CONFIDENCE DISTRIBUTION:
  High: 14 (70.0%)
  Medium: 0 (0.0%)
  Low: 6 (30.0%)

🕒 TEMPORAL CLASSIFICATION (Connected only):
  Current: 6 (42.9%)
  Past: 8 (57.1%)

✅ ALL RESULTS LOOK GOOD!


## Review Final Results
Display the complete DataFrame after manual retry (if performed).


In [None]:
if pipeline_results:
    expanded = expand_results_to_source(pipeline_results, ctx=FILE_CONTEXT)
    display(expanded if isinstance(expanded, pd.DataFrame) else pd.DataFrame(pipeline_results))
else:
    print("No results to display.")



Unnamed: 0,name,institution,connected,connection_detail,current_or_past,supporting_url,confidence,temporal_evidence
0,Robert Duncan,Purdue University,Y,Robert Duncan is employed as an Assistant Prof...,current,https://www.purdue.edu/academics/ogsp/ogsp/ogsp?,high,Employment listed for Spring 2024 and Fall 202...
1,Mitch Daniels,Purdue University,Y,Former president of Purdue University (2013-20...,past,https://en.wikipedia.org/wiki/Purdue_Universit...,high,2013-2022
2,Morgan Furze,Purdue University,Y,Assistant professor in the Botany & Plant Path...,current,https://ag.purdue.edu/btny/Pages/default.aspx,high,"Assistant professor role documented on Sep 26,..."
3,Connie Weaver,Purdue University,Y,Former professor and head of the Department of...,past,https://www.bing.com/ck/a?p=5b07d17a8eed81b4cf...,high,"Joined Purdue in 1978, became department head ..."
4,Geoffrey Hinton,Purdue University,N,No evidence that Geoffrey Hinton has held an e...,,,low,
5,Albert Warner Overhauser,Purdue University,Y,Professor (Stuart Distinguished Professor of P...,past,https://www.physics.purdue.edu/news/2012/overh...,high,1973-2011
6,Isidor Isaac Rabi,Purdue University,N,"No employment, alumni, or official visiting/fe...",,https://www.nobelprize.org/prizes/physics/1944...,low,"Rabi's career: 1920s-1930s Columbia, 1940s Cor..."
7,Chintamani Nagesa Ramachandra Rao,Purdue University,Y,Alumnus (PhD graduate) from Purdue University ...,past,https://en-academic.com/dic.nsf/ewink/780658,high,PhD earned in 1958
8,Jane Goodall,Purdue University,N,"No evidence of employment, student, alumni, or...",,,low,No official connection evidence found up to 2025.
9,Marie Curie,Purdue University,N,"Marie Curie had no employment, student, or off...",,,low,"Marie Curie died in 1934, and no historical re..."


## Optional Export
Write the results to a CSV file when `EXPORT_PATH` is set.


In [None]:
from ftfy import fix_text

if pipeline_results and EXPORT_PATH:
    export_path = Path(EXPORT_PATH).with_suffix(".xlsx")
    export_path.parent.mkdir(parents=True, exist_ok=True)
    expanded = expand_results_to_source(pipeline_results, ctx=FILE_CONTEXT)
    df_to_export = expanded if isinstance(expanded, pd.DataFrame) else pd.DataFrame(pipeline_results)
    object_columns = df_to_export.select_dtypes(include="object").columns
    if len(object_columns):
        df_to_export[object_columns] = df_to_export[object_columns].applymap(
            lambda value: fix_text(value) if isinstance(value, str) else value
        )
    df_to_export.to_excel(export_path, index=False)
    print(f"Results exported to {export_path}")
elif not pipeline_results:
    print("No results to export.")
else:
    print("EXPORT_PATH is empty; skipping export.")



Results exported to data\results.xlsx


  df_to_export[object_columns] = df_to_export[object_columns].applymap(


## Cleanup
Close shared clients so future runs start fresh.


In [None]:
await close_search_clients()
await close_session()
print("Cleanup complete.")


Cleanup complete.
