Fix five OpenAI-integration bugs surfaced by end-to-end run by smodee · Pull Request #22 · algorithmicgovernance/BioScanCast

smodee · 2026-05-17T20:53:20Z

Summary

Five small, independent fixes for bugs that only show up when the insight and filtering stages actually call OpenAI. All were invisible to the existing test suite because it uses FakeLLMClient, which doesn't enforce the API's contracts.

Surfaced while running an end-to-end smoke test of extraction + insight against the seven sources in data/docling_eval/sources/ (using PRs #19 and #20 combined on a local test branch).

Commits

d3bd0a2 — Render table chunks as text when embedding. PDF table chunks have empty .text (content lives in .table_data); embed_chunks was sending "" to the OpenAI embeddings API, which 400s. Now renders table rows inline for embedding.
a041164 — List all extraction-schema fields as required (closes Bug: Insight extraction JSON schema rejected by OpenAI strict mode #12). OpenAI's strict: True JSON-schema mode requires every property in required. The optional fields already use nullable types, so this is just expanding the required array.
7a75372 — Survive truncated insight extraction responses. Two complementary fixes for the chunk-extractor failure mode where dense pages (e.g. ECDC CDTR) blew past the 1024 output-token cap and broke the whole question:
- New InsightConfig.extraction_max_output_tokens (default 4096), threaded through pipeline → chunk_extractor → generate_json.
- OpenAILLMClient.generate_json now catches json.JSONDecodeError, logs the failure with token usage + finish_reason, and returns an empty content dict with accurate token counts (so the budget tracker stays right).
97484df — Include 'json' keyword in filter prompt (closes Bug: LLM filter prompt missing 'json' keyword causes OpenAI 400 error #11). Required by response_format={'type': 'json_object'}; every real LLM-filter call was 400ing.
af7844c — Drop duplicate LLMClient Protocol in filtering (closes Consolidate duplicate LLMClient Protocol definitions #1). filtering/llm_filter.py defined the same Protocol as bioscancast/llm/client.py. Collapsed to a single import.

Test plan

pytest bioscancast/tests/ — 178 passed, no regressions.
End-to-end run against the 7 sources in data/docling_eval/sources/ (script lives on a separate test branch, requires Hybrid PDF extraction: add DoclingTableRefiner (closes #16) #19 + Switch extraction fetcher to curl_cffi for Cloudflare-fronted sources #20 merged):
- Before this PR: 1/5 questions produced records, 3/5 hit hard errors (empty embedding / strict schema / mid-string truncation).
- After this PR: 4/5 questions produce records (mpox 11, cholera 13, US measles 8, ECDC foodborne 14). Africa CDC is skipped because the source PDF requires OCR — expected behavior.
- Total spend: ~32k input + ~~5.7k output tokens on gpt-4o-mini (~~$0.006).
Reviewer: confirm the generate_json contract change (tolerant of JSONDecodeError) is acceptable. Current callers (extract_facts_from_chunk, filtering pipeline) already handle missing keys via .get(...), so the empty-content fallback is safe. Future callers should not rely on generate_json raising on malformed JSON.
Reviewer: the bioscancast.llm.__init__ FilteringLLMClient / InsightLLMClient aliases are intentionally left alone — they exist because filtering and insight still have different generate_json signatures. Unifying those signatures is a separate architectural decision.

Closes #1, closes #11, closes #12.

embed_chunks sent chunk.text directly to the OpenAI embeddings API, which 400s on empty strings. PDF table chunks have empty .text and keep their content in .table_data, so any document with a table chunk broke real-OpenAI insight runs. Surfaced by the test/stage3-combined smoke run against the WHO mpox sitrep, which produces a table chunk on page 1 with no surrounding prose.

OpenAI's strict JSON schema mode requires every key in `properties` to appear in `required`; the optional fields use nullable types (`["string", "null"]`) which is the correct way to express optionality under strict mode. Without this fix every real OpenAI call from the insight extraction stage failed with a 400. Tests using FakeLLMClient did not catch it because the fake client doesn't validate the schema.

Two complementary fixes for the chunk-extractor failure mode where the model produced JSON longer than the LLM client's default 1024 output-token cap and the partial response broke pipeline.run() for the whole question: 1. Make the output-token cap a config knob (InsightConfig.extraction_max_output_tokens, default 4096) and thread it through pipeline -> chunk_extractor -> generate_json. The 1024 default was small enough that a single dense page (e.g. the ECDC CDTR) blew past it. 2. Make OpenAILLMClient.generate_json tolerant of unparseable model output: instead of raising json.JSONDecodeError, log the failure (model, finish_reason, head of raw text) and return an LLMResponse with empty content and the real token usage. The HTTP call succeeded and we paid for the tokens, so the budget tracker should see them; the caller already handles missing keys via .get(). Surfaced by scripts/test_extraction_insight.py on the ECDC CDTR PDF.

OpenAI's response_format={'type': 'json_object'} mode requires the word 'json' to appear somewhere in the messages, otherwise the API returns 400. The filter prompt didn't contain it, so every real OpenAI call from the filtering stage failed. Not caught earlier because filtering integration tests use FakeLLMClient and the pipeline defaults to llm_client=None (the fail-closed mode), so the real call path was never exercised.

bioscancast/filtering/llm_filter.py defined a local LLMClient Protocol with the same shape as bioscancast/llm/client.LLMClient. Two definitions of the same Protocol are a foot-gun (drift over time, ambiguous type-checker errors); collapsing to a single source of truth. Filtering modules now import from bioscancast.llm.client directly, matching the pattern the search stage already uses (bioscancast/stages/search_stage/{pipeline,query_decomposition}.py). Note: the bioscancast.llm.__init__ aliases (FilteringLLMClient / InsightLLMClient) are left alone — they exist because the filtering and insight Protocols still have different signatures (string prompt vs. system/user/schema). Unifying those signatures is a separate architectural decision worth its own PR.

smodee added 5 commits May 17, 2026 22:40

rapsoj approved these changes May 18, 2026

View reviewed changes

rapsoj merged commit 463bc1a into main May 18, 2026

rapsoj deleted the fix/openai-integration-bugs branch May 18, 2026 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix five OpenAI-integration bugs surfaced by end-to-end run#22

Fix five OpenAI-integration bugs surfaced by end-to-end run#22
rapsoj merged 5 commits into
mainfrom
fix/openai-integration-bugs

smodee commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

smodee commented May 17, 2026

Summary

Commits

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants