Skip to content

Fix five OpenAI-integration bugs surfaced by end-to-end run#22

Merged
rapsoj merged 5 commits into
mainfrom
fix/openai-integration-bugs
May 18, 2026
Merged

Fix five OpenAI-integration bugs surfaced by end-to-end run#22
rapsoj merged 5 commits into
mainfrom
fix/openai-integration-bugs

Conversation

@smodee
Copy link
Copy Markdown
Collaborator

@smodee smodee commented May 17, 2026

Summary

Five small, independent fixes for bugs that only show up when the insight and filtering stages actually call OpenAI. All were invisible to the existing test suite because it uses FakeLLMClient, which doesn't enforce the API's contracts.

Surfaced while running an end-to-end smoke test of extraction + insight against the seven sources in data/docling_eval/sources/ (using PRs #19 and #20 combined on a local test branch).

Commits

  1. d3bd0a2 — Render table chunks as text when embedding. PDF table chunks have empty .text (content lives in .table_data); embed_chunks was sending "" to the OpenAI embeddings API, which 400s. Now renders table rows inline for embedding.
  2. a041164 — List all extraction-schema fields as required (closes Bug: Insight extraction JSON schema rejected by OpenAI strict mode #12). OpenAI's strict: True JSON-schema mode requires every property in required. The optional fields already use nullable types, so this is just expanding the required array.
  3. 7a75372 — Survive truncated insight extraction responses. Two complementary fixes for the chunk-extractor failure mode where dense pages (e.g. ECDC CDTR) blew past the 1024 output-token cap and broke the whole question:
    • New InsightConfig.extraction_max_output_tokens (default 4096), threaded through pipeline → chunk_extractor → generate_json.
    • OpenAILLMClient.generate_json now catches json.JSONDecodeError, logs the failure with token usage + finish_reason, and returns an empty content dict with accurate token counts (so the budget tracker stays right).
  4. 97484df — Include 'json' keyword in filter prompt (closes Bug: LLM filter prompt missing 'json' keyword causes OpenAI 400 error #11). Required by response_format={'type': 'json_object'}; every real LLM-filter call was 400ing.
  5. af7844c — Drop duplicate LLMClient Protocol in filtering (closes Consolidate duplicate LLMClient Protocol definitions #1). filtering/llm_filter.py defined the same Protocol as bioscancast/llm/client.py. Collapsed to a single import.

Test plan

  • pytest bioscancast/tests/ — 178 passed, no regressions.
  • End-to-end run against the 7 sources in data/docling_eval/sources/ (script lives on a separate test branch, requires Hybrid PDF extraction: add DoclingTableRefiner (closes #16) #19 + Switch extraction fetcher to curl_cffi for Cloudflare-fronted sources #20 merged):
    • Before this PR: 1/5 questions produced records, 3/5 hit hard errors (empty embedding / strict schema / mid-string truncation).
    • After this PR: 4/5 questions produce records (mpox 11, cholera 13, US measles 8, ECDC foodborne 14). Africa CDC is skipped because the source PDF requires OCR — expected behavior.
    • Total spend: ~32k input + 5.7k output tokens on gpt-4o-mini ($0.006).
  • Reviewer: confirm the generate_json contract change (tolerant of JSONDecodeError) is acceptable. Current callers (extract_facts_from_chunk, filtering pipeline) already handle missing keys via .get(...), so the empty-content fallback is safe. Future callers should not rely on generate_json raising on malformed JSON.
  • Reviewer: the bioscancast.llm.__init__ FilteringLLMClient / InsightLLMClient aliases are intentionally left alone — they exist because filtering and insight still have different generate_json signatures. Unifying those signatures is a separate architectural decision.

Closes #1, closes #11, closes #12.

smodee added 5 commits May 17, 2026 22:40
embed_chunks sent chunk.text directly to the OpenAI embeddings API,
which 400s on empty strings. PDF table chunks have empty .text and
keep their content in .table_data, so any document with a table
chunk broke real-OpenAI insight runs.

Surfaced by the test/stage3-combined smoke run against the WHO mpox
sitrep, which produces a table chunk on page 1 with no surrounding
prose.
OpenAI's strict JSON schema mode requires every key in `properties`
to appear in `required`; the optional fields use nullable types
(`["string", "null"]`) which is the correct way to express
optionality under strict mode.

Without this fix every real OpenAI call from the insight extraction
stage failed with a 400. Tests using FakeLLMClient did not catch it
because the fake client doesn't validate the schema.
Two complementary fixes for the chunk-extractor failure mode where
the model produced JSON longer than the LLM client's default 1024
output-token cap and the partial response broke pipeline.run() for
the whole question:

1. Make the output-token cap a config knob
   (InsightConfig.extraction_max_output_tokens, default 4096) and
   thread it through pipeline -> chunk_extractor -> generate_json.
   The 1024 default was small enough that a single dense page
   (e.g. the ECDC CDTR) blew past it.

2. Make OpenAILLMClient.generate_json tolerant of unparseable model
   output: instead of raising json.JSONDecodeError, log the failure
   (model, finish_reason, head of raw text) and return an LLMResponse
   with empty content and the real token usage. The HTTP call
   succeeded and we paid for the tokens, so the budget tracker should
   see them; the caller already handles missing keys via .get().

Surfaced by scripts/test_extraction_insight.py on the ECDC CDTR PDF.
OpenAI's response_format={'type': 'json_object'} mode requires the
word 'json' to appear somewhere in the messages, otherwise the API
returns 400. The filter prompt didn't contain it, so every real
OpenAI call from the filtering stage failed.

Not caught earlier because filtering integration tests use
FakeLLMClient and the pipeline defaults to llm_client=None (the
fail-closed mode), so the real call path was never exercised.
bioscancast/filtering/llm_filter.py defined a local LLMClient
Protocol with the same shape as bioscancast/llm/client.LLMClient.
Two definitions of the same Protocol are a foot-gun (drift over
time, ambiguous type-checker errors); collapsing to a single
source of truth.

Filtering modules now import from bioscancast.llm.client directly,
matching the pattern the search stage already uses
(bioscancast/stages/search_stage/{pipeline,query_decomposition}.py).

Note: the bioscancast.llm.__init__ aliases (FilteringLLMClient /
InsightLLMClient) are left alone — they exist because the filtering
and insight Protocols still have different signatures (string prompt
vs. system/user/schema). Unifying those signatures is a separate
architectural decision worth its own PR.
@rapsoj rapsoj merged commit 463bc1a into main May 18, 2026
@rapsoj rapsoj deleted the fix/openai-integration-bugs branch May 18, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants