Fix five OpenAI-integration bugs surfaced by end-to-end run#22
Merged
Conversation
embed_chunks sent chunk.text directly to the OpenAI embeddings API, which 400s on empty strings. PDF table chunks have empty .text and keep their content in .table_data, so any document with a table chunk broke real-OpenAI insight runs. Surfaced by the test/stage3-combined smoke run against the WHO mpox sitrep, which produces a table chunk on page 1 with no surrounding prose.
OpenAI's strict JSON schema mode requires every key in `properties` to appear in `required`; the optional fields use nullable types (`["string", "null"]`) which is the correct way to express optionality under strict mode. Without this fix every real OpenAI call from the insight extraction stage failed with a 400. Tests using FakeLLMClient did not catch it because the fake client doesn't validate the schema.
Two complementary fixes for the chunk-extractor failure mode where the model produced JSON longer than the LLM client's default 1024 output-token cap and the partial response broke pipeline.run() for the whole question: 1. Make the output-token cap a config knob (InsightConfig.extraction_max_output_tokens, default 4096) and thread it through pipeline -> chunk_extractor -> generate_json. The 1024 default was small enough that a single dense page (e.g. the ECDC CDTR) blew past it. 2. Make OpenAILLMClient.generate_json tolerant of unparseable model output: instead of raising json.JSONDecodeError, log the failure (model, finish_reason, head of raw text) and return an LLMResponse with empty content and the real token usage. The HTTP call succeeded and we paid for the tokens, so the budget tracker should see them; the caller already handles missing keys via .get(). Surfaced by scripts/test_extraction_insight.py on the ECDC CDTR PDF.
OpenAI's response_format={'type': 'json_object'} mode requires the
word 'json' to appear somewhere in the messages, otherwise the API
returns 400. The filter prompt didn't contain it, so every real
OpenAI call from the filtering stage failed.
Not caught earlier because filtering integration tests use
FakeLLMClient and the pipeline defaults to llm_client=None (the
fail-closed mode), so the real call path was never exercised.
bioscancast/filtering/llm_filter.py defined a local LLMClient
Protocol with the same shape as bioscancast/llm/client.LLMClient.
Two definitions of the same Protocol are a foot-gun (drift over
time, ambiguous type-checker errors); collapsing to a single
source of truth.
Filtering modules now import from bioscancast.llm.client directly,
matching the pattern the search stage already uses
(bioscancast/stages/search_stage/{pipeline,query_decomposition}.py).
Note: the bioscancast.llm.__init__ aliases (FilteringLLMClient /
InsightLLMClient) are left alone — they exist because the filtering
and insight Protocols still have different signatures (string prompt
vs. system/user/schema). Unifying those signatures is a separate
architectural decision worth its own PR.
rapsoj
approved these changes
May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five small, independent fixes for bugs that only show up when the insight and filtering stages actually call OpenAI. All were invisible to the existing test suite because it uses
FakeLLMClient, which doesn't enforce the API's contracts.Surfaced while running an end-to-end smoke test of extraction + insight against the seven sources in
data/docling_eval/sources/(using PRs #19 and #20 combined on a local test branch).Commits
d3bd0a2— Render table chunks as text when embedding. PDF table chunks have empty.text(content lives in.table_data);embed_chunkswas sending""to the OpenAI embeddings API, which 400s. Now renders table rows inline for embedding.a041164— List all extraction-schema fields as required (closes Bug: Insight extraction JSON schema rejected by OpenAI strict mode #12). OpenAI'sstrict: TrueJSON-schema mode requires every property inrequired. The optional fields already use nullable types, so this is just expanding therequiredarray.7a75372— Survive truncated insight extraction responses. Two complementary fixes for the chunk-extractor failure mode where dense pages (e.g. ECDC CDTR) blew past the 1024 output-token cap and broke the whole question:InsightConfig.extraction_max_output_tokens(default 4096), threaded throughpipeline → chunk_extractor → generate_json.OpenAILLMClient.generate_jsonnow catchesjson.JSONDecodeError, logs the failure with token usage + finish_reason, and returns an empty content dict with accurate token counts (so the budget tracker stays right).97484df— Include 'json' keyword in filter prompt (closes Bug: LLM filter prompt missing 'json' keyword causes OpenAI 400 error #11). Required byresponse_format={'type': 'json_object'}; every real LLM-filter call was 400ing.af7844c— Drop duplicate LLMClient Protocol in filtering (closes Consolidate duplicate LLMClient Protocol definitions #1).filtering/llm_filter.pydefined the same Protocol asbioscancast/llm/client.py. Collapsed to a single import.Test plan
pytest bioscancast/tests/— 178 passed, no regressions.data/docling_eval/sources/(script lives on a separate test branch, requires Hybrid PDF extraction: add DoclingTableRefiner (closes #16) #19 + Switch extraction fetcher to curl_cffi for Cloudflare-fronted sources #20 merged):5.7k output tokens on gpt-4o-mini ($0.006).generate_jsoncontract change (tolerant ofJSONDecodeError) is acceptable. Current callers (extract_facts_from_chunk, filtering pipeline) already handle missing keys via.get(...), so the empty-content fallback is safe. Future callers should not rely ongenerate_jsonraising on malformed JSON.bioscancast.llm.__init__FilteringLLMClient/InsightLLMClientaliases are intentionally left alone — they exist because filtering and insight still have differentgenerate_jsonsignatures. Unifying those signatures is a separate architectural decision.Closes #1, closes #11, closes #12.