feat: bastanteo-driven improvements (multi-file async, hybrid matcher, job filtering)#10
Merged
Merged
Conversation
…, job filtering)
End-to-end work surfaced by running a real-world Spanish banking
bastanteo POC against flydesk-idp. Each change is independently
useful but they ship together because the POC needed all of them.
## Extraction
- ``MultimodalExtractor``: ``max_tokens=8192`` (was the 4096 default)
+ schema compression + auto-retry of suspicious empty arrays. Fixes
the silent ``rows=[]`` truncation that bit multi-row personas /
apoderamientos schemas. Retry path uses a new
``extract_retry_arrays`` prompt template (English, registered in
PromptCatalog) so language stays consistent.
- ``extract.yaml`` v1.1: explicit ARRAY FIELDS directive with examples
for personas/line_items/signatories — pushes the model to emit
every row instead of summarising.
## Bbox refinement
- ``HybridValueMatcher`` (new): rapidfuzz first (free, ms), LLM only
for the residual the deterministic pass cannot resolve. Default
matcher; ``llm`` and ``fuzzy`` remain available via
``FLYDESK_IDP_BBOX_REFINE_MATCHER``.
- ``BboxRefiner`` is now idempotent — fields with
``bbox.source ∈ {PDF_TEXT, OCR}`` are skipped on re-run, so the
out-of-band worker does not duplicate inline work.
- ``JobWorker`` mutates ``stages.bbox_refine=False`` before calling
the orchestrator on the async path. The dedicated
``BboxRefineWorker`` (triggered by the
``IDPBboxRefineRequested`` EDA event) is now the single refine
path on async. Eliminates the wasted 300s + the misleading
``Pipeline node failed [bbox_refine] error=unknown`` log on
multi-PDF bundles. Sync requests are unchanged.
## Async API
- ``SubmitJobHandler`` accepts ``documents: list[DocumentInput]``
for multi-file submissions in a single async job.
- ``GET /api/v1/jobs`` (new): list endpoint with filters for status,
bbox_refine_status, created_after/before, idempotency_key. Backed
by a new ``ListJobsQuery``/``ListJobsHandler`` pair and a typed
``JobListResponse``.
- ``GetJobResultHandler`` long-poll exits cleanly on
``bbox_refine_status ∈ {succeeded, failed}`` — fixes hangs in
``PARTIAL_SUCCEEDED → SUCCEEDED`` transitions.
## Timeouts
- Per-stage env-tunable timeouts in ``IDPSettings``:
``extract_timeout_s=600``, ``judge_timeout_s=300``,
``bbox_refine_inline_timeout_s=300``,
``classifier_timeout_s=180``, ``splitter_timeout_s=180``,
``judge_escalation_timeout_s=600``. Orchestrator reads each
setting per pipeline node instead of hardcoded constants.
- ``async_timeout_s`` raised default 300→1200 so multi-file bundles
+ the empty-array auto-retry don't hit a hard wall.
## Migration
- ``20260515_0003_widen_job_status``: widens
``extraction_jobs.status`` varchar to fit
``PARTIAL_SUCCEEDED`` + ``REFINING_BBOXES``.
## Tests
- ``test_hybrid_matcher.py`` — 4 tests covering empty input,
all-fuzzy-hits skips LLM, partial-fuzzy forwards residual,
all-fuzzy-misses forwards everything.
- ``test_submit_job_handler.py`` — 5 tests for single + multi-file
paths.
- ``test_list_jobs_handler.py`` — 3 tests for filter mapping.
- Full unit suite: 223 passed, 1 skipped.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Two ruff errors that landed in PR #10 CI: - ``src/flydesk_idp/core/services/bbox/bbox_refiner.py:101`` — B007: ``idx`` unused inside the body of the already-grounded counter loop. Switched to ``for field in leaves`` since the index was never read. - ``tests/unit/test_hybrid_matcher.py:19`` — F401: ``typing.Any`` was left from an earlier scaffold and is no longer referenced. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
PR #10 second CI run flagged ``ruff format --check`` differences on the three new files added in the main commit. Reformatted in place; no semantic changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
ancongui
added a commit
that referenced
this pull request
May 31, 2026
…, job filtering) (#10) * feat: bastanteo-driven improvements (multi-file async, hybrid matcher, job filtering) End-to-end work surfaced by running a real-world Spanish banking bastanteo POC against flydesk-idp. Each change is independently useful but they ship together because the POC needed all of them. ## Extraction - ``MultimodalExtractor``: ``max_tokens=8192`` (was the 4096 default) + schema compression + auto-retry of suspicious empty arrays. Fixes the silent ``rows=[]`` truncation that bit multi-row personas / apoderamientos schemas. Retry path uses a new ``extract_retry_arrays`` prompt template (English, registered in PromptCatalog) so language stays consistent. - ``extract.yaml`` v1.1: explicit ARRAY FIELDS directive with examples for personas/line_items/signatories — pushes the model to emit every row instead of summarising. ## Bbox refinement - ``HybridValueMatcher`` (new): rapidfuzz first (free, ms), LLM only for the residual the deterministic pass cannot resolve. Default matcher; ``llm`` and ``fuzzy`` remain available via ``FLYDESK_IDP_BBOX_REFINE_MATCHER``. - ``BboxRefiner`` is now idempotent — fields with ``bbox.source ∈ {PDF_TEXT, OCR}`` are skipped on re-run, so the out-of-band worker does not duplicate inline work. - ``JobWorker`` mutates ``stages.bbox_refine=False`` before calling the orchestrator on the async path. The dedicated ``BboxRefineWorker`` (triggered by the ``IDPBboxRefineRequested`` EDA event) is now the single refine path on async. Eliminates the wasted 300s + the misleading ``Pipeline node failed [bbox_refine] error=unknown`` log on multi-PDF bundles. Sync requests are unchanged. ## Async API - ``SubmitJobHandler`` accepts ``documents: list[DocumentInput]`` for multi-file submissions in a single async job. - ``GET /api/v1/jobs`` (new): list endpoint with filters for status, bbox_refine_status, created_after/before, idempotency_key. Backed by a new ``ListJobsQuery``/``ListJobsHandler`` pair and a typed ``JobListResponse``. - ``GetJobResultHandler`` long-poll exits cleanly on ``bbox_refine_status ∈ {succeeded, failed}`` — fixes hangs in ``PARTIAL_SUCCEEDED → SUCCEEDED`` transitions. ## Timeouts - Per-stage env-tunable timeouts in ``IDPSettings``: ``extract_timeout_s=600``, ``judge_timeout_s=300``, ``bbox_refine_inline_timeout_s=300``, ``classifier_timeout_s=180``, ``splitter_timeout_s=180``, ``judge_escalation_timeout_s=600``. Orchestrator reads each setting per pipeline node instead of hardcoded constants. - ``async_timeout_s`` raised default 300→1200 so multi-file bundles + the empty-array auto-retry don't hit a hard wall. ## Migration - ``20260515_0003_widen_job_status``: widens ``extraction_jobs.status`` varchar to fit ``PARTIAL_SUCCEEDED`` + ``REFINING_BBOXES``. ## Tests - ``test_hybrid_matcher.py`` — 4 tests covering empty input, all-fuzzy-hits skips LLM, partial-fuzzy forwards residual, all-fuzzy-misses forwards everything. - ``test_submit_job_handler.py`` — 5 tests for single + multi-file paths. - ``test_list_jobs_handler.py`` — 3 tests for filter mapping. - Full unit suite: 223 passed, 1 skipped. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * fix(lint): drop unused idx loop var and unused typing.Any import Two ruff errors that landed in PR #10 CI: - ``src/flydesk_idp/core/services/bbox/bbox_refiner.py:101`` — B007: ``idx`` unused inside the body of the already-grounded counter loop. Switched to ``for field in leaves`` since the index was never read. - ``tests/unit/test_hybrid_matcher.py:19`` — F401: ``typing.Any`` was left from an earlier scaffold and is no longer referenced. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * fix(lint): apply ruff format to new files PR #10 second CI run flagged ``ruff format --check`` differences on the three new files added in the main commit. Reformatted in place; no semantic changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: ancongui <andres.contreras@soon.es> Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end work surfaced by running a real-world Spanish banking bastanteo + KYB POC against flydesk-idp. Each change is independently useful but they ship together because the POC needed all of them.
Extraction reliability
MultimodalExtractornow passesmax_tokens=8192to the underlying agent + adds schema compression + auto-retry of suspicious empty arrays. Fixes the silentrows=[]truncation that bit multi-row personas/apoderamientos schemas.extract_retry_arraysprompt (English, registered inPromptCatalog) so the retry intention stays language-consistent.extract.yamlv1.1: explicit ARRAY FIELDS directive with concrete examples.Bbox refinement architecture
HybridValueMatcher(new): rapidfuzz first (free, ms-scale), LLM only for the residual. Default;llmandfuzzyremain available viaFLYDESK_IDP_BBOX_REFINE_MATCHER.BboxRefineris idempotent — fields withbbox.source ∈ {PDF_TEXT, OCR}skip on re-run.JobWorkerskips inlinebbox_refineon async — mutatesstages.bbox_refine=Falsebefore calling the orchestrator and lets the dedicatedBboxRefineWorker(triggered byIDPBboxRefineRequested) be the single refine path. Eliminates the wasted 300s + misleadingPipeline node failedlog on multi-PDF bundles. Sync requests are untouched.Async API
SubmitJobHandleracceptsdocuments: list[DocumentInput]for multi-file submissions in a single job.GET /api/v1/jobs(new) — list with filters: status (CSV), bbox_refine_status (CSV), created_after/before, idempotency_key. Backed byListJobsQuery/ListJobsHandler+ typedJobListResponse.GetJobResultHandlerlong-poll exits onbbox_refine_status ∈ {succeeded, failed}— fixes hangs inPARTIAL_SUCCEEDED → SUCCEEDEDtransitions.Per-stage timeouts
IDPSettingsnow exposes env-tunable timeouts per pipeline node (extract_timeout_s,judge_timeout_s,bbox_refine_inline_timeout_s,classifier_timeout_s,splitter_timeout_s,judge_escalation_timeout_s). Orchestrator reads them per node.async_timeout_sraised 300 → 1200 so multi-file bundles + the empty-array auto-retry don't hit a hard wall.Migration
20260515_0003_widen_job_statuswidensextraction_jobs.statusto fitPARTIAL_SUCCEEDED+REFINING_BBOXES.Test plan
tests/unit/test_hybrid_matcher.py— empty input · all-fuzzy-hits skips LLM · partial-fuzzy forwards residual · all-fuzzy-misses forwards everythingtests/unit/test_submit_job_handler.py— single-file + multi-file paths (5 tests)tests/unit/test_list_jobs_handler.py— filter mapping (3 tests)SUCCEEDEDwith grounded bboxes (61% PDF-text + OCR anchored), out-of-band refinement path validated under load🤖 Generated with Claude Code