Skip to content

feat: bastanteo-driven improvements (multi-file async, hybrid matcher, job filtering)#10

Merged
ancongui merged 3 commits into
mainfrom
feat/bastanteo-driven-improvements
May 15, 2026
Merged

feat: bastanteo-driven improvements (multi-file async, hybrid matcher, job filtering)#10
ancongui merged 3 commits into
mainfrom
feat/bastanteo-driven-improvements

Conversation

@ancongui
Copy link
Copy Markdown
Contributor

Summary

End-to-end work surfaced by running a real-world Spanish banking bastanteo + KYB POC against flydesk-idp. Each change is independently useful but they ship together because the POC needed all of them.

Extraction reliability

  • MultimodalExtractor now passes max_tokens=8192 to the underlying agent + adds schema compression + auto-retry of suspicious empty arrays. Fixes the silent rows=[] truncation that bit multi-row personas/apoderamientos schemas.
  • New extract_retry_arrays prompt (English, registered in PromptCatalog) so the retry intention stays language-consistent.
  • extract.yaml v1.1: explicit ARRAY FIELDS directive with concrete examples.

Bbox refinement architecture

  • HybridValueMatcher (new): rapidfuzz first (free, ms-scale), LLM only for the residual. Default; llm and fuzzy remain available via FLYDESK_IDP_BBOX_REFINE_MATCHER.
  • BboxRefiner is idempotent — fields with bbox.source ∈ {PDF_TEXT, OCR} skip on re-run.
  • JobWorker skips inline bbox_refine on async — mutates stages.bbox_refine=False before calling the orchestrator and lets the dedicated BboxRefineWorker (triggered by IDPBboxRefineRequested) be the single refine path. Eliminates the wasted 300s + misleading Pipeline node failed log on multi-PDF bundles. Sync requests are untouched.

Async API

  • SubmitJobHandler accepts documents: list[DocumentInput] for multi-file submissions in a single job.
  • GET /api/v1/jobs (new) — list with filters: status (CSV), bbox_refine_status (CSV), created_after/before, idempotency_key. Backed by ListJobsQuery/ListJobsHandler + typed JobListResponse.
  • GetJobResultHandler long-poll exits on bbox_refine_status ∈ {succeeded, failed} — fixes hangs in PARTIAL_SUCCEEDED → SUCCEEDED transitions.

Per-stage timeouts

  • IDPSettings now exposes env-tunable timeouts per pipeline node (extract_timeout_s, judge_timeout_s, bbox_refine_inline_timeout_s, classifier_timeout_s, splitter_timeout_s, judge_escalation_timeout_s). Orchestrator reads them per node.
  • async_timeout_s raised 300 → 1200 so multi-file bundles + the empty-array auto-retry don't hit a hard wall.

Migration

  • 20260515_0003_widen_job_status widens extraction_jobs.status to fit PARTIAL_SUCCEEDED + REFINING_BBOXES.

Test plan

  • tests/unit/test_hybrid_matcher.py — empty input · all-fuzzy-hits skips LLM · partial-fuzzy forwards residual · all-fuzzy-misses forwards everything
  • tests/unit/test_submit_job_handler.py — single-file + multi-file paths (5 tests)
  • tests/unit/test_list_jobs_handler.py — filter mapping (3 tests)
  • Full unit suite: 223 passed, 1 skipped
  • End-to-end via bastanteo POC: 7-PDF bundle → SUCCEEDED with grounded bboxes (61% PDF-text + OCR anchored), out-of-band refinement path validated under load

🤖 Generated with Claude Code

ancongui and others added 3 commits May 15, 2026 17:55
…, job filtering)

End-to-end work surfaced by running a real-world Spanish banking
bastanteo POC against flydesk-idp. Each change is independently
useful but they ship together because the POC needed all of them.

## Extraction

- ``MultimodalExtractor``: ``max_tokens=8192`` (was the 4096 default)
  + schema compression + auto-retry of suspicious empty arrays. Fixes
  the silent ``rows=[]`` truncation that bit multi-row personas /
  apoderamientos schemas. Retry path uses a new
  ``extract_retry_arrays`` prompt template (English, registered in
  PromptCatalog) so language stays consistent.
- ``extract.yaml`` v1.1: explicit ARRAY FIELDS directive with examples
  for personas/line_items/signatories — pushes the model to emit
  every row instead of summarising.

## Bbox refinement

- ``HybridValueMatcher`` (new): rapidfuzz first (free, ms), LLM only
  for the residual the deterministic pass cannot resolve. Default
  matcher; ``llm`` and ``fuzzy`` remain available via
  ``FLYDESK_IDP_BBOX_REFINE_MATCHER``.
- ``BboxRefiner`` is now idempotent — fields with
  ``bbox.source ∈ {PDF_TEXT, OCR}`` are skipped on re-run, so the
  out-of-band worker does not duplicate inline work.
- ``JobWorker`` mutates ``stages.bbox_refine=False`` before calling
  the orchestrator on the async path. The dedicated
  ``BboxRefineWorker`` (triggered by the
  ``IDPBboxRefineRequested`` EDA event) is now the single refine
  path on async. Eliminates the wasted 300s + the misleading
  ``Pipeline node failed [bbox_refine] error=unknown`` log on
  multi-PDF bundles. Sync requests are unchanged.

## Async API

- ``SubmitJobHandler`` accepts ``documents: list[DocumentInput]``
  for multi-file submissions in a single async job.
- ``GET /api/v1/jobs`` (new): list endpoint with filters for status,
  bbox_refine_status, created_after/before, idempotency_key. Backed
  by a new ``ListJobsQuery``/``ListJobsHandler`` pair and a typed
  ``JobListResponse``.
- ``GetJobResultHandler`` long-poll exits cleanly on
  ``bbox_refine_status ∈ {succeeded, failed}`` — fixes hangs in
  ``PARTIAL_SUCCEEDED → SUCCEEDED`` transitions.

## Timeouts

- Per-stage env-tunable timeouts in ``IDPSettings``:
  ``extract_timeout_s=600``, ``judge_timeout_s=300``,
  ``bbox_refine_inline_timeout_s=300``,
  ``classifier_timeout_s=180``, ``splitter_timeout_s=180``,
  ``judge_escalation_timeout_s=600``. Orchestrator reads each
  setting per pipeline node instead of hardcoded constants.
- ``async_timeout_s`` raised default 300→1200 so multi-file bundles
  + the empty-array auto-retry don't hit a hard wall.

## Migration

- ``20260515_0003_widen_job_status``: widens
  ``extraction_jobs.status`` varchar to fit
  ``PARTIAL_SUCCEEDED`` + ``REFINING_BBOXES``.

## Tests

- ``test_hybrid_matcher.py`` — 4 tests covering empty input,
  all-fuzzy-hits skips LLM, partial-fuzzy forwards residual,
  all-fuzzy-misses forwards everything.
- ``test_submit_job_handler.py`` — 5 tests for single + multi-file
  paths.
- ``test_list_jobs_handler.py`` — 3 tests for filter mapping.
- Full unit suite: 223 passed, 1 skipped.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Two ruff errors that landed in PR #10 CI:

- ``src/flydesk_idp/core/services/bbox/bbox_refiner.py:101`` —
  B007: ``idx`` unused inside the body of the already-grounded
  counter loop. Switched to ``for field in leaves`` since the index
  was never read.
- ``tests/unit/test_hybrid_matcher.py:19`` — F401: ``typing.Any``
  was left from an earlier scaffold and is no longer referenced.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
PR #10 second CI run flagged ``ruff format --check`` differences on
the three new files added in the main commit. Reformatted in place;
no semantic changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ancongui ancongui merged commit b31b902 into main May 15, 2026
4 checks passed
@ancongui ancongui deleted the feat/bastanteo-driven-improvements branch May 15, 2026 16:03
ancongui added a commit that referenced this pull request May 31, 2026
…, job filtering) (#10)

* feat: bastanteo-driven improvements (multi-file async, hybrid matcher, job filtering)

End-to-end work surfaced by running a real-world Spanish banking
bastanteo POC against flydesk-idp. Each change is independently
useful but they ship together because the POC needed all of them.

## Extraction

- ``MultimodalExtractor``: ``max_tokens=8192`` (was the 4096 default)
  + schema compression + auto-retry of suspicious empty arrays. Fixes
  the silent ``rows=[]`` truncation that bit multi-row personas /
  apoderamientos schemas. Retry path uses a new
  ``extract_retry_arrays`` prompt template (English, registered in
  PromptCatalog) so language stays consistent.
- ``extract.yaml`` v1.1: explicit ARRAY FIELDS directive with examples
  for personas/line_items/signatories — pushes the model to emit
  every row instead of summarising.

## Bbox refinement

- ``HybridValueMatcher`` (new): rapidfuzz first (free, ms), LLM only
  for the residual the deterministic pass cannot resolve. Default
  matcher; ``llm`` and ``fuzzy`` remain available via
  ``FLYDESK_IDP_BBOX_REFINE_MATCHER``.
- ``BboxRefiner`` is now idempotent — fields with
  ``bbox.source ∈ {PDF_TEXT, OCR}`` are skipped on re-run, so the
  out-of-band worker does not duplicate inline work.
- ``JobWorker`` mutates ``stages.bbox_refine=False`` before calling
  the orchestrator on the async path. The dedicated
  ``BboxRefineWorker`` (triggered by the
  ``IDPBboxRefineRequested`` EDA event) is now the single refine
  path on async. Eliminates the wasted 300s + the misleading
  ``Pipeline node failed [bbox_refine] error=unknown`` log on
  multi-PDF bundles. Sync requests are unchanged.

## Async API

- ``SubmitJobHandler`` accepts ``documents: list[DocumentInput]``
  for multi-file submissions in a single async job.
- ``GET /api/v1/jobs`` (new): list endpoint with filters for status,
  bbox_refine_status, created_after/before, idempotency_key. Backed
  by a new ``ListJobsQuery``/``ListJobsHandler`` pair and a typed
  ``JobListResponse``.
- ``GetJobResultHandler`` long-poll exits cleanly on
  ``bbox_refine_status ∈ {succeeded, failed}`` — fixes hangs in
  ``PARTIAL_SUCCEEDED → SUCCEEDED`` transitions.

## Timeouts

- Per-stage env-tunable timeouts in ``IDPSettings``:
  ``extract_timeout_s=600``, ``judge_timeout_s=300``,
  ``bbox_refine_inline_timeout_s=300``,
  ``classifier_timeout_s=180``, ``splitter_timeout_s=180``,
  ``judge_escalation_timeout_s=600``. Orchestrator reads each
  setting per pipeline node instead of hardcoded constants.
- ``async_timeout_s`` raised default 300→1200 so multi-file bundles
  + the empty-array auto-retry don't hit a hard wall.

## Migration

- ``20260515_0003_widen_job_status``: widens
  ``extraction_jobs.status`` varchar to fit
  ``PARTIAL_SUCCEEDED`` + ``REFINING_BBOXES``.

## Tests

- ``test_hybrid_matcher.py`` — 4 tests covering empty input,
  all-fuzzy-hits skips LLM, partial-fuzzy forwards residual,
  all-fuzzy-misses forwards everything.
- ``test_submit_job_handler.py`` — 5 tests for single + multi-file
  paths.
- ``test_list_jobs_handler.py`` — 3 tests for filter mapping.
- Full unit suite: 223 passed, 1 skipped.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(lint): drop unused idx loop var and unused typing.Any import

Two ruff errors that landed in PR #10 CI:

- ``src/flydesk_idp/core/services/bbox/bbox_refiner.py:101`` —
  B007: ``idx`` unused inside the body of the already-grounded
  counter loop. Switched to ``for field in leaves`` since the index
  was never read.
- ``tests/unit/test_hybrid_matcher.py:19`` — F401: ``typing.Any``
  was left from an earlier scaffold and is no longer referenced.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(lint): apply ruff format to new files

PR #10 second CI run flagged ``ruff format --check`` differences on
the three new files added in the main commit. Reformatted in place;
no semantic changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: ancongui <andres.contreras@soon.es>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant