feat(extract): add document extraction pipeline with OCR and LLM analysis#475
Merged
feat(extract): add document extraction pipeline with OCR and LLM analysis#475
Conversation
0e2df39 to
86d4eba
Compare
Add the text extraction foundation for the document extraction pipeline (#200). Introduces ExtractedText and OCRData columns on Document, a pure-Go PDF text extractor using ledongthuc/pdf, and a design plan document. - Add ExtractedText (string) and OCRData ([]byte) to Document model - Implement ExtractText for PDF, text/*, and markdown MIME types - Add IsScanned heuristic (empty/whitespace text = scanned) - Include test fixture generator and sample.pdf Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the OCR layer for the document extraction pipeline (#200). Scanned PDFs and images are recognized via tesseract + pdftoppm when available, with graceful degradation and a one-time hint when tools are missing. - Add OCR function with PDF rasterization (pdftoppm) and image OCR paths - Parse tesseract TSV output preserving word/line/paragraph structure - Add tool detection with sync.Once caching (HasTesseract, HasPDFToPPM) - Add tesseract + poppler-utils to devShell in flake.nix - Add one-time tesseract hint setting in data/settings.go Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the LLM extraction layer for the document extraction pipeline (#200). When an extraction model is configured, documents are analyzed to extract vendor, amounts, dates, entity links, and maintenance schedules. - Add ExtractionHints and EntityContext types with validation maps - Build extraction prompt with entity context for LLM matching - Parse flexible LLM JSON responses (money as string/float, multiple date formats, code-fenced responses) - Add [extraction] config section with model, max_ocr_pages, enabled - Wire extraction config through app Model and main.go Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the Pipeline orchestrator that sequences text extraction, OCR, and LLM extraction into a single Run call (#200). Wire it into the document upload flow with entity context from the database. - Add Pipeline.Run orchestrating all three extraction layers - Add Store.EntityNames for LLM entity matching context - Rewrite parseDocumentFormData to use Pipeline with documentParseResult - Add buildExtractionPipeline and showTesseractHint to app Model - Pre-fill document title from LLM suggestion, notes from summary - Surface non-fatal extraction errors via status bar Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rogress Add an interactive overlay that shows real-time progress when documents are processed through the extraction pipeline (text -> OCR -> LLM). Each step displays a spinner, elapsed time, and detail (page count, model name, character count). Users accept results with `a` or cancel with `esc`. Key changes: - Extraction overlay with step navigation (j/k), expand/collapse (enter) - Channel-based OCR streaming with per-page and rasterization progress - LLM token streaming in overlay with accumulated JSON display - Accept/cancel flow: results held until user presses `a` - Proper context cancellation: esc cancels all in-flight work - OCR failure gracefully continues to LLM step - currency_unit field in extraction schema for cents/dollars disambiguation - Configurable pdftotext timeout (extraction.text_timeout / MICASA_TEXT_TIMEOUT) - ExtractionPromptInput struct replaces 7 positional params - Cached extraction LLM client on model - Docs: configuration, keybindings, and documents guide updated closes #200 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
VHS tape that demonstrates the document extraction overlay: importing a scanned PDF, OCR progress, LLM extraction, and accepting results. Requires Ollama running with qwen3:0.6b. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Combine sample.pdf (digital text) and scanned-invoice.pdf (image pages) into a 109KB mixed-inspection.pdf fixture. The pipeline test verifies that pdftotext extracts digital pages while OCR handles scanned ones -- the common case for real-world inspection reports and permits. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without these tools, all OCR and PDF text extraction tests are skipped in CI. Install poppler-utils (pdftotext, pdftoppm) and tesseract-ocr on all three platforms so the pipeline tests exercise real code paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace checked-in binary test fixtures (sample.pdf, invoice.png, scanned-invoice.pdf, mixed-inspection.pdf) with 4 bash scripts that generate them on demand. Fixtures are now gitignored and generated via shell hook (local dev) or CI step (with shell: bash for Windows). - gen-sample-pdf.bash: base64-embedded minimal PDF (no deps) - gen-invoice-png.bash: magick-generated realistic invoice image - gen-scanned-pdf.bash: magick image-to-PDF conversion - gen-mixed-pdf.bash: pdfunite digital + scanned pages - 5 nix apps: 4 individual + gen-testdata combined - CI: imagemagick added to all platforms, shell: bash fixture step closes #200 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use Shift+F to jump directly to Docs tab - Add new document instead of editing existing one - Increase terminal height for overlay centering Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… tape Clamp base view to terminal height before overlay compositing so the extraction overlay centers correctly when opened from an in-place form save (where the form content overflows the terminal). Fix expanded log content: add left border pipe (│) to visually separate log output from step headers, add blank line spacing between expanded steps, and use directional triangles (▾ down when expanded, ▸ right when collapsed). Fix the demo tape: enter Edit mode before Shift+A, use a temp directory with just the test PDF so the file picker navigation is deterministic. closes #200 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ubuntu 22.04/24.04 ship ImageMagick v6 which only provides `convert`, not the v7 `magick` command. Symlink convert to magick on Linux CI so gen scripts work uniformly across all platforms. Add gen-sample-text-png.bash for the OCR image integration test fixture. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Choco's poppler package extracts binaries into a nested directory that isn't on PATH. Split the install into two steps: choco install in PowerShell, then find pdfunite.exe and add its directory to GITHUB_PATH in bash. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The choco poppler package only ships source code, not compiled binaries. Drop it from Windows CI and gracefully skip gen-mixed-pdf when pdfunite is unavailable. The mixed-PDF test already skips when the fixture is missing.
macOS ImageMagick needs ghostscript for text rendering (magick -annotate). Without it, fixture images are blank and tesseract returns empty text. Also add skipOrFatalCI helper: tests skip locally when tools are missing, but fail hard in CI on Linux/macOS where all tools should be installed.
Tests for the two main gaps in coverage: - Pipeline with LLM: mock httptest server returns canned extraction JSON, verifying the full text -> LLM -> parsed hints path. Also tests LLM server down, garbage response, and no-text skip. - OCRWithProgress: empty data, context cancellation, and integration tests for image and PDF paths (need tesseract/pdftoppm in CI).
0a8df47 to
3e75321
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
waitForpattern; text extraction stays synchronousextracted_text,ocr_text,ocr_tsvcolumns on documentsTest plan
go test -shuffle=on ./...passes🤖 Generated with Claude Code