Skip to content

Add Markdown export and bulk export workflow#19

Merged
bigbruno merged 3 commits into
biglinux:mainfrom
xathay:feature/markdown-export
May 22, 2026
Merged

Add Markdown export and bulk export workflow#19
bigbruno merged 3 commits into
biglinux:mainfrom
xathay:feature/markdown-export

Conversation

@xathay
Copy link
Copy Markdown
Contributor

@xathay xathay commented May 21, 2026

Summary

Two related additions to the post-OCR conclusion page (and the CLI):

  • Markdown export — new `convert_pdf_to_markdown` reuses the existing `pdftotext + LayoutAnalyzer` pipeline to emit structured `.md` (ATX headings, GFM pipe tables, bold key-value lines, paragraph blocks). Optional YAML front-matter is handy for Obsidian/Hugo/LLM ingestion. Available both in the GUI (per-file `Gtk.MenuButton` that also subsumes the existing ODF export) and as `bigocrpdf-cli export-md [--front-matter]`.
  • Bulk export with selection mode — a toggle in the Generated Files card header turns the row into a selection view (checkboxes, Select all / Clear, Export selected ▾ menu). Exports run on a background thread with a cancellable progress dialog, name collisions auto-suffix `(1)`, `(2)`, and write-permission is pre-checked so a read-only destination fails immediately instead of N per-file errors.

The menus use `Gio.Menu` + `Gtk.PopoverMenu` so keyboard navigation (Up/Down/Enter) and screen-reader semantics come for free. A shared `_build_progress_dialog` helper backs both single-file and bulk flows; `convert_pdf_to_markdown` accepts a `cancel_event` so MD batches respond to Cancel mid-file (matching the ODF contract).

Two commits, separable for review:

  1. `Add Markdown export with optional YAML front-matter`
  2. `Add bulk export with selection mode on the conclusion page`

Test plan

  • `pytest` — 358 passing (20 new, focused on escape rules, table-cell escaping, YAML control-char stripping, front-matter, `cancel_event` propagation, `_unique_path` auto-suffix, dismissed-dialog helper)
  • `ruff check` clean on `src` and `tests`
  • `ruff format --check` clean on every file touched
  • CLI: `bigocrpdf-cli export-md sample.pdf --front-matter` produces valid `.md` with YAML front-matter
  • GUI single-file: per-row Export ▾Markdown (.md) opens dialog, writes file, optional Open after export launches default app
  • GUI single-file MD: cancellable progress dialog mirrors the ODF flow
  • GUI bulk: selection toggle reveals checkboxes + action bar; Select all / Clear work; bulk export honors auto-suffix on name conflicts
  • GUI bulk: read-only destination folder produces a single clear toast (no spurious per-file failures)
  • GUI bulk: Cancel mid-run reports `Cancelled — saved X of Y` correctly
  • Keyboard navigation: `Up`/`Down`/`Enter` work inside both export menus

Notes

  • No new runtime dependencies — front-matter date uses `datetime.timezone.utc` (Py 3.10+ stdlib) and Markdown emission stays inside the existing `tsv_odf_converter.py` module.
  • Removed the standalone Save as ODF icon button on each row; OpenDocument (.odt) is now the first item inside the new unified Export ▾ menu — same behavior, less visual clutter, ready to host future formats (Marker-based MD, JSON for LLMs, etc.) without inflating the action bar.
  • The bulk pipeline reuses the per-format converters; no new export logic was duplicated.

xathay added 2 commits May 21, 2026 02:24
Adds a third export format alongside the existing TXT/ODF: the conclusion
page now offers .md output through a unified Gtk.MenuButton (replacing
the standalone ODF icon), and the CLI gains an `export-md` subcommand
with a `--front-matter` flag.

Conversion reuses the existing pdftotext+LayoutAnalyzer pipeline:
headings become ATX prefixes, tables render as GitHub-flavored pipe
tables, key-value lines bold the key, and paragraphs are escaped only
where Markdown actually changes meaning (inline emphasis/links/pipes,
plus line-start block markers) so CPF/phone hyphens stay readable.

The MenuButton uses Gio.Menu + Gtk.PopoverMenu for native keyboard
navigation and screen-reader semantics. Single-file MD export shows
the same cancellable progress dialog the ODF flow already had, backed
by a shared _build_progress_dialog helper. Front-matter dates are
emitted in UTC so they stay deterministic across timezones.

20 new tests cover the conversion: escape rules, table cells with
Markdown specials, YAML control-char stripping, front-matter,
cancel_event propagation, _unique_path auto-suffix.
Lets the user export several OCR'd files at once. A toggle in the
"Generated Files" card header switches the row into a selection view:
per-file actions are hidden, every row gains a checkbox, and a bottom
bar appears with "Select all", "Clear" and an "Export selected ▾" menu
mirroring the per-file format choices (ODT, MD).

The bulk pipeline picks a destination folder up front, then walks the
selection in a background thread with a cancellable progress dialog
(reusing the same _build_progress_dialog helper from the single-file
flow). Name collisions are resolved by auto-suffixing "(1)", "(2)" so
the run never silently overwrites a file the user didn't pick. The
destination is pre-checked for existence and write permission so a
read-only folder fails immediately with a clear toast instead of N
per-file errors.

Cancellation mid-file is honored by both converters (convert_pdf_to_odf
already supported it; convert_pdf_to_markdown gained cancel_event in
the previous commit). The bulk worker distinguishes ExportCancelled
from a real failure so the summary toast reports counts correctly.

The bulk menu uses the same Gio.Menu + Gtk.PopoverMenu pattern as the
per-row export button for consistent keyboard navigation.
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 21, 2026

Not up to standards ⛔

🔴 Issues 50 high

Alerts:
⚠ 50 issues (≤ 0 issues of at least minor severity)

Results:
50 new issues

Category Results
Security 50 high

View in Codacy

🟢 Metrics 150 complexity · 8 duplication

Metric Results
Complexity 150
Duplication 8

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Static-analysis cleanup driven by the bots that ran on the PR:

- Split create_markdown() into per-element emitters (_emit_heading,
  _emit_table, _emit_kv, _emit_paragraph) plus a tiny dispatch helper,
  bringing Cognitive Complexity from 37 down under the 15 threshold.
- Split _bulk_export_worker by extracting _bulk_convert_one (single-file
  conversion) and _safe_remove (partial-output cleanup) plus a
  _BULK_EXTENSIONS map for format → suffix, bringing the worker CC from
  25 down under 15. The try/except/else flow also became clearer.
- Replaced bare logger.error in the bulk except path with
  logger.exception("Bulk export failed for %s", pdf_path) so the
  traceback ends up in the log.
- Module-level constants for duplicated literals:
    * _EXPORT_FAILED_MSG = _("Export failed")  (4 callers)
    * _NOTIFY_ACTIVE = "notify::active"        (4 GTK signal binds)
    * input_pdf_with_text_help in build_parser (3 subcommands)
- Tests: replaced hardcoded "/tmp/*.pdf" mock paths with
  os.path.join(tempfile.gettempdir(), …) via a _mock_pdf_path helper —
  the path is never touched (parse_tsv_pages is mocked), but Sonar's
  "publicly writable directory" rule is happy now. Generic
  Exception(...) test mocks tightened to RuntimeError(...).

No behavior changes, no public API changes — 358/358 tests still pass,
ruff clean, MD output diff-identical against pre-refactor fixtures.
@sonarqubecloud
Copy link
Copy Markdown

@bigbruno bigbruno merged commit 9c256a8 into biglinux:main May 22, 2026
1 of 2 checks passed
@xathay xathay deleted the feature/markdown-export branch May 22, 2026 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants