Skip to content

Maya uploads a PDF and reviews the extracted specs #3

@danielnaab

Description

@danielnaab

User Story:

As a form creator (Maya), in order to digitize a paper form without technical skills, I want to upload a PDF form and review the structured specs the system extracted from it

Preconditions:

  • Maya is authenticated (Slice 1)
  • PDF form available for upload

Acceptance Criteria:

  • Upload page accepts PDF files
  • System extracts structure from PDF and produces a DataCollectionSpec
  • System generates a default FormSpec based on the extracted DataCollectionSpec
  • Both specs are displayed in the catalog as browsable, reviewable content
  • Maya can see what fields were extracted, their types, grouping, and conditions
  • Maya can see the proposed form layout (pages, sections, delivery modes)
  • Extracted specs are persisted as a FormProject in git
  • Extraction errors or low-confidence fields are flagged for review
  • Form projects are stored as bare git repos with version history
  • Project detail page shows version history with commit-level snapshots
  • Projects are publicly viewable at user-scoped URLs (/:owner/:slug)
  • Mutations (delete, re-extract) are restricted to project owners via service-layer permission checks
  • Authenticated users can fork projects they do not own
  • User profile pages list a user's projects at /:owner
  • Git repository browsing (tree, blob, commits) available at GitHub-style URLs
  • Read-only git clone served over HTTP
  • Home page shows dashboard for authenticated users, landing page for anonymous visitors

Success Metrics:

  • Extraction accuracy: percentage of fields correctly identified vs. source PDF
  • Time from upload to reviewable spec < 30 seconds
  • Establish baseline evaluation metrics for LLM extraction quality

Notes:

  • First LLM integration point — uses Claude API (Opus/Sonnet baseline)
  • LLM service uses strategy pattern: PdfExtractor interface with ApiPdfExtractor implementation
  • Evaluation: compare extracted spec against manually-created ground truth for test PDFs
  • Future experiments: alternative models, prompting strategies, chunking approaches
  • Form projects stored as bare git repos at data/repos/<slug>.git
  • ProjectService layer enforces ownership permissions; route handlers are thin wrappers
  • GitHub-style URL structure: /:owner/:slug, /:owner/:slug/tree/:ref/*, /:owner/:slug/settings, etc.

Definition of Done:

  • Acceptance criteria met
  • Threat model updated -- any new trust boundaries, data flows, or attack surfaces are reflected in catalog/architecture/threat-model.md
  • Technical documentation updated -- architecture docs and decisions are current
  • LLM extraction service has interface abstraction (swappable implementations)
  • At least one test PDF with ground truth for evaluation
  • Tests pass
  • Type checking passes
  • CI pipeline green
  • Deployed and demoable

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions