Skip to content

Maya's extractions use our fine-tuned model #65

@danielnaab

Description

@danielnaab

User Story

As Maya, in order to demonstrate that a small fine-tuned model can match or beat prompted larger models on a narrow domain task, I want an extraction variant powered by a LoRA fine-tune of a small open model, served from our own inference endpoint.

Preconditions

Acceptance Criteria

  • Training dataset committed under data/fine-tuning/extraction/ — at least 50 (PDF description → spec) pairs, either hand-curated from fixtures or Opus-generated and reviewed
  • LoRA fine-tuning script under scripts/fine-tune-extraction.ts (or language appropriate to the trainer)
  • Trained adapter checkpoint committed or clearly documented with retrieval instructions
  • FastAPI inference endpoint deployed to existing EC2, reachable via the app's Caddy routing — separate NixOS service, managed alongside the main app
  • New variant extraction/lora-v1 in the extraction registry that calls the inference endpoint via HTTP
  • Extraction tab in Settings → Variants lists the LoRA variant
  • Evaluation run comparing the LoRA variant against Sonnet/Haiku baselines
  • New catalog page catalog/experiments/pdf-field-extraction/lora-v1.md including: base model, training data size, LoRA rank/alpha, training hyperparameters, deployment architecture, and metric deltas vs baselines
  • catalog/experiments/_roadmap.md updated with shipped status and one-line finding
  • If scope becomes unachievable (time/cost), catalog page clearly marks scope-deferred with reasoning — "attempted" is an honest outcome

Success Metrics

  • Either: the LoRA variant ships and hits a reasonable recall number (>50%), OR the catalog honestly documents what was attempted and what blocked shipping
  • If shipped, cost-per-extraction documented vs API variants

Notes

  • Class topic: fine-tuning (Ch 5), MLOps (Ch 4), production deployment (Ch 6) — the headline rubric item
  • Base model choice: smallest viable — Llama 3.2 3B or Mistral 7B
  • Cost gate: explicit user approval before GPU training spend
  • This is the most ambitious story; acceptable to scope-defer with a written explanation

Definition of Done

  • Acceptance criteria met (or clearly documented scope-deferral)
  • Tests pass
  • Type checking passes
  • Threat model updated (new service, new external training process)
  • CI pipeline green
  • Deployed and demoable

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions