Skip to content

feat: harden AI summaries and staging bakeoff tooling#12

Merged
darron merged 1 commit into
mainfrom
model-fix
May 9, 2026
Merged

feat: harden AI summaries and staging bakeoff tooling#12
darron merged 1 commit into
mainfrom
model-fix

Conversation

@darron
Copy link
Copy Markdown
Owner

@darron darron commented May 9, 2026

Clean up story and record summary generation so extracted page chrome, daemon metadata, front matter, share links, scripts, image URLs, and noisy HTML/Markdown artifacts are stripped before they can leak into rendered record pages. Tighten the summary prompts, disable model thinking for free-text summary calls, normalize synthesis classifications, and retry summaries that look like extractor noise instead of treating them as valid existing output.

Add staging-only machine access for bearer-token record actions so summary and location-enrichment bakeoffs can run without a browser session. Add Taskfile targets and a deterministic seed script for creating staging test records from one or more URLs, queueing summarization, fetching rendered records, and inspecting stored summaries/location fields.

Improve ingest matching for sources that mention a specific community near a larger reference city. Extraction now prefers the incident municipality, First Nation, reserve, or community over phrases like “east of Regina”, and candidate matching can attach when the specific source city plus year/province/counts strongly match an existing record.

Behavior changes:

  • Staging exposes /admin/api/records/:id/summarize and /admin/api/records/:id/enrich-location to valid ingest-token callers only.
  • Existing noisy summaries may be regenerated instead of skipped.
  • Ingest proposals should avoid duplicate records when sources name a specific community near a better-known city.

Risks:

  • The text sanitizer is heuristic and may remove some legitimate short page text if it resembles navigation or metadata.
  • The staging bearer-token path depends on APP_ENV being set correctly.

Follow-ups:

  • Bake off the staging summary targets against real problem URLs.
  • Watch production summaries for over-aggressive sanitation before widening the chrome/noise patterns further.

Clean up story and record summary generation so extracted page chrome,
daemon metadata, front matter, share links, scripts, image URLs, and noisy
HTML/Markdown artifacts are stripped before they can leak into rendered record
pages. Tighten the summary prompts, disable model thinking for free-text
summary calls, normalize synthesis classifications, and retry summaries that
look like extractor noise instead of treating them as valid existing output.

Add staging-only machine access for bearer-token record actions so summary and
location-enrichment bakeoffs can run without a browser session. Add Taskfile
targets and a deterministic seed script for creating staging test records from
one or more URLs, queueing summarization, fetching rendered records, and
inspecting stored summaries/location fields.

Improve ingest matching for sources that mention a specific community near a
larger reference city. Extraction now prefers the incident municipality, First
Nation, reserve, or community over phrases like “east of Regina”, and candidate
matching can attach when the specific source city plus year/province/counts
strongly match an existing record.

Behavior changes:
- Staging exposes `/admin/api/records/:id/summarize` and
  `/admin/api/records/:id/enrich-location` to valid ingest-token callers only.
- Existing noisy summaries may be regenerated instead of skipped.
- Ingest proposals should avoid duplicate records when sources name a specific
  community near a better-known city.

Risks:
- The text sanitizer is heuristic and may remove some legitimate short page
  text if it resembles navigation or metadata.
- The staging bearer-token path depends on `APP_ENV` being set correctly.

Follow-ups:
- Bake off the staging summary targets against real problem URLs.
- Watch production summaries for over-aggressive sanitation before widening the
  chrome/noise patterns further.
@darron darron self-assigned this May 9, 2026
@darron darron added the bug Something isn't working label May 9, 2026
@darron darron merged commit 1823d29 into main May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant