Conversation
Clean up story and record summary generation so extracted page chrome, daemon metadata, front matter, share links, scripts, image URLs, and noisy HTML/Markdown artifacts are stripped before they can leak into rendered record pages. Tighten the summary prompts, disable model thinking for free-text summary calls, normalize synthesis classifications, and retry summaries that look like extractor noise instead of treating them as valid existing output. Add staging-only machine access for bearer-token record actions so summary and location-enrichment bakeoffs can run without a browser session. Add Taskfile targets and a deterministic seed script for creating staging test records from one or more URLs, queueing summarization, fetching rendered records, and inspecting stored summaries/location fields. Improve ingest matching for sources that mention a specific community near a larger reference city. Extraction now prefers the incident municipality, First Nation, reserve, or community over phrases like “east of Regina”, and candidate matching can attach when the specific source city plus year/province/counts strongly match an existing record. Behavior changes: - Staging exposes `/admin/api/records/:id/summarize` and `/admin/api/records/:id/enrich-location` to valid ingest-token callers only. - Existing noisy summaries may be regenerated instead of skipped. - Ingest proposals should avoid duplicate records when sources name a specific community near a better-known city. Risks: - The text sanitizer is heuristic and may remove some legitimate short page text if it resembles navigation or metadata. - The staging bearer-token path depends on `APP_ENV` being set correctly. Follow-ups: - Bake off the staging summary targets against real problem URLs. - Watch production summaries for over-aggressive sanitation before widening the chrome/noise patterns further.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Clean up story and record summary generation so extracted page chrome, daemon metadata, front matter, share links, scripts, image URLs, and noisy HTML/Markdown artifacts are stripped before they can leak into rendered record pages. Tighten the summary prompts, disable model thinking for free-text summary calls, normalize synthesis classifications, and retry summaries that look like extractor noise instead of treating them as valid existing output.
Add staging-only machine access for bearer-token record actions so summary and location-enrichment bakeoffs can run without a browser session. Add Taskfile targets and a deterministic seed script for creating staging test records from one or more URLs, queueing summarization, fetching rendered records, and inspecting stored summaries/location fields.
Improve ingest matching for sources that mention a specific community near a larger reference city. Extraction now prefers the incident municipality, First Nation, reserve, or community over phrases like “east of Regina”, and candidate matching can attach when the specific source city plus year/province/counts strongly match an existing record.
Behavior changes:
/admin/api/records/:id/summarizeand/admin/api/records/:id/enrich-locationto valid ingest-token callers only.Risks:
APP_ENVbeing set correctly.Follow-ups: