Make label_events writes idempotent on backfill (#25)#26
Open
hunnyboy1217 wants to merge 1 commit intoentrius:testfrom
Open
Make label_events writes idempotent on backfill (#25)#26hunnyboy1217 wants to merge 1 commit intoentrius:testfrom
hunnyboy1217 wants to merge 1 commit intoentrius:testfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make
label_eventswrites idempotent so backfill re-runs (and BullMQ retries) no longer duplicate rows.The webhook + backfill paths called
labelEventRepo.save()without a natural-key conflict path. BecauseLabelEvent.idis@PrimaryGeneratedColumn()andlabel_eventshad no UNIQUE constraint, every backfill INSERTed a fresh row for every label event already in the table. Thepr_labels_by_actor/issue_labels_by_actorviews collapse duplicates viaDISTINCT ON, so API output stayed correct while the table grew unbounded — eventually slowing the miners API as view scans degrade.Changes:
packages/db/11_label_events_dedup.sql— one-shot DELETE collapsing exact-duplicate rows by(repo_full_name, target_number, target_type, label_name, action, timestamp), keeping the lowestid. Idempotent (no-op on already-clean tables) and safe on fresh installs. Includes a batched form in the comments for very large production tables.packages/db/12_label_events_constraints.sql—CREATE UNIQUE INDEX ... NULLS NOT DISTINCTon the natural key. TheCONCURRENTLYvariant for online application to a running database is documented in the file's comments.packages/das/src/webhook/github-fetcher.service.ts—saveLabelTimelineEvents: per-noderepo.save()loop replaced with a single batchedinsert().values(rows).orIgnore().execute(). One round-trip per timeline instead of N, andON CONFLICT DO NOTHINGmakes re-runs no-ops.packages/das/src/webhook/handlers/label.handler.ts— samesave → insert().orIgnore()swap on the webhook write path. Defense-in-depth alongside the existingwebhook_deliveriesdelivery-id dedup.Deploy order for existing production databases:
orIgnore()is a no-op behavior change without the constraint).11_label_events_dedup.sql.12_label_events_constraints.sql(use theCONCURRENTLYvariant in the file comments).Reversing steps 2/3 before deploying the code would cause the old
save()calls to throw on unique-violations and fail backfill jobs.Related Issues
Fixes #25
Type of Change
Testing
No test framework exists in this repo; verified manually.
Reproduction (matches issue #25 steps):
Build / lint:
npm run build— passes (NestJS compiler, full TS typecheck)npm run lint— cleannpm run format:check— cleanOut of scope (separate follow-ups):
new Date().toISOString()while backfill uses GraphQLLabeledEvent.createdAt. The same logical labeling action can be represented by two rows with slightly different timestamps, which won't collide on the new UNIQUE. Bounded leak (≈1 extra row per labeling action that occurs while the app is running, per backfill that follows). Proper fix is to drop the webhook'slabel_eventswrite entirely and enqueue a refresh job — architectural change deserving its own PR.DISTINCT ON (repo, target, label_name) ORDER BY timestamp DESCin the labels views still seq-scans after dedupe. A purpose-built index would help, but is a separate perf PR.Checklist