Skip to content

[Bug] Non-idempotent label event insertion duplicates rows on every backfill #25

@web-dev0521

Description

@web-dev0521

Description

saveLabelTimelineEvents in packages/das/src/webhook/github-fetcher.service.ts:986 calls labelEventRepo.save({...}) without an id. Because LabelEvent uses @PrimaryGeneratedColumn() (packages/das/src/entities/LabelEvent.entity.ts:5) and label_events has no unique constraint (packages/db/07_label_events.sql:6-18), every backfill INSERTs a fresh row for every label event instead of upserting. The full label history of every PR and issue is re-duplicated on every backfill run.

BullMQ is configured with attempts: 2 on backfill jobs (packages/das/src/api/admin.controller.ts:95), so any partial failure mid-backfill (e.g. GitHub rate limit) retries from the beginning and doubles every label event already written during that run. After N backfills, label_events holds N× the real rows; the consumer views (pr_labels_by_actor, issue_labels_by_actor) silently collapse duplicates via DISTINCT ON, so API output stays correct long enough that nobody notices the table is ballooning. Eventually full-table scans on the views slow the miners API to a crawl and timeouts begin.

Failure mode: silent-bad-data + DoS through unbounded resource exhaustion.

Steps to Reproduce

  1. Register a repo with at least one labeled PR or issue via POST /api/v1/admin/repos/register (auto-enqueues a backfill), or call POST /api/v1/admin/backfill directly.
  2. Wait for the backfill job to complete, then run SELECT COUNT(*) FROM label_events;.
  3. Trigger another backfill on the same repo with POST /api/v1/admin/backfill.
  4. Re-run SELECT COUNT(*) FROM label_events; — the row count has increased by exactly the number of label events on GitHub, even though no new labeling activity occurred. Query pr_labels_by_actor / issue_labels_by_actor and observe API output is unchanged because DISTINCT ON masks the duplicates.

Expected Behavior

Re-processing the same label timeline event should be idempotent. label_events should hold one row per (repo_full_name, target_number, target_type, label_name, action, timestamp) regardless of how many backfills or retries fire.

Actual Behavior

Every backfill (and every retry of a backfill, since attempts: 2) appends a full duplicate copy of every label event for every PR and issue in the repo. The table grows without bound, no error is thrown, and the API stays correct until view scans degrade and the miners endpoint starts timing out.

Environment

  • OS: Linux 6.17.0-23-generic
  • Runtime/Node version: Node.js v20.20.2 (container)
  • Browser (if applicable): n/a

Additional Context

Affected code paths:

  • packages/das/src/webhook/github-fetcher.service.ts:975-999saveLabelTimelineEvents performs a non-idempotent save per node
  • packages/das/src/entities/LabelEvent.entity.ts:5id is @PrimaryGeneratedColumn(); no natural-key unique index
  • packages/db/07_label_events.sql:6-18 — table has only a (repo_full_name, target_number, timestamp) index, no UNIQUE constraint
  • packages/das/src/api/admin.controller.ts:95attempts: 2 causes any partial failure to double the rows written before the failure point
  • packages/db/24_view_pr_labels_by_actor.sql, packages/db/25_view_issue_labels_by_actor.sql — both use DISTINCT ON (...) ORDER BY timestamp DESC, hiding the duplication from API consumers

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions