Skip to content

Strategize materialized view refresh timing relative to collection phases #315

@shlokgilda

Description

@shlokgilda

Spinning out from #292 review (cc @MoralCode).

The refresh task in collectoss/tasks/db/refresh_materialized_views.py runs every view on whatever schedule Celery beat says, regardless of:

  • Whether collection is mid-cycle for a given repo (views can land right after a refresh holding partial data).
  • Which collection phase (core / secondary / facade) feeds each view; we may refresh views whose source hasn't actually changed.
  • Concurrent inserts. REFRESH MATERIALIZED VIEW CONCURRENTLY doesn't block reads but does serialize against itself, and on heavy collection windows a long-running refresh can interleave with writes in surprising ways.

Stuff worth thinking through:

  • Trigger refresh after a collection phase completes for a repo group, instead of on a wall clock?
  • Tag each view in the registry with the phases that feed it; only refresh views whose phases just finished?
  • Track last_refreshed_at per view, skip if nothing changed?
  • issue_reporter_created_at lacks a unique index so it can only refresh non-concurrently. that lock is briefly disruptive. Schedule it separately, or add a unique constraint to bring it onto the concurrent path?

Library/fork choice for view + index management is a separate conversation — see #314.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions