Spinning out from #292 review (cc @MoralCode).
The refresh task in collectoss/tasks/db/refresh_materialized_views.py runs every view on whatever schedule Celery beat says, regardless of:
- Whether collection is mid-cycle for a given repo (views can land right after a refresh holding partial data).
- Which collection phase (core / secondary / facade) feeds each view; we may refresh views whose source hasn't actually changed.
- Concurrent inserts.
REFRESH MATERIALIZED VIEW CONCURRENTLY doesn't block reads but does serialize against itself, and on heavy collection windows a long-running refresh can interleave with writes in surprising ways.
Stuff worth thinking through:
- Trigger refresh after a collection phase completes for a repo group, instead of on a wall clock?
- Tag each view in the registry with the phases that feed it; only refresh views whose phases just finished?
- Track
last_refreshed_at per view, skip if nothing changed?
issue_reporter_created_at lacks a unique index so it can only refresh non-concurrently. that lock is briefly disruptive. Schedule it separately, or add a unique constraint to bring it onto the concurrent path?
Library/fork choice for view + index management is a separate conversation — see #314.
Spinning out from #292 review (cc @MoralCode).
The refresh task in
collectoss/tasks/db/refresh_materialized_views.pyruns every view on whatever schedule Celery beat says, regardless of:REFRESH MATERIALIZED VIEW CONCURRENTLYdoesn't block reads but does serialize against itself, and on heavy collection windows a long-running refresh can interleave with writes in surprising ways.Stuff worth thinking through:
last_refreshed_atper view, skip if nothing changed?issue_reporter_created_atlacks a unique index so it can only refresh non-concurrently. that lock is briefly disruptive. Schedule it separately, or add a unique constraint to bring it onto the concurrent path?Library/fork choice for view + index management is a separate conversation — see #314.