-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduled jobs: avoid "top-of-the-hour" service level degradation #54537
Comments
cc @cockroachdb/disaster-recovery |
I don't think we expect to implement this in the jobs system any time soon; users who are used to I think we could probably close this as unplanned. |
Also noting that the Schema Telemetry Job does some of its own randomization before choosing cockroach/pkg/sql/catalog/schematelemetry/schematelemetrycontroller/controller.go Lines 188 to 214 in 9c510f9
Other internal jobs could do something like this too. |
107633: sql/schemachanger: DROP INDEX could drop unrelated foreign keys r=fqazi a=fqazi Previously, when DROP INDEX was resolving and iterating over foreign keys, it did not validate that these foreign keys were related to the index we were dropping. As a result, if any table referred back to the target table with the index, we would analyze its foreign keys. If cascade wasn't specified this could incorrectly end up blocking the DROP INDEX on unrelated foreign key references assuming they need our index. Or worse with cascade we could remove foreign key constraints in other tables. To address this, this patch filters the back references to only look at ones related to the target table, which causes the correct set to be analuzed / dropped. Fixes: #107576 Release note (bug fix): Dropping an index could end up failing or cleaning foreign keys (when CASCADE is specified) on other tables referencing the target table with this index. 107646: sql: use a random minute for the sql-stats-compaction job default recurrence r=maryliag a=rafiss ### scheduledjobs: move MaybeRewriteCronExpr into package This was moved from the schematelemetrycontroller package. There are no code changes in this commit. ---- ### sql: use a random minute for the sql-stats-compaction job default recurrence Now, the sql-stats-compaction job that is created during cluster initialization will be scheduled on a random minute in the hour, rather than at the top of the hour. This will only affect clusters that are initialized after this change is released. Any existing clusters will continue to keep whatever recurrence they had before, which defaulted to `@hourly.` This change was made because we have observed that this job can cause CPU spikes on the serverless host clusters, since different tenants all had this job scheduled for the same time. --- see: https://cockroachlabs.slack.com/archives/C04U1BTF8/p1688829944578639 refs: #54537 Epic: None Release note: None Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
When creating a schedule it's common to use simple cron syntax of '@daily' or '@hourly'. The downside of that is if every schedule is created with that syntax, cron runs at at the top of the hour / day. And if every service across all of your systems are using similar scheduling syntax then you get times where your services are hosed.
What we're looking for here is something where you can create schedules to avoid the above issues. "Random" or some sort of random jitter is probably a good enough heuristic to spread schedules out to avoid the problem. The ideal is to have a solution that can identify the best times to run to avoid "top of the hour" degradation.
For certain scheduled operations, you also want them to be on a consistent cadence but on staggered start times (e.g. every hour on the 23rd minute) -- namely Backups so that you can target an RPO.
Epic CRDB-7909
Jira issue: CRDB-3742
The text was updated successfully, but these errors were encountered: