scheduled jobs: avoid "top-of-the-hour" service level degradation #54537

mwang1026 · 2020-09-17T21:31:22Z

When creating a schedule it's common to use simple cron syntax of '@daily' or '@hourly'. The downside of that is if every schedule is created with that syntax, cron runs at at the top of the hour / day. And if every service across all of your systems are using similar scheduling syntax then you get times where your services are hosed.

What we're looking for here is something where you can create schedules to avoid the above issues. "Random" or some sort of random jitter is probably a good enough heuristic to spread schedules out to avoid the problem. The ideal is to have a solution that can identify the best times to run to avoid "top of the hour" degradation.

For certain scheduled operations, you also want them to be on a consistent cadence but on staggered start times (e.g. every hour on the 23rd minute) -- namely Backups so that you can target an RPO.

Deliverables *

A way to specify randomness of the start time within the recurring window
That randomness is applied on the initial schedule and the specified cadence determines future schedule times

Other alternatives *

Load based determination of when a schedule should run
Randomness that's always applied on every run

Epic CRDB-7909

Jira issue: CRDB-3742

blathers-crl · 2023-07-09T04:32:55Z

cc @cockroachdb/disaster-recovery

dt · 2023-07-10T16:08:34Z

I don't think we expect to implement this in the jobs system any time soon; users who are used to @houly in their cron system of choice may well expect it to be a shorthand for exactly 0 * * * *, as it is in most cron implementations, and those who want a random minute of the hour can already do that, just be using something like (random()*60)::int::string || ' * * * *' when they create their schedule instead of @hourly.

I think we could probably close this as unplanned.

rafiss · 2023-07-26T17:31:42Z

Also noting that the Schema Telemetry Job does some of its own randomization before choosing @weekly or @hourly:

cockroach/pkg/sql/catalog/schematelemetry/schematelemetrycontroller/controller.go

Lines 188 to 214 in 9c510f9

    
           // MaybeRewriteCronExpr is used to rewrite the interval-oriented cron exprs 
        
           // into an equivalent frequency interval but with an offset derived from the 
        
           // uuid. For a given pair of inputs, the output of this function will always 
        
           // be the same. If the input cronExpr is not a special form as denoted by 
        
           // the keys of cronExprRewrites, it will be returned unmodified. This rewrite 
        
           // occurs in order to uniformly distribute the production of telemetry logs 
        
           // over the intended time interval to avoid bursts. 
        
           func MaybeRewriteCronExpr(id uuid.UUID, cronExpr string) string { 
        
           	if f, ok := cronExprRewrites[cronExpr]; ok { 
        
           		hash := fnv.New64a() // arbitrary hash function 
        
           		_, _ = hash.Write(id.GetBytes()) 
        
           		return f(rand.New(rand.NewSource(int64(hash.Sum64())))) 
        
           	} 
        
           	return cronExpr 
        
           } 
        
           var cronExprRewrites = map[string]func(r *rand.Rand) string{ 
        
           	cronWeekly: func(r *rand.Rand) string { 
        
           		return fmt.Sprintf("%d %d * * %d", r.Intn(60), r.Intn(23), r.Intn(7)) 
        
           	}, 
        
           	cronDaily: func(r *rand.Rand) string { 
        
           		return fmt.Sprintf("%d %d * * *", r.Intn(60), r.Intn(23)) 
        
           	}, 
        
           	cronHourly: func(r *rand.Rand) string { 
        
           		return fmt.Sprintf("%d * * * *", r.Intn(60)) 
        
           	}, 
        
           }

Other internal jobs could do something like this too.

107633: sql/schemachanger: DROP INDEX could drop unrelated foreign keys r=fqazi a=fqazi Previously, when DROP INDEX was resolving and iterating over foreign keys, it did not validate that these foreign keys were related to the index we were dropping. As a result, if any table referred back to the target table with the index, we would analyze its foreign keys. If cascade wasn't specified this could incorrectly end up blocking the DROP INDEX on unrelated foreign key references assuming they need our index. Or worse with cascade we could remove foreign key constraints in other tables. To address this, this patch filters the back references to only look at ones related to the target table, which causes the correct set to be analuzed / dropped. Fixes: #107576 Release note (bug fix): Dropping an index could end up failing or cleaning foreign keys (when CASCADE is specified) on other tables referencing the target table with this index. 107646: sql: use a random minute for the sql-stats-compaction job default recurrence r=maryliag a=rafiss ### scheduledjobs: move MaybeRewriteCronExpr into package This was moved from the schematelemetrycontroller package. There are no code changes in this commit. ---- ### sql: use a random minute for the sql-stats-compaction job default recurrence Now, the sql-stats-compaction job that is created during cluster initialization will be scheduled on a random minute in the hour, rather than at the top of the hour. This will only affect clusters that are initialized after this change is released. Any existing clusters will continue to keep whatever recurrence they had before, which defaulted to `@hourly.` This change was made because we have observed that this job can cause CPU spikes on the serverless host clusters, since different tenants all had this job scheduled for the same time. --- see: https://cockroachlabs.slack.com/archives/C04U1BTF8/p1688829944578639 refs: #54537 Epic: None Release note: None Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>

mwang1026 added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-disaster-recovery labels Sep 17, 2020

mwang1026 added this to Triage in Disaster Recovery Backlog via automation Sep 17, 2020

mwang1026 moved this from Triage to Backup/Restore/Dump in Disaster Recovery Backlog Sep 22, 2020

kenliu added the T-disaster-recovery label Dec 5, 2020

postamar mentioned this issue Aug 8, 2022

schematelemetry,eventpb: add schema telemetry #84761

Merged

rafiss added T-jobs and removed T-disaster-recovery labels Jul 9, 2023

blathers-crl bot added this to Triage in Jobs Jul 9, 2023

blathers-crl bot added the A-jobs label Jul 9, 2023

exalate-issue-sync bot added T-disaster-recovery and removed A-jobs T-jobs labels Jul 9, 2023

rafiss added T-jobs A-jobs and removed T-disaster-recovery labels Jul 9, 2023

rafiss mentioned this issue Jul 26, 2023

sql: use a random minute for the sql-stats-compaction job default recurrence #107646

Merged

blathers-crl bot mentioned this issue Jul 28, 2023

release-23.1: sql: use a random minute for the sql-stats-compaction job default recurrence #107782

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduled jobs: avoid "top-of-the-hour" service level degradation #54537

scheduled jobs: avoid "top-of-the-hour" service level degradation #54537

mwang1026 commented Sep 17, 2020 •

edited by cockroach-jira-scripts

blathers-crl bot commented Jul 9, 2023

dt commented Jul 10, 2023 •

edited

rafiss commented Jul 26, 2023

scheduled jobs: avoid "top-of-the-hour" service level degradation #54537

scheduled jobs: avoid "top-of-the-hour" service level degradation #54537

Comments

mwang1026 commented Sep 17, 2020 • edited by cockroach-jira-scripts

blathers-crl bot commented Jul 9, 2023

dt commented Jul 10, 2023 • edited

rafiss commented Jul 26, 2023

mwang1026 commented Sep 17, 2020 •

edited by cockroach-jira-scripts

dt commented Jul 10, 2023 •

edited