Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduled jobs: avoid "top-of-the-hour" service level degradation #54537

Open
mwang1026 opened this issue Sep 17, 2020 · 3 comments
Open

scheduled jobs: avoid "top-of-the-hour" service level degradation #54537

mwang1026 opened this issue Sep 17, 2020 · 3 comments
Labels
A-disaster-recovery A-jobs C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-jobs

Comments

@mwang1026
Copy link

mwang1026 commented Sep 17, 2020

When creating a schedule it's common to use simple cron syntax of '@daily' or '@hourly'. The downside of that is if every schedule is created with that syntax, cron runs at at the top of the hour / day. And if every service across all of your systems are using similar scheduling syntax then you get times where your services are hosed.

What we're looking for here is something where you can create schedules to avoid the above issues. "Random" or some sort of random jitter is probably a good enough heuristic to spread schedules out to avoid the problem. The ideal is to have a solution that can identify the best times to run to avoid "top of the hour" degradation.

For certain scheduled operations, you also want them to be on a consistent cadence but on staggered start times (e.g. every hour on the 23rd minute) -- namely Backups so that you can target an RPO.

  • Deliverables *
  • A way to specify randomness of the start time within the recurring window
  • That randomness is applied on the initial schedule and the specified cadence determines future schedule times
  • Other alternatives *
  • Load based determination of when a schedule should run
  • Randomness that's always applied on every run

Epic CRDB-7909

Jira issue: CRDB-3742

@mwang1026 mwang1026 added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-disaster-recovery labels Sep 17, 2020
@mwang1026 mwang1026 added this to Triage in Disaster Recovery Backlog via automation Sep 17, 2020
@mwang1026 mwang1026 moved this from Triage to Backup/Restore/Dump in Disaster Recovery Backlog Sep 22, 2020
@blathers-crl blathers-crl bot added this to Triage in Jobs Jul 9, 2023
@blathers-crl blathers-crl bot added the A-jobs label Jul 9, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 9, 2023

cc @cockroachdb/disaster-recovery

@dt
Copy link
Member

dt commented Jul 10, 2023

I don't think we expect to implement this in the jobs system any time soon; users who are used to @houly in their cron system of choice may well expect it to be a shorthand for exactly 0 * * * *, as it is in most cron implementations, and those who want a random minute of the hour can already do that, just be using something like (random()*60)::int::string || ' * * * *' when they create their schedule instead of @hourly.

I think we could probably close this as unplanned.

@rafiss
Copy link
Collaborator

rafiss commented Jul 26, 2023

Also noting that the Schema Telemetry Job does some of its own randomization before choosing @weekly or @hourly:

// MaybeRewriteCronExpr is used to rewrite the interval-oriented cron exprs
// into an equivalent frequency interval but with an offset derived from the
// uuid. For a given pair of inputs, the output of this function will always
// be the same. If the input cronExpr is not a special form as denoted by
// the keys of cronExprRewrites, it will be returned unmodified. This rewrite
// occurs in order to uniformly distribute the production of telemetry logs
// over the intended time interval to avoid bursts.
func MaybeRewriteCronExpr(id uuid.UUID, cronExpr string) string {
if f, ok := cronExprRewrites[cronExpr]; ok {
hash := fnv.New64a() // arbitrary hash function
_, _ = hash.Write(id.GetBytes())
return f(rand.New(rand.NewSource(int64(hash.Sum64()))))
}
return cronExpr
}
var cronExprRewrites = map[string]func(r *rand.Rand) string{
cronWeekly: func(r *rand.Rand) string {
return fmt.Sprintf("%d %d * * %d", r.Intn(60), r.Intn(23), r.Intn(7))
},
cronDaily: func(r *rand.Rand) string {
return fmt.Sprintf("%d %d * * *", r.Intn(60), r.Intn(23))
},
cronHourly: func(r *rand.Rand) string {
return fmt.Sprintf("%d * * * *", r.Intn(60))
},
}

Other internal jobs could do something like this too.

craig bot pushed a commit that referenced this issue Jul 28, 2023
107633: sql/schemachanger: DROP INDEX could drop unrelated foreign keys r=fqazi a=fqazi

Previously, when DROP INDEX was resolving and iterating over foreign keys, it did not validate that these foreign keys were related to the index we were dropping. As a result, if any table referred back to the target table with the index, we would analyze its foreign keys. If cascade wasn't specified this could incorrectly end up blocking the DROP INDEX on unrelated foreign key references assuming they need our index. Or worse with cascade we could remove foreign key constraints in other tables. To address this, this patch filters the back references to only look at ones related to the target table, which causes the correct set to be analuzed / dropped.

Fixes: #107576

Release note (bug fix): Dropping an index could end up failing or cleaning foreign keys (when CASCADE is specified) on other tables referencing the target table with this index.

107646: sql: use a random minute for the sql-stats-compaction job default recurrence r=maryliag a=rafiss

### scheduledjobs: move MaybeRewriteCronExpr into package

This was moved from the schematelemetrycontroller package.

There are no code changes in this commit.

----

### sql: use a random minute for the sql-stats-compaction job default recurrence

Now, the sql-stats-compaction job that is created during cluster
initialization will be scheduled on a random minute in the hour, rather
than at the top of the hour. This will only affect clusters that are
initialized after this change is released. Any existing clusters will
continue to keep whatever recurrence they had before, which defaulted to
`@hourly.`

This change was made because we have observed that this job can cause
CPU spikes on the serverless host clusters, since different tenants all
had this job scheduled for the same time.

---

see: https://cockroachlabs.slack.com/archives/C04U1BTF8/p1688829944578639
refs: #54537
Epic: None
Release note: None

Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-jobs C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-jobs
Projects
Jobs
Triage
Development

No branches or pull requests

4 participants