Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: add expiration for paused jobs #84598

Open
amruss opened this issue Jul 18, 2022 · 13 comments
Open

jobs: add expiration for paused jobs #84598

amruss opened this issue Jul 18, 2022 · 13 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-jobs
Projects

Comments

@amruss
Copy link
Contributor

amruss commented Jul 18, 2022

Paused jobs can be easily forgotten about. Long-paused jobs can cause cluster issues, because of the accumulation of garbage that paused jobs can create. Additionally, supporting long-paused jobs through cluster upgrades is hard.

We should:

  • Expire paused jobs after 3 days of being paused, add expiration time as a column in jobs table
  • If a job is unpaused then paused, the expiration counter should reset

Rational behind 3 days: you have enough working-day time to have done something about this

Follow up work:

Jira issue: CRDB-17756

Epic CRDB-20422

@amruss amruss added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 18, 2022
@amruss amruss added this to Triage in Jobs via automation Jul 18, 2022
@blathers-crl blathers-crl bot added the T-jobs label Jul 18, 2022
@amruss amruss moved this from Triage to In Progress in Jobs Jul 18, 2022
@amruss amruss assigned dt and unassigned chengxiong-ruan Aug 11, 2022
@amruss
Copy link
Contributor Author

amruss commented Aug 15, 2022

Discussion: 10 days instead of 3

@lunevalex
Copy link
Collaborator

@amruss dont we recommend people do full backups once a day and incrementals throughout the day? In that case if I am going to do a full backup in less in a day, why would we want to have a busted incremental linger for 3 or even 10 days?

@irfansharif
Copy link
Contributor

irfansharif commented Sep 23, 2022

I find the suggestion of 3d already somewhat surprising given these PTS records are left around on the error path -- an accrual of up to 3d of MVCC garbage for the entire keyspan (full backups for ex.) can be debilitating storage wise, effects on read latencies, etc. We can improve MVCC garbage observability and and admission control for this work, and that's all good, but I still question anything longer than single digit hours here. Certainly there will be users wanting to accomodate an incident response time of >= 3d, which configurability could let them do, I'm just questioning whether it should be made the default. +cc @williamkulju, @mwang1026.

@irfansharif
Copy link
Contributor

If we had some way of making PTS records dramatically cheaper WRT them holding up garbage collection, say doing something like #86697, then perhaps a multi-day expiration time would be palatable; the data being kept around is O(live-bytes). But today it's O(live-bytes * time), a compounding we should consider in the defaults we ship with. +cc @nvanbenschoten.

@dt
Copy link
Member

dt commented Sep 23, 2022

Expiring a paused job is not intended to be the defense against mvcc garbage accumulation due to PTS records; note that garbage can accumulate for a variety of reasons other than a paused job (i.e. a non-paused job that is running slowly, is stuck, etc, or not even a job at all but rather a backup schedule that hasn't completed yet, etc), so the answer catching and fixing garbage accumulation is better kv obsv tools, like garbage bytes % alerts, displaying garbage amounts on the dbs and tables console page, etc.

Expiring paused jobs after 10 or 3 days is just there to catch something that is forgotten so it doesn't linger indefinitely, i.e. across major versions.

@amruss
Copy link
Contributor Author

amruss commented Nov 10, 2022

Moving this back to triage to discuss w/ cdc team

@joshimhoff
Copy link
Collaborator

joshimhoff commented Dec 1, 2022

SRE just discussed this. Can we execute on this ticket ~now and include it in a patch update, with a cluster setting disabled by default if needed that we can flip on in CC immediately? The pause job if backup fails behavior implies the need for SRE to do a lot of regular operational work (read: toil) to keep CC backups from holding up PTO and spending lots of disk unnecessarily. Re: my suggestion to fix this ~now, in CC at least, I think landing pause instead of fail in 22.1 without also landing expiration for paused jobs should be considered a regression.

@dt
Copy link
Member

dt commented Dec 1, 2022

I'm not exactly sure how a patch release version of this would look, since we don't track when a job was paused right now, so existing, paused jobs don't reflect how long they've been paused. We'll need to start writing a new field at pause time to use for this, and any jobs that predate it will likely just be left alone?

@joshimhoff
Copy link
Collaborator

joshimhoff commented Dec 1, 2022

There is nothing in the system that at least indirectly keeps track of when a job went from running to paused? Isn't there a hearbeated at column, or something like that?

@miretskiy
Copy link
Contributor

I'm not too sure I will wind up adding job expiration; but I will add protected timestamp expiration.

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Feb 15, 2023
Changefeeds utilize protected timestamp system (PTS)
to ensure that the data targeted by changefeed is not
garbage collected prematurely.  PTS record is managed
by running changefeed by periodically updating
PTS record timestamp, so that the data older than
the that timestamp may be GCed.  However, if the
changefeed stops running when it is paused (either due
to operator action, or due to `on_error=pause` option,
the PTS record remains so that the changefeed can
be resumed at a later time. However, it is also possible
that operator may not notice that the job is paused for
too long, thus causing buildup of garbage data.

Excessive buildup of GC work is not great since it
impacts overall cluster performance, and, once GC can resume,
its cost is proportional to how much GC work needs to be done.
This PR introduces a new setting
`changefeed.protect_timestamp.expire_after` to automatically
expire PTS records that are too old.
This automatic expiration is a safety mechanism and the setting
default is rather concervative.  The operator is still expected
to monitor changefeed jobs, and to restart paused changefeeds
expediently.  If the changefeed job remains paused, and the
underlying PTS records is removed due to expiration, then the
changefeed job will fail when it is restarted.  This failure
is preferable to unbounded buildup of garbage data in the cluster.

Epic: CRDB-21953
This PR does not add expiration to the job itself, as requested
by cockroachdb#84598, but it does accomplish the same goal.

Release note (enterprise change): Changefeed will automatically
expire PTS records after `changefeed.protect_timestamp.expire_after` so
that paused changefeed jobs do not cause GC data buildup in the cluster.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Feb 16, 2023
Changefeeds utilize protected timestamp system (PTS)
to ensure that the data targeted by changefeed is not
garbage collected prematurely.  PTS record is managed
by running changefeed by periodically updating
PTS record timestamp, so that the data older than
the that timestamp may be GCed.  However, if the
changefeed stops running when it is paused (either due
to operator action, or due to `on_error=pause` option,
the PTS record remains so that the changefeed can
be resumed at a later time. However, it is also possible
that operator may not notice that the job is paused for
too long, thus causing buildup of garbage data.

Excessive buildup of GC work is not great since it
impacts overall cluster performance, and, once GC can resume,
its cost is proportional to how much GC work needs to be done.
This PR introduces a new setting
`changefeed.protect_timestamp.expire_after` to automatically
expire PTS records that are too old.
This automatic expiration is a safety mechanism and the setting
default is rather concervative.  The operator is still expected
to monitor changefeed jobs, and to restart paused changefeeds
expediently.  If the changefeed job remains paused, and the
underlying PTS records is removed due to expiration, then the
changefeed job will fail when it is restarted.  This failure
is preferable to unbounded buildup of garbage data in the cluster.

Epic: CRDB-21953
This PR does not add expiration to the job itself, as requested
by cockroachdb#84598, but it does accomplish the same goal.

Release note (enterprise change): Changefeed will automatically
expire PTS records after `changefeed.protect_timestamp.expire_after` so
that paused changefeed jobs do not cause GC data buildup in the cluster.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Feb 23, 2023
Changefeeds utilize protected timestamp system (PTS)
to ensure that the data targeted by changefeed is not
garbage collected prematurely.  PTS record is managed
by running changefeed by periodically updating
PTS record timestamp, so that the data older than
the that timestamp may be GCed.  However, if the
changefeed stops running when it is paused (either due
to operator action, or due to `on_error=pause` option,
the PTS record remains so that the changefeed can
be resumed at a later time. However, it is also possible
that operator may not notice that the job is paused for
too long, thus causing buildup of garbage data.

Excessive buildup of GC work is not great since it
impacts overall cluster performance, and, once GC can resume,
its cost is proportional to how much GC work needs to be done.
This PR introduces a new changefeed option
`gc_protect_expires_after` to automatically expire PTS records that
are too old.  This automatic expiration is a safety mechanism
in case changefeed job gets paused by an operator or due to
an error, while holding onto PTS record due to `protect_gc_on_pause`
option.
The operator is still expected to monitor changefeed jobs,
and to restart paused changefeeds expediently.  If the changefeed
job remains paused, and the underlying PTS records expires, then
the changefeed job will be canceled to prevent build up of GC data.

Epic: CRDB-21953
This PR does not add expiration to the job itself, as requested
by cockroachdb#84598, but it does accomplish the same goal.

Release note (enterprise change): Changefeed will automatically
expire PTS records for paused jobs if changefeed is configured
with `gc_protect_expires_after` option.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Feb 23, 2023
Changefeeds utilize protected timestamp system (PTS)
to ensure that the data targeted by changefeed is not
garbage collected prematurely.  PTS record is managed
by running changefeed by periodically updating
PTS record timestamp, so that the data older than
the that timestamp may be GCed.  However, if the
changefeed stops running when it is paused (either due
to operator action, or due to `on_error=pause` option,
the PTS record remains so that the changefeed can
be resumed at a later time. However, it is also possible
that operator may not notice that the job is paused for
too long, thus causing buildup of garbage data.

Excessive buildup of GC work is not great since it
impacts overall cluster performance, and, once GC can resume,
its cost is proportional to how much GC work needs to be done.
This PR introduces a new changefeed option
`gc_protect_expires_after` to automatically expire PTS records that
are too old.  This automatic expiration is a safety mechanism
in case changefeed job gets paused by an operator or due to
an error, while holding onto PTS record due to `protect_gc_on_pause`
option.
The operator is still expected to monitor changefeed jobs,
and to restart paused changefeeds expediently.  If the changefeed
job remains paused, and the underlying PTS records expires, then
the changefeed job will be canceled to prevent build up of GC data.

Epic: CRDB-21953
Informs cockroachdb#84598

Release note (enterprise change): Changefeed will automatically
expire PTS records for paused jobs if changefeed is configured
with `gc_protect_expires_after` option.
craig bot pushed a commit that referenced this issue Feb 24, 2023
97148: changefeedccl: Expire protected timestamps r=miretskiy a=miretskiy

Changefeeds utilize protected timestamp system (PTS)
to ensure that the data targeted by changefeed is not
garbage collected prematurely.  PTS record is managed
by running changefeed by periodically updating
PTS record timestamp, so that the data older than
the that timestamp may be GCed.  However, if the
changefeed stops running when it is paused (either due
to operator action, or due to `on_error=pause` option,
the PTS record remains so that the changefeed can
be resumed at a later time. However, it is also possible
that operator may not notice that the job is paused for
too long, thus causing buildup of garbage data.

Excessive buildup of GC work is not great since it
impacts overall cluster performance, and, once GC can resume,
its cost is proportional to how much GC work needs to be done.
This PR introduces a new changefeed option
`gc_protect_expires_after` to automatically expire PTS records that
are too old.  This automatic expiration is a safety mechanism
in case changefeed job gets paused by an operator or due to
an error, while holding onto PTS record due to `protect_gc_on_pause`
option.
The operator is still expected to monitor changefeed jobs,
and to restart paused changefeeds expediently.  If the changefeed
job remains paused, and the underlying PTS records expires, then
the changefeed job will be canceled to prevent build up of GC data.

Epic: [CRDB-21953](https://cockroachlabs.atlassian.net/browse/CRDB-21953)
Informs #84598

Release note (enterprise change): Changefeed will automatically
expire PTS records for paused jobs if changefeed is configured
with `gc_protect_expires_after` option.

97539: kvserver: fix deadlock on rebalance obj change r=kvoli a=kvoli

Previously, changing the rebalance objective could lead to inconsistent
locking order between the load based splitter and rebalance objective.
When the objective was updated, the previous method also blocked
batch requests from completing until every replica lb splitter was
reset.

This commit moves the split objective to be a variable owned by the
decider, rather than inferred on each decider operation. The split
objective is updated on a rebalance objective change atomically over
each replica but not atomically over a store. This removes the need for
blocking batch requests until every replica is updated.

Resolves: #97000
Resolves: #97445
Resolves: #97450
Resolves: #97452
Resolves: #97457

Release note: None

Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
@miretskiy
Copy link
Contributor

I'm going to close this issue as complete per above PR.

@irfansharif
Copy link
Contributor

@miretskiy, @dt: am I correct in reading #97148 only applies to protected timestamps left by changefeeds? But we also have backups that can be paused, and the motivating incidents above happened to due to PTS records left behind when backups were paused. I'm going to re-open.

@irfansharif irfansharif reopened this Feb 24, 2023
@miretskiy
Copy link
Contributor

Sure; but we will not be working on other job implementations... Perhaps we can move this to DR...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-jobs
Projects
No open projects
Jobs
In Progress
Development

No branches or pull requests

7 participants