New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: add expiration for paused jobs #84598
Comments
Discussion: 10 days instead of 3 |
@amruss dont we recommend people do full backups once a day and incrementals throughout the day? In that case if I am going to do a full backup in less in a day, why would we want to have a busted incremental linger for 3 or even 10 days? |
I find the suggestion of 3d already somewhat surprising given these PTS records are left around on the error path -- an accrual of up to 3d of MVCC garbage for the entire keyspan (full backups for ex.) can be debilitating storage wise, effects on read latencies, etc. We can improve MVCC garbage observability and and admission control for this work, and that's all good, but I still question anything longer than single digit hours here. Certainly there will be users wanting to accomodate an incident response time of >= 3d, which configurability could let them do, I'm just questioning whether it should be made the default. +cc @williamkulju, @mwang1026. |
If we had some way of making PTS records dramatically cheaper WRT them holding up garbage collection, say doing something like #86697, then perhaps a multi-day expiration time would be palatable; the data being kept around is O(live-bytes). But today it's O(live-bytes * time), a compounding we should consider in the defaults we ship with. +cc @nvanbenschoten. |
Expiring a paused job is not intended to be the defense against mvcc garbage accumulation due to PTS records; note that garbage can accumulate for a variety of reasons other than a paused job (i.e. a non-paused job that is running slowly, is stuck, etc, or not even a job at all but rather a backup schedule that hasn't completed yet, etc), so the answer catching and fixing garbage accumulation is better kv obsv tools, like garbage bytes % alerts, displaying garbage amounts on the dbs and tables console page, etc. Expiring paused jobs after 10 or 3 days is just there to catch something that is forgotten so it doesn't linger indefinitely, i.e. across major versions. |
Moving this back to triage to discuss w/ cdc team |
SRE just discussed this. Can we execute on this ticket ~now and include it in a patch update, with a cluster setting disabled by default if needed that we can flip on in CC immediately? The pause job if backup fails behavior implies the need for SRE to do a lot of regular operational work (read: toil) to keep CC backups from holding up PTO and spending lots of disk unnecessarily. Re: my suggestion to fix this ~now, in CC at least, I think landing pause instead of fail in 22.1 without also landing expiration for paused jobs should be considered a regression. |
I'm not exactly sure how a patch release version of this would look, since we don't track when a job was paused right now, so existing, paused jobs don't reflect how long they've been paused. We'll need to start writing a new field at pause time to use for this, and any jobs that predate it will likely just be left alone? |
There is nothing in the system that at least indirectly keeps track of when a job went from running to paused? Isn't there a hearbeated at column, or something like that? |
I'm not too sure I will wind up adding job expiration; but I will add protected timestamp expiration. |
Changefeeds utilize protected timestamp system (PTS) to ensure that the data targeted by changefeed is not garbage collected prematurely. PTS record is managed by running changefeed by periodically updating PTS record timestamp, so that the data older than the that timestamp may be GCed. However, if the changefeed stops running when it is paused (either due to operator action, or due to `on_error=pause` option, the PTS record remains so that the changefeed can be resumed at a later time. However, it is also possible that operator may not notice that the job is paused for too long, thus causing buildup of garbage data. Excessive buildup of GC work is not great since it impacts overall cluster performance, and, once GC can resume, its cost is proportional to how much GC work needs to be done. This PR introduces a new setting `changefeed.protect_timestamp.expire_after` to automatically expire PTS records that are too old. This automatic expiration is a safety mechanism and the setting default is rather concervative. The operator is still expected to monitor changefeed jobs, and to restart paused changefeeds expediently. If the changefeed job remains paused, and the underlying PTS records is removed due to expiration, then the changefeed job will fail when it is restarted. This failure is preferable to unbounded buildup of garbage data in the cluster. Epic: CRDB-21953 This PR does not add expiration to the job itself, as requested by cockroachdb#84598, but it does accomplish the same goal. Release note (enterprise change): Changefeed will automatically expire PTS records after `changefeed.protect_timestamp.expire_after` so that paused changefeed jobs do not cause GC data buildup in the cluster.
Changefeeds utilize protected timestamp system (PTS) to ensure that the data targeted by changefeed is not garbage collected prematurely. PTS record is managed by running changefeed by periodically updating PTS record timestamp, so that the data older than the that timestamp may be GCed. However, if the changefeed stops running when it is paused (either due to operator action, or due to `on_error=pause` option, the PTS record remains so that the changefeed can be resumed at a later time. However, it is also possible that operator may not notice that the job is paused for too long, thus causing buildup of garbage data. Excessive buildup of GC work is not great since it impacts overall cluster performance, and, once GC can resume, its cost is proportional to how much GC work needs to be done. This PR introduces a new setting `changefeed.protect_timestamp.expire_after` to automatically expire PTS records that are too old. This automatic expiration is a safety mechanism and the setting default is rather concervative. The operator is still expected to monitor changefeed jobs, and to restart paused changefeeds expediently. If the changefeed job remains paused, and the underlying PTS records is removed due to expiration, then the changefeed job will fail when it is restarted. This failure is preferable to unbounded buildup of garbage data in the cluster. Epic: CRDB-21953 This PR does not add expiration to the job itself, as requested by cockroachdb#84598, but it does accomplish the same goal. Release note (enterprise change): Changefeed will automatically expire PTS records after `changefeed.protect_timestamp.expire_after` so that paused changefeed jobs do not cause GC data buildup in the cluster.
Changefeeds utilize protected timestamp system (PTS) to ensure that the data targeted by changefeed is not garbage collected prematurely. PTS record is managed by running changefeed by periodically updating PTS record timestamp, so that the data older than the that timestamp may be GCed. However, if the changefeed stops running when it is paused (either due to operator action, or due to `on_error=pause` option, the PTS record remains so that the changefeed can be resumed at a later time. However, it is also possible that operator may not notice that the job is paused for too long, thus causing buildup of garbage data. Excessive buildup of GC work is not great since it impacts overall cluster performance, and, once GC can resume, its cost is proportional to how much GC work needs to be done. This PR introduces a new changefeed option `gc_protect_expires_after` to automatically expire PTS records that are too old. This automatic expiration is a safety mechanism in case changefeed job gets paused by an operator or due to an error, while holding onto PTS record due to `protect_gc_on_pause` option. The operator is still expected to monitor changefeed jobs, and to restart paused changefeeds expediently. If the changefeed job remains paused, and the underlying PTS records expires, then the changefeed job will be canceled to prevent build up of GC data. Epic: CRDB-21953 This PR does not add expiration to the job itself, as requested by cockroachdb#84598, but it does accomplish the same goal. Release note (enterprise change): Changefeed will automatically expire PTS records for paused jobs if changefeed is configured with `gc_protect_expires_after` option.
Changefeeds utilize protected timestamp system (PTS) to ensure that the data targeted by changefeed is not garbage collected prematurely. PTS record is managed by running changefeed by periodically updating PTS record timestamp, so that the data older than the that timestamp may be GCed. However, if the changefeed stops running when it is paused (either due to operator action, or due to `on_error=pause` option, the PTS record remains so that the changefeed can be resumed at a later time. However, it is also possible that operator may not notice that the job is paused for too long, thus causing buildup of garbage data. Excessive buildup of GC work is not great since it impacts overall cluster performance, and, once GC can resume, its cost is proportional to how much GC work needs to be done. This PR introduces a new changefeed option `gc_protect_expires_after` to automatically expire PTS records that are too old. This automatic expiration is a safety mechanism in case changefeed job gets paused by an operator or due to an error, while holding onto PTS record due to `protect_gc_on_pause` option. The operator is still expected to monitor changefeed jobs, and to restart paused changefeeds expediently. If the changefeed job remains paused, and the underlying PTS records expires, then the changefeed job will be canceled to prevent build up of GC data. Epic: CRDB-21953 Informs cockroachdb#84598 Release note (enterprise change): Changefeed will automatically expire PTS records for paused jobs if changefeed is configured with `gc_protect_expires_after` option.
97148: changefeedccl: Expire protected timestamps r=miretskiy a=miretskiy Changefeeds utilize protected timestamp system (PTS) to ensure that the data targeted by changefeed is not garbage collected prematurely. PTS record is managed by running changefeed by periodically updating PTS record timestamp, so that the data older than the that timestamp may be GCed. However, if the changefeed stops running when it is paused (either due to operator action, or due to `on_error=pause` option, the PTS record remains so that the changefeed can be resumed at a later time. However, it is also possible that operator may not notice that the job is paused for too long, thus causing buildup of garbage data. Excessive buildup of GC work is not great since it impacts overall cluster performance, and, once GC can resume, its cost is proportional to how much GC work needs to be done. This PR introduces a new changefeed option `gc_protect_expires_after` to automatically expire PTS records that are too old. This automatic expiration is a safety mechanism in case changefeed job gets paused by an operator or due to an error, while holding onto PTS record due to `protect_gc_on_pause` option. The operator is still expected to monitor changefeed jobs, and to restart paused changefeeds expediently. If the changefeed job remains paused, and the underlying PTS records expires, then the changefeed job will be canceled to prevent build up of GC data. Epic: [CRDB-21953](https://cockroachlabs.atlassian.net/browse/CRDB-21953) Informs #84598 Release note (enterprise change): Changefeed will automatically expire PTS records for paused jobs if changefeed is configured with `gc_protect_expires_after` option. 97539: kvserver: fix deadlock on rebalance obj change r=kvoli a=kvoli Previously, changing the rebalance objective could lead to inconsistent locking order between the load based splitter and rebalance objective. When the objective was updated, the previous method also blocked batch requests from completing until every replica lb splitter was reset. This commit moves the split objective to be a variable owned by the decider, rather than inferred on each decider operation. The split objective is updated on a rebalance objective change atomically over each replica but not atomically over a store. This removes the need for blocking batch requests until every replica is updated. Resolves: #97000 Resolves: #97445 Resolves: #97450 Resolves: #97452 Resolves: #97457 Release note: None Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
I'm going to close this issue as complete per above PR. |
@miretskiy, @dt: am I correct in reading #97148 only applies to protected timestamps left by changefeeds? But we also have backups that can be paused, and the motivating incidents above happened to due to PTS records left behind when backups were paused. I'm going to re-open. |
Sure; but we will not be working on other job implementations... Perhaps we can move this to DR... |
Paused jobs can be easily forgotten about. Long-paused jobs can cause cluster issues, because of the accumulation of garbage that paused jobs can create. Additionally, supporting long-paused jobs through cluster upgrades is hard.
We should:
Rational behind 3 days: you have enough working-day time to have done something about this
Follow up work:
Jira issue: CRDB-17756
Epic CRDB-20422
The text was updated successfully, but these errors were encountered: