-
Notifications
You must be signed in to change notification settings - Fork 4k
release-24.1: sql/ttl: improve TTL replan decision logic #151490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-24.1: sql/ttl: improve TTL replan decision logic #151490
Conversation
Replace calculatePlanGrowth with detectNodeAvailabilityChanges to make TTL job replanning less sensitive to span changes. The new logic focuses specifically on detecting when nodes become unavailable rather than reacting to all plan differences. The previous implementation would trigger replans for span splits/merges that don't actually indicate beneficial restart scenarios. The new approach only considers missing nodes from the original plan, which typically indicates node failures where work redistribution would benefit from restarting the job. It also supports a stability window so that replan decisions need to fire consecutively. This should help eleviate changes in plans due to range cache issues. Fixes cockroachdb#150343 Epic: none Release note (ops change): The 'sql.ttl.replan_flow_threshold' may have been set to 0 to work around the TTL replanner being too sensitive. This fix will alleviate that and any instance that had set replan_flow_threshold to 0 can be reset back to the default.
The TTL restart test was experiencing flakiness due to the default stability window causing delays in replanning when nodes changed. The test would wait for TTL progress across all nodes but the replanning logic wouldn't trigger immediately when nodes were restarted. This change disables the stability window. This also fixes a bug in the logic that checks if the TTL job is progressing. It would look for key removal across all ranges over time. The existing check repeatedly change the baseline. We now save that the baseline and compare it with each check. Release note: None Epic: None Closes cockroachdb#151011
The ttl_restart roachtest was flaky due to its reliance on having one lease per node. It attempted to enforce this distribution by relocating leases before starting the TTL job. However, this setup was not always effective, and the resulting imbalance sometimes caused the test to fail. This change improves the test's resilience by checking the lease distribution after the TTL job has started running. If the TTL job does not have one lease per node at that point, the test logs an explanatory message and exits early, treating the run as successful. Fixes cockroachdb#151112 Fixes cockroachdb#151113 Release note: None Epic: none
Thanks for opening a backport. Before merging, please confirm that it falls into one of the following categories (select one):
Add a brief release justification to the PR description explaining your selection. Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy. All backports must be reviewed by the TL and EM for the owning area. |
✅ PR #151490 is compliant with backport policy Confidence: high The non-production file, 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Backport:
Please see individual PRs for details.
/cc @cockroachdb/release
Release justification: bug fix that was hit by a customer