Skip to content

Conversation

spilchen
Copy link
Contributor

@spilchen spilchen commented Aug 6, 2025

Backport:

Please see individual PRs for details.

/cc @cockroachdb/release

Release justification: bug fix that was hit by a customer

Replace calculatePlanGrowth with detectNodeAvailabilityChanges to make
TTL job replanning less sensitive to span changes. The new logic focuses
specifically on detecting when nodes become unavailable rather than
reacting to all plan differences.

The previous implementation would trigger replans for span splits/merges
that don't actually indicate beneficial restart scenarios. The new
approach only considers missing nodes from the original plan, which
typically indicates node failures where work redistribution would benefit
from restarting the job. It also supports a stability window so that
replan decisions need to fire consecutively. This should help eleviate
changes in plans due to range cache issues.

Fixes cockroachdb#150343

Epic: none
Release note (ops change): The 'sql.ttl.replan_flow_threshold' may
have been set to 0 to work around the TTL replanner being too sensitive.
This fix will alleviate that and any instance that had set
replan_flow_threshold to 0 can be reset back to the default.
The TTL restart test was experiencing flakiness due to the default
stability window causing delays in replanning when nodes changed. The
test would wait for TTL progress across all nodes but the replanning
logic wouldn't trigger immediately when nodes were restarted. This
change disables the stability window.

This also fixes a bug in the logic that checks if the TTL job is
progressing. It would look for key removal across all ranges over time.
The existing check repeatedly change the baseline. We now save that the
baseline and compare it with each check.

Release note: None

Epic: None

Closes cockroachdb#151011
The ttl_restart roachtest was flaky due to its reliance on having one
lease per node. It attempted to enforce this distribution by relocating
leases before starting the TTL job. However, this setup was not always
effective, and the resulting imbalance sometimes caused the test to
fail.

This change improves the test's resilience by checking the lease
distribution after the TTL job has started running. If the TTL job does
not have one lease per node at that point, the test logs an explanatory
message and exits early, treating the run as successful.

Fixes cockroachdb#151112
Fixes cockroachdb#151113

Release note: None
Epic: none
@spilchen spilchen self-assigned this Aug 6, 2025
@spilchen spilchen requested a review from a team as a code owner August 6, 2025 18:31
Copy link

blathers-crl bot commented Aug 6, 2025

Thanks for opening a backport.

Before merging, please confirm that it falls into one of the following categories (select one):

  • Non-production code changes. Includes test-only changes, build system changes, etc.
  • Fixes for serious issues. Defined in the policy as correctness, stability, or security issues, data corruption/loss, significant performance regressions, breaking working and widely used functionality, or an inability to detect and debug production issues.
  • Other approved changes. These changes must be gated behind a disabled-by-default feature flag unless there is a strong justification not to.

Add a brief release justification to the PR description explaining your selection.

Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy.

All backports must be reviewed by the TL and EM for the owning area.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Aug 6, 2025
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link

blathers-crl bot commented Aug 6, 2025

✅ PR #151490 is compliant with backport policy

Confidence: high
Critical bug criteria met: [Bugs that can cause the DB to return incorrect results or result in suboptimal performance]
Feature flag detected: Yes
Backward compatible: true
Explanation: The PR includes changes in both production and non-production files. The production changes within the pkg/sql/ttl/ttljob directory involve improvements to the TTL replan decision logic, which can be considered a critical bug fix since it directly impacts the performance and correctness of TTL operations in production environments. The introduction of settings like replanStabilityWindow and modifications to the job resumption logic to better handle node changes are critical for ensuring the reliability and performance of TTL processes. These changes do not introduce backward compatibility issues or remove any version gates, preserving existing functionality while enhancing performance.

The non-production file, pkg/cmd/roachtest/tests/ttl_restart.go, is also modified, focusing on enhancing test reliability and robustness, which does not impact production code and aligns with exceptions for test-only changes. The changes in the test files support the nature of the modifications in the production code, ensuring thorough testing of new functionality. Additionally, the PR correctly includes a release justification indicating it contains necessary bug fixes.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@spilchen spilchen merged commit b86ab1e into cockroachdb:release-24.1 Aug 7, 2025
16 checks passed
@spilchen spilchen deleted the backport24.1-150771-151063-151323 branch August 7, 2025 00:23
@crl-codesys-jira crl-codesys-jira added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Oct 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) v24.1.24
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants