Skip to content

backport: v2 add consistency checks across core_cluster_members, truststore, and dqlite (#515)#602

Merged
roosterfish merged 4 commits intocanonical:v2from
louiseschmidtgen:KU-5062/backport-member-check-v2
Feb 3, 2026
Merged

backport: v2 add consistency checks across core_cluster_members, truststore, and dqlite (#515)#602
roosterfish merged 4 commits intocanonical:v2from
louiseschmidtgen:KU-5062/backport-member-check-v2

Conversation

@louiseschmidtgen
Copy link
Contributor

Backport

This PR backports #515

Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

(cherry picked from commit 2bd94bb)
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
(cherry picked from commit a0676a3)
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

(cherry picked from commit ae40a37)
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

(cherry picked from commit e69343e)
Copilot AI review requested due to automatic review settings January 28, 2026 14:47
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR backports consistency checks that prevent critical cluster operations when core_cluster_members (database), truststore, and dqlite cluster configuration are out of sync. The change adds validation before joins, removals, and token generation to ensure cluster health and prevent partial failure scenarios.

Changes:

  • Added CheckMembershipConsistency() method that validates consistency across three membership sources with automatic retry logic
  • Integrated consistency checks before join operations, member removals (unless forced), and token generation
  • Enhanced test infrastructure with membership consistency validation tests and parallel join scenarios

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
internal/state/state.go Implements core consistency checking logic with retry mechanism for handling transient inconsistencies during concurrent operations
internal/rest/resources/cluster.go Adds consistency checks before join and remove operations, with forced removal bypassing checks
internal/rest/resources/tokens.go Adds consistency check before token generation to prevent issuing tokens when cluster is in inconsistent state
example/test/main.sh Adds tests for membership consistency validation and parallel join scenarios, plus improved shutdown process safety

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +251 to +269
for {
coreClusterMembers, truststoreRemotes, dqliteNodes, err := s.getMembershipData(ctx)
if err != nil {
return fmt.Errorf("Failed to gather membership data for consistency check: %w", err)
}

err = s.checkMembershipConsistency(coreClusterMembers, truststoreRemotes, dqliteNodes)
if err != nil {
select {
case <-ctx.Done():
return fmt.Errorf("Membership consistency check failed after timeout: %w", err)
case <-time.After(200 * time.Millisecond):
continue
}
}

return nil
}
}
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry loop should check if the context has been cancelled before attempting to fetch membership data. Currently, if the context times out during the sleep, the next iteration will call getMembershipData with an expired context, which will fail with a generic context error instead of providing the more informative consistency check error message. Consider checking ctx.Done() at the start of the loop to ensure the timeout error message from line 261 is properly used.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@louiseschmidtgen something we could fix also in v3? I know v3 has an extra commit that doesn't anymore set the 30s deadline. But you could still get the generic context deadline exceeded error if it expires whilst calling getMembershipData.

Copy link
Contributor

@roosterfish roosterfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Please see my comment, but if you agree we can do this in a follow up PR.

Comment on lines +251 to +269
for {
coreClusterMembers, truststoreRemotes, dqliteNodes, err := s.getMembershipData(ctx)
if err != nil {
return fmt.Errorf("Failed to gather membership data for consistency check: %w", err)
}

err = s.checkMembershipConsistency(coreClusterMembers, truststoreRemotes, dqliteNodes)
if err != nil {
select {
case <-ctx.Done():
return fmt.Errorf("Membership consistency check failed after timeout: %w", err)
case <-time.After(200 * time.Millisecond):
continue
}
}

return nil
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@louiseschmidtgen something we could fix also in v3? I know v3 has an extra commit that doesn't anymore set the 30s deadline. But you could still get the generic context deadline exceeded error if it expires whilst calling getMembershipData.

@roosterfish
Copy link
Contributor

I see the CLA check failing also on other repos, should not be related to this particular PR.

@roosterfish roosterfish merged commit 38c2bb7 into canonical:v2 Feb 3, 2026
10 of 12 checks passed
@louiseschmidtgen louiseschmidtgen deleted the KU-5062/backport-member-check-v2 branch February 3, 2026 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants