backport: v2 add consistency checks across core_cluster_members, truststore, and dqlite (#515)#602
Conversation
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com> (cherry picked from commit 2bd94bb)
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com> (cherry picked from commit a0676a3)
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com> (cherry picked from commit ae40a37)
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com> (cherry picked from commit e69343e)
There was a problem hiding this comment.
Pull request overview
This PR backports consistency checks that prevent critical cluster operations when core_cluster_members (database), truststore, and dqlite cluster configuration are out of sync. The change adds validation before joins, removals, and token generation to ensure cluster health and prevent partial failure scenarios.
Changes:
- Added
CheckMembershipConsistency()method that validates consistency across three membership sources with automatic retry logic - Integrated consistency checks before join operations, member removals (unless forced), and token generation
- Enhanced test infrastructure with membership consistency validation tests and parallel join scenarios
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| internal/state/state.go | Implements core consistency checking logic with retry mechanism for handling transient inconsistencies during concurrent operations |
| internal/rest/resources/cluster.go | Adds consistency checks before join and remove operations, with forced removal bypassing checks |
| internal/rest/resources/tokens.go | Adds consistency check before token generation to prevent issuing tokens when cluster is in inconsistent state |
| example/test/main.sh | Adds tests for membership consistency validation and parallel join scenarios, plus improved shutdown process safety |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for { | ||
| coreClusterMembers, truststoreRemotes, dqliteNodes, err := s.getMembershipData(ctx) | ||
| if err != nil { | ||
| return fmt.Errorf("Failed to gather membership data for consistency check: %w", err) | ||
| } | ||
|
|
||
| err = s.checkMembershipConsistency(coreClusterMembers, truststoreRemotes, dqliteNodes) | ||
| if err != nil { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return fmt.Errorf("Membership consistency check failed after timeout: %w", err) | ||
| case <-time.After(200 * time.Millisecond): | ||
| continue | ||
| } | ||
| } | ||
|
|
||
| return nil | ||
| } | ||
| } |
There was a problem hiding this comment.
The retry loop should check if the context has been cancelled before attempting to fetch membership data. Currently, if the context times out during the sleep, the next iteration will call getMembershipData with an expired context, which will fail with a generic context error instead of providing the more informative consistency check error message. Consider checking ctx.Done() at the start of the loop to ensure the timeout error message from line 261 is properly used.
There was a problem hiding this comment.
@louiseschmidtgen something we could fix also in v3? I know v3 has an extra commit that doesn't anymore set the 30s deadline. But you could still get the generic context deadline exceeded error if it expires whilst calling getMembershipData.
roosterfish
left a comment
There was a problem hiding this comment.
Thanks! Please see my comment, but if you agree we can do this in a follow up PR.
| for { | ||
| coreClusterMembers, truststoreRemotes, dqliteNodes, err := s.getMembershipData(ctx) | ||
| if err != nil { | ||
| return fmt.Errorf("Failed to gather membership data for consistency check: %w", err) | ||
| } | ||
|
|
||
| err = s.checkMembershipConsistency(coreClusterMembers, truststoreRemotes, dqliteNodes) | ||
| if err != nil { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return fmt.Errorf("Membership consistency check failed after timeout: %w", err) | ||
| case <-time.After(200 * time.Millisecond): | ||
| continue | ||
| } | ||
| } | ||
|
|
||
| return nil | ||
| } | ||
| } |
There was a problem hiding this comment.
@louiseschmidtgen something we could fix also in v3? I know v3 has an extra commit that doesn't anymore set the 30s deadline. But you could still get the generic context deadline exceeded error if it expires whilst calling getMembershipData.
|
I see the CLA check failing also on other repos, should not be related to this particular PR. |
Backport
This PR backports #515