-
Notifications
You must be signed in to change notification settings - Fork 78
test: fix space counter #1212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
test: fix space counter #1212
Conversation
WalkthroughExpanded migration test assertions in Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Migration Test
participant MUR as MasterUserRecord
participant Space as Space
participant SB as SpaceBinding
Note over Test,MUR: prepareDeactivatedUser / prepareBannedUser flow
Test->>MUR: verify MUR deletion
alt MUR deleted
Test->>Space: WaitUntilSpaceAndSpaceBindingsDeleted
Space-->>Test: confirmed deleted
Test->>SB: confirmed bindings deleted
else MUR still present
MUR-->>Test: still exists (retry/wait)
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/retest flaky test |
|
/retest branch on host side not updated with master |
metlos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find! I think this will improve the situation but I'm not sure it is a 100% bulletproof solution, see the comments below.
I'm approving anyway, because this will make the situation better for sure..
test/migration/setup_runner.go
Outdated
| require.NoError(t, err) | ||
|
|
||
| // let's wait until ToolchainStatus is updated with the latest numbers from the space counter | ||
| _, err = hostAwait.WaitForToolchainStatus(t, wait.UntilToolchainStatusUpdatedAfter(time.Now())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this potentially race as well? E.g. if the space counter can react quickly enough before this line is hit, we could theoretically timeout here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
We could capture the time before deactivating the user and use that instead of time.Now()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there's anything happening in parallel to this test in the cluster, we could end up exiting too early though. So care needs to be taken here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow what you mean by the timeout. The ToolchainStatus is updated every second in e2e tests
| toolchainStatusRefreshTime: '1s' |
which is exactly what we check in the
UntilToolchainStatusUpdatedAfter function: toolchain-e2e/testsupport/wait/host.go
Lines 1461 to 1468 in 6b150c2
| // UntilToolchainStatusUpdated returns a `ToolchainStatusWaitCriterion` which checks that the | |
| // ToolchainStatus ready condition was updated after the given time | |
| func UntilToolchainStatusUpdatedAfter(t time.Time) ToolchainStatusWaitCriterion { | |
| return ToolchainStatusWaitCriterion{ | |
| Match: func(actual *toolchainv1alpha1.ToolchainStatus) bool { | |
| cond, found := condition.FindConditionByType(actual.Status.Conditions, toolchainv1alpha1.ConditionReady) | |
| return found && t.Before(cond.LastUpdatedTime.Time) | |
| }, |
When the ToolchainStatus is updated, then it also syncs with the counters, so we can be sure that the content of the ToolchainStatus is up-to-date to whatever happened before this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just noticed that we already do the wait & check of the ToolchainStatus at the end of the setup runner here:
toolchain-e2e/test/migration/setup_runner.go
Lines 66 to 68 in 6b150c2
| // wait until the ToolchainStatus is updated to make sure that all counters are in sync | |
| _, err := r.Awaitilities.Host().WaitForToolchainStatus(t, wait.UntilToolchainStatusUpdatedAfter(time.Now())) | |
| require.NoError(t, err) |
so I guess that we don't need to add the extra ones - only waiting until Space is being deleted should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we don't need to add extra ones (it would be redundant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we capture the time before all the operations? This way, we would look for an update that happened during or after all the operations, rather than after the operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed, waiting until Space is gone and then for the update of the ToolchainStatus is completely sufficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deal, thanks!
test/migration/setup_runner.go
Outdated
| require.NoError(t, err) | ||
|
|
||
| // let's wait until ToolchainStatus is updated with the latest numbers from the space counter | ||
| _, err = hostAwait.WaitForToolchainStatus(t, wait.UntilToolchainStatusUpdatedAfter(time.Now())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, I think this can potentially race. Is there a way of knowing e.g. the specific number of spaces that we should expect?
mfrancisc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Just a minor caveat about the timing issue that was mentioned by @metlos .
but we can address that as follow up in case we see tests timing out.
test/migration/setup_runner.go
Outdated
| require.NoError(t, err) | ||
|
|
||
| // let's wait until ToolchainStatus is updated with the latest numbers from the space counter | ||
| _, err = hostAwait.WaitForToolchainStatus(t, wait.UntilToolchainStatusUpdatedAfter(time.Now())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
We could capture the time before deactivating the user and use that instead of time.Now()
MatousJobanek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, thanks a lot 🚀
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexeykazakov, MatousJobanek, metlos, mfrancisc, rsoaresd The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest merge conflict with the pairing |
|



Description
Lately, we are often hitting this flaky test:
The space count mismatch seems to happen between "setup migration" and "verify migration" steps. It can be happening for two reasons:
Premature Operator Shutdown
we simply kill the operators (at the end of the migration setup) too early, so it either doesn't properly decreased the counter or it doesn't store the decreased value to the ToolchainStatus
No Counter Reconciliation on Startup
we don't recount the Spaces at the start of the host operator in e2e tests
Slack thread
https://redhat-internal.slack.com/archives/CHK0J6HT6/p1759830691610259
Paired PR
codeready-toolchain/host-operator#1210
Issue ticket number and link
SANDBOX-1437
Summary by CodeRabbit