Skip to content

Conversation

@rsoaresd
Copy link
Contributor

@rsoaresd rsoaresd commented Oct 7, 2025

Description

Lately, we are often hitting this flaky test:

space.go:108: 
        	Error Trace:	/go/src/github.com/codeready-toolchain/toolchain-e2e/testsupport/space/space.go:108
        	            				/go/src/github.com/codeready-toolchain/toolchain-e2e/test/e2e/space_autocompletion_test.go:73
        	Error:      	Received unexpected error:
        	            	context deadline exceeded
        	Test:       	TestAutomaticClusterAssignment/set_low_max_number_of_spaces_and_expect_that_space_won't_be_provisioned_but_added_on_waiting_list/increment_the_max_number_of_spaces_and_expect_that_first_space_will_be_provisioned.

The space count mismatch seems to happen between "setup migration" and "verify migration" steps. It can be happening for two reasons:

  • Premature Operator Shutdown
    we simply kill the operators (at the end of the migration setup) too early, so it either doesn't properly decreased the counter or it doesn't store the decreased value to the ToolchainStatus

  • No Counter Reconciliation on Startup
    we don't recount the Spaces at the start of the host operator in e2e tests

Slack thread

https://redhat-internal.slack.com/archives/CHK0J6HT6/p1759830691610259

Paired PR

codeready-toolchain/host-operator#1210

Issue ticket number and link

SANDBOX-1437

Summary by CodeRabbit

  • Tests
    • Increased test coverage for deactivated and banned user flows by adding explicit checks that user spaces and related bindings are fully removed.
    • Replaced informal waiting notes with definitive assertions and waits to reduce flakiness and improve reliability of migration/cleanup verifications.
    • Enhances confidence in automatic account-space cleanup and platform status reporting without changing runtime behavior.

@openshift-ci openshift-ci bot requested review from jrosental and metlos October 7, 2025 15:37
@openshift-ci openshift-ci bot added the approved label Oct 7, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 7, 2025

Walkthrough

Expanded migration test assertions in test/migration/setup_runner.go to explicitly wait for and verify that Space and SpaceBinding resources are deleted after MasterUserRecord removal in both user deactivation and banning flows.

Changes

Cohort / File(s) Change summary
Migration tests: space deletion assertions
test/migration/setup_runner.go
After verifying MasterUserRecord deletion in the deactivation and banning flows, added explicit WaitUntilSpaceAndSpaceBindingsDeleted checks and require.NoError assertions for space deletion; removed an inline TODO and separated the space deletion verification from the MUR/bindings check.

Sequence Diagram(s)

sequenceDiagram
  participant Test as Migration Test
  participant MUR as MasterUserRecord
  participant Space as Space
  participant SB as SpaceBinding

  Note over Test,MUR: prepareDeactivatedUser / prepareBannedUser flow
  Test->>MUR: verify MUR deletion
  alt MUR deleted
    Test->>Space: WaitUntilSpaceAndSpaceBindingsDeleted
    Space-->>Test: confirmed deleted
    Test->>SB: confirmed bindings deleted
  else MUR still present
    MUR-->>Test: still exists (retry/wait)
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Areas to check: test/migration/setup_runner.go change locations where waits/assertions were added, and any test timeouts/retry behavior.

Suggested reviewers

  • jrosental
  • MatousJobanek
  • rajivnathan

Poem

I hop through tests with whiskers twitching bright,
Spaces and bindings vanish out of sight.
I wait and I nudge until deletions are done,
Then nibble a carrot — another test won. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The pull request title is 'test: fix space counter', which refers to fixing a space counter issue. However, the actual changes in the PR focus on modifying test setup code in setup_runner.go to add explicit verification that spaces and space bindings are properly deleted during user preparation phases. The title is vague and does not accurately convey the specific nature of the code changes, which involve restructuring space deletion assertions and removing TODO comments rather than directly 'fixing a space counter'. Consider revising the title to more accurately reflect the actual changes, such as 'test: add explicit space deletion verification in user setup' or 'test: remove space deletion TODO and add explicit assertions'. This would make the title more specific and descriptive of what the changeset actually accomplishes.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ddc307 and dc9ff99.

📒 Files selected for processing (1)
  • test/migration/setup_runner.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/migration/setup_runner.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Unit Tests
  • GitHub Check: Build & push Developer Sandbox UI image for UI e2e tests
  • GitHub Check: GolangCI Lint
  • GitHub Check: Build & push operator bundles for e2e tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rsoaresd
Copy link
Contributor Author

rsoaresd commented Oct 7, 2025

/retest

flaky test

@rsoaresd
Copy link
Contributor Author

rsoaresd commented Oct 8, 2025

/retest

branch on host side not updated with master

Copy link
Contributor

@metlos metlos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find! I think this will improve the situation but I'm not sure it is a 100% bulletproof solution, see the comments below.

I'm approving anyway, because this will make the situation better for sure..

require.NoError(t, err)

// let's wait until ToolchainStatus is updated with the latest numbers from the space counter
_, err = hostAwait.WaitForToolchainStatus(t, wait.UntilToolchainStatusUpdatedAfter(time.Now()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this potentially race as well? E.g. if the space counter can react quickly enough before this line is hit, we could theoretically timeout here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

We could capture the time before deactivating the user and use that instead of time.Now()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's anything happening in parallel to this test in the cluster, we could end up exiting too early though. So care needs to be taken here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow what you mean by the timeout. The ToolchainStatus is updated every second in e2e tests


which is exactly what we check in the UntilToolchainStatusUpdatedAfter function:
// UntilToolchainStatusUpdated returns a `ToolchainStatusWaitCriterion` which checks that the
// ToolchainStatus ready condition was updated after the given time
func UntilToolchainStatusUpdatedAfter(t time.Time) ToolchainStatusWaitCriterion {
return ToolchainStatusWaitCriterion{
Match: func(actual *toolchainv1alpha1.ToolchainStatus) bool {
cond, found := condition.FindConditionByType(actual.Status.Conditions, toolchainv1alpha1.ConditionReady)
return found && t.Before(cond.LastUpdatedTime.Time)
},

When the ToolchainStatus is updated, then it also syncs with the counters, so we can be sure that the content of the ToolchainStatus is up-to-date to whatever happened before this line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just noticed that we already do the wait & check of the ToolchainStatus at the end of the setup runner here:

// wait until the ToolchainStatus is updated to make sure that all counters are in sync
_, err := r.Awaitilities.Host().WaitForToolchainStatus(t, wait.UntilToolchainStatusUpdatedAfter(time.Now()))
require.NoError(t, err)

so I guess that we don't need to add the extra ones - only waiting until Space is being deleted should be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we don't need to add extra ones (it would be redundant)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we capture the time before all the operations? This way, we would look for an update that happened during or after all the operations, rather than after the operations

cc: @MatousJobanek @metlos @mfrancisc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed, waiting until Space is gone and then for the update of the ToolchainStatus is completely sufficient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deal, thanks!

require.NoError(t, err)

// let's wait until ToolchainStatus is updated with the latest numbers from the space counter
_, err = hostAwait.WaitForToolchainStatus(t, wait.UntilToolchainStatusUpdatedAfter(time.Now()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I think this can potentially race. Is there a way of knowing e.g. the specific number of spaces that we should expect?

Copy link
Contributor

@mfrancisc mfrancisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Just a minor caveat about the timing issue that was mentioned by @metlos .
but we can address that as follow up in case we see tests timing out.

require.NoError(t, err)

// let's wait until ToolchainStatus is updated with the latest numbers from the space counter
_, err = hostAwait.WaitForToolchainStatus(t, wait.UntilToolchainStatusUpdatedAfter(time.Now()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

We could capture the time before deactivating the user and use that instead of time.Now()

Copy link
Collaborator

@MatousJobanek MatousJobanek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thanks a lot 🚀

@openshift-ci
Copy link

openshift-ci bot commented Nov 4, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexeykazakov, MatousJobanek, metlos, mfrancisc, rsoaresd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [MatousJobanek,alexeykazakov,metlos,mfrancisc,rsoaresd]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rsoaresd
Copy link
Contributor Author

rsoaresd commented Nov 5, 2025

/retest

merge conflict with the pairing

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants