Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: fix mixed cluster and node versions for tenant span stats #105183

Merged

Conversation

THardy98
Copy link
Contributor

@THardy98 THardy98 commented Jun 20, 2023

Epic: None

This change ensures we don't make assertions when the cluster is in a
finalizing state (i.e. waiting for cluster version to upgrade). We
cannot make any meaningful assertions in this state without introducing
flakiness into the test, as such we skip this state. Additionally, this
allows us to simplify the test logic.

Release note: None

@THardy98 THardy98 requested a review from a team June 20, 2023 13:48
@THardy98 THardy98 requested a review from a team as a code owner June 20, 2023 13:48
@THardy98 THardy98 requested review from smg260 and renatolabs and removed request for a team June 20, 2023 13:48
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @renatolabs and @smg260)

@THardy98
Copy link
Contributor Author

Stress testing this for a while before merging. A few fixes have been made before, I hope to catch any potential failures before merging this.

Comment on lines 171 to 174
// An error is expected if:
// - the cluster version is <23.1.0
// - or cluster version >=23.1 and node versions <23.1.0.
if !cv.AtLeast(v231) || len(h.Context().FromVersionNodes) == 4 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments.

  • cluster version >=23.1 is misleading as, in this test, the cluster version will be at most V23_1 by definition.
  • cluster version >=23.1 and node versions <23.1.0 -- not sure what you meant here, but cluster version being in 23.1 and nodes being <23.1 is not possible. Upgrades will only happen once all nodes are running the 23.1 binary and once the upgrade begins, older binaries are no longer able to join.
  • There's a TOCTOU issue with regards to the cluster version: if this function is called while the upgrade is finalizing, there's no guarantees the value you're using here still reflects the actual cluster version. I believe that if you run this test using a seed that will lead to this function being called during finalization, and sleep for 3mins on line 164, this test will fail with high probability (i.e., there's a timing dependency).

I think the issue with the test is that it may flake if this hook is called during finalization. Does it make sense to even perform the error check in this case? I'm having a hard time thinking of a meaningful assertion we can write in that environment.

If you decide to skip, you can add the following to the top of this function:

if h.Context().Finalizing {
    return nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, can the versioning logic simply be gated to the cluster version? Do we need to check the node versions at all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? What "versioning logic" are you referring to?

Do we need to check the node versions at all?

In what context?


To make myself clearer (in case it wasn't), my suggestion is to replace the entire diff in this PR with the above check for Finalizing in the first line of the fetch span stats - mixed function. As I said, it will be quite tricky, if not impossible, to write a non-flaky assertion while the upgrade is finalizing.

Copy link
Contributor Author

@THardy98 THardy98 Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, was in a couple long pairing sessions.

Could you elaborate? What "versioning logic" are you referring to?

Currently, in the test, there is a check to see if the nodes are on different binaries. If so, we check that we receive the expected error message. If the cluster version is only at 23.1.0 when all nodes are are at least at 23.1.0, then I suspect we don't need this check at all? We can just use a cluster version check.

To make myself clearer (in case it wasn't), my suggestion is to replace the entire diff in this PR with the above check for Finalizing in the first line of the fetch span stats - mixed function. As I said, it will be quite tricky, if not impossible, to write a non-flaky assertion while the upgrade is finalizing.

For my own knowledge, what does it mean for an upgrade to be finalizing, could it be described as when the cluster version is upgrading once all nodes have upgraded? It would help me understand why the diff I've written would cause flakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cluster version is only at 23.1.0 when all nodes are are at least at 23.1.0, then I suspect we don't need this check at all? We can just use a cluster version check.

I think there's some confusion here, 23.1.0 is the notation we use for released binaries. The cluster version is not quite the same. Instead of trying to write more words myself, I recommend you take a look at the comments in cockroach_versions.go and upgrademanager/manager.go first if you haven't already. Feel free to ask questions if something is still not clear.

could it be described as when the cluster version is upgrading once all nodes have upgraded?

Yes (sorry for not being clearer earlier). It means that all nodes are ahead of the current cluster version, and they are allowed to upgrade.

The way that mixed-version tests work is that we update each node's binaries (calling your user functions in between at random points in time), and when all nodes are running the latest binary, we allow the upgrade to finalize. This is done by resetting the cluster.preserve_downgrade_option cluster setting [1]. It's not too different from the process described in [2].

[1]

func (s finalizeUpgradeStep) Run(
ctx context.Context, l *logger.Logger, c cluster.Cluster, helper *Helper,
) error {
node, db := helper.RandomDB(s.prng, s.crdbNodes)
l.Printf("resetting preserve_downgrade_option (via node %d)", node)
_, err := db.ExecContext(ctx, "RESET CLUSTER SETTING cluster.preserve_downgrade_option")
return err
}

[2] https://www.cockroachlabs.com/docs/stable/upgrade-cockroach-version.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the diff I've written would cause flakes.

It causes flakes in the following scenario:

  1. the fetch span stats - mixed function is called while the upgrade is finalizing (all nodes running a release-23.1 binary, migrations are running in the background, cluster version is advancing from V22_2 all the way to V23_1 while your function is running)
  2. You read the current cluster version in "fetch span stats". Let's say that that returns 22.2-86 (i.e., by the time you read the cluster setting, quite a few migrations have completed).
  3. Since the cluster version is <23.1, you deem that you should get an error if you try to use your endpoint.
  4. In the meantime, the remaining migrations finish running, and the cluster version reaches V23_1
  5. You call your endpoint. You expect an error, but no error is returned (behavior is correct, assertion is wrong).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for explaining! The flake case scenario in particular is illuminating, skipping these assertions during finalization seems prudent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that in mind, does it make sense to strip the entire else branch?

I.e.

} else {
					// All nodes are on one version, but we're in mixed state (i.e. cluster version is on a different version)
					issueNodeID := c.All()[0]
					dialNodeID := c.All()[1]

					// Dial a node for span stats.
					l.Printf("Dial a node for span stats (different cluster version).")
					res, err = fetchSpanStatsFromNode(ctx, l, c, c.Node(issueNodeID), newReqBody(dialNodeID, startKey, endKey))
					if err != nil {
						return err
					}

					// Expect an error in the stdout - mixed version error.
					// Ensure the result can be marshalled into a valid error response.
					err = json.Unmarshal([]byte(res.Stdout), &errOutput)
					if err != nil {
						return err
					}
					// Ensure we get the expected error.
					mixedClusterVersionErr := assertExpectedError(errOutput.Message, mixedVersionReqError)
					expectedUnknown := assertExpectedError(errOutput.Message, unknownFieldError)
					if !mixedClusterVersionErr && !expectedUnknown {
						return errors.Newf("expected '%s' or '%s' in error message, got: '%v'", mixedVersionReqError, unknownFieldError, errOutput.Error)
					}
				}

Is there a state where all nodes are on one version, the cluster is on another, but we aren't in a finalizing state?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed -- if we skip on Finalizing, we can remove the entire else branch -- we always expect an error in that case, regardless of which node we connect to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, made the change and updated the commit/PR desc to reflect the new diff.

@THardy98 THardy98 force-pushed the update_mixed_version_failure_case branch from ce202eb to 2f313a7 Compare June 21, 2023 18:24
@THardy98 THardy98 requested a review from renatolabs June 21, 2023 18:24
if h.Context().Finalizing {
return nil
}

// If we have nodes in mixed versions.
if len(h.Context().FromVersionNodes) > 0 && len(h.Context().ToVersionNodes) > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also remove this condition here. Now that we don't run this validation when the upgrade is finalizing, we know that the cluster version is still at V22_2 every time this function is called, and therefore we should get the error every time, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, removed.

@THardy98 THardy98 force-pushed the update_mixed_version_failure_case branch from 2f313a7 to efbd78b Compare June 22, 2023 15:00
This change ensures we don't make assertions when the cluster is in a
finalizing state (i.e. waiting for cluster version to upgrade). We
cannot make any meaningful assertions in this state without introducing
flakiness into the test, as such we skip this state. Additionally, this
allows us to simplify the test logic.

Release note: None
@THardy98 THardy98 force-pushed the update_mixed_version_failure_case branch from efbd78b to 0926ed3 Compare June 22, 2023 18:42
@THardy98 THardy98 merged commit 296dfd2 into cockroachdb:release-23.1 Jun 23, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants