-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: fix mixed cluster and node versions for tenant span stats #105183
roachtest: fix mixed cluster and node versions for tenant span stats #105183
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @renatolabs and @smg260)
Stress testing this for a while before merging. A few fixes have been made before, I hope to catch any potential failures before merging this. |
// An error is expected if: | ||
// - the cluster version is <23.1.0 | ||
// - or cluster version >=23.1 and node versions <23.1.0. | ||
if !cv.AtLeast(v231) || len(h.Context().FromVersionNodes) == 4 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments.
cluster version >=23.1
is misleading as, in this test, the cluster version will be at most V23_1 by definition.cluster version >=23.1 and node versions <23.1.0
-- not sure what you meant here, but cluster version being in 23.1 and nodes being <23.1 is not possible. Upgrades will only happen once all nodes are running the 23.1 binary and once the upgrade begins, older binaries are no longer able to join.- There's a TOCTOU issue with regards to the cluster version: if this function is called while the upgrade is finalizing, there's no guarantees the value you're using here still reflects the actual cluster version. I believe that if you run this test using a seed that will lead to this function being called during finalization, and sleep for 3mins on line 164, this test will fail with high probability (i.e., there's a timing dependency).
I think the issue with the test is that it may flake if this hook is called during finalization. Does it make sense to even perform the error check in this case? I'm having a hard time thinking of a meaningful assertion we can write in that environment.
If you decide to skip, you can add the following to the top of this function:
if h.Context().Finalizing {
return nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, can the versioning logic simply be gated to the cluster version? Do we need to check the node versions at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate? What "versioning logic" are you referring to?
Do we need to check the node versions at all?
In what context?
To make myself clearer (in case it wasn't), my suggestion is to replace the entire diff in this PR with the above check for Finalizing
in the first line of the fetch span stats - mixed
function. As I said, it will be quite tricky, if not impossible, to write a non-flaky assertion while the upgrade is finalizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, was in a couple long pairing sessions.
Could you elaborate? What "versioning logic" are you referring to?
Currently, in the test, there is a check to see if the nodes are on different binaries. If so, we check that we receive the expected error message. If the cluster version is only at 23.1.0 when all nodes are are at least at 23.1.0, then I suspect we don't need this check at all? We can just use a cluster version check.
To make myself clearer (in case it wasn't), my suggestion is to replace the entire diff in this PR with the above check for Finalizing in the first line of the fetch span stats - mixed function. As I said, it will be quite tricky, if not impossible, to write a non-flaky assertion while the upgrade is finalizing.
For my own knowledge, what does it mean for an upgrade to be finalizing, could it be described as when the cluster version is upgrading once all nodes have upgraded? It would help me understand why the diff I've written would cause flakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the cluster version is only at 23.1.0 when all nodes are are at least at 23.1.0, then I suspect we don't need this check at all? We can just use a cluster version check.
I think there's some confusion here, 23.1.0
is the notation we use for released binaries. The cluster version is not quite the same. Instead of trying to write more words myself, I recommend you take a look at the comments in cockroach_versions.go
and upgrademanager/manager.go
first if you haven't already. Feel free to ask questions if something is still not clear.
could it be described as when the cluster version is upgrading once all nodes have upgraded?
Yes (sorry for not being clearer earlier). It means that all nodes are ahead of the current cluster version, and they are allowed to upgrade.
The way that mixed-version tests work is that we update each node's binaries (calling your user functions in between at random points in time), and when all nodes are running the latest binary, we allow the upgrade to finalize. This is done by resetting the cluster.preserve_downgrade_option
cluster setting [1]. It's not too different from the process described in [2].
[1]
cockroach/pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
Lines 628 to 635 in 6e2f434
func (s finalizeUpgradeStep) Run( | |
ctx context.Context, l *logger.Logger, c cluster.Cluster, helper *Helper, | |
) error { | |
node, db := helper.RandomDB(s.prng, s.crdbNodes) | |
l.Printf("resetting preserve_downgrade_option (via node %d)", node) | |
_, err := db.ExecContext(ctx, "RESET CLUSTER SETTING cluster.preserve_downgrade_option") | |
return err | |
} |
[2] https://www.cockroachlabs.com/docs/stable/upgrade-cockroach-version.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the diff I've written would cause flakes.
It causes flakes in the following scenario:
- the
fetch span stats - mixed
function is called while the upgrade is finalizing (all nodes running arelease-23.1
binary, migrations are running in the background, cluster version is advancing fromV22_2
all the way toV23_1
while your function is running) - You read the current cluster version in "fetch span stats". Let's say that that returns
22.2-86
(i.e., by the time you read the cluster setting, quite a few migrations have completed). - Since the cluster version is <23.1, you deem that you should get an error if you try to use your endpoint.
- In the meantime, the remaining migrations finish running, and the cluster version reaches
V23_1
- You call your endpoint. You expect an error, but no error is returned (behavior is correct, assertion is wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for explaining! The flake case scenario in particular is illuminating, skipping these assertions during finalization seems prudent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With that in mind, does it make sense to strip the entire else
branch?
I.e.
} else {
// All nodes are on one version, but we're in mixed state (i.e. cluster version is on a different version)
issueNodeID := c.All()[0]
dialNodeID := c.All()[1]
// Dial a node for span stats.
l.Printf("Dial a node for span stats (different cluster version).")
res, err = fetchSpanStatsFromNode(ctx, l, c, c.Node(issueNodeID), newReqBody(dialNodeID, startKey, endKey))
if err != nil {
return err
}
// Expect an error in the stdout - mixed version error.
// Ensure the result can be marshalled into a valid error response.
err = json.Unmarshal([]byte(res.Stdout), &errOutput)
if err != nil {
return err
}
// Ensure we get the expected error.
mixedClusterVersionErr := assertExpectedError(errOutput.Message, mixedVersionReqError)
expectedUnknown := assertExpectedError(errOutput.Message, unknownFieldError)
if !mixedClusterVersionErr && !expectedUnknown {
return errors.Newf("expected '%s' or '%s' in error message, got: '%v'", mixedVersionReqError, unknownFieldError, errOutput.Error)
}
}
Is there a state where all nodes are on one version, the cluster is on another, but we aren't in a finalizing state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed -- if we skip on Finalizing
, we can remove the entire else
branch -- we always expect an error in that case, regardless of which node we connect to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet, made the change and updated the commit/PR desc to reflect the new diff.
ce202eb
to
2f313a7
Compare
if h.Context().Finalizing { | ||
return nil | ||
} | ||
|
||
// If we have nodes in mixed versions. | ||
if len(h.Context().FromVersionNodes) > 0 && len(h.Context().ToVersionNodes) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also remove this condition here. Now that we don't run this validation when the upgrade is finalizing, we know that the cluster version is still at V22_2 every time this function is called, and therefore we should get the error every time, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, removed.
2f313a7
to
efbd78b
Compare
This change ensures we don't make assertions when the cluster is in a finalizing state (i.e. waiting for cluster version to upgrade). We cannot make any meaningful assertions in this state without introducing flakiness into the test, as such we skip this state. Additionally, this allows us to simplify the test logic. Release note: None
efbd78b
to
0926ed3
Compare
Epic: None
This change ensures we don't make assertions when the cluster is in a
finalizing state (i.e. waiting for cluster version to upgrade). We
cannot make any meaningful assertions in this state without introducing
flakiness into the test, as such we skip this state. Additionally, this
allows us to simplify the test logic.
Release note: None