New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mantle/{kola,platform}: include Ignition runtime in test timer #2660
Conversation
example output:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Would be good to have @saqibali-2k look at this also.
example output:
Cool. And of course, the VMs get torn down as expected?
@@ -48,6 +48,27 @@ func RetryConditional(attempts int, delay time.Duration, shouldRetry func(err er | |||
return err | |||
} | |||
|
|||
func RetryUntilTimeout(timeout, delay time.Duration, f func() error) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring would be useful for this with the exact semantics of timeout
and delay
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check the update to see if it's good.
mantle/platform/platform.go
Outdated
@@ -68,6 +68,9 @@ type Machine interface { | |||
// SSH runs a single command over a new SSH connection. | |||
SSH(cmd string) ([]byte, []byte, error) | |||
|
|||
// StartMachine sets up the journal and performs sanity checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check the update to see if it's good
Yep. They seem to get torn down as expected. |
I don't see any reason to wait up front rather than trying first.
Looking at the function signature of util.Retry (A) and the way we were calling it (B) A) func Retry(attempts int, delay time.Duration, f func() error) error B) util.Retry(sshRetries, sshTimeout, f) - where: - sshRetries = 60 - sshTimeout = 10 * time.Second it seems as if our intention was to be able to retry the SSH command up to 60 times having each one of those 60 tries timeout after 10 seconds. In reality what happens is that we run the ssh command, which eventually times itself out and then delay 10 seconds before running the ssh command the next time (60 times). In my experience with [1] the ssh command would take around 35s to time itself out. This means Retry() could take 60*(35s + 10s)=2700s=45m to actually stop trying. I don't think this was the original intention. Let's make this easier to follow by adding a RetryUntilTimeout() function and use that instead. Also let's get rid of the global variables as I think they aren't really needed. [1] coreos/fedora-coreos-tracker#1066 (comment)
…ll platforms It's more logical than the reverse. Prep for adding an option that will be used in all platforms. This is the same as c6f1b09 but migrating the rest of the platforms that don't already have it set up that way.
platform.StartMachine() blocks on SSH being available: StartMachine() |-> StartMachineAfterReboot() |-> j.Start() | -> util.Retry() <- ssh retry So if we call it as part of our NewMachine() bringup there's no way to start the test execution timer and have it include the time it takes to execute Ignition (which has to complete before SSH will become available). Let's add a Machine Option (SkipStartMachine) that will tell the NewMachine code to skip the call to StartMachine() and return as soon as the platform thinks the machine is running. Then in our test code we can start the timer and then call StartMachine() (via the new mach.Start() interface) manually to set up journal forwarding and perform sanity checks. This should help with issues like one we had recently [1] where our timeouts weren't properly timing out. This was because the provider timeout (implemented via WaitUntilReady()), had already moved out of that code but the test timeout (via ExecTimer) hadn't started yet because everything was blocking under StartMachine(). NOTE: This commit doesn't solve the problem for tests that bring up machines themselves by calling NewMachine() but should solve it for the large majority of tests. [1] coreos/fedora-coreos-tracker#1066 (comment)
c87979a
to
0736745
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and thanks for the work!
Aside: Should we consider some sort of mechanisms for de-duplicating the repeated code between platform/machine
?
❤️
Probably. |
We're timing out sometimes on this test because it takes a while for Ignition to set up RAID1 and Ignition provisioning time is now part of the overall test timeout[1]. Bump it to 15 mins. [1] coreos/coreos-assembler#2660
There are a few commits in here that should be reviewed individually. THe main commit is