Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mantle/{kola,platform}: include Ignition runtime in test timer #2660

Merged
merged 4 commits into from Jan 26, 2022

Conversation

dustymabe
Copy link
Member

There are a few commits in here that should be reviewed individually. THe main commit is

mantle/{kola,platform}: include Ignition runtime in test timer

platform.StartMachine() blocks on SSH being available:

StartMachine()
     |-> StartMachineAfterReboot()
         |-> j.Start()
             | -> util.Retry() <- ssh retry

So if we call it as part of our NewMachine() bringup there's
no way to start the test execution timer and have it include
the time it takes to execute Ignition (which has to complete
before SSH will become available).

Let's add a Machine Option (SkipStartMachine) that will tell the
NewMachine code to skip the call to StartMachine() and return as
soon as the platform thinks the machine is running. Then in our
test code we can start the timer and then call StartMachine()
(via the new mach.Start() interface) manually to set up journal
forwarding and perform sanity checks.

This should help with issues like one we had recently [1] where our
timeouts weren't properly timing out. This was because the provider
timeout (implemented via WaitUntilReady()), had already moved out of
that code but the test timeout (via ExecTimer) hadn't started yet
because everything was blocking under StartMachine().

NOTE: This commit doesn't solve the problem for tests that bring up
      machines themselves by calling NewMachine() but should solve it
      for the large majority of tests.

[1] https://github.com/coreos/fedora-coreos-tracker/issues/1066#issuecomment-1009978326

@dustymabe
Copy link
Member Author

example output:

...
=== RUN   rpmostree.upgrade-rollback
--- FAIL: coreos.ignition.ssh.key (633.97s)
        harness.go:101: TIMEOUT[10m0s]: SSH unsuccessful within allotted timeframe for i-0c2b754267beec527.
--- FAIL: fcos.metadata.aws (634.15s)
        harness.go:101: TIMEOUT[10m0s]: SSH unsuccessful within allotted timeframe for i-0f736a390cb6b4cc1.
--- FAIL: rpmostree.upgrade-rollback (634.19s)
        harness.go:101: TIMEOUT[10m0s]: SSH unsuccessful within allotted timeframe for i-0b33747c535355ccc.
--- FAIL: coreos.ignition.journald-log (634.25s)
        harness.go:101: TIMEOUT[10m0s]: SSH unsuccessful within allotted timeframe for i-01799e0a2e2c04889.
--- FAIL: ext.config.ignition.kargs (634.27s)
        harness.go:101: TIMEOUT[10m0s]: SSH unsuccessful within allotted timeframe for i-04f87e36ae4ceba43.
--- FAIL: ext.config.extensions.module (634.40s)
        harness.go:101: TIMEOUT[10m0s]: SSH unsuccessful within allotted timeframe for i-0eb388c0e471328ac.
--- FAIL: fcos.network.listeners (634.43s)
        harness.go:101: TIMEOUT[10m0s]: SSH unsuccessful within allotted timeframe for i-0716e767b6dc920cc.
--- FAIL: ext.config.extensions.package (634.65s)
        harness.go:101: TIMEOUT[10m0s]: SSH unsuccessful within allotted timeframe for i-0718ddb8f5f2453de.

jlebon
jlebon previously approved these changes Jan 25, 2022
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Would be good to have @saqibali-2k look at this also.

example output:

Cool. And of course, the VMs get torn down as expected?

@@ -48,6 +48,27 @@ func RetryConditional(attempts int, delay time.Duration, shouldRetry func(err er
return err
}

func RetryUntilTimeout(timeout, delay time.Duration, f func() error) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring would be useful for this with the exact semantics of timeout and delay.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the update to see if it's good.

@@ -68,6 +68,9 @@ type Machine interface {
// SSH runs a single command over a new SSH connection.
SSH(cmd string) ([]byte, []byte, error)

// StartMachine sets up the journal and performs sanity checks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the update to see if it's good

@dustymabe
Copy link
Member Author

Cool. And of course, the VMs get torn down as expected?

Yep. They seem to get torn down as expected.

I don't see any reason to wait up front rather than trying first.
Looking at the function signature of util.Retry (A) and the way
we were calling it (B)

    A) func Retry(attempts int, delay time.Duration, f func() error) error
    B) util.Retry(sshRetries, sshTimeout, f)
        - where:
          - sshRetries = 60
          - sshTimeout = 10 * time.Second

it seems as if our intention was to be able to retry the SSH command
up to 60 times having each one of those 60 tries timeout after 10
seconds. In reality what happens is that we run the ssh command,
which eventually times itself out and then delay 10 seconds before
running the ssh command the next time (60 times). In my experience
with [1] the ssh command would take around 35s to time itself out.
This means Retry() could take 60*(35s + 10s)=2700s=45m to actually
stop trying.

I don't think this was the original intention. Let's make this easier
to follow by adding a RetryUntilTimeout() function and use that
instead. Also let's get rid of the global variables as I think they
aren't really needed.

[1] coreos/fedora-coreos-tracker#1066 (comment)
…ll platforms

It's more logical than the reverse. Prep for adding an option that
will be used in all platforms.

This is the same as c6f1b09 but migrating the rest of the platforms
that don't already have it set up that way.
platform.StartMachine() blocks on SSH being available:

StartMachine()
     |-> StartMachineAfterReboot()
         |-> j.Start()
             | -> util.Retry() <- ssh retry

So if we call it as part of our NewMachine() bringup there's
no way to start the test execution timer and have it include
the time it takes to execute Ignition (which has to complete
before SSH will become available).

Let's add a Machine Option (SkipStartMachine) that will tell the
NewMachine code to skip the call to StartMachine() and return as
soon as the platform thinks the machine is running. Then in our
test code we can start the timer and then call StartMachine()
(via the new mach.Start() interface) manually to set up journal
forwarding and perform sanity checks.

This should help with issues like one we had recently [1] where our
timeouts weren't properly timing out. This was because the provider
timeout (implemented via WaitUntilReady()), had already moved out of
that code but the test timeout (via ExecTimer) hadn't started yet
because everything was blocking under StartMachine().

NOTE: This commit doesn't solve the problem for tests that bring up
      machines themselves by calling NewMachine() but should solve it
      for the large majority of tests.

[1] coreos/fedora-coreos-tracker#1066 (comment)
Copy link
Member

@saqibali-2k saqibali-2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and thanks for the work!
Aside: Should we consider some sort of mechanisms for de-duplicating the repeated code between platform/machine?

@dustymabe
Copy link
Member Author

Looks good and thanks for the work!

❤️

Aside: Should we consider some sort of mechanisms for de-duplicating the repeated code between platform/machine?

Probably.

@jlebon jlebon merged commit c87a077 into coreos:main Jan 26, 2022
jlebon added a commit to jlebon/fedora-coreos-config that referenced this pull request Jan 27, 2022
We're timing out sometimes on this test because it takes a while for
Ignition to set up RAID1 and Ignition provisioning time is now part of
the overall test timeout[1]. Bump it to 15 mins.

[1] coreos/coreos-assembler#2660
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants