Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deflake integration tests #1599

Merged
merged 1 commit into from May 18, 2017

Conversation

ldez
Copy link
Member

@ldez ldez commented May 13, 2017

Description

Base on #1284

@ldez ldez added area/provider kind/enhancement a new or improved feature. labels May 13, 2017
@ldez ldez force-pushed the refactor-deflake-integration-tests branch from f4f6c39 to 4c566aa Compare May 14, 2017 01:53
Copy link
Contributor

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements @ldez 👍 Left a few comments.

Also:

  • We could go even farther on this, mimicing a bit what I started to do on docker integration request package — i.e. instead of requiring the test to create the Request, etc.. just call Request(…) with the correct arguments.
  • We could try to make Try more composable (so you could do Try(…, Request(…), conditions..)), etc..
  • Finaly, as it's a testing package, we could even make the Try function take a t *testing.T as first argument, and calling t.Fatal in case of error 👼

@@ -12,6 +12,7 @@ import (
"strings"
"time"

"github.com/containous/traefik/integration/utils"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't related to this PR, but integration/try would make more sense (and removing Try from function names)

return nil
}

stopTime := time.Now().Add(timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopTimer := time.NewTimer(timeout)
retryTick := time.NewTicker(interval)

for {
  select {
    case <-stopTimer.C:
      err = fmt.Errorf("try operation failed: %s", err)
      break
    case <-retryTick.C:
      if err = operation(); err == nil {
        // do your thing
        break
     }
  }
}

// This also can be done inside the loop, in the retryTick.C case
stopTimer.Stop()
retryTick.Stop()
return nil

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code doesn't work -> infinite timeout when the operation failed.

for {
	select {
	case <-time.After(timeout):
		return fmt.Errorf("try operation failed: %s", err)
	case <-time.Tick(wait):
		if err = operation(); err == nil {
			return nil
		}
	}
}

-> infinite timeout when the operation failed.

stopTimer := time.NewTimer(timeout)
retryTick := time.NewTicker(interval)

for {
  select {
    case <-stopTimer.C:
      err = fmt.Errorf("try operation failed: %s", err)
      stopTimer.Stop()
      retryTick.Stop()
      return err
    case <-retryTick.C:
      if err = operation(); err == nil {
        stopTimer.Stop()
        retryTick.Stop()
        return err
     }
  }
}

-> works but not very beautiful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code doesn't work -> infinite timeout when the operation failed.

Why would it ? at some point (timeout) stopTimer.C will get the message and we get out of the infinit loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know exactly why but it produce an infinite loop.
it produce 2 loops: one stop at the timeout and one infinite.

interval = maxInterval
}

ci := os.Getenv("CI")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not a huge fan of that. We shouldn't have a special case for CI. If we need a 3x multiplier for timeout in CI, let's just set all the timeout to 3x their value..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this doesn't work because we don't access to global env var in the integration test.

But why I have add this special case?
I don't want to wait 3 min in my local environment to known if a test fail, but in the CI some tests are more slow than a personal computer, we need to have a pragmatic approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this doesn't work because we don't access to global env var in the integration test.

hum.. 🤔 we have access to whichever the environment variable used to execute the tests, so we just need to pass them.

But why I have add this special case?
I don't want to wait 3 min in my local environment to known if a test fail, but in the CI some tests are more slow than a personal computer, we need to have a pragmatic approach.

hum, I don't really buy that but meh I won't fight again, it's just timeout stuff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this doesn't work because we don't access to global env var in the integration test.

hum.. 🤔 we have access to whichever the environment variable used to execute the tests, so we just need to pass them.

in our repo via .semaphoreci/vars we can pass global vars to the CI, but in this case the issue is the overlay docker level (which adds a filter).
To be visible during the tests, The CI global vars need to be passed in the make file like that: atbore-phx@6aef257

CI log

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not a huge fan of that. We shouldn't have a special case for CI. If we need a 3x multiplier for timeout in CI, let's just set all the timeout to 3x their value..

I'm agree "in theory" with this sentence if we are speaking about "timeout" stuff,
but i feel, I'm missing something, just to understand, why do we wait 3 min more, locally ? the value is a timeout so a max "break" value. isn't it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already experienced cases where I had run into integration test timeouts locally only when my laptop was running under high CPU load. Other machines with less or more resources could behave totally differently.

This is really a classical case of YMMV. I'm with @vdemeester that we should just swallow the pill and increase the timeouts globally. For the vast majority of cases, we won't feel the slowdown because the tests will complete quickly enough. For those cases where we experience lots of long-running failing tests, you'd probably start them individually anyway because nobody is going to wait 60, 120, or 180 seconds for a whole bunch of failing integration tests to complete.

Finally, the purpose of this PR is to reduce flakiness, and not waiting long enough is a source of flakiness. I'd rather eliminate this risk and pay the price that things will be a bit slower than necessary if things go south temporarily.


// Sleep pauses the current goroutine for at least the duration d.
// Use only when use an other Try[...] functions is not possible.
func Sleep(d time.Duration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have a special case for CI, this doesn't make sense. We should be able to replace any time.Sleep from the tests code with a Try or sthg that is better than a plain old Sleep

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, in some case we really need to use time.Sleep, the Try logical cannot solve all case.
We need a second way with a stuff like: "Retry again for X time"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loads of cases should/can be handled with Try or something that acting almost the same.

If we Sleep :

  • when waiting for the process to run, we should use Try with a condition (i.e. something that checks the process is running, or some port is open, or for some output of the command to appear, or ?)
  • when waiting for a k/v value to be there, same, we should put the get into a condition

The less we use Sleep, the better really 👼

Copy link
Member Author

@ldez ldez May 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/containous/traefik/blob/master/integration/marathon_test.go#L39

This case need a continue approach, try doesn't work in this case.

PS: this test file seems to test nothing, because all request seems always give the same response, look at the comment block (I have failed to make this work).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also look at https://github.com/containous/traefik/blob/master/integration/docker_test.go#L180

Yes, this one could/should use TryRequest (with the condition on getting a 404 or sthg)

// Conditions are not allowed since it would complicate signaling if the
// response body needs to be closed or not. Callers are expected to close on
// their own if the function returns a nil error.
func TryGetResponseUntilStatusCode(url string, timeout time.Duration, statusCode int) (*http.Response, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is not used elsewhere

}
return nil
}
}

func doGetResponse(url string, timeout time.Duration, condition Condition) (*http.Response, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is not required, let's just pass the http method to doResponse and use only doRespose


_, err = ioutil.ReadAll(resp.Body)
cond := utils.ComposeCondition(utils.UntilStatusCodeIs(200), utils.BodyContains("Welcome to nginx!"))
err = utils.TryRequest(req, 5*time.Second, cond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have make TryRequest taking variadic args, so that you could do TryRequest(req, 5*time.Second, utils.UntilStatusCodeIs…, utils.Body…).


resp, err = client.Do(req)
utils.Sleep(3 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about that one, at all. What are we testing here : that the health was 500 before (it's tested above) and that in a given period of time, it will be 200. I don't see why we re-verify 500, wait for 3s and then wait for 200.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to wait the time to do the check of the "healthcheck"

c.Assert(err, checker.IsNil)
}

utils.Sleep(3 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost the same here, if we want to wait 3s and then have the TryRequest, let's just set a bigger timeout for the Try.

@atbore-phx
Copy link

atbore-phx commented May 14, 2017

when you think this PR reaches a steady state, just ping me and i could execute a series on the CI to verify if the Error Probability is decreasing like that: #1600.
IMHO: it could be a nice test to confirm if we are on the right direction before merge.
@ldez, @timoreimann it's a very nice work and improvement BTW.

@ldez
Copy link
Member Author

ldez commented May 14, 2017

  • Finaly, as it's a testing package, we could even make the Try function take a t *testing.T as first argument, and calling t.Fatal in case of error 👼

It's the first step, I will create the testing package after.

@ldez ldez force-pushed the refactor-deflake-integration-tests branch from f5cbc19 to c6c98da Compare May 14, 2017 17:10
@@ -71,14 +79,14 @@ func (s *AccessLogSuite) TestAccessLog(c *check.C) {
tokens, err := shellwords.Parse(line)
c.Assert(err, checker.IsNil)
c.Assert(len(tokens), checker.Equals, 13)
c.Assert(tokens[6], checker.Equals, "200")
c.Assert(regexp.MustCompile("^\\d\\d\\d$").MatchString(tokens[6]), checker.True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please use backticks instead of quotes to avoid extra escaping.
  • We should be able to specify directly how many digits we expect, i.e., \d{3}.

@@ -9,34 +9,24 @@ import (

"github.com/go-check/check"

"errors"
"github.com/containous/traefik/integration/utils"
"github.com/containous/traefik/integration/try"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should go above line 10 to sort alphabetically.

@@ -65,12 +55,9 @@ func (s *AcmeSuite) TestRetrieveAcmeCertificate(c *check.C) {
client := &http.Client{Transport: tr}

// wait for traefik (generating acme account take some seconds)
err = utils.Try(30*time.Second, func() error {
err = try.Do(90*time.Second, func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use try.GetRequest? Nevermind, sorry.

leader, err := consulClient.Status().Leader()

if err != nil || len(leader) == 0 {
return fmt.Errorf("Leader not find. %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/find/found/


_, err = ioutil.ReadAll(resp.Body)
err = try.Request(req, 5*time.Second, try.StatusCodeIs(200), try.BodyContains("Welcome to nginx!"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the extra BodyContains check? I'm worried it'll break if we change the Nginx image.

200 should be good enough, shouldn't it?

c.Assert(err, checker.IsNil)

time.Sleep(time.Second * 3)
// Verify frontend health : before
err = try.Request(req, 3*time.Second, try.StatusCodeIs(500))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this.


resp, err = client.Do(req)
try.Sleep(2 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want another query check at this point rather making sure whoami1Host's health check returns 200 again.

try.Sleep(2 * time.Second)

// Verify frontend health : 500
err = try.Request(req, 3*time.Second, try.StatusCodeIs(500))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...this should not be needed anymore.

resp, err = client.Do(req)
try.Sleep(2 * time.Second)

// Verify frontend health : after
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the block starting from here to line 91 (incl.) isn't needed as we just want the single, static request starting at line 93.

@@ -25,7 +26,7 @@ func (s *HTTPSSuite) TestWithSNIConfigHandshake(c *check.C) {
c.Assert(err, checker.IsNil)
defer cmd.Process.Kill()

time.Sleep(500 * time.Millisecond)
try.Sleep(500 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS, this should be the only try.Sleep call left after my comments regarding the other ones are addressed. What is this one waiting for? Isn't this just another "wait for Traefik to start" check that we tend to solve via a retriable request?

Copy link
Contributor

@timoreimann timoreimann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is already a tremendously good PR @ldez. 👏

I think there are still 3 open points:

  1. Health check test: I think the implementation is still susceptible to flakiness (primarily because of the try.Sleep usage) and that there's room for simplification. Please see my latest comments.
  2. try.Sleep: I'm quite confident we don't need them and can actually do better. I have made suggestions how to replace all of them, except for one where I'm not totally sure but it looks replaceable too.
  3. CI env var: As outlined, I'm not really in favor of it. It's the least important point to me, though.

}

// Waiting for Traefik healthcheck
try.Sleep(2 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to drop this one if we just increase the timeout in line 65 a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted all sleep except this one and refactor the whole test.

time.Sleep(time.Second * 3)

resp, err = client.Do(req)
// Verify frontend health : before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the remaining block up until line 89 can be simplified as following:

  1. Try-request for a 200 response code. Once we succeed, we know for sure that the first whoami host has successfully toggled its health status back to 200. Also, this way we don't have to rely on the try.Sleep in line 79.
  2. Try-request for a 500 response code. This verifies that the second whoami host is still unhealthy.

Paraphrased, we can remove the section between line 74 and line 80.

Copy link
Member

@emilevauge emilevauge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this PR is so amazing @ldez
LGTM 👏 👏 👏

xt77xwum9yh7znkfw0

return nil
})

err := try.GetRequest("http://"+s.boulderIP+":4000/directory", 120*time.Second, try.StatusCodeIs(200))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sed/200/http.StatusOK eventually ?


return nil
})
c.Assert(err, checker.IsNil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note that this is the reason why I think try.Do(t, 3*time.Second, …) would fit perfectly. But can be done in follow-ups.

@@ -66,7 +63,7 @@ func (s *EurekaSuite) TestSimpleConfiguration(c *check.C) {
}
}`

tmpl, err := template.New("eurekaTemlate").Parse(eurekaTemplate)
tmpl, err := template.New("eurekaTemplate").Parse(eurekaTemplate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤓

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱

Copy link
Contributor

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small comments, but can definitely merged as is 😎 🎉 👏
LGTM 🦁 (probably not with 54 commits though 😝)


return nil
})
c.Assert(err, checker.IsNil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note that this is the reason why I think try.Do(t, 3*time.Second, …) would fit perfectly. But can be done in follow-ups.

@@ -66,7 +63,7 @@ func (s *EurekaSuite) TestSimpleConfiguration(c *check.C) {
}
}`

tmpl, err := template.New("eurekaTemlate").Parse(eurekaTemplate)
tmpl, err := template.New("eurekaTemplate").Parse(eurekaTemplate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱

// ResponseCondition is a retry condition function.
// It receives a response, and returns an error
// if the response failed the condition.
type ResponseCondition func(*http.Response) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could be unexported (but lint or vet might be not happy, not sure)


// DoCondition is a retry condition function.
// It returns an error
type DoCondition func() error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (same as above): could be unexported (but lint or vet might be not happy, not sure)


timeout = applyCIMultiplier(timeout)

var err error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I don't think we need to declare it before hand.

@ldez ldez force-pushed the refactor-deflake-integration-tests branch from bd3a0f9 to 2004de6 Compare May 17, 2017 13:23
- feat: add CI multiplier
- refactor: readability
- feat: custom Sleep function
- refactor(integration): use custom Sleep
- feat: show Try progress
- feat(try): try response with status code
- refactor(try): use a dedicate package.
- refactor(integration): Try everywhere
- feat(CI): pass CI env var to Integration Tests.
- refactor(acme): increase timeout.
- feat(acme): show Traefik logs
- refactor(integration): use `http.StatusXXX`
- refactor: remove Sleep
@ldez ldez force-pushed the refactor-deflake-integration-tests branch from 2004de6 to 59a48fe Compare May 17, 2017 14:34
Copy link
Contributor

@timoreimann timoreimann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldez ldez added this to the 1.4 milestone May 18, 2017
@ldez ldez merged commit 2610023 into traefik:master May 18, 2017
@ldez ldez deleted the refactor-deflake-integration-tests branch May 18, 2017 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider kind/enhancement a new or improved feature.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants