Make HTTP tests more robust by adding retries to the tests #9652

radeusgd · 2024-04-08T15:48:06Z

Pull Request Description

As asked for by @hubertp who was encountering flaky test failures on CI in the Http_Spec and related ones, I'm adding retry logic to make such cases much less likely.
- I've made the test server randomly fail 50% of tests and with the retry logic the tests are still passing, so I think that should be much more robust, in practice the failure rate is much much less (I imagine <1% as most of the time these tests were working and we do a ton of requests in a single CI run).
I move the with_retries method to now be Test.with_retries which can be used anywhere in our tests for the retry logic.
- It sleeps for 0.1s between retries. Not all kinds of tests need it, this was mostly for propagation delays in the Cloud in our tests. I was thinking if the delay should be configurable, but I think the 0.1s delay is not problematic and if our tests are sometimes failing due to high machine load, the delay could also help.
This does not add retry logic to raw HTTP operations or Data.fetch. We may add that later, but that needs some further design. In such case we may remove some retries from tests if they become unnecessary.

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

The documentation has been updated, if necessary.
Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
All code follows the
Scala,
Java,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
All code has been tested:
- Unit tests have been written where possible.
- If GUI codebase was changed, the GUI was tested when built using ./run ide build.

This reverts commit 289e16c.

GregoryTravis · 2024-04-08T15:53:48Z

distribution/lib/Standard/Test/0.0.0-dev/src/Test.enso

+                if i > max_iterations then Panic.throw caught_panic else
+                    if i % 10 == 0 then
+                        IO.println "Still failing after "+i.to_text+" retries. ("+loc.to_display_text+")"
+                    Thread.sleep (1000*sleep_time . floor)


The total time spent might be significantly longer if the action itself takes non-negligible time; it might be better to check the current time against (start_time + total_sleep_delay) rather than using a counter.

Fair point, but I feel like the current behaviour is what we want.

If the action takes 3s to complete due to bad network conditions and it fails on a timeout, then with a retry delay of 2s - it will not retry at all... But the whole point of this is to do some retries. I think it's better to do the same number of retries regardless of how long the underlying action is taking.

The total_sleep_delay is just used to approximate the total wait time. But I guess I can rephrase this to just be max_retries counter and remove the total_sleep_delay altogether, if that will be clearer.

What about a 2-linear backoff? E.g., first retry waits for 2 seconds, another for 4 seconds, another for 8 seconds, 16 secs, etc.... The way you coded it, it will wait for 100 seconds on the CI after every retry, right?

What about a 2-linear backoff? E.g., first retry waits for 2 seconds, another for 4 seconds, another for 8 seconds, 16 secs, etc.... The way you coded it, it will wait for 100 seconds on the CI after every retry, right?

It will wait for 100 milliseconds between every retry, not 100 seconds 😅

I feel like this is unnecessarily complicating stuff. I want the test to finish as soon as possible, so increasing the wait time does not seem to make that better. The strategy we have here was already successfully used for running cloud tests with propagation delays. I don't think there's value in complicating this strategy until we have a reason to do so. For now, I don't see any reasons - it works good enough and is simple.

radeusgd added 7 commits April 8, 2024 17:14

move with_retries to Test

074b2fa

fail HTTP randomly for testing

289e16c

doc

b6bcb59

add retries to HTTP tests

a6a808a

a few more retries

9353839

Revert "fail HTTP randomly for testing"

6686c2c

This reverts commit 289e16c.

a few more retries (2)

b425aa9

radeusgd added the CI: No changelog needed Do not require a changelog entry for this PR. label Apr 8, 2024

radeusgd self-assigned this Apr 8, 2024

radeusgd requested review from jdunkerley, GregoryTravis and AdRiley as code owners April 8, 2024 15:48

radeusgd requested a review from hubertp April 8, 2024 15:48

GregoryTravis approved these changes Apr 8, 2024

View reviewed changes

jdunkerley approved these changes Apr 8, 2024

View reviewed changes

radeusgd added 2 commits April 9, 2024 11:21

Merge branch 'refs/heads/develop' into wip/radeusgd/http-tests-retry

37e5b2d

CR: change names of variables

4fdcd77

radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Apr 9, 2024

mergify bot merged commit 354ee94 into develop Apr 9, 2024
34 of 36 checks passed

mergify bot deleted the wip/radeusgd/http-tests-retry branch April 9, 2024 10:07

unfurl-links bot mentioned this pull request Apr 15, 2024

Generate completion of Table.join join criteria using data from both joined tables #5629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make HTTP tests more robust by adding retries to the tests #9652

Make HTTP tests more robust by adding retries to the tests #9652

radeusgd commented Apr 8, 2024

GregoryTravis Apr 8, 2024

radeusgd Apr 8, 2024 •

edited

Loading

Akirathan Apr 8, 2024

radeusgd Apr 9, 2024

Make HTTP tests more robust by adding retries to the tests #9652

Make HTTP tests more robust by adding retries to the tests #9652

Conversation

radeusgd commented Apr 8, 2024

Pull Request Description

Important Notes

Checklist

GregoryTravis Apr 8, 2024

Choose a reason for hiding this comment

radeusgd Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

Akirathan Apr 8, 2024

Choose a reason for hiding this comment

radeusgd Apr 9, 2024

Choose a reason for hiding this comment

radeusgd Apr 8, 2024 •

edited

Loading