Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HTTP tests more robust by adding retries to the tests #9652

Merged
merged 9 commits into from
Apr 9, 2024

Conversation

radeusgd
Copy link
Member

@radeusgd radeusgd commented Apr 8, 2024

Pull Request Description

  • As asked for by @hubertp who was encountering flaky test failures on CI in the Http_Spec and related ones, I'm adding retry logic to make such cases much less likely.
    • I've made the test server randomly fail 50% of tests and with the retry logic the tests are still passing, so I think that should be much more robust, in practice the failure rate is much much less (I imagine <1% as most of the time these tests were working and we do a ton of requests in a single CI run).
  • I move the with_retries method to now be Test.with_retries which can be used anywhere in our tests for the retry logic.
    • It sleeps for 0.1s between retries. Not all kinds of tests need it, this was mostly for propagation delays in the Cloud in our tests. I was thinking if the delay should be configurable, but I think the 0.1s delay is not problematic and if our tests are sometimes failing due to high machine load, the delay could also help.
  • This does not add retry logic to raw HTTP operations or Data.fetch. We may add that later, but that needs some further design. In such case we may remove some retries from tests if they become unnecessary.

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed, the GUI was tested when built using ./run ide build.

@radeusgd radeusgd added the CI: No changelog needed Do not require a changelog entry for this PR. label Apr 8, 2024
@radeusgd radeusgd self-assigned this Apr 8, 2024
@radeusgd radeusgd requested a review from hubertp April 8, 2024 15:48
if i > max_iterations then Panic.throw caught_panic else
if i % 10 == 0 then
IO.println "Still failing after "+i.to_text+" retries. ("+loc.to_display_text+")"
Thread.sleep (1000*sleep_time . floor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total time spent might be significantly longer if the action itself takes non-negligible time; it might be better to check the current time against (start_time + total_sleep_delay) rather than using a counter.

Copy link
Member Author

@radeusgd radeusgd Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, but I feel like the current behaviour is what we want.

If the action takes 3s to complete due to bad network conditions and it fails on a timeout, then with a retry delay of 2s - it will not retry at all... But the whole point of this is to do some retries. I think it's better to do the same number of retries regardless of how long the underlying action is taking.

The total_sleep_delay is just used to approximate the total wait time. But I guess I can rephrase this to just be max_retries counter and remove the total_sleep_delay altogether, if that will be clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a 2-linear backoff? E.g., first retry waits for 2 seconds, another for 4 seconds, another for 8 seconds, 16 secs, etc.... The way you coded it, it will wait for 100 seconds on the CI after every retry, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a 2-linear backoff? E.g., first retry waits for 2 seconds, another for 4 seconds, another for 8 seconds, 16 secs, etc.... The way you coded it, it will wait for 100 seconds on the CI after every retry, right?

It will wait for 100 milliseconds between every retry, not 100 seconds 😅

I feel like this is unnecessarily complicating stuff. I want the test to finish as soon as possible, so increasing the wait time does not seem to make that better. The strategy we have here was already successfully used for running cloud tests with propagation delays. I don't think there's value in complicating this strategy until we have a reason to do so. For now, I don't see any reasons - it works good enough and is simple.

@radeusgd radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Apr 9, 2024
@mergify mergify bot merged commit 354ee94 into develop Apr 9, 2024
34 of 36 checks passed
@mergify mergify bot deleted the wip/radeusgd/http-tests-retry branch April 9, 2024 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: No changelog needed Do not require a changelog entry for this PR. CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants