Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 20970 - Use a longer retry for curl on Windows #11878

Merged
merged 1 commit into from
Oct 17, 2020

Conversation

Geod24
Copy link
Member

@Geod24 Geod24 commented Oct 17, 2020

This uses the backoff strategy built in curl, instead of the short 5 seconds retry.
Instead, we set retry-max-time, telling curl to fail after 2 minutes.

This uses the backoff strategy built in curl, instead of the short 5 seconds retry.
Instead, we set retry-max-time, telling curl to fail after 2 minutes.
@dlang-bot
Copy link
Contributor

Thanks for your pull request, @Geod24!

Bugzilla references

Auto-close Bugzilla Severity Description
20970 normal Test Suite Azure Pipelines Windows_LDC_Debug x64-debug-ldc failed due to heisenbug

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#11878"

@dlang-bot dlang-bot merged commit 915247a into dlang:master Oct 17, 2020
@WalterBright
Copy link
Member

Thanks, @Geod24 !

Lots more of these network failures in the test suite:

https://issues.dlang.org/buglist.cgi?keywords=TestSuite&list_id=233523&resolution=---

@Geod24
Copy link
Member Author

Geod24 commented Oct 17, 2020

@WalterBright : Yep I was looking at them. I think some of them are invalid because we already have retries in place, and sometimes things are just bound to fail (e.g. if the network issue exceeds 2 minutes).

But going over them, there's a few things we can do better. Buildkite for example retry a failed build twice before marking the job as failed, but Azure does not.

@Geod24 Geod24 deleted the fix-20970 branch October 17, 2020 12:23
@WalterBright
Copy link
Member

and sometimes things are just bound to fail (e.g. if the network issue exceeds 2 minutes).

When that happens, going to sleep for 5 minutes and trying again is still better than me restarting the entire test suite which runs for maybe 30 minutes.

@WalterBright
Copy link
Member

It's commonplace for me to re-start the entire suite multiple times as which test fails with a network failure randomly strikes a different test. It also gets really old having to examine the logs to see if it is a real problem or a network one.

@Geod24
Copy link
Member Author

Geod24 commented Oct 18, 2020

@WalterBright : Just FYI, you should be able to restart most CI on a case-by-case (matrix row) basis.
In the auto-tester, a specific run can be deprecated. Likewise, in Azure, a single matrix row can be restarted (the "Azure Pipelines" is just an aggregator). Buildkite already tries to build a project 3 times before marking at as failed, and each project can be retriggered individually. CirrusCI and CircleCI can also be re-triggered.

The only two ouliers are the documentation tester (only a push can re-trigger it, or waiting for a while), and Github CI, which only allows you to re-trigger the whole matrix. The later rarely fails, and is the fastest CI we have (as it only builds DMD & co and test C++ integration), so I haven't seen it being a blocker.

Re-triggering a CI isn't great, but I tend to just re-triggering it if the failure is not obvious from a glance, and taking a closer look if the second run doesn't succeed either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants