Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the number of retries (1->3) #444

Merged
merged 1 commit into from Apr 12, 2021
Merged

Conversation

Geod24
Copy link
Member

@Geod24 Geod24 commented Apr 12, 2021

@dlang-bot
Copy link
Collaborator

Thanks for your pull request, @Geod24!

@WalterBright
Copy link
Member

@Geod24 I hope this won't cause a retry if a test suite failure is actually caused by a bug and not a networking failure?

@dlang-bot dlang-bot merged commit d909146 into dlang:master Apr 12, 2021
@Geod24 Geod24 deleted the retry-more branch April 12, 2021 09:36
@Geod24
Copy link
Member Author

Geod24 commented Apr 12, 2021

@WalterBright : That's the downside of it - it will.

@WalterBright
Copy link
Member

The obvious question - can we get a proper fix?

@PetarKirov
Copy link
Member

Given that for a human it can be difficult to decide whether a bug is Heisenbug or not, what would be the algorithm to determine that automatically?
"Agent lost" and OOM should ideally be handled by BuildKite, but for other things it's hard to say.

@MartinNowak
Copy link
Member

MartinNowak commented Apr 12, 2021

Fine for me, the bill for those runners is fairly small. Did lower it to 1 in the past since many PR problems were not intermittent, but some are and human time is quite valuable.

@Geod24
Copy link
Member Author

Geod24 commented Apr 12, 2021

@MartinNowak : Perhaps you could take a look at https://github.com/dlang/ci/blob/master/buildkite/Dockerfile so contributors could run an agent as well ? I have a few servers that I would gladly use as permanent runners.

@WalterBright
Copy link
Member

the algorithm

All networking errors would be a great first approximation.

@PetarKirov
Copy link
Member

All networking errors would be a great first approximation.

Obviously, yes. The question is how to determine if a failure is networking related. For example, IIRC some (all?) std.socket unit tests run on localhost, so there internet access is not a prerequisite. A build could fail because dub can't fetch a dependency, which could be either caused by code.dlang.org (and its mirrors) being down, or it could be that the project being built was looking for a non-existing version, etc. While a restart is likely to resolve the first cause, it is unlikely to help with second one.
The high-level idea is clear, but implementation not so much, especially given that we're running the test suites of third-party projects.
Also, IIRC, in the past several months, network related problems where much more rare than say a mismatch between compiler and druntime version occuring when the dlang/druntime project is build on BuildKite.

@WalterBright
Copy link
Member

Over here dlang/dmd#12409 (comment) the failure is:

CI agent stopped responding!

Surely that's detectable.

@MartinNowak
Copy link
Member

MartinNowak commented Apr 13, 2021

There seems to be almost zero benefit for a smart retry over a 3x blunt retry, won't even be noticeably faster.
I'd suggest to just stick with the approach here instead of wasting time on dealing with a huge error surface.

Over here dlang/dmd#12409 (comment) the failure is:

CI agent stopped responding!

IIRC there is a 5 min. wait-time for running jobs when downscaling agents.

stop_timeout: 5min # systemd TimeoutStopSec for graceful agent shutdown

If the problem occurs often, we could bump that a bit if there are many long-running jobs.

@MartinNowak : Perhaps you could take a look at https://github.com/dlang/ci/blob/master/buildkite/Dockerfile so contributors could run an agent as well ? I have a few servers that I would gladly use as permanent runners.

What's the benefit of someone else running servers? Sounds nice in theory, but reliability on a heterogeneous infrastructure run by an uncoordinated group is likely to suffer.

Perhaps you could take a look at https://github.com/dlang/ci/blob/master/buildkite/Dockerfile so contributors could run an agent as well?

I guess a simpler dependency file might indeed help us to update the machines. Is this a real problem?
Could try to find some time when possible, but cannot promise anything.

@WalterBright
Copy link
Member

@MartinNowak thanks for the evaluation. I'll defer to your expertise in the matter!

@MartinNowak
Copy link
Member

I guess a simpler dependency file might indeed help us to update the machines. Is this a real problem?
Could try to find some time when possible, but cannot promise anything.

Any opinion on whether this is an actual problem @Geod24?

@Geod24
Copy link
Member Author

Geod24 commented Apr 20, 2021

@MartinNowak : The lack of machine has definitely hit us in the past. Sometimes there are no agents running for a visible amount of time, although I don't recall any time when it was more than an hour. I wasn't overly bothered by it because I just hit the retry button but @WalterBright was.

@Geod24
Copy link
Member Author

Geod24 commented Apr 20, 2021

Something that is a bit more lacking is the ability for projects to control their dependencies. With the changes we're seeing in the CI ecosystem (Travis disappearing, Github CI raising) I was hoping we could leverage the Github runner to simplify our current pipeline. That could theoretically make it easier for core contributors to run agents, too.
I know that control over dependencies has prevented me for adding our projects here.

@MartinNowak
Copy link
Member

Something that is a bit more lacking is the ability for projects to control their dependencies. With the changes we're seeing in the CI ecosystem (Travis disappearing, Github CI raising) I was hoping we could leverage the Github runner to simplify our current pipeline. That could theoretically make it easier for core contributors to run agents, too.
I know that control over dependencies has prevented me for adding our projects here.

Indeed we could rebuild the service in GitHub Actions 👍, might be more accessible for everyone, would require some additional setup time (hopefully fine). Not sure how long their free open source CI will last, I'd guess a while with MSFTs current strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants