Increase the number of retries (1->3) #444

Geod24 · 2021-04-12T08:38:40Z

dlang-bot · 2021-04-12T08:38:42Z

Thanks for your pull request, @Geod24!

WalterBright · 2021-04-12T08:46:37Z

@Geod24 I hope this won't cause a retry if a test suite failure is actually caused by a bug and not a networking failure?

Geod24 · 2021-04-12T09:37:07Z

@WalterBright : That's the downside of it - it will.

WalterBright · 2021-04-12T10:22:11Z

The obvious question - can we get a proper fix?

PetarKirov · 2021-04-12T14:45:14Z

Given that for a human it can be difficult to decide whether a bug is Heisenbug or not, what would be the algorithm to determine that automatically?
"Agent lost" and OOM should ideally be handled by BuildKite, but for other things it's hard to say.

MartinNowak · 2021-04-12T15:22:21Z

Fine for me, the bill for those runners is fairly small. Did lower it to 1 in the past since many PR problems were not intermittent, but some are and human time is quite valuable.

Geod24 · 2021-04-12T15:54:56Z

@MartinNowak : Perhaps you could take a look at https://github.com/dlang/ci/blob/master/buildkite/Dockerfile so contributors could run an agent as well ? I have a few servers that I would gladly use as permanent runners.

WalterBright · 2021-04-12T19:25:37Z

the algorithm

All networking errors would be a great first approximation.

PetarKirov · 2021-04-12T20:42:22Z

All networking errors would be a great first approximation.

Obviously, yes. The question is how to determine if a failure is networking related. For example, IIRC some (all?) std.socket unit tests run on localhost, so there internet access is not a prerequisite. A build could fail because dub can't fetch a dependency, which could be either caused by code.dlang.org (and its mirrors) being down, or it could be that the project being built was looking for a non-existing version, etc. While a restart is likely to resolve the first cause, it is unlikely to help with second one.
The high-level idea is clear, but implementation not so much, especially given that we're running the test suites of third-party projects.
Also, IIRC, in the past several months, network related problems where much more rare than say a mismatch between compiler and druntime version occuring when the dlang/druntime project is build on BuildKite.

WalterBright · 2021-04-12T21:24:14Z

Over here dlang/dmd#12409 (comment) the failure is:

CI agent stopped responding!

Surely that's detectable.

MartinNowak · 2021-04-13T09:09:41Z

There seems to be almost zero benefit for a smart retry over a 3x blunt retry, won't even be noticeably faster.
I'd suggest to just stick with the approach here instead of wasting time on dealing with a huge error surface.

Over here dlang/dmd#12409 (comment) the failure is:
CI agent stopped responding!

IIRC there is a 5 min. wait-time for running jobs when downscaling agents.

ci/ansible/roles/buildkite_agent/defaults/main.yml

Line 4 in 338cfde

stop_timeout: 5min # systemd TimeoutStopSec for graceful agent shutdown

If the problem occurs often, we could bump that a bit if there are many long-running jobs.

@MartinNowak : Perhaps you could take a look at https://github.com/dlang/ci/blob/master/buildkite/Dockerfile so contributors could run an agent as well ? I have a few servers that I would gladly use as permanent runners.

What's the benefit of someone else running servers? Sounds nice in theory, but reliability on a heterogeneous infrastructure run by an uncoordinated group is likely to suffer.

Perhaps you could take a look at https://github.com/dlang/ci/blob/master/buildkite/Dockerfile so contributors could run an agent as well?

I guess a simpler dependency file might indeed help us to update the machines. Is this a real problem?
Could try to find some time when possible, but cannot promise anything.

Bringing up the agents from images is surely more accessible than the ansible image creation (don't think easier image building tools like Packer would improve much).
We could build and push image updates with GH and use them when booting servers.
Personally not keen to run docker anywhere, but can just use systemd/podman for that part.
Chapter 4. Running Containers as systemd Services with Podman Red Hat Enterprise Linux Atomic Host 7 | Red Hat Customer Portal

WalterBright · 2021-04-14T06:22:03Z

@MartinNowak thanks for the evaluation. I'll defer to your expertise in the matter!

MartinNowak · 2021-04-20T07:11:46Z

I guess a simpler dependency file might indeed help us to update the machines. Is this a real problem?
Could try to find some time when possible, but cannot promise anything.

Any opinion on whether this is an actual problem @Geod24?

Geod24 · 2021-04-20T07:33:40Z

@MartinNowak : The lack of machine has definitely hit us in the past. Sometimes there are no agents running for a visible amount of time, although I don't recall any time when it was more than an hour. I wasn't overly bothered by it because I just hit the retry button but @WalterBright was.

Geod24 · 2021-04-20T07:34:27Z

Something that is a bit more lacking is the ability for projects to control their dependencies. With the changes we're seeing in the CI ecosystem (Travis disappearing, Github CI raising) I was hoping we could leverage the Github runner to simplify our current pipeline. That could theoretically make it easier for core contributors to run agents, too.
I know that control over dependencies has prevented me for adding our projects here.

MartinNowak · 2021-04-20T07:39:50Z

Something that is a bit more lacking is the ability for projects to control their dependencies. With the changes we're seeing in the CI ecosystem (Travis disappearing, Github CI raising) I was hoping we could leverage the Github runner to simplify our current pipeline. That could theoretically make it easier for core contributors to run agents, too.
I know that control over dependencies has prevented me for adding our projects here.

Indeed we could rebuild the service in GitHub Actions 👍, might be more accessible for everyone, would require some additional setup time (hopefully fine). Not sure how long their free open source CI will last, I'd guess a while with MSFTs current strategy.

Increase the number of retries (1->3)

91cca31

Geod24 mentioned this pull request Apr 12, 2021

fix Issue 21821 - Optimizer assumes immutables do not change, but the… dlang/dmd#12424

Merged

WalterBright added the auto-merge label Apr 12, 2021

dlang-bot merged commit d909146 into dlang:master Apr 12, 2021

Geod24 deleted the retry-more branch April 12, 2021 09:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase the number of retries (1->3) #444

Increase the number of retries (1->3) #444

Geod24 commented Apr 12, 2021

dlang-bot commented Apr 12, 2021

WalterBright commented Apr 12, 2021

Geod24 commented Apr 12, 2021

WalterBright commented Apr 12, 2021

PetarKirov commented Apr 12, 2021

MartinNowak commented Apr 12, 2021 •

edited

Geod24 commented Apr 12, 2021

WalterBright commented Apr 12, 2021

PetarKirov commented Apr 12, 2021

WalterBright commented Apr 12, 2021

MartinNowak commented Apr 13, 2021 •

edited

WalterBright commented Apr 14, 2021

MartinNowak commented Apr 20, 2021

Geod24 commented Apr 20, 2021

Geod24 commented Apr 20, 2021

MartinNowak commented Apr 20, 2021

Increase the number of retries (1->3) #444

Increase the number of retries (1->3) #444

Conversation

Geod24 commented Apr 12, 2021

dlang-bot commented Apr 12, 2021

WalterBright commented Apr 12, 2021

Geod24 commented Apr 12, 2021

WalterBright commented Apr 12, 2021

PetarKirov commented Apr 12, 2021

MartinNowak commented Apr 12, 2021 • edited

Geod24 commented Apr 12, 2021

WalterBright commented Apr 12, 2021

PetarKirov commented Apr 12, 2021

WalterBright commented Apr 12, 2021

MartinNowak commented Apr 13, 2021 • edited

WalterBright commented Apr 14, 2021

MartinNowak commented Apr 20, 2021

Geod24 commented Apr 20, 2021

Geod24 commented Apr 20, 2021

MartinNowak commented Apr 20, 2021

MartinNowak commented Apr 12, 2021 •

edited

MartinNowak commented Apr 13, 2021 •

edited