Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BuildKite failure tracking #27508

Open
filipesilva opened this Issue Dec 6, 2018 · 10 comments

Comments

Projects
None yet
5 participants
@filipesilva
Copy link
Member

filipesilva commented Dec 6, 2018

We've recently added the BuildKite CI check and are now tracking how reliable it is, and how it fails.

So far the following failures have been observed:

  • Warning: Checkout failed (build log)
  • warning: failed to remove node_modules/... followed by Warning: Checkout failed! (build log 1, 2, 3)
    • these two seems to happen when the buildkite scripts end up in a state where the folder is no longer a repository. Reported to BuildKite as buildkite/agent#866.
  • Bazels yarn_install fails (build log 1, 2, 3)
    • this failure happens when running Bazel and is always preceded by warning Waiting for the other yarn instance to finish (8340). Perhaps related to bazelbuild/rules_nodejs#419. Warnings also seen in circleci builds but did not lead to failure there.
  • ENOSPC: no space left on device, (build log).
    • we have increased disk space on windows VMs (32GB to 192GB) to validate if this error is legitimate and was truly caused by lack of disk space. Still seeing out of space errors (build 120 after we increased disk size, so it's likely not really a disk space thing.

There is two others identified failure mode that are expected and will not be investigated:

  • Could not find a default pipeline configuration file. (build log 1, 2)
    • this happens when a PR that is not up to date with master fails to build because the buildkite configuration cannot be found. It s is expected and should go away in a couple of weeks as PRs become up to date.
  • Canceled automatically (build log)
    • Intermediate builds are cancelled as new commits arrive. This does not affect the master branch.

cc @alexeagle @IgorMinar

@ngbot ngbot bot added this to the needsTriage milestone Dec 6, 2018

@IgorMinar

This comment has been minimized.

Copy link
Member

IgorMinar commented Dec 6, 2018

IgorMinar added a commit that referenced this issue Dec 6, 2018

Revert "ci: test ts-api-guardian on windows (#27205)"
This reverts commit 24ebdbd.

Buildkite is running out of space and causing PRs to go red: #27508
@filipesilva

This comment has been minimized.

Copy link
Member Author

filipesilva commented Dec 6, 2018

We've removed the github status reporting for BuildKite, which means it will still run but not report anything on PRs. Once it is more stable we will re-enable it.

@ngbot ngbot bot modified the milestones: needsTriage, Backlog Dec 11, 2018

@filipesilva filipesilva referenced this issue Dec 18, 2018

Closed

ci: update recommended buildkite VM #27728

3 of 14 tasks complete
@filipesilva

This comment has been minimized.

Copy link
Member Author

filipesilva commented Dec 18, 2018

Today I reprovisioned the VM with the following changes:

  • new version of buildkite agent (might fix checkout errors buildkite/agent#866 (comment))
  • more RAM (GCE was giving warnings about the VM being at the memory limit, also updated doc #27728)

Briefly looked into Docker for Windows "no space" errors and found a couple of promising hits: docker/for-win#1042, docker/for-win#745, docker/for-win#244.

There seem to be a few situations where Docker for Windows keeps taking more and more disk space. cc @gkalpak

@filipesilva

This comment has been minimized.

Copy link
Member Author

filipesilva commented Dec 28, 2018

Checked the health of builds on master since the last comment:

The timeouts seem to be the worst ones. Normal green builds take around 8-13m, but the ones that have bazel yarn_install timeouts take 16-90m. @gkalpak also saw a green build that actually took 30m.

When the bazel yarn_install times out, it looks like the non-bazel yarn install also takes a while. Perhaps it's not really Bazel specific, but rather the connection from the CI machine to the yarn servers, or the yarn servers themselves.

I also checked the host VM disk space. It was started with 128GB and seems to still have 110GB free.

@filipesilva

This comment has been minimized.

Copy link
Member Author

filipesilva commented Jan 8, 2019

Another health check, for builds from the 28th December 2019 to 7 January 2019:

The new failure is only present in, but very frequent for, builds in the last 24h. Intersped with successful builds so it's not just a broken master on windows. Those builds did not exhibit abnormally lengthy yarn install phases so it doesn't seem like it's a slow network.

Previous step is always the yarn install, which seems to finish installing packages and then process to update webdriver in a postinstall script. I don't think it's related to the postinstall because the error mentions @bazel_tools//tools/jdk:remote_jdk and the node postinstall does not know of bazel at all.

Seems to always happen around 2m 20s after the previous bazel log, which suggests a 120s timeout somewhere.

@filipesilva

This comment has been minimized.

Copy link
Member Author

filipesilva commented Jan 8, 2019

Regarding the no such package '@remotejdk_win error, @alexeagle highlights that e7f4338 was merged recently. These failures started happening 3 builds after that. Might be related.

@meteorcloudy are you familiar with this error?

@laszlocsomor mentioned it's worth it to try --noincompatible_strict_action_env to see if it's related, since that flag default changed in 0.21 and it seems to be causing unexpected trouble.

bazelbuild/bazel#6656 is also noteworthy.

@laszlocsomor

This comment has been minimized.

Copy link

laszlocsomor commented Jan 8, 2019

@laszlocsomor mentioned it's worth it to try --noincompatible_strict_action_env to see if it's related, since that flag default changed in 0.21 and it seems to be causing unexpected trouble.

I want to clarify that this just a shot in the dark.

Also, here's the list of changed flags in 0.21: https://blog.bazel.build/2018/12/19/bazel-0.21.html
If upgrading from Bazel 0.20 to 0.21 is indeed the culprit of errors, then these flags should also be suspicious.

@meteorcloudy

This comment has been minimized.

Copy link
Contributor

meteorcloudy commented Jan 8, 2019

I have never seen this error before. @meisterT , do you have any idea what's causing this with 0.21?

@meteorcloudy

This comment has been minimized.

Copy link
Contributor

meteorcloudy commented Jan 8, 2019

I tried to directly build @bazel_tools//tools/jdk:remote_jdk with 0.21.0 on my machine, there's no problem.

filipesilva added a commit to filipesilva/angular that referenced this issue Jan 14, 2019

ci: build clean windows cache image on master
Shortly after angular#27990 the Windows CI started failing with `Service 'windows-test' failed to build: max depth exceeded`.

Looking up this error shows that docker images have a maximum of 127 layers. The current setup adds more and more layers over time, reaching this limit.

This PR addresses the problem by always creating the cache image clean base environment, without reusing the previous one. This only happens on master.

Related to angular#27508

filipesilva added a commit to filipesilva/angular that referenced this issue Jan 14, 2019

ci: build clean windows cache image on master
Shortly after angular#27990 the Windows CI started failing with `Service 'windows-test' failed to build: max depth exceeded`.

Looking up this error shows that docker images have a maximum of 127 layers. The current setup adds more and more layers over time, reaching this limit.

This PR addresses the problem by always creating the cache image clean base environment, without reusing the previous one. This only happens on master.

Related to angular#27508
@filipesilva

This comment has been minimized.

Copy link
Member Author

filipesilva commented Jan 14, 2019

#28160 has a fix for the Service 'windows-test' failed to build: max depth exceeded. errors introduced by #27990.

ngfelixl added a commit to ngfelixl/angular that referenced this issue Jan 28, 2019

Revert "ci: test ts-api-guardian on windows (angular#27205)"
This reverts commit 24ebdbd.

Buildkite is running out of space and causing PRs to go red: angular#27508
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.