-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no active session for <id>: context deadline exceeded #456
Comments
My first hunch is that some context is initialised with a short deadline, and that deadline expires soon since the build is taking so long. The error is very ambiguous, however buildx/vendor/github.com/moby/buildkit/session/manager.go Lines 176 to 178 in fb7b670
Also, buildx/vendor/github.com/moby/buildkit/util/progress/progressui/printer.go Lines 151 to 158 in fb7b670
All I can tell is that this came from here: buildx/util/progress/printer.go Lines 47 to 53 in fb7b670
I am not entirely sure where to go from here, this is seems like some kind of generic processing queue thing and this can be coming from just about anywhere in either buildkit or buildx code... |
Not much to add, but seeing same behavior here today, similar setup, building multiarch across three nodes, amd64 + aarch64 + armv7l ... the arm7l one took close to eight hours and then hit this. Using plain TCP socket context to the two ARM nodes issuing the build on the amd64 node (UNIX socket). |
Was that export supposed to end with a push? What version of buildkit? |
@tonistiigi yes, in my case it was. |
In my case everything was built in the same docker instance, with qemu for some of the Arm stages. |
@tonistiigi ping! |
It looks like maybe the session connection just dropped because your build took almost 5h. You should see that from the daemon logs. If that is the case then maybe we could add logic to redial. Although not really different from build request connection itself dropping or if this happens at the same time session is being used it would still fail. |
Would you mind pointing me to where the redial logic would have to be implemented? Also, how can we improve the error message? I'm happy to have a go at fixing this one, just need some pointers. I'm pretty sure it's some sort of a drop. |
Session request happens in https://github.com/moby/buildkit/blob/master/client/solve.go#L165 but first you should figure out if it is dropping and if there is any error message of condition that causes it. |
Same happens to me, very frequently. In some cases it is almost impossible to push images to docker hub. In my case I'm crossbuilding on arm to other arm versions or amd64.
|
We are getting the same error while pushing to GitHub Container Registry. We though this is related to the 5GB limit of a single image layer, but we were able to push images with layersizes >5GB.
The following log is copied from an image with layersizes above 5GB size. As of this log, the pushing of the manifest data fails for the example above.
|
moby/buildkit#2019 is potentially related with regard to improving how the underlying error handling machinery is implemented. |
For anyone else running into this issue as part of their CI pipeline, we've decided to downgrade to legacy builds instead because reliability is more important than speed. |
This also happens with short build times, in this case less than 10 minues. Often the same build passes with a build time of about 12 minutes. I'd say it fails in about 50 % of the jobs.
|
It looks like related to concurrent builds on the same instance of buildkit.
|
The same problem. |
Does anybody knows any workaround for this issue? |
@Oliveirakun export DOCKER_BUILDKIT=1
for i in $(seq 1 3); do
if docker buildx build --platform linux/amd64,linux/arm64 -t test:dev . --push; then
break
fi
sleep 3
if [[ "${i}" == "3" ]]; then
echo "[Error]: build error"
exit 1
fi
done Add retry mechanism.. |
I'd like to see daemon debug logs for this case. Possible cases are
|
Same Problem : ( |
Same problem here ... anybody got a solution or hack for that? Or otherwise can we store multiarch images locally in our "images" and then try to push them? |
Hi everyone! I add before " Build and push " step.
or
It works for me. |
Guys I have stopped seeing the problem after update to Buildkit: moby/buildkit:v0.9.0-rootless and increasing cache size from default to 50 GB ( --oci-worker-gc-keepstorage=50000) |
Still have it on Buildkit: moby/buildkit:v0.9.0-rootless, Azure Kubernetes service, Azure Container service In my case was related to 2 concurrent cache exports: 2021-08-24T12:19:53.5930996Z #26 exporting cache
|
Still have it on Buildkit: moby/buildkit:v0.9.0-rootless, Azure Kubernetes service, Azure Container service In my case was related to 2 concurrent cache exports:
Maybe you can format the reply so that others can read better. |
moby/buildkit#2369 was merged. Hopefully this is fixed now. |
@ekaterinadimitrova2 are you running the master buildkit build with the patch? |
I can confirm that this is still happening with the master build kit build All local using qemu for the emulated Arm64 bits. |
I've been consistently running into this with Docker for Mac 4.4.2 on my M1 MacBook Air, specifically when building a Dockerfile with a COPY instruction. Downgrading to 4.3.2 seems to have fixed the problem. |
Do you use 'docker buildx create'? |
I tried it both ways with the same result, but when I did use buildx create the command was |
I've been trying to figure out why I get this dreaded error when doing a few concurrent
Btw I don't get the error when DOCKER_HOST is blank. Only get it when using DOCKER_HOST=ssh://docker.example.com. I'm on that server docker.example.com (passwordless ssh is configured and working). Anyway I saw this closed bug and figure it's been fixed for so long I must have the fix, but my docker 20.10.14 on ubuntu 20 still uses |
@jamshid does the error occur right away in your case or as a timeout? This was originally reported as a timeout case, if you are seeing it right away it might be a different bug altogether. |
@errordeveloper thanks yes its pretty immediate. Where are the useful logs, just I haven't had much luck narrowing it down to something easily reproducible but I can file a bug with the logs. But I'd still like to upgrade buildx to see if this is already fixed. |
Are you using official Docker package for Ubuntu? I'd recommend trying using latest official packages, it's probably a good idea to also remove |
I have to admit, what I said above is just a general point about upgrading, but having looked more specifically at the details, I can see that you don't have to upgrade. Firstly, buildx v0.8.1 is only two weeks old and one patch relase behind latests (v0.8.2). You should try this to get an instance of buildkitd running in a container:
Having done that, you should get latest stable version out of the box. |
I am using latest official ubuntu docker packages. Ok will file a new bug about the SSH transport. Sorry my confusion about buildx versions was that I was looking at the buildkit releases under https://github.com/moby/buildkit/releases. That versioning is apparently unrelated to the buildx cli plugin (https://github.com/docker/buildx) versioning. |
Getting this error for GitHub Action - has anyone encountered anything similar? |
I was getting this error in GitHub Actions. Solved by bumping action versions:
UPD: still getting the error sometimes 😔 UPD2: solved by running on a machine with more memory. |
I confirmed that this is still an issue in version 4.9.1, this time testing on x86 macOS. As before, downgrading to 4.3.2 is an effective workaround. |
* append cuda version to tags * revertme: push to hub * Update docker readme * Build base-conda-py3.9-torch1.12-cuda11.3.1 * Use new images in conda tests * revertme: push to hub * Revert "revertme: push to hub" This reverts commit 0f7d534. * Revert "revertme: push to hub" This reverts commit 46a05fc. * Run conda if workflow edited * Run gpu testing if workflow edited * Use new tags in release/Dockerfile * Build base-cuda and PL release images with all combinations * Update release docker * Update conda from py3.9-torch1.12 to py3.10-torch.1.12 * Fix ubuntu version * Revert conda * revertme: push to hub * Don't build Python 3.10 for now... * Fix pl release builder * updating version contribute to the error? docker/buildx#456 * Update actions' versions * Update slack user to notify * Don't use 11.6.0 to avoid bagua incompatibility * Don't use 11.1, and use 11.1.1 * Update .github/workflows/ci-pytorch_test-conda.yml Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com> * Update trigger * Ignore artfacts from tutorials * Trim docker images to distribute * Add an image for tutorials * Update conda image 3.8x1.10 * Try different conda variants * No need to set cuda for conda jobs * Update who to notify ipu failure * Don't push * update filenaem Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com>
* append cuda version to tags * revertme: push to hub * Update docker readme * Build base-conda-py3.9-torch1.12-cuda11.3.1 * Use new images in conda tests * revertme: push to hub * Revert "revertme: push to hub" This reverts commit 0f7d534. * Revert "revertme: push to hub" This reverts commit 46a05fc. * Run conda if workflow edited * Run gpu testing if workflow edited * Use new tags in release/Dockerfile * Build base-cuda and PL release images with all combinations * Update release docker * Update conda from py3.9-torch1.12 to py3.10-torch.1.12 * Fix ubuntu version * Revert conda * revertme: push to hub * Don't build Python 3.10 for now... * Fix pl release builder * updating version contribute to the error? docker/buildx#456 * Update actions' versions * Update slack user to notify * Don't use 11.6.0 to avoid bagua incompatibility * Don't use 11.1, and use 11.1.1 * Update .github/workflows/ci-pytorch_test-conda.yml Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com> * Update trigger * Ignore artfacts from tutorials * Trim docker images to distribute * Add an image for tutorials * Update conda image 3.8x1.10 * Try different conda variants * No need to set cuda for conda jobs * Update who to notify ipu failure * Don't push * update filenaem Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com>
* append cuda version to tags * revertme: push to hub * Update docker readme * Build base-conda-py3.9-torch1.12-cuda11.3.1 * Use new images in conda tests * revertme: push to hub * Revert "revertme: push to hub" This reverts commit 0f7d534. * Revert "revertme: push to hub" This reverts commit 46a05fc. * Run conda if workflow edited * Run gpu testing if workflow edited * Use new tags in release/Dockerfile * Build base-cuda and PL release images with all combinations * Update release docker * Update conda from py3.9-torch1.12 to py3.10-torch.1.12 * Fix ubuntu version * Revert conda * revertme: push to hub * Don't build Python 3.10 for now... * Fix pl release builder * updating version contribute to the error? docker/buildx#456 * Update actions' versions * Update slack user to notify * Don't use 11.6.0 to avoid bagua incompatibility * Don't use 11.1, and use 11.1.1 * Update .github/workflows/ci-pytorch_test-conda.yml Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com> * Update trigger * Ignore artfacts from tutorials * Trim docker images to distribute * Add an image for tutorials * Update conda image 3.8x1.10 * Try different conda variants * No need to set cuda for conda jobs * Update who to notify ipu failure * Don't push * update filenaem Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com>
* update version and changelog for 1.7.2 release * Reset all results on epoch end (#14061) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Skip ddp fork tests on windows (#14121) * Fix device placement when `.cuda()` called without specifying index (#14128) * Convert subprocess test to standalone test (#14101) * Fix entry point test for Python 3.10 (#14154) * Fix flaky test caused by weak reference (#14157) * Fix saving hyperparameters in a composition where parent is not a LM or LDM (#14151) Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Remove DeepSpeed version restriction from Lite (#13967) * Configure the check-group app (#14165) Co-authored-by: Jirka <jirka.borovec@seznam.cz> * Update onnxruntime requirement from <=1.12.0 to <1.13.0 in /requirements (#14083) Updates the requirements on [onnxruntime](https://github.com/microsoft/onnxruntime) to permit the latest version. - [Release notes](https://github.com/microsoft/onnxruntime/releases) - [Changelog](https://github.com/microsoft/onnxruntime/blob/master/docs/ReleaseManagement.md) - [Commits](microsoft/onnxruntime@v0.1.4...v1.12.1) --- updated-dependencies: - dependency-name: onnxruntime dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update gcsfs requirement from <2022.6.0,>=2021.5.0 to >=2021.5.0,<2022.8.0 in /requirements (#14079) Update gcsfs requirement in /requirements Updates the requirements on [gcsfs](https://github.com/fsspec/gcsfs) to permit the latest version. - [Release notes](https://github.com/fsspec/gcsfs/releases) - [Commits](fsspec/gcsfs@2021.05.0...2022.7.1) --- updated-dependencies: - dependency-name: gcsfs dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix a bug that caused spurious `AttributeError` when multiple `DataLoader` classes are imported (#14117) fix * CI: Replace `_` of in GHA workflow filenames with `-` (#13917) * Rename workflow files * Update docs * Fix azure badges * Update the main readme * bad rebase * Update doc * CI: Update Windows version from 2019 to 2022 (#14129) Update windows * CI/CD: Add CUDA version to docker image tags (#13831) * append cuda version to tags * revertme: push to hub * Update docker readme * Build base-conda-py3.9-torch1.12-cuda11.3.1 * Use new images in conda tests * revertme: push to hub * Revert "revertme: push to hub" This reverts commit 0f7d534. * Revert "revertme: push to hub" This reverts commit 46a05fc. * Run conda if workflow edited * Run gpu testing if workflow edited * Use new tags in release/Dockerfile * Build base-cuda and PL release images with all combinations * Update release docker * Update conda from py3.9-torch1.12 to py3.10-torch.1.12 * Fix ubuntu version * Revert conda * revertme: push to hub * Don't build Python 3.10 for now... * Fix pl release builder * updating version contribute to the error? docker/buildx#456 * Update actions' versions * Update slack user to notify * Don't use 11.6.0 to avoid bagua incompatibility * Don't use 11.1, and use 11.1.1 * Update .github/workflows/ci-pytorch_test-conda.yml Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com> * Update trigger * Ignore artfacts from tutorials * Trim docker images to distribute * Add an image for tutorials * Update conda image 3.8x1.10 * Try different conda variants * No need to set cuda for conda jobs * Update who to notify ipu failure * Don't push * update filenaem Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com> * Avoid entry_points deprecation warning (#14052) Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> * Configure the check-group app (#14165) Co-authored-by: Jirka <jirka.borovec@seznam.cz> * Profile batch transfer and gradient clipping hooks (#14069) Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Avoid false positive warning about using `sync_dist` when using torchmetrics (#14143) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Avoid raising the sampler warning if num_replicas=1 (#14097) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: otaj <6065855+otaj@users.noreply.github.com> * Remove skipping logic in favor of path filtering (#14170) * Support checkpoint save and load with Stochastic Weight Averaging (#9938) Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Kushashwa Ravi Shrimali <kushashwaravishrimali@gmail.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Use fsdp module to initialize precision scalar for fsdp native (#14092) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Laverne Henderson <laverne.henderson@coupa.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * add more issues types (#14174) * add more issues types * Update .github/ISSUE_TEMPLATE/config.yml Co-authored-by: Mansy <ahmed.mansy156@gmail.com> * typo Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Mansy <ahmed.mansy156@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Laverne Henderson <laverne.henderson@coupa.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> * CI: clean building docs (#14216) * CI: clean building docs * group * . * CI: docker focus on PL only (#14246) * CI: docker focus on PL only * group * Allowed setting attributes on `DataLoader` and `BatchSampler` when instantiated inside `*_dataloader` hooks (#14212) Co-authored-by: otaj <6065855+otaj@users.noreply.github.com> * Revert "Remove skipping logic in favor of path filtering (#14170)" (#14244) * Update defaults for WandbLogger's run name and project name (#14145) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com> Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com> Co-authored-by: otaj <6065855+otaj@users.noreply.github.com> Co-authored-by: Adam Reeve <adreeve@gmail.com> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kushashwa Ravi Shrimali <kushashwaravishrimali@gmail.com> Co-authored-by: Laverne Henderson <laverne.henderson@coupa.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Mansy <ahmed.mansy156@gmail.com>
* append cuda version to tags * revertme: push to hub * Update docker readme * Build base-conda-py3.9-torch1.12-cuda11.3.1 * Use new images in conda tests * revertme: push to hub * Revert "revertme: push to hub" This reverts commit 0f7d534. * Revert "revertme: push to hub" This reverts commit 46a05fc. * Run conda if workflow edited * Run gpu testing if workflow edited * Use new tags in release/Dockerfile * Build base-cuda and PL release images with all combinations * Update release docker * Update conda from py3.9-torch1.12 to py3.10-torch.1.12 * Fix ubuntu version * Revert conda * revertme: push to hub * Don't build Python 3.10 for now... * Fix pl release builder * updating version contribute to the error? docker/buildx#456 * Update actions' versions * Update slack user to notify * Don't use 11.6.0 to avoid bagua incompatibility * Don't use 11.1, and use 11.1.1 * Update .github/workflows/ci-pytorch_test-conda.yml Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com> * Update trigger * Ignore artfacts from tutorials * Trim docker images to distribute * Add an image for tutorials * Update conda image 3.8x1.10 * Try different conda variants * No need to set cuda for conda jobs * Update who to notify ipu failure * Don't push * update filenaem Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com> (cherry picked from commit d5f35ec)
* append cuda version to tags * revertme: push to hub * Update docker readme * Build base-conda-py3.9-torch1.12-cuda11.3.1 * Use new images in conda tests * revertme: push to hub * Revert "revertme: push to hub" This reverts commit 0f7d534. * Revert "revertme: push to hub" This reverts commit 46a05fc. * Run conda if workflow edited * Run gpu testing if workflow edited * Use new tags in release/Dockerfile * Build base-cuda and PL release images with all combinations * Update release docker * Update conda from py3.9-torch1.12 to py3.10-torch.1.12 * Fix ubuntu version * Revert conda * revertme: push to hub * Don't build Python 3.10 for now... * Fix pl release builder * updating version contribute to the error? docker/buildx#456 * Update actions' versions * Update slack user to notify * Don't use 11.6.0 to avoid bagua incompatibility * Don't use 11.1, and use 11.1.1 * Update .github/workflows/ci-pytorch_test-conda.yml Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com> * Update trigger * Ignore artfacts from tutorials * Trim docker images to distribute * Add an image for tutorials * Update conda image 3.8x1.10 * Try different conda variants * No need to set cuda for conda jobs * Update who to notify ipu failure * Don't push * update filenaem Co-authored-by: Luca Medeiros <67411094+luca-medeiros@users.noreply.github.com> (cherry picked from commit d5f35ec)
I was seeing this error crop up but only when using |
I’m seeing buildx error like this:
Things to note:
The text was updated successfully, but these errors were encountered: