Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] Concurrent uploads fail in CI/CD - Good practices in CI/CD #14175

Open
1 task done
docbrown1955 opened this issue Jun 27, 2023 · 7 comments
Open
1 task done

Comments

@docbrown1955
Copy link

docbrown1955 commented Jun 27, 2023

Hi,

I have an issue on a Gitlab CI/CD pipeline building and uploading a simple C++ package. I have 4 parallel jobs (they do not share any data, in particular they have independent conan caches) that all build & upload the same package with different configurations (typically debug/release + static/shared library).

By doing so, I randomly get failing jobs returning a "needs DELETE permission" error at the conan upload step. Sometimes all jobs pass, sometimes one or two are failing with the same error:
image

When re-run independently, these failed jobs are successful. To be more specific: the error arises during the upload of sources (conan_sources.tgz), that is the same between all four jobs (only the build config is different). Example of error I get at when doing conan upload:

   conan upload -c test_parallel_ci* -r conan_remote
   Checking which revisions exist in the remote server
   Preparing artifacts to upload
   Compressing recipe sources...
   Compressing package...
   Uploading artifacts
   Uploading recipe 'test_parallel_ci/0.1.0#408bd9e0d39fd1700f55e5daf4dc2221'
   ERROR: Permission denied for user: '[MASKED]': 403: Not enough permissions to delete/overwrite artifact 
   'conan-local:_/test_parallel_ci/0.1.0/_/408bd9e0d39fd1700f55e5daf4dc2221/export/conan_sources.tgz' 
   (user: '[MASKED]' needs DELETE permission)..

It seems to be a concurrency issue on the (artifactory) server side, and I naturally wonder if my use of conan is well defined or not. Having completely independent jobs firing uploads of a same recipe revision at the same time might not be a good idea, although it is supposed to happen in general. Is there a good practice I am missing here to automate the upload of several conan packages ? Otherwise I guess I have to dig into the artifactory server configuration ?

OS: Linux
Docker image: conanio/gcc10
Conan version: 2.0.7
Minimal example to reproduce the bug: I used the template project given by conan new cmake_lib -d name=test_parallel_ci -d version=0.1 and a minimal gitlab CI with 4 independent parallel jobs building debug/release, static/shared versions of the library.

NB: I have no particular knowledge on the artifactory server I am using.

Have you read the CONTRIBUTING guide?

  • I've read the CONTRIBUTING guide
@memsharded
Copy link
Member

Hi @docbrown1955

Thanks for your qustion.
We will try to reproduce it, but this sounds like a potential quasi-expected race condition, due to the lack of global upload lock in the server side and the usage of restricted permissions that don't allow overwrites.

If you are doing parallel uploads, I'd recommend the following to remove the potential race conditions completely:

  • First upload just the recipe conan upload ... --only-recipe
  • Then, fire in parallel the full conan upload ... for the different configurations.

It is also planned to implement the upload of multiple binaries parallel internally (like conan upload ... --parallel, or maybe via some conf).

Please let us know if the above 2 steps upload helps. Also if you manage to get the server version, that is also relevant.
For the parallel uploads built-in feature, you might want to track this ticket: #13224

@docbrown1955
Copy link
Author

Hello @memsharded, thanks for the very quick answer!

I will search for a solution to guarantee the sources are uploaded only once. It is not straight-forward though, because it is difficult to separate the build & the upload parts (a job cannot be locked between two steps, and two jobs do not share the conan cache).

I see two solutions:

  • Implement some ad-hoc solution retrying the upload a few times (not great, but should work fine)
  • Share a single conan cache between all jobs, which would also prevent downloading dependencies each time
  • Other solutions would imply copying a lot of things across jobs or loose the guarantee that everything that is uploaded has been validated (built and tested)

I have a follow-up question on sharing the conan cache:

From the conan point of view, would concurrent conan install and conan export-pkg commands behave well for the cache ?

From what I saw in the conan-1.0 doc, as long as there is no delete operation it should be fine. Is this still true in conan-2.0 ? If so, all jobs can share the cache, and a single subsequent job can upload at once all binaries (using the --parallel option when available).

On the server side: I do not have delete permissions, and I checked the artifactory server version using the GET ***/artifactory/api/system/version API request and got the following reply (removing the addons and license):

{
    "version": "7.59.9",
    "revision": "75909900",
    "entitlements": {
        "EVENT_BASED_PULL_REPLICATION": true,
        "SMART_REMOTE_TARGET_FOR_EDGE": false,
        "REPO_REPLICATION": true,
        "MULTIPUSH_REPLICATION": true
    }
}

NB: The issue does not seem to arise in a larger project (with jobs during a few minutes or more I guess there is enough uncertainty in the final time of each job so that a first upload arrives soon enough compared to the others).

@memsharded
Copy link
Member

memsharded commented Jun 28, 2023

I have a follow-up question on sharing the conan cache:

I am afraid this is not possible. The Conan cache is not concurrent safe. Conan 1.X had some locking that could help to alleviate some cases (but not all), but still it wasn't guaranteed to be good for concurrent usage. Conan 2.0 has initially removed all synchronization mechanisms in the cache, because it had to be redesigned from the ground up to be multi-revision ready, and it is guaranteed to not be safe for concurrent usage.

It is in the roadmap to try to make 2.0 cache concurrent, and hopefully the new cache design will help to make it safer for concurrent usage. But this effort hasn't started yet, and it might take a while, there are still other priorities.

NB: The issue does not seem to arise in a larger project (with jobs during a few minutes or more I guess there is enough uncertainty in the final time of each job so that a first upload arrives soon enough compared to the others).

yes, the larger the builds, the more unlikely to happen. Note the race condition happens between the check of the existence of the revision in the server and the upload of the files. For most cases this should be quite short, like a couple of seconds at most.

It might be a bit higher if the conan_sources.tgz is very big, is this the case? If it is, a way to optimize it is to avoid packaging the sources and use a scm approach in which just the url+commit is stored in the recipe, but not the actual sources.

@memsharded memsharded added this to the 2.0.8 milestone Jun 28, 2023
@puetzk
Copy link

puetzk commented Jun 28, 2023

If you are doing parallel uploads, I'd recommend the following to remove the potential race conditions completely:

First upload just the recipe conan upload ... --only-recipe

I don't see how separating the recipe and binary uploads into two commands would that fix the race? It would seem that conan upload ... --only-recipe still suffers the same TOCTOU race if the recipe did not exist initially, and multiple agents (e.g. building for different OS) see this and decide to upload it.

I would think using something like X-Checksum-Deploy was the only real fix here (letting artifactory accept the upload from whoever got there first, and return HTTP 201 to any subsequent or concurrent attempts as long as the checksums match). Or I suppose conan could emulate the same - retry the pre-upload check for an existing file with the expected checksum after a failure, and quietly swallow the error and continue on if it would now have decided to skip the upload which failed.

@memsharded
Copy link
Member

I don't see how separating the recipe and binary uploads into two commands would that fix the race? It would seem that conan upload ... --only-recipe still suffers the same TOCTOU race if the recipe did not exist initially, and multiple agents (e.g. building for different OS) see this and decide to upload it.

The idea is that this is done only once, by one agent. It can be done even before the agents start to build, by the main agent, with a conan export + conan upload --only-recipe.

I would think using something like X-Checksum-Deploy was the only real fix here (letting artifactory accept the upload from whoever got there first, and return HTTP 201 to any subsequent or concurrent attempts as long as the checksums match). Or I suppose conan could emulate the same - retry the pre-upload check for an existing file with the expected checksum after a failure, and quietly swallow the error and continue on if it would now have decided to skip the upload which failed.

The X-Checksum-Deploy is in place and working, and it will avoid most of the problems, as long as the overwrite/detele permissions allow it. But the DELETE permissions (to overwrite) seems to be rejecting X-Checksum-Deploy uploads too (most likely because they can avoid the transfer, but change the metadata anyway). I am pretty sure that this cannot be changed in Artifactory, sounds very risky.

@docbrown1955
Copy link
Author

I am afraid this is not possible. The Conan cache is not concurrent safe. Conan 1.X had some locking that could help to alleviate some cases (but not all), but still it wasn't guaranteed to be good for concurrent usage. Conan 2.0 has initially removed all synchronization mechanisms in the cache, because it had to be redesigned from the ground up to be multi-revision ready, and it is guaranteed to not be safe for concurrent usage.

It is in the roadmap to try to make 2.0 cache concurrent, and hopefully the new cache design will help to make it safer for concurrent usage. But this effort hasn't started yet, and it might take a while, there are still other priorities.

Ok, then I will go for the retry solution that at least works fine. I do not see another simple solution that does not imply copying a lot of stuff between independent CI/CD jobs.

I hope it will be possible to make the conan cache concurrent safe at one point, as it could drastically improve CI/CD performance for projects with a lot of dependencies by sharing a single conan cache across job instances.

It might be a bit higher if the conan_sources.tgz is very big, is this the case?

The pathological case I discovered is on a small package but with a very quick CI (about 30s in total). On a large package there is no problem, it would be interesting to see if on a sane build adding large "data.txt" is enough to make the concurrency problem appear again. But from the discussions above I guess it would.

If it is, a way to optimize it is to avoid packaging the sources and use a scm approach in which just the url+commit is stored in the recipe, but not the actual sources.

Thanks for the tip, I will keep this in mind as well :)

@memsharded memsharded modified the milestones: 2.0.8, 2.0.9 Jul 11, 2023
@memsharded memsharded modified the milestones: 2.0.9, 2.0.10 Jul 19, 2023
@memsharded memsharded modified the milestones: 2.0.10, 2.0.11 Aug 25, 2023
@memsharded memsharded modified the milestones: 2.0.11, 2.0.12 Sep 13, 2023
@czoido czoido modified the milestones: 2.0.12, 2.0.13, 2.0.14 Sep 26, 2023
@czoido czoido modified the milestones: 2.0.14, 2.0.15 Nov 7, 2023
@memsharded memsharded modified the milestones: 2.0.15, 2.1 Dec 18, 2023
@memsharded
Copy link
Member

Ok, then I will go for the retry solution that at least works fine. I do not see another simple solution that does not imply copying a lot of stuff between independent CI/CD jobs.

Regarding this, now in latest Conan 2.0.17, there is the conan cache save/restore functionality, that can be used to centralize things in CI, and then do a single upload from one job with the binaries for the different configurations gathered in CI. This approach is described in https://blog.conan.io/2023/11/28/Conan-new-features-2-0-14.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants