Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deterministic short_paths #3971

Closed
wagdav opened this issue Nov 20, 2018 · 21 comments
Closed

deterministic short_paths #3971

wagdav opened this issue Nov 20, 2018 · 21 comments

Comments

@wagdav
Copy link
Contributor

wagdav commented Nov 20, 2018

(the issue template is shamelessly stolen from https://github.com/concourse/concourse)

What challenge are you facing?

We're building a big C++ application in our CI system that uses many Conan packages. Build times can be as long as 1 hour so we use compilation caches (ccache/clcache) to speed up the build. This use case is also mentioned in #3502.

Some of our Conan packages depend on the short_paths option, without this they don't build on Windows, at all. The problem is that the generated short paths aren't deterministic and this makes it impossible to re-use the compilation cache database on a different worker. Only worker-local cache is viable.

Our workers are short lived, their virtual machines are destroyed and recreated daily. Using a worker-local compilation cache is sub-optimal because its lifetime is one day, hence the system needs multiple builds daily to warm up.

What would make this better?

A deterministic short path would allow the use of a global compilation cache greatly speeding up our builds.

The following proposal builds upon the idea dicussed in #3502. Currently the conans.util.windows.path_shortener generates a redirect pointing to (using the default values):

C:\.conan\<random_name>\1

Imagine, instead of this, having something like

C:\.conan\<deterministic_short_path>\1

where the

deterministic_short_path = hash(name, version, user, channel, package_id)

In other words, all the path components appearing in the Conan data directory (available from conan_reference and package_reference in conans.paths.py) will be hashed to a single, deterministic value.

To keep the path short we would truncate the output of a typical hash function such as MD5 or SHA. So the short path would look like:

C:\.conan\f5b96df\1

This would also solve issues like #1881

Are you interested in implementing this yourself?

If you think this proposal is feasible we would be very happy to implement this feature.

@memsharded
Copy link
Member

The only problem with this approach, is the possibility of a collision.

Known as the "birthday problem", the probability for 7 digits is here: source-foundry/font-v#2 (comment)

That will be not very likely, but still very possible. Might need to go for 8 digits. That is also 2 more than current random length, which also might eventually break some package that was really in the limit.

I think it could make sense to take the risk, but need to discuss it.

Thanks for the offer to implement it yourself! Will contact shortly if approved.

@wagdav
Copy link
Contributor Author

wagdav commented Nov 21, 2018

@memsharded thanks for your quick response.

The collisions are possible, however generated hashes don't have to be unique globally. The collisions need to be sufficiently rare on a particular host.

Picking the right length is important, though. Using the simple square approximation from this answer

p ~= (n^2)/(2m)

Where n is the number of items and m is the number of possibilities for each item. The number of possibilities for a hex string is 16^c where c is the number of characters.

  • n = 100, c=4 p ~= 0.08
  • n = 1000, c = 6 gives p ~= 0.03
  • n = 5000, c=7 gives p ~= 0.05
  • n = 10000, c=8 gives p ~=0.01

I guess the question is how many packages a typical host has in its database (assuming that the short_path feature is always on).

@marco-m
Copy link

marco-m commented Nov 21, 2018

(co-author of proposal here)
... and, since these hashes would have meaning only locally, we could even consider a user-customizable parameter for the hash length if really needed. This would not solve the problem of the path being too long in certain unlucky cases, but it would solve the problem of the collisions if the local host has a huge number of packages.

@memsharded
Copy link
Member

Yes, that means, that with a local cache of up to 1000 packages, if using 8 bits, the probability of collision would be around 0.0001 (0.01%)

@wagdav
Copy link
Contributor Author

wagdav commented Nov 21, 2018

Discussing with @marco-m we could make the redirect collision-free if we implemented this as follows:

  • compute the hash
  • try to create the redirect using the first N characters
  • if this path already exists under c:\.conan\ use an extra digit and try to create the redirect using the first N+1 characters

@memsharded
Copy link
Member

Not sure it is enough @wagdav

There is no guarantee that they will be evaluated in the same order

Creating of short paths:

  • Pkg/1@user/channel => hash f12345
  • Pkg2/1@user/channel => (collides) => hash f123456

If for some reason, not unlikely, a different run later that tries to use the same cache runs Pkg2/1@user/channel first, then won't it map to f12345?

Is there something I am missing?

@jgsogo
Copy link
Contributor

jgsogo commented Nov 21, 2018

I think that the approach could work, @memsharded. In your scenario, if Pkg2/1@user/channel is created first, then it could be already in the cache (will use the previously used path stored in the CONAN_LINK file) or created again and look for a new tmp directory not existing in the cache.

Something like this in the path_shortener function should work:

    full_hash = hash_function(path)  # Use path as input, all the settings/options/ids should already be there
    min_length = 4  # Any number will work
    max_length = 8
    assert len(full_hash) > max_length  # Should be easy enough for a hash function

    redirect = os.path.join(short_home, full_hash[:min_length])
    while os.path.exists(redirect) and min_length != max_length:
        min_length += 1
        redirect = os.path.join(short_home, full_hash[:min_length])

    # If everything exists, then fallback to previous behaviour
    if min_length == max_length:
        redirect = tempfile.mkdtemp(dir=short_home, prefix="")

Main issue here: if you remove a package from the Conan cache, will the linked folder be removed too? If not, all the proposed redirect will eventually fail for packages that has been created and removed several times.

@jgsogo
Copy link
Contributor

jgsogo commented Nov 23, 2018

After sharing this issue with the team, we are willing to accept this PR. The implementation should follow the idea above:

  • hash_function: any sha should be enough
  • let's get a deterministic path from 6 to 15 chars (fallback to mkdtemp). Although short paths should be erased when the path in the conan cache is deleted, we keep the temp-dir fallback just in case there is a bug for any package and the short paths are not deleted.
  • add a real_path.txt file inside the shortened path directory with the full path to the Conan cache it is linked from. This way we will be able to track the issue mentioned above.

Thanks very much for offering your help @wagdav !

@marco-m
Copy link

marco-m commented Nov 23, 2018

@memsharded, @lasote, @jgsogo we (@wagdav and myself, and surely also @piponazo) wanted to tell you that we admire how the Conan project treats contributors and users. You are doing a fantastic work.

@Croydon
Copy link
Contributor

Croydon commented Nov 24, 2018

hash_function: any sha should be enough

this might be the case, but I would strongly advise against using sha-1. Even this might not be a security relevant thing in this use case (debatable), insecure hashing algorithm should just go away altogether. I would say it is counterproductive from a "big picture" point of view to introduce broken hashing algorithm in new code no matter the usage.

(Also I'm pretty sure that I have heard that several e.g. government institutions are forbidden from using any software which contains broken hashing algorithm no matter the usage. But that was a while ago please don't pin me down on this.)

@memsharded
Copy link
Member

I don't see any security concern here at all, so sha1 should be perfectly fine. We are taking a known, public path, and mapping it to another path. If someone has access to the system, then the smallest problem is this hash. Furthermore, we are just going to keep a very few 6-7-8 characters of it...

Furthermore other hashing algorithms are already being used in Conan (md5 for file checksums, sha1 for https deduplication checks). There are plans to upgrade them, but at the moment I see no concern for using a sha1 at this moment. Even git is sha1 based :) :) :)

@memsharded
Copy link
Member

Hi @wagdav , @marco-m did you have the chance to work on this? Would you need some help or guidance? I am tentatively assigning @wagdav this issue, but tell otherwise and we will plan for it. Thanks!

@memsharded memsharded changed the title [proposal] deterministic short_paths deterministic short_paths Dec 14, 2018
@Croydon
Copy link
Contributor

Croydon commented Dec 14, 2018

I don't see any security concern here at all

Yes, but what I tried to say it no matter in what environment you are using it I don't think it is good to have a dependency on a broken hash algorithm.

There are plans to upgrade them, but at the moment I see no concern for using a sha1 at this moment. Even git is sha1 based :) :) :)

Even more when you have plans to upgrade, why introducing SHA-1 now once more?
And git is actively working to get rid of of SHA-1 since SHAttered and settled for SHA-256.

Maybe asking this in the other direction: is there a downside of using e.g. SHA-256? (even though, yes it gets cut of anyway and doesn't seem to be security relevant in this specific use case).

@memsharded
Copy link
Member

Fair point, yes, lets do sha2 then.

I was not arguing against it, I was arguing against the fact that using other algorithm here would imply some kind of vulnerability or insecurity in conan. Thanks!

@wagdav
Copy link
Contributor Author

wagdav commented Dec 14, 2018

@memsharded Thanks for the follow-up! We didn't forget about this issue! We are planning to work on this next week! I'll let you know if we're blocked!

@marco-m
Copy link

marco-m commented Dec 19, 2018

@memsharded FYI we have to postpone to beginning of next year.

@lasote lasote added this to the 1.12 milestone Jan 4, 2019
@lasote
Copy link
Contributor

lasote commented Jan 4, 2019

Hi, we want to release this for Conan 1.12 at the end of the month. Would you be able to do it? We will do it otherwise

@wagdav
Copy link
Contributor Author

wagdav commented Jan 4, 2019

Yes! We'll be submitting a PR next week! 🤞

@ghost ghost removed the stage: queue label Jan 10, 2019
lasote pushed a commit that referenced this issue Jan 10, 2019
* Add sha256 helper function

* Implement hashed_redirect

Generate deterministic short paths. This makes it possible to re-use the
compilation cache database after the re-installation of conan packages.

Resolves #3971

* Write the full path to the Conan cache in the shortened directory.

Resolves #3971
@marco-m
Copy link

marco-m commented Jan 10, 2019

Hello, I was thinking, what about the documentation ? :-)

@wagdav
Copy link
Contributor Author

wagdav commented Jan 10, 2019

Hello, I was thinking, what about the documentation ? :-)

The Conan doc repo documents the user facing behavior of the short path feature (how to enable/configure etc). This didn't change with this PR so I didn't modify anything.

@memsharded
Copy link
Member

If anything, it would be nice to add a line there that says that the short_paths version "tries" to use a hash of the long one to achieve some degree of repeatibility, but without global guarantees.

NoWiseMan pushed a commit to NoWiseMan/conan that referenced this issue Jan 11, 2019
* Add sha256 helper function

* Implement hashed_redirect

Generate deterministic short paths. This makes it possible to re-use the
compilation cache database after the re-installation of conan packages.

Resolves conan-io#3971

* Write the full path to the Conan cache in the shortened directory.

Resolves conan-io#3971
NoWiseMan pushed a commit to NoWiseMan/conan that referenced this issue Jan 11, 2019
* Add sha256 helper function

* Implement hashed_redirect

Generate deterministic short paths. This makes it possible to re-use the
compilation cache database after the re-installation of conan packages.

Resolves conan-io#3971

* Write the full path to the Conan cache in the shortened directory.

Resolves conan-io#3971
lasote pushed a commit that referenced this issue Feb 26, 2019
* add new arch [armv5el, armv5hf]

- armv5el -> armv5 soft-float
- armv5hf -> armv5 hard-float

There are still many devices based on these architectures especially embedded devices like Danelec devices, Synology etc...

*  support for armv5el, armv5hf

- made changes to b2 generator
- add tests

* Feature/deterministic short paths (#4238)

* Add sha256 helper function

* Implement hashed_redirect

Generate deterministic short paths. This makes it possible to re-use the
compilation cache database after the re-installation of conan packages.

Resolves #3971

* Write the full path to the Conan cache in the shortened directory.

Resolves #3971

* Adds new architectures: ppc32, armv8_32 and armv8.3 (#4195)

* new archs in settings.yaml

* Added new archs to migration

* Update migration to original settings.yml

* Updated b2 generator with new archs

* pure review of b2 generator

* Added new archs to get_gnu_triplet()

* revert changes on b2 generator

* review

* Back to aaarch64 and link for ilp32 clarification

* Conversion to apple arch

* added armv8.3 to get_gnu_triplet()

* feature/enabling environment variables unset action (#4224)

* Enabling the possibility of unsetting variables by means of appending such variable with value equal to None. Tests covering tools.env.environment_append have been included

* Fixing PEP8 line width requirement

* Fixing PEP8 line width requirement

* Fixing path separator used in the unit test considering its os-dependant value

* Fixing how the env_vars dictionary is being modified. Instead of removing values on-the-fly, It is previourly checked which should be removed by iterating on the keys.

* Fixing how the env_vars dictionary is being modified again. Iterating on keys didn't work enough and now it is decoupled the iteration by using the keys that should be removed and were previously stored

* Fix. Setting all the variables to None (unsetting all of them) effectively removes them from environment.

- Applying changes needed.
- Adding regression test.
- Adding additional test covering the case where no variables are asked to be removed (empty dict).
- Improve tests naming.

* Removed versions (#4267)

* Fixed bug with cached default profile (#4256)

* Fixed bug

* Docstring

* removed caching in client_cache of settings and profile

* Feature: link editable packages (#4181)

* first try

* fix includes

* minor changes

* more import fixes

* fix more imports

* short path as decorator

* move SimplePaths to conans/paths/simple_paths.py

* move back constants to conans.paths

* moving folders to conans.paths

* remove diffs caused just by ordering of imports

* remove diffs caused just by ordering of imports

* no diffs

* no diff

* move is_case_insensitive_os to conans.paths

* add rm_conandir to windows imports

* move package_layouts to paths

* fix imports

* move LINKED_FOLDER_SENTINEL to conans.paths.__init__.py

* move 'get_package_layout' to 'SimplePaths'

* get rid of PackageBaseLayout, go for duck typing

* add parser for the file that will contain paths to includes, res, bins, libs,...

* rename PackageUserLayout > PackageEditableLayout

* interface for installed_as_editable info

* test with a header only editable package

* working example for header only lib

* add test for a library with settings and options (still header only)

* fix py2 test (declare string in source as unicode)

* change import to avoid conans.tools one, so we need to mock a different function

* make it compatible for win/linux

* create the editable file sentinel using --editable argument in install command

* add needed functions to simple_paths

* add helper functions to simple paths

* add package_metadata to new layouts

* add missing named var (something went wrong merging from develop)

* remove lines, not anymore there

* add package metadata to users one

* handle error: package folders needs a ConanFileReference

* reorder imports

* add missing import

* reorder imports

* if an editable package is dirty, log an error

* moved parser to conan_file

* move parsing for conan_package_layout closer to the ConanFile, we will be able to change it in the future

* placeholders should be preprended with settings|options

* fix typos, remove unused code

* fix bug/typo

* order imports

* remove unreachable code

* remove unreachable code

* remove comments

* move unittests to its folder

* install as editable (requires the package to be already in the cache)

* declare tests

* refactor the editable hack: rely on member attribute

* move imports

* rename file to match class name

* notify user about editable condition

* gather logic together

* make it editable from the very beginning, we don't care if it is editable or not

* any error will require us to delete the editable flag

* add tests

* add tests related to editable package removal

* rename tests

* add test for other commands

* add tests for conan commands

* Update conans/test/command/editable_reference/create_editable_package_test.py

Co-Authored-By: jgsogo <jgsogo@gmail.com>

* add changes from review

* move tests to final location

* remove unused var

* change variable name (it was a py conanfile)

* don0t care about the output, but check something

* remove comments

* add tests installing an editable package (and installing its dependencies)

* remove unused import

* move cpp_info overrides to an standalone model

* minor changes related to reviews

* mode create/remove editables to client_cache class

* renamed test folder

* remove code related to 'install --editable'

* link/unlink editable packages

* consider these tests

* handle remove command on editable

* adapt message

* removal is on its own test

* check after unlink

* add tests using linked packages

* add tests for several commands

* fix test related to utf8 and ordering

* apply ordering

* remove prints

* putting path between quotes seems to solve windows issue

* review (add more checks in tests)

* function changed its name

* no import in __init__.py

* remove installed as editable

* order is not guaranteed on info command output

* no need for os.path.join

* move EDITABLE before locking package layout

* no different behaviour for update

* new model for editable_cpp_info

* check that cache file has more priority than the repo one

* review

* remove test, it is accepted behaviour now

* review

* revert change, this line is needed to pass tests in Windows

* weird substitution is not needed, but path normalization is

* inform about link removal

* use conditional 'strict' arg when reusing cache for several TestClients

* pass string to conan_api, build ConanFileReference inside it

* add tests case for editable package without files, it will use package_info data

* remove unused import

* remove uneeded command

* get the path to the conanfile.py from the argument to 'link' command inside the 'conan_api'

* remove 'EditableCppInfo.create', rename loader ones to 'load(file...)' and 'loads(teext...)'

* rename func argument

* move CONAN_PACKAGE_LAYOUT_FILE to conans/paths

* reorder variables inside file

* renamed variables

* pep8

* not using revisions in editable nodes, fix ordering issue

* rename variables

* fixing version ranges with spaces (#4273)

* fixing version ranges with spaces

* fix broken test

* simplified testing

* more error cases

* [WIP] Rename common variables to conventional ones (#4272)

* rename internal member 'PackageReference.conan' to 'PackageReference.ref'

* rename internal member 'PackageReference.conan' to 'PackageReference.ref'

* working on ConanFileRefs

* working on ConanFileRefs

* working on PackageReference

* working on ClientCache

* back to ConanFileReference

* rename member 'Node::conan_ref'

* rename member 'PythonRequires::conan_ref'

* renaming vars in comments, docs,...

* lost references

* remove duplicated uneeded conversion to str

* some loose fringes

* minor fix, review by @lasote #4272

* rename PackageReference::package_id' to 'PackageReference::id'

* loose fringes

*  support for armv5el, armv5hf

- made changes to b2 generator
- add tests

* add handling for arm5hf to get_gnu_triplet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants