Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH - Shorten hash used in environment path #611

Closed
Tracked by #507 ...
asmeurer opened this issue Oct 10, 2023 · 8 comments · Fixed by #652
Closed
Tracked by #507 ...

ENH - Shorten hash used in environment path #611

asmeurer opened this issue Oct 10, 2023 · 8 comments · Fixed by #652
Assignees
Labels
needs: discussion 💬 This item needs team-level discussion before scoping type: enhancement 💅🏼

Comments

@asmeurer
Copy link
Contributor

Context

Environment paths currently use a hash of the environment plus a timestamp plus the environment name. For example 99108419ad0fd922fdeff9bbc434b58d41f68e3f923a83f6a7ab19568463bc82-20230915-201619-848782-1-test for an environment named "test".

This path is very long (90 characters plus the environment name) and causes issues on Windows where the default filesystem limits paths to 260 characters, and some packages can have long paths. See #588. For example, notebook has a path with 122 characters (#588 (comment)), so 122 + 90 is 212 characters, leaving only 48 characters for the root of wherever the conda-store state directory is stored plus the environment name.

Long paths can also cause issues on other platforms for some packages that are built to only be able to be installed in paths up to 255 characters. See #588 (comment)

Value and/or benefit

On Windows, this can be fixed by setting a registry key, but this requires administrator privileges, so some deployments might not be able to do it.

Anything else?

See #588 for more technical details on the Windows path length problem.

My suggestion would be to use a shorter hash. We can incorporate the timestamp into the hash, since it is already stored in the database. However, we would need to make sure this is done in a way that is backwards compatible.

@asmeurer asmeurer added the type: question 🤔 Further information is requested label Oct 10, 2023
@trallard trallard added type: enhancement 💅🏼 needs: discussion 💬 This item needs team-level discussion before scoping and removed type: question 🤔 Further information is requested labels Oct 11, 2023
@trallard trallard changed the title Shorten hash used in environment path ENH - Shorten hash used in environment path Oct 11, 2023
@jaimergp jaimergp added this to the challenges 🛠 milestone Nov 2, 2023
@jaimergp
Copy link
Member

jaimergp commented Nov 2, 2023

I'd consider using a different encoding for the hash. Our current base16 could be safely extended to base32 and hence we can afford a shorter hash. The date could be a shorter (yet less readable) timestamp.

If we want readability, I'd keep a series of symlinks to the hashed path. If you want to map back, you can always query the database. Convenience shouldn't trump usability.

@nkaretnikov nkaretnikov self-assigned this Nov 2, 2023
@nkaretnikov
Copy link
Contributor

nkaretnikov commented Nov 4, 2023

[UPD: Nevermind, see my messages below. This is a conda limitation on all platforms, we cannot solve this just by using the extended length prefix.]

@jaimergp Not sure where base32 comes from, it's sha256 + some other stuff:

@property
def build_key(self):
    """A conda environment build is a function of the sha256 of the
    environment.yaml along with the time that the package was built.

    The last two parts of the build key are to assist finding the
    record in the database.

    The build key should be a key that allows for the environment
    build to be easily identified and found in the database.
    """
    datetime_format = "%Y%m%d-%H%M%S-%f"
    return f"{self.specification.sha256}-{self.scheduled_on.strftime(datetime_format)}-{self.id}-{self.specification.name}"

In #588, Aaron shows this example:

ERROR:root:[WinError 206] The filename or extension is too long: 'C:\\Users\\Aaron\\Quansight\\conda-store\\conda-store-server\\conda-store-state\\default\\99108419ad0fd922fdeff9bbc434b58d41f68e3f923a83f6a7ab19568463bc82-20230915-201619-848782-1-test\\Lib\\site-packages\\sympy\\parsing\\autolev\\test-examples\\pydy-example-repo'

Yes, our build_key contains redundant information, state directories could also use shorter names. However, even if we somehow had a small build_key that cannot exceed a certain size, part of the issue is the internal package directory structure, which we don't control. Yes, we can take a look at all existing packages and infer the current max directory path length, but it might still break one day.

IMO, a better solution, which shouldn't require changing much and won't affect other platforms is using the extended-length path prefix "\\?\" (32,767 chars max), as documented here:

https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation

IIUC, this will cause Windows to use different APIs, which don't have this legacy path limit. Also, based on what've read, the support for this should go back to Windows 7, but it needs to be verified. The only issue I can think of is not being able to browse these paths using Explorer, but I think that was only a problem on XP, and it was updated to use different APIs on Windows 7, but again, this needs to be verified. Also, it's not something that should regularly happen as these are internal. Finally, these extended paths are supposed to be absolute (cannot use . or ..), but this doesn't seem like a problem for us.

A number of projects seem to be using this approach:

So I'm going to explore the extended-length path prefix "\\?\" first.
[UPD: Nope, we'll need to change the build_key - see below.]

Not going to explore a different build_key for now, but it raises a number of questions if we decide to change it:

  • it has a lot of redundant information (why have a hash if you have build id, why have a timestamp and env name?)
  • build id and env name can be arbitrary large
  • env name is user-controlled
  • what will happen on collisions: a total failure or a rebuild? where is this used?
  • how to migrate existing installations to this new yet-to-be-defined build_key.

Requirements for new build_key if we decide to change it:

  • constant size or size with known upper bound
  • addresses the collision/uniqueness, depending on use (see where it's used in code)
  • check existing max directory structure in packages to know how much space we have left out of the 256 char limit
  • rename state directories to make them shorter
  • other properties it needs to have (the build_key docstring mentions some of those).

@nkaretnikov
Copy link
Contributor

nkaretnikov commented Nov 4, 2023

I forgot to mention advantages of the extended path prefix "\\?\":

  • it's a platform-dependent solution for a platform-dependent problem. macOS and Linux are unaffected by this change
  • on Windows, we can also put it behind a flag, like git does, so existing Windows installs are also not affected
  • it shouldn't change anything on the DB side since it mostly concerns filesystem access, so can be just done in code.

Going to write a PR and report if anything breaks.
[UPD: Going to shorten the build_key instead.]

@nkaretnikov
Copy link
Contributor

nkaretnikov commented Nov 4, 2023

OK, nevermind. Jaime told me this is needed because of point 1 here: #588 (comment). So there's no other way than to shorten the build_key.

conda-build will use a default prefix placeholder of 255 characters. This means that the prefix base path (e.g. ~/.local/conda-store/envs/env-name-1234abcd) always needs to be under 255 characters. There should be a check that validates the configuration and ensures that base installation path + intermediate directories + default length of the hashed environment <= 255. This is for all platforms.

Repro: #588 (comment)

@nkaretnikov
Copy link
Contributor

nkaretnikov commented Nov 4, 2023

@jaimergp I've thought more about this in the context of the conda prefix. Let's look at Aaron's example, which exceeded the limit:

/private/var/folders/wc/dppcpmxs1tlb36nqcw853wkm0000gn/T/pytest-of-aaronmeurer/pytest-12/test_conda_store_register_envi0/conda-store-state/pytest-namespace/35f604188f69ceb5d9e3fae2c93ffd48c2971d192f21c776e3ebfed7e1196868-20231005-205956-556312-1-pytest-name

  • I highlighted variable-lengths strings in bold.
  • First, note: conda-store-state now would be .conda-store/state (later we're planning to make this controlled via XDG, see #648)
  • We can tell the user to move to a shorter base, so everything before conda-store-state can potentially be c:\\ or /home/<user>, say, 10-15 chars
  • Then we have the namespace name. This is the first problem. This can be a namespace created via the API or the default user namespace, which is the username. When this namespace/username is created in the UI, the user has no idea it's part of the prefix and that they might run into problems due to the conda prefix limit
  • Then we have the hash, which is big but constant
  • Then the timestamp and build id. These can get arbitrary large. The build id will grow faster.
  • Then we also use the full environment name. This is currently only limited by the DB field length, which is 255 chars.

Summary: there are many things that are non-constant and can exceed the prefix length.

I think the long-term solution should be similar to what Nix uses: https://nixos.org/guides/nix-pills/nix-store-paths.
The ideal scheme should be: ROOT + constant value (hash). We can instruct users to have a shorter ROOT and we'll know exactly how long that ROOT can be to not exceed the prefix (determined by the hash length). The main question here is how to select this hash. Can it be just the hash of the conda lockfile? IIUC, lockfiles are supposed to be deterministic. This will also allow for sharing when multiple users end up building the same env. However, we need to be careful about access to this shared resource (such as on deletion). I'll think more about this, but I welcome comments on this idea.

For now, I'll implement the truncated hash + timestamp idea, but it's a bandaid rather than a real solution. We should use something else long-term.

UPD: More info on Nix hashes: https://nixos.wiki/wiki/Nix_Hash

nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Nov 4, 2023
nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Nov 5, 2023
@asmeurer
Copy link
Contributor Author

asmeurer commented Nov 5, 2023

Conda itself would need to be updated to use \\?\. And anyway it could always happen that some other package gets installed that needs to access paths in the install prefix and doesn't use this API. The more robust fix is to use something shorter for the hash+timestamp.

nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Nov 5, 2023
nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Nov 5, 2023
@nkaretnikov
Copy link
Contributor

Status update: submitted a WIP PR but need to address review feedback: #652.

nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Nov 17, 2023
nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Nov 22, 2023
@nkaretnikov
Copy link
Contributor

Opened a separate issue for Nix-like hashes: #678

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: discussion 💬 This item needs team-level discussion before scoping type: enhancement 💅🏼
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants