Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: always_softlink for conda create --clone #7867

Closed
Tagar opened this issue Oct 17, 2018 · 15 comments
Closed

Feature request: always_softlink for conda create --clone #7867

Tagar opened this issue Oct 17, 2018 · 15 comments
Labels
locked [bot] locked due to inactivity

Comments

@Tagar
Copy link

Tagar commented Oct 17, 2018

We have a use case when we'd love to have an option for conda create --clone
to force using softlinks across filesystems instead of copying files - what conda create does currently.

Option can be called something like force_softlinks or always_softlinks like mentioned in #3373

PS. In more details about the use case:
The use case is around distributed compute environments, where same conda virtual environment
has to be available on all workers nodes. We already have a shared NFS storage that's mounted on all workers nodes (same mount location). Currently creating and cloning users' conda environments in some cases takes a lot of time (some of them specify anaconda as a dependency, so it can be as big as 2.5Gb with hundreds of packages). All those worker nodes also have root conda environments that live on each worker too (another location but also consistent across all servers). If conda would allow to clone that root environment and use symlinks to the root environment (that's what they will be cloning), then amount of work required to create an environment will be much less. We expect it'll speed up conda create --clone quite a bit in this scenario. Also, it has a another nice side-effect - worker servers would be reading most of that conda environment locally and not from NFS share (it's okay for one server), but when tens of servers would start reading a few Gb from NFS, it easily adds up to significant network traffic.

Thank you!

@mingwandroid
Copy link
Contributor

Soft links will not work because code will call realpath on them and mess them up.

@Tagar
Copy link
Author

Tagar commented Oct 18, 2018

@mingwandroid thank you. Is there is a way do enforce keeping softlinks?
We think there should be an option to suppress realpath for this use case.
Commands like cp has options to control this like --dereference / --no-dereference.
conda already has certain options to use "at user's discretion" like --insecure / --clobber etc.
We feel strongly about this use case and would like to have a way to control how conda
operates on symlinks. Thank you.

@mingwandroid
Copy link
Contributor

We think there should be an option to suppress realpath for this use case.

Do you think if I was to interpose this glibc core C function things would continue to work ok? They would not.

We feel strongly about this use case and would like to have a way to control how conda
operates on symlinks. Thank you.

The strength of your feelings on this subject are not relevant because it is technically unsupportable.

Even when we use hardlinks there are cases where we simply have to copy files. This is true for all executables on macOS and Linux and doubly true for symlinks.

@Tagar
Copy link
Author

Tagar commented Oct 18, 2018

We think there should be an option to suppress realpath for this use case.

Do you think if I was to interpose this glibc core C function things would continue to work ok? They would not.

realpath is just a function in glibc

http://man7.org/linux/man-pages/man3/realpath.3.html

I understand how it works. I am not asking to change that function :-) What I was asking if conda would have an option not to call realpath so it would use symlinks correctly.

conda already supports symlinks partially - for example, always_softlinks like mentioned in #3373 .
I don't understand why it can't work technically?

Thanks.

@mingwandroid
Copy link
Contributor

Please read what I wrote carefully. It is not conda that is calling realpath that is the problem. It is the thousands of packages we provide that can and will call realpath, and when they do that, they will break.

@mingwandroid
Copy link
Contributor

Well, we used to allow symlinks for executables, but we had to stop doing that because glibc
calls realpath on any executable immediately on startup, and it then thinks it's in $CONDA_ROOT/pkgs somewhere, despite it not being there from the software's perspective.
As soon as that happens, rpath/runpath $ORIGIN no long works (it returns that the file is in the package cache) and shared libraries (which are loaded by virtue of being relative to $ORIGIN in RPATH) fail to load. Now if the package cache was at the same depth as the envs folder this wouldn't matter, but trying to fix issues with symlinks is like playing whack-a-mole.

To explain that more, any of the hundreds of packages we build may call realpath on one of these symlinks and when it does you're in the same position as with glibc, namely, you are in the wrong directory, the conda package cache.

Please stop insisting there's a way this can be made to work, there is not.

@Tagar
Copy link
Author

Tagar commented Oct 18, 2018

Thanks @mingwandroid - I understand now. You haven't referenced "thousands of packages" in your previous response, so I was under impression we were talking just about conda behavior/implementation itself. But I understand that now - thanks for clarifying.

I am surprised some packages would call realpath before they import a package. Can you please give example of one single package where this happens? Thanks again.

@mingwandroid
Copy link
Contributor

I am surprised some packages would call realpath before they import a package

Hmm, I'm not even talking about just python packages and wrt python packages I am not talking about realpath before importing a package, I am talking about any use of the glibc realpath function from any software in our ecosystem, be that python, C, C++, R or Rust. As soon as that software uses that function (and very critical software such as GCC does this for each input file) and then does any file IO relative to the result of that, bad things will happen, files will not get found.

@mingwandroid
Copy link
Contributor

mingwandroid commented Oct 18, 2018

.. your real problem is that it is slow to create envs when copying is used and the source is NFS.

If you want to file a bug along those lines then we can think about ways to improve that. One idea would be to copy the tarball and extract that locally, though decompressing .tar.bz2 is very slow so that may be something that needs to wait until such a time as we change to a compression format with much faster decompression speeds. Still such an 'improvement' would be somewhat site-specific, i.e. slow end-user CPUs with very fast networks vs fast end-user CPUs with very slow networks.

@Tagar
Copy link
Author

Tagar commented Oct 18, 2018

Thank you a lot for the detailed explanation there. Some of those details are beyond my knowledge of conda obviously - we're very new to conda world, so please pardon our basic questions.

Thank you for providing gcc example - I just did ltrace to confirm this :-)

rdautkha@ gcc-test  $ ltrace gcc ab.cc 2>&1 |grep realpath
realpath(0x7ffee9fff710, 0x7ffee9ffe6f0, 0x7ffee9fff870, -1) = 0x7ffee9ffe6f0
realpath(0x7ffee9fff710, 0x7ffee9ffe6f0, 0x7ffee9fff870, -1) = 0x7ffee9ffe6f0

That's very interesting, I am not sure I understand why gcc (or some other packages would need to know realpath) as they can keep working with the symlinks, and glibc would dereference symlink each time transparently to an application (or to any of those thousands of packages).
But obviously I am missing something basic here.

If you want to file a bug along those lines then we can think about ways to improve that. One idea would be to copy the tarball and extract that locally, though decompressing .tar.bz2 is very slow so that may be something that needs to wait until such a time as we change to a compression format with much faster decompression speeds

I will think if I have other ideas. On decompressing that's an option too. Although it would be very slow as you mentioned. Big overhead each time application starts. Good points on cpu vs network. I agree. I also found https://github.com/conda/conda-build yesterday that seems might be somewhat helpful for some of our use cases, for example, for use cases with Spark on YARN - https://conda.github.io/conda-pack/spark.html as YARN can distribute packages as part of application submission. Although, again, I will think if there are better, more generic options can be possible here.

Thanks again.

@mingwandroid
Copy link
Contributor

Big overhead each time application starts

Is that not a big concern when symlinking to your NFS server? Each file access will go across the network. I really believe that copying is the best option for this kind of a setup at present, yes, it takes more space, but you should gain a good speed increase vs going over the network?

@Tagar
Copy link
Author

Tagar commented Oct 18, 2018

Nope. To start application up you need python, some shared libraries and some basic python packages. Those are not huge. Vs anaconda example I gave earlier which is almost 3Gb on-disk space. That would be a huge overhead to unpack those 3Gb for example. Most applications would use just a handful of packages from Anaconda, and after first NFS access, OS will cache those remote files most likely. So I expect it would be much faster with the NFS approach I was suggesting earlier. Thank you.

@mingwandroid
Copy link
Contributor

That would be a huge overhead to unpack those 3Gb for example

The Anaconda installers aren't geared towards such uses. They are more for students or people new to Python who want a big collection of packages pre-installed without having to spend time figuring out how to use conda (though they really should!).

It is a meta-package that consists of 200 actual packages. In my copy-the-package proposal It is just the needed packages from those 200 packages that would be sent over the network, not all 200 of them.

I think you are probably better off using Miniconda than Anaconda, I mean who's ever going to run grin now that we have ripgrep and ag packages which are much faster and have far more features? Using Miniconda gives you the ability to only install what you need on your NFS server.

@Tagar
Copy link
Author

Tagar commented Oct 18, 2018

That's a totally valid point!
We actually wanted to have anaconda==5.3.0 for example as a "baseline" so other folks could
"clone" the baseline environment if they have to overwrite some versions there or add something else.
We still have to switch our applications to start using conda and there are other options here, specifically having most of our use cases revolving around distributed Python/ PySpark applications etc.
One of the option I mentioned is that Spark on YARN option - https://conda.github.io/conda-pack/spark.html .
Although again I agree having a compact list of requirements per-applications might be ideal.
But we also have to think how to satisfy our current requirements when most folks don't use conda and we treat that anaconda environment as the default root environment, default python distro that all applications are using.
Once we completely switch all applications to conda what you're suggesting will definitely be brought up and discussed.

@github-actions
Copy link

github-actions bot commented Sep 6, 2021

Hi there, thank you for your contribution to Conda!

This issue has been automatically locked since it has not had recent activity after it was closed.

Please open a new issue if needed.

@github-actions github-actions bot added the locked [bot] locked due to inactivity label Sep 6, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked [bot] locked due to inactivity
Projects
None yet
Development

No branches or pull requests

2 participants