-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CFEP 20 - package split #39
CFEP 20 - package split #39
Conversation
Thank you @SylvainCorlay! |
This CFEP proposes a policy on how to split packages in multiple outputs for | ||
|
||
- the main runtime requirements (shared objects, executables) | ||
- the developer requirements (headers, static libraries, pkg-config and cmake artifacts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could static libraries be separated from this? Given conda is primarily based around using shared libraries it would be nice to have the static libraries split into a separate package than the other things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that would align with CFEP 18 where static libraries need to be dropped or at least split out into separate packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to split static libraries in another package than -dev
, which is most typically not a runtime requirement.
- However, `foobar-doc` does not require `foobar`. | ||
- Also, `foobar-dev` will typically depend on `foobar` (if it exists), pinned to the same version. | ||
|
||
Note: while we may expect usecases where for e.g. statically linking with libfoobar, we only need foobar-dev, it is probably a good pattern to have foobar-dev always depend on foobar to not break expectations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another reason to split the static library into a separate package from the headers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this argument. The static libraries would be packaged in the -dev
package which would never be a dependency of non-dev packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we make sure that static packages are never used if we have -dev
packages in host when building? If the static packages are also brought in by -dev
we have no idea whether the static package was used or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to change how static libraries are handled, we need to update CFEP 18 as well since that already prescribes how static libraries should be handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the static packages are also brought in by -dev we have no idea whether the static package was used or not.
We don't know by just looking at the recipe meta.yaml. The recipe author should know what libraries they are linking with hopefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, we know because -static
are separate packages and if they are not in the host
requirements (directly or indirectly) they are not used. By adding it to -dev
it will always be in host
indirectly and we will not know. Why change the current situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change the current situation?
The rationale for changing is that static libraries and headers have roughly the same implications with respect to binary compatibility, and we will end up applying the same pattern for both with respect to exports (run exports
, and eventually run_constraints exports
). Also, it does not make sense to have the static libraries without the headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not talking about binary compatibility. What I mean is that having the static libraries in host
might make a downstream link to the static library instead of the shared library which we do not want. We always want shared libraries to be used.
Also, it does not make sense to have the static libraries without the headers.
It makes sense to have headers without static libraries. -static
output can have a dependency on -dev
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that having the static libraries in host might make a downstream link to the static library instead of the shared library which we do not want. We always want shared libraries to be used.
You fear that static libraries may be linked against by accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be good to have some text on how we plan to implement this. Do we need migrations for feedstocks and their dependencies? What if people don't want to go along with it? How strict are we enforcing this policy?
|
||
## Abstract | ||
|
||
This CFEP proposes a policy on how to split packages in multiple outputs for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to specify exactly which kinds of packages we are referring to here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be for all packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we won't ship headers with numpy in your view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I responded to that in the thread below:
numpy headers are not installed in a header directory (they are package data). All package data should probably be installed alongside the runtime package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! I missed that when I sent this message or got my own wires crossed. Your answer makes sense. Thank you!
## Motivation | ||
|
||
At the moment, there is not a common pattern to split packages in multiple output including C/C++ headers, JS source maps, or artifacts of build systems that are not required at runtime but are necessary at build time. Packages have been split in various ways, but adopting a convention will allow us to create tooling, idioms, and best practices on how to deal with such split packages. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again text specifically on which packages this applies to would be good. Do we want a numpy-dev
for its headers? Are we going to rename pybind11
to pybind11-dev
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again text specifically on which packages this applies to would be good.
Yes, probably for all.
numpy
numpy's headers are not in the standard include directory, but in package data. Also they are "needed" at runtime since there is a runtime function returning them, so I don't think there is a need for numpy-dev.
Are we going to rename pybind11 to pybind11-dev
Yes, for the most part (but pybind11 is not quite header-only) so there should probably be both a pybind11 (with the python part and the package-data duplication of the headers), and pybind11-dev with the headers and the cmake things.
A few off the cuff questions for this:
|
Thanks for writing this @SylvainCorlay, and before I get into the details, I do welcome any attempts to standardize how packages are split! I am pretty against the way Debian splits packages, because it makes it so that users never really get the fully usable package (headers, documentation, etc) that they are often looking for when they install a package. In my view, conda-forge should be "batteries included." People who need to strip out capabilities for performance reasons, such as building docker files, should be able to do that themselves. Such people usually have more experience than the standard user. Perhaps I missed it, but for this to maintain the current "batteries included" model, I think a couple of things need to be added.
|
Yep, we could have an I have been turning this problem around for a while, as I was dealing with subtle issues of binary compatibility for several packages built with different versions of header-only libraries, and the Debian approach seems to be striking it really well, especially if combined with run_exports and (future) run_constraints_exports for runtime and dev packages.
I think a huge amount of them (most compiled things). As discussed in the gitter, we though a CFEP would be a good place to start this conversation - even if we define it as a objective to tend to.
I think the main obstacle is the way recipe outputs currently work, which makes them cumbersome to use. In the specs discussion, we were actually looking into always having a list of outputs - even if there is only one, to make things more uniform. Other toolings may be
I think these two questions are related. Providing guidance for typical scenari may be a way to go
|
Thank you for the comments @SylvainCorlay!
I agree that the current outputs are not optimal for our work and I fully agree with the effort for new specs. However, whatever we put in this CFEP should be based on the assumption that it will be implemented with conda + conda-build + conda-forge as it stands now. If we are not working under this assumption, then we risk making policy decisions that are either impossible to implement or have no bearing since they are conditioned on tools existing that don't. We are simply not currently in a place to widely adopt any new recipe standards or recipe build tooling (e.g. boa) to make a CFEP happen.
Many of the ideas here will require some service to unpack the built recipes, inspect them, and report back to the user. For example, we could have the CI jobs run an inspection script and then send the results to the admin server so it can post them with a comment. There are some security and abuse things to think about, plus the maintenance burden of keeping such a system running.
This implies large migrations of recipes to use the new names and splits which will be a ton of work for both maintainers and the core team. I am not sure if this is worth it, especially as we may create backwards compatibility issues. |
Just to get this raised ... I very much believe the opposite. I certainly don't have any actual data to support this, but I would bet that most of our users do not care about headers or static libraries, and if they care about conda-forge providing packaged documentation I personally haven't seen any evidence to that effect. (Namely, I've never seen any requests of "please add the documentation to this package".) Building off of that belief, I think that a Debian-style dev split would be valuable, although it would certainly be a pain to implement. Specifically, I think there would be real value if we could distribute generally smaller packages containing fewer files. A packaging system written in Python is never going to be "lightweight", but I think that it's a good goal to strive for. I certainly believe that conda-forge should be "batteries included" in a certain sense, but I think that right now we're shipping around a lot of files that most of our users don't use. And I also think that the users who do use them are going to be quite familiar with the "dev package" concept and won't be thrown off by it.
If this point is referring to a general dev split, I strongly disagree. Well, definitely people with special needs should be able to do what they want, but the package maintainers are the people best positioned to know which files are needed for fine-grained subsets of functionality — and they are definitely the ones best able to maintain this information as an upstream package evolves over time. This is especially true for compiler-type tools and cross-compilation scenarios. Overall I've tried to emphasize that I've got a lot of "think" and "believe" here — I'm happy to look into gathering any relevant data that might be helpful. |
Yeah, I guess one of the motivations for me in helping starting conda-forge was that I was seeing so many users of Debian distros who were trying to transition to be developers getting continually tripped up by the package splitting. This is my anecdotal experience as well. But it has led me to the belief that "batteries included" is a more equitable way to distribute packages. In this light, fully functional and inclusive default packages are central to my notion of what conda-forge is and should be going forward. |
Agree with @scopatz on this. One example to think about: |
Absolutely. I was just pointing that a better support for multiple outputs will simplify things a lot. Note on the
Yeah, I think that this is a strong argument to not do this for Python.
More generally, a lot of developpers are getting tripped up by packaging :) A serious issue in the Debian world is the tooling for creating packages. |
Right, so the conda-forge model works better than the Debian model 😉 |
Not for everything though... |
A first step would be to provide conventions for when we split things, rather than having package authors make heterogeneous choices when doing a split. Then, when things are split, we may decide to have the split packages all be suffixed (-run, -dev, -src, -dbg, -doc), and have the non suffixed one point to one of them, or require several of them. |
That will work in cases where the conventions we decide on are not overlapping with existing outputs. However, what is the point of all of the work of repackaging things if we are not going to at least try and get the recipes to use the new split packages? We don't gain much until that happens and that task is going to be a lot of work. |
I fully agree that this is the kind of thing that we ought to be mindful of. In my ideal world there would be a way to split the python headers into a separate package but not introduce a usability papercut ... but that can get tricky. (You can add features to the package manager to try to guide things, like "package python recommends package python-headers", but I think that added complexity is a big usability hit.) |
Since Debian was mentioned here, it would be great if equal weight for consideration is given to packaging conventions of other popular Linux distributions such as Fedora (rpm based ones) and Arch Linux (pacman based ones). These too have a very huge user base and they try to accommodate a large number of bleeding edge packages as well. Similar to the conda-forge community, they have also spent considerable effort in splitting packages and providing meta packages for backwards compat. Although I am not a great fan of their naming conventions, I do like the way they follow strict rules for logically splitting portions of a package, and have strict guidelines for different families of packages (rust, python, perl, php ...). Also, while coming up with the implementation details for this CFEP, please also consider the various scenarios which can easily lead to situations involving clobbering of files (for example the DSO in libxml=2.0 and libxml-lib=2.1 will easily step over each other). Might require heavy repodata patching, or depending on the situation, involve major number in the package name only for libraries. |
|
||
## Motivation | ||
|
||
At the moment, there is not a common pattern to split packages in multiple output including C/C++ headers, JS source maps, or artifacts of build systems that are not required at runtime but are necessary at build time. Packages have been split in various ways, but adopting a convention will allow us to create tooling, idioms, and best practices on how to deal with such split packages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to provide a more detailed motivation section?
What are the benefits that we would reap from the proliferation of split packages?
Could we obtain these benefits without split packages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to provide a more detailed motivation section?
Yes @CJ-Wright, I will expand a bit on the motivation section today.
Looking over those guides is definitely making me think twice about this cfep. One of the really nice things about conda forge is how (relatively) easy it is to get involved, even for compiled packages. I worry that if we start adding a lot of rules about outputs and where they go, we are going to effectively stop contributions. Even if you are used to building and linking code, the kinds of package splits we are talking about here are pretty non-trivial. People would need to understand pages of documentation and the quirks of conda build to get it right. The cost we pay for enabling people to easily distribute their compiled code is that our packages tend to be a bit bloated. This is by design and a conscious choice. We should think hard before we go back on it. |
I know this is a very long shot, but I think it is an important mental exercise. First, some of the takeaways may be low-hanging fruits that we could achieve without tackling the more controversial parts
Other discussions about |
💯
Right. IMHO the patterns are good, but forcing people to do this in staged-recipes is bad. I think if we really want something like this to happen we'll need dedicated effort from core, much better tooling as mentioned above, or maybe a new "standards" subteam whose job is basically to go around fixing things. I'm not excited for us to adopt policies around package splits that are then routinely ignored or not used. This dilutes the overall process. |
We split off -static to save space and to avoid static libs getting found when we didn't want them. This worked well and was basically zero-cost (other than recipe changes which are not insignificant). I'd like conda-build to actually handle
Sounds good.
|
Is filename and using GNU file (and maybe LIEF) going to be enough here to do the splits we'd like? We could for sure come up with a split that satisfies most needs I think. The canonical name would, due to backwards compat reasons, by necessity include all the sub-packages, but we could slowly migrate packages to using only the bits they need. Then our users who, e.g. Having said that I do also like the status quo a bit (but my personal want for |
^^^ Motivated by the PR linked above, which stems from the recent |
I understand this discussion is mostly about non-python packages, but for Python packages it might also be useful to have a separate |
It is in scope here, @jorisvandenbossche. However, some python packages ship tests in the python package itself and so extracting the test files is going to be difficult. |
For reference I split out the tests from |
Right, @xhochy. If you have one test directory sure. In practice, some projects embed their tests at pretty deep levels in their module hierarchy. This is more of a pain than it is worth IMHO. |
Sure, the packages that want this, might need to do some changes on the package level to make this possible / easier. But at least if it is possible / there are clear rules about how to do and name this on the conda side, that would be good I think. |
Filenames may be used at least to warn the recipe author that they may have e.g. debug symbols in a non-dbg package. |
I've come back to this discussion after a conversation with @ericdill around python tests and user expectations. One of the thorny issues that stuck out was the discrepancy between what authors intended to ship and what we provide. These discrepancies are not necessarily bad and can be required for conda-forge to operate properly, but they do interpose us between the software author and consumer. This CFEP could greatly increase this and may cause us to no longer satisfy user expectations that are created by the source code author. One potential outcome to consider is that we have two (potentially) different abstract consumers of packages. On the one hand we have users who are installing pkgs into their environments. I think we service this reasonably well currently, Overall, I agree with @scopatz that it would be better to keep our current names for the complete packages. Maintainers are usually better plugged in to policy changes than users and so would be in a better place to adapt. For example, having I think it would be good to have numbers for some of the earlier points around the number of packages that would need to be converted. I'd also be curious how many files are expected to be cut from conda-forge. The artifacts repo may provide enough information to get started. It doesn't provide the sizes for all the files in each tarball, but we may be able to estimate (or extract them and put them into the metadata). |
merging as deferred |
I believe a splitting convention like
|
As promised, this is a start for a CFEP on splitting packages in run, dev, dbg, and doc outputs.
Note: maybe we will be able to also include a
-src
output for a "source" package.