-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Packages from repo.continuum.io have the same name and version, but different MD5 checksums #4956
Comments
And is concerning from a security standpoint... |
I am not sure exactly what caused this, but will raise it internally. |
CC @mcg1969 |
Isn't this related to our metadata hotfix discussion? We have two options: 1) fix the external copy of the metadata, and leave the file the same, or 2) rebuild the package, changing the MD5. There is no third option to bump the build number: when metadata is broken, it must be fixed. Until recently we generally favored option 1, but we've since moved to option 2 because option 1 breaks things for people to rebuild their external repodata indices directly from packages. It seems to me, however, that assuming the identical filenames means identical MD5s is problematic. We don't make that assumption within conda (or when we do, it's a bug). |
This probably belongs at github.com/ContinuumIO/anaconda-issues, but we can keep it here for now. |
No, it belongs here, because the question of how to handle broken metadata (especially incorrect dependencies) is something that goes above and beyond anaconda. |
There is an option 3, actually: 3) remove the broken build from the repo altogether. But that is likely to confound caches as well. |
I think I'm misunderstanding something. Why is rebuilding the package with |
There's a lot of history here. It starts with a hard requirement the metadata for a package needs to evolve over time, most prominently because upper bounds need to get added to the packages dependency version constraints over time. This unknowable information at package build time. What is the source of truth for package metadata? Is it repodata.json or within the package itself? We've at this point settled on the source of truth being the metadata contained within the package itself. There are multiple reasons that's preferable. One is that it let's Ilan over the last couple months has thus gone through repo.continuum.io and updated metadata within packages. At some point, the ultimate solution to all of these problems will probably be a rider file alongside the package tarball that contains information that can't be included in the tarball itself (md5, signature, etc) along with overrides to metadata content. There's probably a good deal of planning and work to build that facility out properly though. |
This is why I cc'd you @mcg1969. Thought you'd give a better explanation than I would. |
We have been through this many times internally, I'm afraid, suggesting that it needs to be documented. The problem is that if you leave the old package in place, and you don't change its metadata, then it's possible for For instance, suppose package So what we'd like to do is modify the metadata for On the other hand, if we refuse to update existing builds, then those builds of |
To those that +1'd the comment about security concerns, I'd like to know specifically what those concerns would be. I'm not denying that it could be a security concern, mind you. But I think it would likely be better for us to find alternate ways to mitigate those concerns than to allow users to break their conda environments because we refuse to fix known metadata errors. So if we have some specific scenarios where people feel this is an issue, we might be able to address them. For instance, what if we included a history of MD5 checksums in repodata.json, so that users could be sure that a given MD5 is among those officially generated by the package provider? |
@kalefranz weren't you planning to include more robust checksum/verification capability within conda? If we do so, it would seem important to find a way to exclude metadata from that checksum. |
Speaking of security, another candidate for metadata hotfixes is security updates to critical dependencies. If a new version of OpenSSL comes out, we'd like to be able to make sure that all new installations of packages that depend on it pull only the latest version. |
I'm a bit confused by this statement. My understanding is that the purpose of the package build number is to be able to supersede old packages while keeping the version number constant. If build numbers did not exist, then I would agree with you. We've been using this strategy quite successfully in conda-forge to release new builds for the same version number. Having package checksums change while leaving all else equal (version, build number) seems seriously problematic. First, I don't think modifying the metadata is existing packages is a good idea because many users cache the packages in repo.continuum.io behind their firewall. So it's either
Is there a reason why these aren't valid solutions? |
+1. I think it makes sense that any changes to the conda recipe (including just fixing a dependency, or even just updating metadata like the package description) constitute a new build of the conda package. |
Nothing prevents this in general, so why is the metadata being changed for specific dependencies? This attempts to relieve a package maintenance burden, but in fact makes it more difficult to maintain infrastructure based on conda. It is the responsibility of the person depending on
If it breaks, then it is the dependent's problem to constrain their dependencies. Currently, an opaque choice is being made for users, rather than forcing people to be conscious of what they are depending on.
Again, if their environment breaks, it is their responsibility to constrain versions of their dependencies.
Correct me if I'm wrong, but this isn't true if you bump the build number. A user will get an error because they are trying install the latest version of I still don't understand why this being special cased for certain packages whose constraints Continuum happens to have knowledge of, and there's absolutely nothing to be done about it. If the metadata is going to be part of the package, then changes to it must be reflected in the version in some way, because the MD5 sum reflects the version of package including metadata. Ideally, metadata would be separate from code, but the current system doesn't work like that. With that in mind bumping the build number seems like the best compromise here. |
@kalefranz, I simply don't have the time to keep re-hasing this argument. I have great respect for all of the minds on this thread, but there is nothing that is being questioned here that we haven't already thought of. When I get a bigger chunk of time, I will go through this again, here, but at that point it needs to somehow be encapsulated in a FAQ or other documentation. |
Though @kalefranz to be clear I am not wedded to the idea of updating the packages an breaking MD5---just to the notion that metadata hotfixes are necessary for existing packages. |
Would you mind addressing the points we've raised about the build number? By creating new packages with a higher build number, "metadata hotfixes" to existing tarballs are not necessary. |
This is not the case. As I said above I will indeed talk about it, but I just don't have the time right this moment. |
OK, folks, sorry for the delay. Let's whittle down a simple example that will hopefully illustrate why it is problematic for packages to remain the the repository with broken metadata. Consider packages
Here are a number of ways that this package combination breaks. In all five of these scenarios, a broken environment results. First, three fresh installs:
Scenario 4 bears explaining: in this case, conda has a choice between downgrading C one click, and downgrading B two clicks. It's going to prefer the one-click, so it will prefer And now, consider a properly functioning environment containing
In all cases, we're left with a broken environment, when what should have happened was: It's important to emphasize that in cases 4-6, the environment was not broken before. Sure, the metadata was wrong, in hindsight, but users do not care about metadata when their environment is up and running; metadata matters only during installations, removals, and updates. And of course, we know this to be the case because the same environment with And frankly, in our experience, users don't find it satisfactory when we explain that broken metadata is the cause. They had a working environment, they did It's also important to note that these breakages occurred even though the package developer for C was diligent about correcting his metadata and issuing a new build. It wasn't enough. Leaving the existing package in place ensures a variety of scenarios where the broken build will still be selected. Again, to me there are two different issues here:
I do not feel strongly about 2. In fact, I've argued against it in other forums. In those same discussions, however, I've argued forcefully in favor of 1, and have even influenced conda-forge policy on the matter. Continuum deals with these kinds of reports quite often. Some are self-inflicted problems with packages we have built and served, and some involve packages from other channels, including conda-forge. I am fully confident that more people are well-served by hotfixing metadata than are poorly served by its other consequences. |
For cases 1-3, why is conda considering earlier builds of the packages to be equivalent to the latest builds? Perhaps it should consider only the latest builds (or at least, latest build metadata) when doing the dependency resolution? A package with a larger build number should be considered a drop-in replacement for any earlier build with the same version number. Alternatively, as suggested above, if there is a huge problem with the metadata of C 1.0.1-1, it could be forcibly pulled from the repo or marked deprecated or superseded in some way (in my mind, the existence of a later build is essentially marking earlier builds as deprecated). In cases 4-6, how does Conda currently know to get the new metadata from the hotfixed C 1.0.1-1? The conda package C 1.0.1-1 I have on my system is not the same package that is now on the server (which is the confusing part we are arguing against). Could we apply the exact same logic of updating the metadata, but with the understanding that newer builds of the package have metadata that supersedes older builds? |
We've considered that, and we can re-consider it. However, there are situations where we have to pull a specific build, such as metapackages with pinned build numbers or production environments with pinned packages. That makes it difficult for us to determine whether or not conda needs to consider the older builds.
Because metadata during the install/update process is pulled from @kalefranz, correct me if I'm wrong here, but I believe the reason it was decided to insert the fixed metadata back into package (thereby changing their MD5) was to handle situations where Allowing the latest build's metadata to supersede the bundled metadata for older builds---a sort of "live" metadata hotfixing approach---is something we could certainly consider. |
Doesn't this essentially break, then, if you're changing builds under people's noses? Or do we need another level of pinning a package, perhaps by hash? It seems we can't have it both ways - either the package file is changing, which means I can't trust that I have the most recent information even if I have pinned a specific version and build, and I always have to fetch it (and re-install it) just in case it updated, or the package file never changes, so I can trust that I only need to consider updating if there is a new build advertised. |
And heck, if you're really wanting to be careful, you have to pin the channel, too. |
So two proposals seem to be: (a) A package file in a channel is determined by (in order of specificity) the (major, minor, patch) (referring to upstream information) and (build string, build number, file hash) (referring to the conda packaging process). Any update in the hash is automatically updated (i.e., newer hash metadata overrides older hash metadata automatically). How to compare hashes is unresolved. One proposal is to keep the set of previous hashes in the package metadata, so if you have two iterations, one of the file's hashes will be in the other package's metadata. (b) A package file in a channel is determined by (in order of specificity) the (major, minor, patch) (referring to upstream information) and (build string, build number) (referring to the conda packaging process). The build number is an integer so can be naturally compared. Newer build metadata automatically overrides older build metadata. So basically, hotfixing a build essentially moves the pinning process down one level. Arguably, if you're really trying to guarantee a specific package file, you should be pinning to a hash because that can be checked. |
To anyone still listening: I believe I have a solution to the hotfixing dilemma. I call it "virtual hotfixing via build groups", and I've detailed the approach in the Gist linked below. It requires Comments appreciated: https://gist.github.com/mcg1969/38589eeefb046c417720f1027f97085b |
@mcg1969 thanks for posting the proposal. I'm not sure I understand 100%, but it seems like the virtual-hotfix would apply to most existing conda packages as they are built now by default, even ones that aren't broken. If true, this would be a problem for a lot of the packages I deal with (pyzmq, petsc, mpi4py, fenics, others). I think the 'build siblings' notion makes sense, but I also think it's important that it be defined in a way that ensures it does not result in creating any sibling relationships between builds already on anaconda.org. There are plenty of packages correctly pinning their dependencies and updating build-time pinnings with new build numbers. For example, Since this is about solving an exceptional circumstance (hotfixes should only apply to known-bad builds), it makes more sense to me for this sort of behavior to be opt-in (require explicit sibling declaration), rather than opt-out (all builds are siblings by default). A similar proposal that makes this hotfixing explicit would be to have a special |
@minrk: @jjhelmus and I had a good internal discussion about my proposal above, and in hindsight we should have summarized our findings there or here. In short, we concluded that 1) it would cause some problems to automatically construct build groups according to the build string convention; and 2) changes coming to That said, it is my view that
I understand that this is difficult to accomplish without some serious Jinja wizardry, but I'm genuinely concerned that there are material consequences to the solver caused by this pinning practice. |
@mcg1969 your comment above explaining why hotfixes are necessary is very helpful. I appreciate the comments following on from that by @jasongrout and @minrk. I understand there are further conda-team-internal discussions that are happening as well. SummaryBelow I introduce terms for 5 principles: Containment, Invariance, Equivalence, Deprecation, Precedence. Using these I propose that packages once published are never changed allowing MD5 sums to stay the same for all time, but that a published package can, optionally, be superceded by a new package with a higher build number and meta-data specifying that it supercedes and deprecates the previous package. Deprecated packages are labelled as such in the repository and therefore will effectively cease to be available. Mirroring mechanisms should pick up these changes. Local package caches and conda environments containing the deprecated package will warn the user when they realize they contain a deprecated version that has been superceded by a functionally equivalent package. DetailsMy Personal View
PrinciplesThis is a tricky set of conditions to satisfy, but let me try and provide some terminology for principles that can be used to build a proposal:
Proposal@mcg1969 has outlined a number of scenarios above that are not edge cases but very accurately represent present reality. They come down to "The package base content is just fine, but the package meta-data is now invalid due to some new conditions which have arisen, and as such can result in environment creation or package updates/installs that will be invalid". Here is a proposal for what I think should happens when this event occurs:
Implications
Concerns
|
@bkreider should have tagged you on this, but you've probably picked it up from the internal conversations and references to this issue. |
Thanks @ijstokes it is a good proposal and I believe it is reasonable.
Does that have a chance to speed-up the solver ? It is – subjectively – starting to be slow to me.
This has the drawback of potentially being forgotten. If the new package build indicate that it deprecates an older one, it would be great to have a way to enforce the now-deprecated-one to be moved.
I would start with something completely restrictive – only allow to deprecate same-exact-package-except-build-number – potentially making this more open with time. It is easier to remove restrictions than add some. Side question – which may be a further refinement of this proposal – As the new-package, in most of the case will only differs from the old one in the metadata, is there a plan to distribute it also as a patch on top of the old one ? That should allow to decrease bandwidth and storage usage. Thanks for taking the time to write your extensive description. |
Perhaps. @mcg1969 would be able to give a better assessment of that. My sense is that this would not be an especially major performance improvement in the near term, but it could pave the way to reduce the package search space. @mcg1969 is already pretty clever about that. He knows a thing or two about optimization and solvers.
I haven't thought this completely through but I believe a mechanism that allowed one package to authoritatively declare another package to be deprecated may have the impact that it is not strictly necessary to label or move depreacted packages into a separate channel (or virtual-channel-by-way-of-label). So the deprecated package could just sit around and would be ignored by the solver. Maybe.
Yes, I agree.
I can't imagine this is going to happen in the near term, unfortunately. To do this we'd need to re-build git-like capabilities into conda packages so only the delta could be transfered, then the "new package" reified on the client end so MD5 sums could be verified. And this process would need to work for any arbitrary past package to the current package (IOW, full git-like delta reconstruction of package state). Once we start using conda packages for software updates to Mars rovers or Voyager deep space explorers perhaps NASA will pay us to implement (or port from git) this functionality. In fact, doing this would probably end up being a reliable mechanism to do client-side package builds that produce identical results to the original server-side build (for MD5 sum verification purposes). As I'm sure you know there are a million challenges to making that viable.
Thanks to everyone for putting so much time and thought into this. Curiously enough a variation on this topic was the core theoretical contribution of my PhD dissertation, but it was over a decade ago since I've given this kind of problem serious thought. |
Sorry I'm late to the party, but I hit this as reported on the anaconda google group. Looks like 85 numpy packages are bad as of the middle of last week. If hot fixes will continue, then please also release new builds with an incremented build number. |
If you are interested, some code to:
And, from the thread referenced above, the mismatches I found last week, (so this may have changed). |
😳 Turns out there may not be 85 bad packages...sometimes the one of the headers (E-Tags) returned by curl has the same value as the md5, sometimes it doesn't.... ...and I was trying to flag bad packages without actually downloading them using ETag. |
@ijstokes thanks, I think that using 'labels' to deprecate builds sounds like a pretty good solution. If I understand correctly, deprecation in this proposal amounts to applying a standardized 'deprecated' label to packages, and can be done at any time, yes? So the solution can be retroactively applied to existing packages without even uploading a new build. In that case, exactly the mechanism of deprecation on upload could be a somewhat separate question of defaults / conventions, etc. If that's the case, should deprecation be omitted from metadata and instead be only part of the upload command? e.g.
which would apply the And perhaps a dedicated command for doing deprecations, though if there's already an API for applying labels. Does this exist already? I see |
Is metadata being edited in packages on anaconda.org? Started getting some MD5 checksum mismatches in Appveyor and Travis CI today |
@wesm: yes, that is exactly what this thread is about. packages are "hotfixed" to change their content, even with identical version and build numbers. Only the meta-data is changed. This is done to better constrain the dependencies. E.g. imagine a conda package that said it worked with pandas >= 0.11.3, but then it turns out 0.14.0 breaks it -- that conda package works perfectly right up until the day pandas 0.14.0 is released and available, at which point new package plans will get the latest version of the dependencies and the package won't work, the environment will be broken. This issue is trying to address how that situation should be resolved. |
I was referring to user packages -- are you editing metadata in artifacts that were uploaded by non-Anaconda Inc.-affiliated users with |
No we would never do that. If we found a dangerous package (containing a virus for example) I expect we'd quarantine it and get in touch with the user but modifying other people's packages? No way. |
Got it, thanks for clarifying. A few weeks ago we were getting MD5 checksum mismatches on some conda-forge packages from anaconda.org. In my cases we had set the conda package cache to store using Travis CI's caching system. So it must have been some other kind of transient issue |
Hi there, thank you for your contribution to Conda! This issue has been automatically locked since it has not had recent activity after it was closed. Please open a new issue if needed. |
Occasionally, I have packages--originally from repo.continuum.io--whose MD5 sum does not match the current package of the same name and version. I have not performed any operations on any of these packages.
Can the build number be bumped whenever packages undergo whatever transformation is making them yield a different MD5 sum (but doesn't require a version bump)?
This breaks many things that cache packages.
The text was updated successfully, but these errors were encountered: