Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packages from repo.continuum.io have the same name and version, but different MD5 checksums #4956

Closed
cpcloud opened this issue Mar 28, 2017 · 62 comments
Labels
locked [bot] locked due to inactivity pending::discussion contains some ongoing discussion that needs to be resolved prior to proceeding source::community catch-all for issues filed by community members

Comments

@cpcloud
Copy link

cpcloud commented Mar 28, 2017

Occasionally, I have packages--originally from repo.continuum.io--whose MD5 sum does not match the current package of the same name and version. I have not performed any operations on any of these packages.

Can the build number be bumped whenever packages undergo whatever transformation is making them yield a different MD5 sum (but doesn't require a version bump)?

This breaks many things that cache packages.

@cpcloud
Copy link
Author

cpcloud commented Mar 28, 2017

cc @msarahan @kalefranz

@jasongrout
Copy link
Contributor

This breaks many things that cache packages.

And is concerning from a security standpoint...

@msarahan
Copy link
Contributor

msarahan commented Apr 3, 2017

I am not sure exactly what caused this, but will raise it internally.

@kalefranz
Copy link
Contributor

CC @mcg1969

@kalefranz kalefranz added source::community catch-all for issues filed by community members pending::discussion contains some ongoing discussion that needs to be resolved prior to proceeding and removed type:enhancement labels Apr 3, 2017
@mcg1969
Copy link
Contributor

mcg1969 commented Apr 3, 2017

Isn't this related to our metadata hotfix discussion? We have two options: 1) fix the external copy of the metadata, and leave the file the same, or 2) rebuild the package, changing the MD5. There is no third option to bump the build number: when metadata is broken, it must be fixed.

Until recently we generally favored option 1, but we've since moved to option 2 because option 1 breaks things for people to rebuild their external repodata indices directly from packages.

It seems to me, however, that assuming the identical filenames means identical MD5s is problematic. We don't make that assumption within conda (or when we do, it's a bug).

@kalefranz
Copy link
Contributor

This probably belongs at github.com/ContinuumIO/anaconda-issues, but we can keep it here for now.

@mcg1969
Copy link
Contributor

mcg1969 commented Apr 3, 2017

No, it belongs here, because the question of how to handle broken metadata (especially incorrect dependencies) is something that goes above and beyond anaconda.

@mcg1969
Copy link
Contributor

mcg1969 commented Apr 3, 2017

There is an option 3, actually: 3) remove the broken build from the repo altogether. But that is likely to confound caches as well.

@cpcloud
Copy link
Author

cpcloud commented Apr 3, 2017

  1. rebuild the package, changing the MD5. There is no third option to bump the build number: when metadata is broken, it must be fixed.

I think I'm misunderstanding something. Why is rebuilding the package with build_number + 1 not an option?

@kalefranz
Copy link
Contributor

There's a lot of history here. It starts with a hard requirement the metadata for a package needs to evolve over time, most prominently because upper bounds need to get added to the packages dependency version constraints over time. This unknowable information at package build time.

What is the source of truth for package metadata? Is it repodata.json or within the package itself? We've at this point settled on the source of truth being the metadata contained within the package itself. There are multiple reasons that's preferable. One is that it let's conda index build repodata.json, as most users expect. It also ensures that repodata as known by anaconda.org is correct with anaconda upload.

Ilan over the last couple months has thus gone through repo.continuum.io and updated metadata within packages.

At some point, the ultimate solution to all of these problems will probably be a rider file alongside the package tarball that contains information that can't be included in the tarball itself (md5, signature, etc) along with overrides to metadata content. There's probably a good deal of planning and work to build that facility out properly though.

@kalefranz
Copy link
Contributor

I think I'm misunderstanding something. Why is rebuilding the package with build_number + 1 not an option?

This is why I cc'd you @mcg1969. Thought you'd give a better explanation than I would.

@mcg1969
Copy link
Contributor

mcg1969 commented Apr 3, 2017

I think I'm misunderstanding something. Why is rebuilding the package with build_number + 1 not an option?

We have been through this many times internally, I'm afraid, suggesting that it needs to be documented.

The problem is that if you leave the old package in place, and you don't change its metadata, then it's possible for conda to still pick up that broken package with specific combinations of dependencies.

For instance, suppose package foo 1.0 depends on bar. A new version bar 2.0 is released, and it breaks foo. Nobody knew this at the time foo was built, of course, so they just used a bare, unversioned bar dependency. Unfortunately, this means that if users update bar in those environments, foo is going to break.

So what we'd like to do is modify the metadata for foo 1.0 and change its dependency to bar <2.0. None of the program code is changing; this is strictly a metadata change. If we do this for all existing builds, then conda users get the benefit of this new metadata, and their environments will not break if they do conda update bar.

On the other hand, if we refuse to update existing builds, then those builds of foo are now broken in the current conda universe. Anyone who does conda install foo=1.0 bar=2.0, for instance, is going to get the broken build of foo.

@mcg1969
Copy link
Contributor

mcg1969 commented Apr 3, 2017

To those that +1'd the comment about security concerns, I'd like to know specifically what those concerns would be. I'm not denying that it could be a security concern, mind you. But I think it would likely be better for us to find alternate ways to mitigate those concerns than to allow users to break their conda environments because we refuse to fix known metadata errors.

So if we have some specific scenarios where people feel this is an issue, we might be able to address them. For instance, what if we included a history of MD5 checksums in repodata.json, so that users could be sure that a given MD5 is among those officially generated by the package provider?

@mcg1969
Copy link
Contributor

mcg1969 commented Apr 3, 2017

@kalefranz weren't you planning to include more robust checksum/verification capability within conda? If we do so, it would seem important to find a way to exclude metadata from that checksum.

@mcg1969
Copy link
Contributor

mcg1969 commented Apr 3, 2017

Speaking of security, another candidate for metadata hotfixes is security updates to critical dependencies. If a new version of OpenSSL comes out, we'd like to be able to make sure that all new installations of packages that depend on it pull only the latest version.

@wesm
Copy link

wesm commented Apr 3, 2017

On the other hand, if we refuse to update existing builds, then those builds of foo are now broken in the current conda universe. Anyone who does conda install foo=1.0 bar=2.0, for instance, is going to get the broken build of foo.

I'm a bit confused by this statement. My understanding is that the purpose of the package build number is to be able to supersede old packages while keeping the version number constant. If build numbers did not exist, then I would agree with you. We've been using this strategy quite successfully in conda-forge to release new builds for the same version number. Having package checksums change while leaving all else equal (version, build number) seems seriously problematic.

First, I don't think modifying the metadata is existing packages is a good idea because many users cache the packages in repo.continuum.io behind their firewall. So it's either

  • Increase the build number, and leave the existing packages (trusting conda to do the right thing)
  • Increase the build number, and remove the old packages

Is there a reason why these aren't valid solutions?

@jasongrout
Copy link
Contributor

jasongrout commented Apr 3, 2017

My understanding is that the purpose of the package build number is to be able to supersede old packages while keeping the version number constant.

+1. I think it makes sense that any changes to the conda recipe (including just fixing a dependency, or even just updating metadata like the package description) constitute a new build of the conda package.

@cpcloud
Copy link
Author

cpcloud commented Apr 3, 2017

@mcg1969

We have been through this many times internally, I'm afraid, suggesting that it needs to be documented.

The problem is that if you leave the old package in place, then it's possible for conda to still pick up that broken package with specific combinations of dependencies.

Nothing prevents this in general, so why is the metadata being changed for specific dependencies? This attempts to relieve a package maintenance burden, but in fact makes it more difficult to maintain infrastructure based on conda. It is the responsibility of the person depending on bar to validate that using bar without any version constraint whatsoever works for their package. If it doesn't, that's on them for not being careful about constraining dependencies.

For instance, suppose package foo 1.0 depends on bar. A new version bar 2.0 is released, and it breaks foo. Nobody knew this at the time foo was built, of course, so they just used a bare, unversioned bar dependency. Unfortunately, this means that if users update bar in those environments, foo is going to break.

If it breaks, then it is the dependent's problem to constrain their dependencies. Currently, an opaque choice is being made for users, rather than forcing people to be conscious of what they are depending on.

So what we'd like to do is modify the metadata foo 1.0 so that it depends on bar <2.0. None of the program code is changing, this is strictly a metadata change. If we do this for all existing builds, then conda users get the benefit of this new metadata, and their environments will not break if they do conda update bar.

Again, if their environment breaks, it is their responsibility to constrain versions of their dependencies. conda update is a red herring because it's completely untenable in a multi-user setting.

On the other hand, if we refuse to update existing builds, then those builds of foo are now broken in the current conda universe. Anyone who does conda install foo=1.0 bar=2.0, for instance, is going to get the broken build of foo.

Correct me if I'm wrong, but this isn't true if you bump the build number. A user will get an error because they are trying install the latest version of foo which is 1.0 which has a new build number because of new metadata, and conda now knows that bar must be less than 2.0 and would fail to install.

I still don't understand why this being special cased for certain packages whose constraints Continuum happens to have knowledge of, and there's absolutely nothing to be done about it.

If the metadata is going to be part of the package, then changes to it must be reflected in the version in some way, because the MD5 sum reflects the version of package including metadata. Ideally, metadata would be separate from code, but the current system doesn't work like that.

With that in mind bumping the build number seems like the best compromise here.

@mcg1969
Copy link
Contributor

mcg1969 commented Apr 3, 2017

@kalefranz, I simply don't have the time to keep re-hasing this argument. I have great respect for all of the minds on this thread, but there is nothing that is being questioned here that we haven't already thought of. When I get a bigger chunk of time, I will go through this again, here, but at that point it needs to somehow be encapsulated in a FAQ or other documentation.

@mcg1969
Copy link
Contributor

mcg1969 commented Apr 3, 2017

Though @kalefranz to be clear I am not wedded to the idea of updating the packages an breaking MD5---just to the notion that metadata hotfixes are necessary for existing packages.

@wesm
Copy link

wesm commented Apr 3, 2017

[I am ... wedded to] the notion that metadata hotfixes are necessary for existing packages

Would you mind addressing the points we've raised about the build number? By creating new packages with a higher build number, "metadata hotfixes" to existing tarballs are not necessary.

@mcg1969
Copy link
Contributor

mcg1969 commented Apr 3, 2017

This is not the case. As I said above I will indeed talk about it, but I just don't have the time right this moment.

@mcg1969
Copy link
Contributor

mcg1969 commented Apr 5, 2017

OK, folks, sorry for the delay. Let's whittle down a simple example that will hopefully illustrate why it is problematic for packages to remain the the repository with broken metadata.

Consider packages A, B, and C

  • A-1.0-1: depends on B >=2.
  • B-1.0-1, B-2.0-1, B-2.0-2.
  • C-1.0-1: depends on B, unversioned. However, when version 2 of package B was released, it was discovered to break C, so a new build was made:
  • C-1.0-2: depends on B <2

Here are a number of ways that this package combination breaks. In all five of these scenarios, a broken environment results. First, three fresh installs:

  1. conda create -n env A B C
  2. conda create -n env A C
  3. conda create -n env C B>=2
  4. conda create -n env C B

Scenario 4 bears explaining: in this case, conda has a choice between downgrading C one click, and downgrading B two clicks. It's going to prefer the one-click, so it will prefer C-1.0-1/B-2.0-2 to C-1.0-2/B-1.0-1. We can tweak conda's optimization algorithms to better handle that particular case, but it's very likely that it will come at a cost to other corner cases.

And now, consider a properly functioning environment containing C-1.0-1 and B-1.0-1:

  1. The user learns about some important new features in B, so he does conda update B. Package C remains untouched, because conda does not update packages unless it has to, in order to minimize disruption of user environments.
  2. The user performs conda update --all. Conda is faced with the same choice as scenario 4; noting that it can upgrade B more aggressively than C, it does so.
  3. The user decides he needs package A as well so he does conda install A. This forces B to upgrade, and C-1.0-1 has no objection, so it proceeds unchallenged.

In all cases, we're left with a broken environment, when what should have happened was:
1,2,3. No environment is created due to dependency conflicts.
4. C-1.0-2/B-1.0-1 is installed.
5. Package B is not updated, because doing so would conflict with the existing version of C.
6. Package C is updated, but package B is not.
7. The attempted installation of A would fail because of a conflict with C.

It's important to emphasize that in cases 4-6, the environment was not broken before. Sure, the metadata was wrong, in hindsight, but users do not care about metadata when their environment is up and running; metadata matters only during installations, removals, and updates. And of course, we know this to be the case because the same environment with C-1.0-2 and B-1.0-1 would be functionally identical.

And frankly, in our experience, users don't find it satisfactory when we explain that broken metadata is the cause. They had a working environment, they did conda something something, and now it's broken. That's what matters to them.

It's also important to note that these breakages occurred even though the package developer for C was diligent about correcting his metadata and issuing a new build. It wasn't enough. Leaving the existing package in place ensures a variety of scenarios where the broken build will still be selected.

Again, to me there are two different issues here:

  1. Should the data in repodata.json be "hotfixed" to correct dependency issues like this?
  2. Should these metadata changes cause a conda package to be rebuilt?

I do not feel strongly about 2. In fact, I've argued against it in other forums. In those same discussions, however, I've argued forcefully in favor of 1, and have even influenced conda-forge policy on the matter.

Continuum deals with these kinds of reports quite often. Some are self-inflicted problems with packages we have built and served, and some involve packages from other channels, including conda-forge. I am fully confident that more people are well-served by hotfixing metadata than are poorly served by its other consequences.

@jasongrout
Copy link
Contributor

For cases 1-3, why is conda considering earlier builds of the packages to be equivalent to the latest builds? Perhaps it should consider only the latest builds (or at least, latest build metadata) when doing the dependency resolution? A package with a larger build number should be considered a drop-in replacement for any earlier build with the same version number. Alternatively, as suggested above, if there is a huge problem with the metadata of C 1.0.1-1, it could be forcibly pulled from the repo or marked deprecated or superseded in some way (in my mind, the existence of a later build is essentially marking earlier builds as deprecated).

In cases 4-6, how does Conda currently know to get the new metadata from the hotfixed C 1.0.1-1? The conda package C 1.0.1-1 I have on my system is not the same package that is now on the server (which is the confusing part we are arguing against). Could we apply the exact same logic of updating the metadata, but with the understanding that newer builds of the package have metadata that supersedes older builds?

@mcg1969
Copy link
Contributor

mcg1969 commented Apr 5, 2017

For cases 1-3, why is conda considering earlier builds of the packages to be equivalent to the latest builds?

We've considered that, and we can re-consider it. However, there are situations where we have to pull a specific build, such as metapackages with pinned build numbers or production environments with pinned packages. That makes it difficult for us to determine whether or not conda needs to consider the older builds.

In cases 4-6, how does Conda currently know to get the new metadata from the hotfixed C 1.0.1-1?

Because metadata during the install/update process is pulled from repodata.json, not the packages themselves. So hotfixing metadata used to involve just updating that file, and leaving the package intact.

@kalefranz, correct me if I'm wrong here, but I believe the reason it was decided to insert the fixed metadata back into package (thereby changing their MD5) was to handle situations where repodata.json was being rebuilt from package data for some reason.

Allowing the latest build's metadata to supersede the bundled metadata for older builds---a sort of "live" metadata hotfixing approach---is something we could certainly consider.

@jasongrout
Copy link
Contributor

such as metapackages with pinned build numbers or production environments with pinned packages.

Doesn't this essentially break, then, if you're changing builds under people's noses? Or do we need another level of pinning a package, perhaps by hash?

It seems we can't have it both ways - either the package file is changing, which means I can't trust that I have the most recent information even if I have pinned a specific version and build, and I always have to fetch it (and re-install it) just in case it updated, or the package file never changes, so I can trust that I only need to consider updating if there is a new build advertised.

@mcg1969
Copy link
Contributor

mcg1969 commented Apr 5, 2017

And heck, if you're really wanting to be careful, you have to pin the channel, too.

@jasongrout
Copy link
Contributor

jasongrout commented Apr 5, 2017

So two proposals seem to be:

(a) A package file in a channel is determined by (in order of specificity) the (major, minor, patch) (referring to upstream information) and (build string, build number, file hash) (referring to the conda packaging process). Any update in the hash is automatically updated (i.e., newer hash metadata overrides older hash metadata automatically). How to compare hashes is unresolved. One proposal is to keep the set of previous hashes in the package metadata, so if you have two iterations, one of the file's hashes will be in the other package's metadata.

(b) A package file in a channel is determined by (in order of specificity) the (major, minor, patch) (referring to upstream information) and (build string, build number) (referring to the conda packaging process). The build number is an integer so can be naturally compared. Newer build metadata automatically overrides older build metadata.

So basically, hotfixing a build essentially moves the pinning process down one level. Arguably, if you're really trying to guarantee a specific package file, you should be pinning to a hash because that can be checked.

@mcg1969
Copy link
Contributor

mcg1969 commented May 5, 2017

To anyone still listening: I believe I have a solution to the hotfixing dilemma. I call it "virtual hotfixing via build groups", and I've detailed the approach in the Gist linked below. It requires conda to interpret packages just a little bit differently, but in a way that I think we would all agree is reasonably intuitive.

Comments appreciated:

https://gist.github.com/mcg1969/38589eeefb046c417720f1027f97085b

@minrk
Copy link
Contributor

minrk commented May 18, 2017

@mcg1969 thanks for posting the proposal. I'm not sure I understand 100%, but it seems like the virtual-hotfix would apply to most existing conda packages as they are built now by default, even ones that aren't broken. If true, this would be a problem for a lot of the packages I deal with (pyzmq, petsc, mpi4py, fenics, others).

I think the 'build siblings' notion makes sense, but I also think it's important that it be defined in a way that ensures it does not result in creating any sibling relationships between builds already on anaconda.org. There are plenty of packages correctly pinning their dependencies and updating build-time pinnings with new build numbers. For example, conda-forge/pyzmq-16.0.2-py36_{0,1} pin their zeromq dependency on 4.1 and 4.2, respectively, and must not have the same dependency metadata or the package would be broken. interpreting py36_2 as a hotfix for py36_0 would introduce precisely the problem it intends to solve, but in correctly specified packages instead of incorrectly specified ones. How does this proposal ensure that these packages are kept as correct builds with different dependencies instead of a virtual hotfix that applies incompatible metadata to earlier builds?

Since this is about solving an exceptional circumstance (hotfixes should only apply to known-bad builds), it makes more sense to me for this sort of behavior to be opt-in (require explicit sibling declaration), rather than opt-out (all builds are siblings by default). A similar proposal that makes this hotfixing explicit would be to have a special hotfix_N build-string suffix, so that uploading a new package package-x.y.z-py36_2_hotfix_1 overrides py36_2, etc. I hope a criterion for whatever is arrived upon is that it doesn't interpret any builds already on anaconda.org as being metadata hotfixes.

@mcg1969
Copy link
Contributor

mcg1969 commented May 18, 2017

@minrk: @jjhelmus and I had a good internal discussion about my proposal above, and in hindsight we should have summarized our findings there or here. In short, we concluded that 1) it would cause some problems to automatically construct build groups according to the build string convention; and 2) changes coming to conda-build would render this approach irrelevant anyway, because build strings are going to start being built according to a different convetion.

That said, it is my view that pyzmq should not be doing what it is doing---specifically, using build numbers to differentiate between packages that are no different except in their dependencies. This is something that should be resolved using build string differences instead, and the two should have identical build numbers. So for instance, the packages should be something like this:

pyzmq-16.0.2-py36zmq41_0
pyzmq-16.0.2-py36zmq42_0

I understand that this is difficult to accomplish without some serious Jinja wizardry, but I'm genuinely concerned that there are material consequences to the solver caused by this pinning practice.

@ijstokes
Copy link

ijstokes commented Jun 30, 2017

@mcg1969 your comment above explaining why hotfixes are necessary is very helpful. I appreciate the comments following on from that by @jasongrout and @minrk. I understand there are further conda-team-internal discussions that are happening as well.

Summary

Below I introduce terms for 5 principles: Containment, Invariance, Equivalence, Deprecation, Precedence. Using these I propose that packages once published are never changed allowing MD5 sums to stay the same for all time, but that a published package can, optionally, be superceded by a new package with a higher build number and meta-data specifying that it supercedes and deprecates the previous package. Deprecated packages are labelled as such in the repository and therefore will effectively cease to be available. Mirroring mechanisms should pick up these changes. Local package caches and conda environments containing the deprecated package will warn the user when they realize they contain a deprecated version that has been superceded by a functionally equivalent package.

Details

My Personal View

  1. I really don't think we can change MD5s for a publicly released package -- implying for a fixed channel, package name, platform, version, build number. Replication and caching of packages sits at the core of the Internet and even the way conda works. We break all sorts of implicit (and well founded) assumptions when we start doing this. The purpose of generating these MD5 sums is to verify that the package has not been changed: either by corruption, mirroring fault, or maliciously. MD5 sums are a standard way for people to verify that they have mirrored and retrieved the "correct" version (which is why we use them and why conda aborts when it discovers an MD5 sum mis-match). The possibility of a package changing, once published, undermines the ability of millions of people to trust conda.

  2. The meta-data has an intimate relationship with the behavior of the package. The meta-data provides an operational specification that needs to be satisfied. The degree to which this is "pinned" can be decided, to a degree, independently, but whatever decision is made, and whoever makes that decision, the end user of the package, in most cases, wants to be sure they have received an exact copy of that decision. The most practical way to do this is to i) embed the meta-data into the conda package; and ii) use a single MD5 sum as the combined "package + meta-data" verification mechanism. A whole host of other complications will be introduced if the meta-data is separated, verified, and linked/synchronized independently from the code-only (or binary-only) package. I feel like those arguments (granted, described here in an entirely un-rigorous manner) may undermine the "build group" proposal which otherwise has some things to recommend it.

  3. The issues outlined above by @mcg1969 are totally legitimate: packages that were 100% correct at time of creation and valid when used to create an environment should not be made invalid (i.e. guaranteed to not work) by the release of a dependency package that makes it no longer compatible with the base package, despite the base packages now-out-of-date dependency specification suggesting that it should be valid. Environments containing those packages should not be exposed to the risk that updates will result in the environment being broken.

Principles

This is a tricky set of conditions to satisfy, but let me try and provide some terminology for principles that can be used to build a proposal:

  1. Containment: meta-data is part of the package and needs to be embedded inside the package as the canonical version describing the content and specification. The reason is that the meta-data provides information that impacts the way the package behaves.

  2. Invariance: a specific release of a package, once issued, must never change. Correspondingly its MD5 sum will never change (given a standard and normalized packaging mechanism, such as tar-bz2 -- and as a side note, I'm not actually sure conda does normalize package contents before the tar-bz2 step). The reason is that conda specifically and the Internet in general rely on package caching and a (not unreasonable) assumption that things like versioned software packages won't change once released publicly, coupled with the use of MD5 sums to verify this invariance.

  3. Deprecation: a mechanism to make it clear that a previously issued package is deprecated and should not be used.

  4. Equivalence: a standard for specifying functional package equivalence. I believe packages that only vary by build numbers (in meta-data) satisfy this.

  5. Precedence: a standard for understanding package release (partial) ordering. I believe we already have this (otherwise I don't know how conda's solver would work).

Proposal

@mcg1969 has outlined a number of scenarios above that are not edge cases but very accurately represent present reality. They come down to "The package base content is just fine, but the package meta-data is now invalid due to some new conditions which have arisen, and as such can result in environment creation or package updates/installs that will be invalid". Here is a proposal for what I think should happens when this event occurs:

  1. A new package with an incremented build number is created. We need to create a new package anyway, the only difference from the current hotfix process is that this is new, not a replacement. The new package can explicitly list the old packages that it deprecates, perhaps in a way that conda can leverage (in the creation of a new repodata.json or when running the solver). This will satisfy the principles of invariance (old package isn't overwritten), equivalence (new package is functionally the same as old package), and precedence (new package is used in preference to old package).

  2. The old package is removed from standard package channels and moved to a "deprecated" channel. In this way the package is still accessible and can be mirrored if required, but it will not be picked up by the solver. Today I do not know what the implication is for end-user package caches. This (partially) solves the principle of deprecation (removing the old package from the standard package resolution set).

  3. Some TBD mechanism is used to flag the package as deprecated. This flag would allow package caches to flag the deprecated packages as "dirty" and treat them differently when doing local package updates and installations. This would be new conda/repo functionality. This solves the rest of the principle of deprecation.

Implications

  1. Anyone doing new/fresh package installs or environment creation will only pick up the "corrected" package

  2. Anyone wishing to apply this strategy will need to manage two channels: the "main" channel and the "deprecated" channel. Somewhat as an implementation detail Anaconda Repository (as it stands today) makes this very easy: labels can be used to create (effectively) new (sub) channels for things like "alpha", "beta", "rc1", "dev", and, if required, "deprecated" within any "top level" channel.

  3. Anyone with the old package installed in an environment that previously worked would need to have conda realize that the old package is deprecated and that a new package needs to be installed in its place. If it is a read-only environment then conda would need to flag this (deprecation warning) and either abort (unless a flag such as --allow-deprecated is used) or proceed (unless a flag such as --deny-deprecated is used), depending what is decided for the default behavior. Based on PLS I would suggest that the default should be to abort on the basis that the package was deprecated for a reason and the user would need to --allow-deprecated to continue, understanding this may produce undesirable results.

  4. It may make sense to have an option to update-deprecated, such that any time package operations of any sort are applied to a conda environment that has packages in it which are known to be deprecated those packages will be updated to their superceded functional equivalent package, thereby bringing into effect the updated meta-data and dependency specifications.

  5. It reduces the number of packages that are considered "live" or "valid" at any given time.

  6. conda would be updated to have a formal understanding of build number as a higher precedence equivalent (if it doesn't already have this understanding). Some configuration mechanism would allow users to (optionally) specify a requirement to ALWAYS use a higher build number if it is available. EDIT: In fact implicit build number precedence equivalence may not be required if new packages could (optionally) explicitly specify old packages as deprecated and superceded by the new package -- in this way the solver would have explicit guidance to ignore the deprecated package and only consider the superceding package.

  7. In this world it should be the exception, not the norm, that a package's dependency specification would ever contain a build number. Instead the specs should assume the principles of "equivalence" and "precedence" and make their package dependency specifications open to any build for the given version.

  8. None of this implies that a new build number necessarily causes an older build to be deprecated, which I think was one of @minrk 's concerns about the "build-group" proposal.

Concerns

  1. Lots of packages will be shuffled into this "deprecated" state.

  2. How far back in previous package versions will this process be applied? Has the risk of being extensive.

  3. EDIT: Fixed logic may be required (and specified in the conda spec) to describe the parameters for the deprecated package specification. For security reasons I believe it would be necessary to limit these specifications only to packages with the same abstract name, and possibly (even probably) additional constraints (channel, platform, etc.). This would prevent foo from deprecating bar, or from foo-py36-3.2.1_6 deprecating (intentionally or accidentally) foo-py27-3.2.1._3.

@ijstokes
Copy link

ijstokes commented Jul 1, 2017

@bkreider should have tagged you on this, but you've probably picked it up from the internal conversations and references to this issue.

@Carreau
Copy link

Carreau commented Jul 1, 2017

Thanks @ijstokes it is a good proposal and I believe it is reasonable.

It reduces the number of packages that are considered "live" or "valid" at any given time.

Does that have a chance to speed-up the solver ? It is – subjectively – starting to be slow to me.

The old package is removed from standard package channels and moved to a "deprecated" channel

This has the drawback of potentially being forgotten. If the new package build indicate that it deprecates an older one, it would be great to have a way to enforce the now-deprecated-one to be moved.

For security reasons I believe it would be necessary to limit these specifications only to packages with the same abstract name, and possibly (even probably) additional constraints (channel, platform, etc.)For security reasons I believe it would be necessary to limit these specifications only to packages with the same abstract name, and possibly (even probably) additional constraints (channel, platform, etc.)

I would start with something completely restrictive – only allow to deprecate same-exact-package-except-build-number – potentially making this more open with time. It is easier to remove restrictions than add some.

Side question – which may be a further refinement of this proposal –  As the new-package, in most of the case will only differs from the old one in the metadata, is there a plan to distribute it also as a patch on top of the old one ? That should allow to decrease bandwidth and storage usage.

Thanks for taking the time to write your extensive description.

@ijstokes
Copy link

ijstokes commented Jul 1, 2017

@Carreau:

It reduces the number of packages that are considered "live" or "valid" at any given time.

Does that have a chance to speed-up the solver ? It is – subjectively – starting to be slow to me.

Perhaps. @mcg1969 would be able to give a better assessment of that. My sense is that this would not be an especially major performance improvement in the near term, but it could pave the way to reduce the package search space. @mcg1969 is already pretty clever about that. He knows a thing or two about optimization and solvers.

This has the drawback of potentially being forgotten. If the new package build indicate that it deprecates an older one, it would be great to have a way to enforce the now-deprecated-one to be moved.

I haven't thought this completely through but I believe a mechanism that allowed one package to authoritatively declare another package to be deprecated may have the impact that it is not strictly necessary to label or move depreacted packages into a separate channel (or virtual-channel-by-way-of-label). So the deprecated package could just sit around and would be ignored by the solver. Maybe.

I would start with something completely restrictive – only allow to deprecate same-exact-package-except-build-number – potentially making this more open with time. It is easier to remove restrictions than add some.

Yes, I agree.

Side question – which may be a further refinement of this proposal – As the new-package, in most of the case will only differs from the old one in the metadata, is there a plan to distribute it also as a patch on top of the old one ? That should allow to decrease bandwidth and storage usage.

I can't imagine this is going to happen in the near term, unfortunately. To do this we'd need to re-build git-like capabilities into conda packages so only the delta could be transfered, then the "new package" reified on the client end so MD5 sums could be verified. And this process would need to work for any arbitrary past package to the current package (IOW, full git-like delta reconstruction of package state). Once we start using conda packages for software updates to Mars rovers or Voyager deep space explorers perhaps NASA will pay us to implement (or port from git) this functionality. In fact, doing this would probably end up being a reliable mechanism to do client-side package builds that produce identical results to the original server-side build (for MD5 sum verification purposes). As I'm sure you know there are a million challenges to making that viable.

Thanks for taking the time to write your extensive description.

Thanks to everyone for putting so much time and thought into this. Curiously enough a variation on this topic was the core theoretical contribution of my PhD dissertation, but it was over a decade ago since I've given this kind of problem serious thought.

@mwcraig
Copy link
Contributor

mwcraig commented Jul 10, 2017

Sorry I'm late to the party, but I hit this as reported on the anaconda google group. Looks like 85 numpy packages are bad as of the middle of last week.

If hot fixes will continue, then please also release new builds with an incremented build number.

@mwcraig
Copy link
Contributor

mwcraig commented Jul 10, 2017

If you are interested, some code to:

And, from the thread referenced above, the mismatches I found last week, (so this may have changed).

@mwcraig
Copy link
Contributor

mwcraig commented Jul 11, 2017

😳 Turns out there may not be 85 bad packages...sometimes the one of the headers (E-Tags) returned by curl has the same value as the md5, sometimes it doesn't....

...and I was trying to flag bad packages without actually downloading them using ETag.

@minrk
Copy link
Contributor

minrk commented Jul 12, 2017

@ijstokes thanks, I think that using 'labels' to deprecate builds sounds like a pretty good solution. If I understand correctly, deprecation in this proposal amounts to applying a standardized 'deprecated' label to packages, and can be done at any time, yes? So the solution can be retroactively applied to existing packages without even uploading a new build.

In that case, exactly the mechanism of deprecation on upload could be a somewhat separate question of defaults / conventions, etc. If that's the case, should deprecation be omitted from metadata and instead be only part of the upload command? e.g.

anaconda upload --deprecate-siblings mypackage-py36_0.tar.gz

which would apply the deprecated label to all packages that vary only in build number on the target channel.

And perhaps a dedicated command for doing deprecations, though if there's already an API for applying labels. Does this exist already? I see anaconda label, but it doesn't look like it can apply labels to packages.

@wesm
Copy link

wesm commented Aug 3, 2017

Is metadata being edited in packages on anaconda.org? Started getting some MD5 checksum mismatches in Appveyor and Travis CI today

@ijstokes
Copy link

@wesm: yes, that is exactly what this thread is about. packages are "hotfixed" to change their content, even with identical version and build numbers. Only the meta-data is changed. This is done to better constrain the dependencies. E.g. imagine a conda package that said it worked with pandas >= 0.11.3, but then it turns out 0.14.0 breaks it -- that conda package works perfectly right up until the day pandas 0.14.0 is released and available, at which point new package plans will get the latest version of the dependencies and the package won't work, the environment will be broken. This issue is trying to address how that situation should be resolved.

@wesm
Copy link

wesm commented Aug 29, 2017

I was referring to user packages -- are you editing metadata in artifacts that were uploaded by non-Anaconda Inc.-affiliated users with anaconda upload?

@mingwandroid
Copy link
Contributor

No we would never do that. If we found a dangerous package (containing a virus for example) I expect we'd quarantine it and get in touch with the user but modifying other people's packages? No way.

@wesm
Copy link

wesm commented Aug 30, 2017

Got it, thanks for clarifying. A few weeks ago we were getting MD5 checksum mismatches on some conda-forge packages from anaconda.org. In my cases we had set the conda package cache to store using Travis CI's caching system. So it must have been some other kind of transient issue

@kalefranz
Copy link
Contributor

@github-actions
Copy link

Hi there, thank you for your contribution to Conda!

This issue has been automatically locked since it has not had recent activity after it was closed.

Please open a new issue if needed.

@github-actions github-actions bot added the locked [bot] locked due to inactivity label Sep 17, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked [bot] locked due to inactivity pending::discussion contains some ongoing discussion that needs to be resolved prior to proceeding source::community catch-all for issues filed by community members
Projects
None yet
Development

No branches or pull requests