separate conda package metadata under new root-level key in repodata #8639

kalefranz · 2019-05-07T22:15:09Z

This PR denormalizes the overlaying/entangled metadata in the original .conda package format work (i.e. conda_size, conda_sha256, conda_inner_sha256, and conda_outer_sha256 fields). With this PR, repodata.json should now add a new root-level packages.conda field containing information for all packages with the .conda extension.

This strategy has several advantages, including:

All existing versions of conda will simply not see the new packages.
New packages are decoupled from old packages, and so, in cases where someone only uploads the new format to a channel skipping the old format, current versions of conda won't crash.
Having a metadata entry for each package makes clear exactly which package the metadata refers to and where the source of truth is. That's especially useful for an dynamic anaconda.org repo situation where users can copy, move, and delete individual packages from channels.

The root key packages.conda was chosen over conda_packages so that sorting repodata.json keeps the info key as the top key in the sort.

An example of what repodata.json now looks like is included in this PR as tests/data/conda_format_repo/win-64/repodata.json.

{
  "info": {
    "subdir": "win-64"
  },
  "packages": {
    "zlib-1.2.11-h62dcd97_3.tar.bz2": {
      "build": "h62dcd97_3",
      "build_number": 3,
      "depends": [
        "vc >=14.1,<15.0a0"
      ],
      "license": "zlib",
      "license_family": "Other",
      "md5": "a46cf10ba0eece37dffcec2d45a1f4ec",
      "name": "zlib",
      "sha256": "10363f6c023d7fb3d11fdb4cc8de59b5ad5c6affdf960210dd95a252a3fced2b",
      "size": 131285,
      "subdir": "win-64",
      "timestamp": 1542815182812,
      "version": "1.2.11"
    }
  },
  "packages.conda": {
    "zlib-1.2.11-h62dcd97_3.conda": {
      "build": "h62dcd97_3",
      "build_number": 3,
      "depends": [
        "vc >=14.1,<15.0a0"
      ],
      "license": "zlib",
      "license_family": "Other",
      "md5": "edad165fc3d25636d4f0a61c42873fbc",
      "name": "zlib",
      "sha256": "2fb5900c4a2ca7e0f509ebc344b3508815d7647c86cfb6721a1690365222e55a",
      "size": 112305,
      "subdir": "win-64",
      "timestamp": 1542815182812,
      "version": "1.2.11"
    }
  },
  "removed": [],
  "repodata_version": 1
}

msarahan · 2019-05-09T15:09:12Z

This PR does not currently specify or demonstrate how new packages should look - what the actual data in repodata.json should be. We need that in tests. @kalefranz are you going to have time to develop that, or should we take that on?

The more hacky way that I had done things is very much still in the codebase, and this PR should probably remove it. It's not necessary to keep it because this is a better way that will require much less hacky stuff, but we need the tests for usage of the new file format to reflect this new way of doing things.

kalefranz · 2019-05-09T17:17:00Z

I've updated this PR with a second commit. In addition to functional code changes, it exercises all the package cache code pretty thoroughly with additions to tests/core/test_package_cache_data.py.

I need another hour or so on this though. The one remaining item on my TODO list here is to audit all uses of CONDA_TARBALL_EXTENSION to make sure they are still appropriate.

kalefranz · 2019-05-09T17:59:12Z

OK, pending test success, I'm happy with the state of this PR and welcome review and feedback.

Signed-off-by: Kale Franz <kfranz@continuum.io>

kalefranz

I've added some comments to the code. Need to make at least two more changes.

conda/models/channel.py

conda/models/dist.py

conda/models/match_spec.py

Signed-off-by: Kale Franz <kfranz@continuum.io>

kalefranz · 2019-05-09T21:42:13Z

The failing py27 test is tests/test_api_build.py::test_warning_on_file_clobbering, and it seems to be failing on other branches as well.

The failing conda-build test is tests/test_api_build.py::test_warning_on_file_clobbering. @msarahan do you have any insight into what got out of sync here?

kalefranz · 2019-05-09T21:47:41Z

conda/core/subdir_data.py

+        for fn, info in concatv(
+            iteritems(conda_packages),
+            ((k, legacy_packages[k]) for k in use_these_legacy_keys),
+        ):


@msarahan if we want to add a feature flag to opt-in to using the new package format (which we can switch to true by default in a later conda release) then the only change we have to make to do that is localized right here.

msarahan · 2019-05-09T22:05:27Z

conda/core/package_cache_data.py

@@ -275,17 +260,15 @@ def _check_writable(self):
        return i_wri

    @staticmethod
-    def _clean_tarball_path_and_get_checksums(tarball_path, sha256sum=None, md5sum=None):
+    def _clean_tarball_path_and_get_md5sum(tarball_path, md5sum=None):


It looks like you are reverting a lot of the work I did to prefer sha256 over md5 where possible. Is that intentional? What's your reasoning on that?

The purpose of using md5 in all of these places is only for quick identification, and to have reasonable assurance that we have the "right" artifact, in the context of distinguishing if the package flask-1.0.2-py_0.tar.bz2 in the package cache came from free, main, conda-forge, or some other channel. We know pretty close to what it should be by the file name, but because we don't include channel information with the bare tarballs in the package cache, we don't know what channel they come from. It's also a quick check to ensure that a package hasn't been updated in-place in the remote channel (and therefore needs to be downloaded again).

Matching the md5 to the expected md5 from repodata gives us the assurance that an artifact that has the right name in a package cache is actually the artifact that we want. To be clear: Using md5 here is not a security measure. Just as we use md5 as a checksum to ensure downloads have not been inadvertently corrupted, we use md5 here for a fast check to make sure the artifact is as we expect (or not).

Neither md5 nor sha256 in this context is sufficient as a security measure to guard against malicious attacks. The purpose of using md5 here is ONLY as a quick checksum. And in these quick checksum cases we should prefer md5 to sha256 since the former is indeed quite a bit faster.

When we add cryptographic signatures, we will be making specific and more targeted use of sha256 + size, in combination.

Thanks. That makes sense. Is it worth keeping at least for the file downloads, though?

Yeah, I don't see any problem with that. And while we're at it, we might as well add in size to that function too and get rid of one of those TODOs. Pushing another commit soon with the change.

Signed-off-by: Kale Franz <kfranz@continuum.io>

msarahan · 2019-05-10T13:12:15Z

conda/gateways/connection/download.py

-            log.debug("MD5 sums mismatch for download: %s (%s != %s), "
-                      "trying again" % (url, digest_builder.hexdigest(), md5sum))
-            raise MD5MismatchError(url, target_full_path, md5sum, actual_md5sum)
+        if md5:


Is it worthwhile to compute both sha256 and md5? My original thought was to only compute the most secure option available. I guess also computing md5 won't add much time, though.

Using sha256 instead of md5 isn't really a valid security measure. You need to couple sha256 with cryptographic signatures to actually have "security" in this context. Without signatures, the best assurance we can have here is that the artifact wasn't inadvertently corrupted by erroneous HTTP over the wire. It's false to think that md5 is any more "secure" in this use case than sha256. (Probably why s3 uses md5 for etags, as I mentioned in #8651.) So while it's not any more secure, you also incur the performance penalty of the heavier sha256.

All that said, we'll probably need to start using sha256 and incorporating that into any upcoming package signing implementation, so while we're really not adding any security now, it makes sense to prepare for the future. I'll make the change.

Is it worthwhile to compute both sha256 and md5?

Computing both is wasteful because computing SHA256 usually isn't x times but rather 0.x times slower than computing MD5. On my machine it's roughly 50 %.

Using sha256 instead of md5 isn't really a valid security measure.

Sure.

So while it's not any more secure, you also incur the performance penalty of the heavier sha256.
[...]
All that said, we'll probably need to start using sha256 and incorporating that into any upcoming package signing implementation

For performance reasons you might want to consider SHA512 or SHA512/256 (supported from openssl 1.1.1 onwards, available in Python via cryptography >=2.5). For me ( = not representative, of course), computing SHA512/256 is nearly as performant as MD5:

> podman run --rm -it continuumio/miniconda3 sh -lic ' conda create -qynopenssl openssl >/dev/null && conda activate openssl openssl version for hash in md5 sha256 sha512 sha512-256 ; do openssl speed -seconds 2 -bytes 1048576 -evp "${hash}" 2>/dev/null | tail -1 ; done ' OpenSSL 1.1.1b 26 Feb 2019 md5 790102.02k sha256 515857.24k sha512 767557.63k sha512-256 770179.07k

cool! thanks for the pointer.

…iles

msarahan · 2019-05-12T15:16:24Z

The 2 test failures are confusing to me. They're reproducible locally, but only when running that test file as a whole. If those tests are run individually, or run in a subset (--lf after a failing run), then they pass. Clearly something is more stateful than it should be. I have verified that they are not sharing a package cache dir (the test folder paths are very different). If anyone has any ideas for what else to check, please chime in.

Signed-off-by: Kale Franz <kfranz@continuum.io>

…package-metadata

…milar packages

…ackage-metadata

github-actions · 2021-08-27T06:02:18Z

Hi there, thank you for your contribution to Conda!

This pull request has been automatically locked since it has not had recent activity after it was closed.

Please open a new issue or pull request if needed.

separate conda package metadata under new root-level key in repodata

c76a322

kalefranz requested a review from a team as a code owner May 7, 2019 22:15

denormalize new package format metadata in repodata

e4861dd

kalefranz added 3 commits May 9, 2019 12:53

audit use of .tar.bz2 and CONDA_TARBALL_EXTENSION

860e13a

rename CONDA_TARBALL_EXTENSION to CONDA_PACKAGE_EXTENSION

09f39a2

avoid global call in tight loop

f1675ad

address remaining unit test errors

8be9786

Signed-off-by: Kale Franz <kfranz@continuum.io>

kalefranz commented May 9, 2019

View reviewed changes

conda/models/channel.py Outdated Show resolved Hide resolved

conda/models/dist.py Show resolved Hide resolved

conda/models/match_spec.py Show resolved Hide resolved

conda/models/match_spec.py Show resolved Hide resolved

conda/models/match_spec.py Outdated Show resolved Hide resolved

address outstanding comments

7227b17

Signed-off-by: Kale Franz <kfranz@continuum.io>

kalefranz commented May 9, 2019

View reviewed changes

msarahan reviewed May 9, 2019

View reviewed changes

msarahan mentioned this pull request May 10, 2019

implement new format indexing in its own repodata section conda/conda-build#3500

Closed

kalefranz added 2 commits May 9, 2019 22:01

flake8

301775c

Signed-off-by: Kale Franz <kfranz@continuum.io>

add sha256 and size to download()

769cead

kalefranz mentioned this pull request May 10, 2019

sharded repodata #8651

Closed

msarahan reviewed May 10, 2019

View reviewed changes

msarahan added this to the 4.7.0 milestone May 10, 2019

msarahan added 2 commits May 10, 2019 12:16

prefer sha256 over md5 for download verification

258e926

Merge branch 'master' into separate-conda-package-metadata

32aaae3

msarahan mentioned this pull request May 10, 2019

simply package info for .conda files conda/conda-package-handling#10

Merged

implement tests for preferring existing pkg cache when using .conda f…

304592d

…iles

msarahan and others added 4 commits May 13, 2019 08:25

move failing package_cache_data tests to integration

73308b6

add tests for checksum mismatch

7a053d7

add tests for checksum mismatch

8690b55

prefer sha256 over md5 when both are available

e6562af

Signed-off-by: Kale Franz <kfranz@continuum.io>

msarahan added 3 commits May 13, 2019 16:17

Merge branch 'separate-conda-package-metadata-2' into separate-conda-…

daa9cac

…package-metadata

allow .conda packages to use cached .tar.bz2 entries for otherwise si…

f87945e

…milar packages

flake8

9b64da0

msarahan mentioned this pull request May 14, 2019

Separate conda package metadata 2 #8667

Closed

msarahan added 4 commits May 13, 2019 20:38

fix file URL scope oops; comment sha256 enforcement test

571bd36

flake8

4de69db

Merge branch 'master' of github.com:conda/conda into separate-conda-p…

2cb8f6b

…ackage-metadata

handle empty size or md5 PackageCacheRecords better

ae90e90

msarahan merged commit 63fa8ce into conda:master May 14, 2019

This was referenced May 22, 2019

conda extremely slow #7239

Closed

Conda Install Extremely Slow #8087

Closed

github-actions bot added the locked [bot] locked due to inactivity label Aug 27, 2021

github-actions bot locked as resolved and limited conversation to collaborators Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

separate conda package metadata under new root-level key in repodata #8639

separate conda package metadata under new root-level key in repodata #8639

kalefranz commented May 7, 2019 •

edited

msarahan commented May 9, 2019

kalefranz commented May 9, 2019

kalefranz commented May 9, 2019

kalefranz left a comment

kalefranz commented May 9, 2019

kalefranz May 9, 2019

msarahan May 9, 2019

kalefranz May 10, 2019

msarahan May 10, 2019

kalefranz May 10, 2019

msarahan May 10, 2019

kalefranz May 10, 2019

mbargull May 14, 2019

msarahan May 14, 2019

msarahan commented May 12, 2019

github-actions bot commented Aug 27, 2021

separate conda package metadata under new root-level key in repodata #8639

separate conda package metadata under new root-level key in repodata #8639

Conversation

kalefranz commented May 7, 2019 • edited

msarahan commented May 9, 2019

kalefranz commented May 9, 2019

kalefranz commented May 9, 2019

kalefranz left a comment

Choose a reason for hiding this comment

kalefranz commented May 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msarahan commented May 12, 2019

github-actions bot commented Aug 27, 2021

kalefranz commented May 7, 2019 •

edited