Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separate conda package metadata under new root-level key in repodata #8639

Merged
merged 23 commits into from May 14, 2019

Conversation

kalefranz
Copy link
Contributor

@kalefranz kalefranz commented May 7, 2019

This PR denormalizes the overlaying/entangled metadata in the original .conda package format work (i.e. conda_size, conda_sha256, conda_inner_sha256, and conda_outer_sha256 fields). With this PR, repodata.json should now add a new root-level packages.conda field containing information for all packages with the .conda extension.

This strategy has several advantages, including:

  1. All existing versions of conda will simply not see the new packages.
  2. New packages are decoupled from old packages, and so, in cases where someone only uploads the new format to a channel skipping the old format, current versions of conda won't crash.
  3. Having a metadata entry for each package makes clear exactly which package the metadata refers to and where the source of truth is. That's especially useful for an dynamic anaconda.org repo situation where users can copy, move, and delete individual packages from channels.

The root key packages.conda was chosen over conda_packages so that sorting repodata.json keeps the info key as the top key in the sort.

An example of what repodata.json now looks like is included in this PR as tests/data/conda_format_repo/win-64/repodata.json.

{
  "info": {
    "subdir": "win-64"
  },
  "packages": {
    "zlib-1.2.11-h62dcd97_3.tar.bz2": {
      "build": "h62dcd97_3",
      "build_number": 3,
      "depends": [
        "vc >=14.1,<15.0a0"
      ],
      "license": "zlib",
      "license_family": "Other",
      "md5": "a46cf10ba0eece37dffcec2d45a1f4ec",
      "name": "zlib",
      "sha256": "10363f6c023d7fb3d11fdb4cc8de59b5ad5c6affdf960210dd95a252a3fced2b",
      "size": 131285,
      "subdir": "win-64",
      "timestamp": 1542815182812,
      "version": "1.2.11"
    }
  },
  "packages.conda": {
    "zlib-1.2.11-h62dcd97_3.conda": {
      "build": "h62dcd97_3",
      "build_number": 3,
      "depends": [
        "vc >=14.1,<15.0a0"
      ],
      "license": "zlib",
      "license_family": "Other",
      "md5": "edad165fc3d25636d4f0a61c42873fbc",
      "name": "zlib",
      "sha256": "2fb5900c4a2ca7e0f509ebc344b3508815d7647c86cfb6721a1690365222e55a",
      "size": 112305,
      "subdir": "win-64",
      "timestamp": 1542815182812,
      "version": "1.2.11"
    }
  },
  "removed": [],
  "repodata_version": 1
}

@kalefranz kalefranz requested a review from a team as a code owner May 7, 2019 22:15
@msarahan
Copy link
Contributor

msarahan commented May 9, 2019

This PR does not currently specify or demonstrate how new packages should look - what the actual data in repodata.json should be. We need that in tests. @kalefranz are you going to have time to develop that, or should we take that on?

The more hacky way that I had done things is very much still in the codebase, and this PR should probably remove it. It's not necessary to keep it because this is a better way that will require much less hacky stuff, but we need the tests for usage of the new file format to reflect this new way of doing things.

@kalefranz
Copy link
Contributor Author

I've updated this PR with a second commit. In addition to functional code changes, it exercises all the package cache code pretty thoroughly with additions to tests/core/test_package_cache_data.py.

I need another hour or so on this though. The one remaining item on my TODO list here is to audit all uses of CONDA_TARBALL_EXTENSION to make sure they are still appropriate.

@kalefranz
Copy link
Contributor Author

OK, pending test success, I'm happy with the state of this PR and welcome review and feedback.

Signed-off-by: Kale Franz <kfranz@continuum.io>
Copy link
Contributor Author

@kalefranz kalefranz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments to the code. Need to make at least two more changes.

conda/models/channel.py Outdated Show resolved Hide resolved
conda/models/dist.py Show resolved Hide resolved
conda/models/match_spec.py Show resolved Hide resolved
conda/models/match_spec.py Show resolved Hide resolved
conda/models/match_spec.py Outdated Show resolved Hide resolved
Signed-off-by: Kale Franz <kfranz@continuum.io>
@kalefranz
Copy link
Contributor Author

The failing py27 test is tests/test_api_build.py::test_warning_on_file_clobbering, and it seems to be failing on other branches as well.

The failing conda-build test is tests/test_api_build.py::test_warning_on_file_clobbering. @msarahan do you have any insight into what got out of sync here?

for fn, info in concatv(
iteritems(conda_packages),
((k, legacy_packages[k]) for k in use_these_legacy_keys),
):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msarahan if we want to add a feature flag to opt-in to using the new package format (which we can switch to true by default in a later conda release) then the only change we have to make to do that is localized right here.

@@ -275,17 +260,15 @@ def _check_writable(self):
return i_wri

@staticmethod
def _clean_tarball_path_and_get_checksums(tarball_path, sha256sum=None, md5sum=None):
def _clean_tarball_path_and_get_md5sum(tarball_path, md5sum=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you are reverting a lot of the work I did to prefer sha256 over md5 where possible. Is that intentional? What's your reasoning on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of using md5 in all of these places is only for quick identification, and to have reasonable assurance that we have the "right" artifact, in the context of distinguishing if the package flask-1.0.2-py_0.tar.bz2 in the package cache came from free, main, conda-forge, or some other channel. We know pretty close to what it should be by the file name, but because we don't include channel information with the bare tarballs in the package cache, we don't know what channel they come from. It's also a quick check to ensure that a package hasn't been updated in-place in the remote channel (and therefore needs to be downloaded again).

Matching the md5 to the expected md5 from repodata gives us the assurance that an artifact that has the right name in a package cache is actually the artifact that we want. To be clear: Using md5 here is not a security measure. Just as we use md5 as a checksum to ensure downloads have not been inadvertently corrupted, we use md5 here for a fast check to make sure the artifact is as we expect (or not).

Neither md5 nor sha256 in this context is sufficient as a security measure to guard against malicious attacks. The purpose of using md5 here is ONLY as a quick checksum. And in these quick checksum cases we should prefer md5 to sha256 since the former is indeed quite a bit faster.

When we add cryptographic signatures, we will be making specific and more targeted use of sha256 + size, in combination.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That makes sense. Is it worth keeping at least for the file downloads, though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't see any problem with that. And while we're at it, we might as well add in size to that function too and get rid of one of those TODOs. Pushing another commit soon with the change.

Signed-off-by: Kale Franz <kfranz@continuum.io>
@kalefranz kalefranz mentioned this pull request May 10, 2019
log.debug("MD5 sums mismatch for download: %s (%s != %s), "
"trying again" % (url, digest_builder.hexdigest(), md5sum))
raise MD5MismatchError(url, target_full_path, md5sum, actual_md5sum)
if md5:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worthwhile to compute both sha256 and md5? My original thought was to only compute the most secure option available. I guess also computing md5 won't add much time, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using sha256 instead of md5 isn't really a valid security measure. You need to couple sha256 with cryptographic signatures to actually have "security" in this context. Without signatures, the best assurance we can have here is that the artifact wasn't inadvertently corrupted by erroneous HTTP over the wire. It's false to think that md5 is any more "secure" in this use case than sha256. (Probably why s3 uses md5 for etags, as I mentioned in #8651.) So while it's not any more secure, you also incur the performance penalty of the heavier sha256.

All that said, we'll probably need to start using sha256 and incorporating that into any upcoming package signing implementation, so while we're really not adding any security now, it makes sense to prepare for the future. I'll make the change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worthwhile to compute both sha256 and md5?

Computing both is wasteful because computing SHA256 usually isn't x times but rather 0.x times slower than computing MD5. On my machine it's roughly 50 %.

Using sha256 instead of md5 isn't really a valid security measure.

Sure.

So while it's not any more secure, you also incur the performance penalty of the heavier sha256.
[...]
All that said, we'll probably need to start using sha256 and incorporating that into any upcoming package signing implementation

For performance reasons you might want to consider SHA512 or SHA512/256 (supported from openssl 1.1.1 onwards, available in Python via cryptography >=2.5). For me ( = not representative, of course), computing SHA512/256 is nearly as performant as MD5:

> podman run --rm -it continuumio/miniconda3 sh -lic '                                                                                                                                                                                                                      
conda create -qynopenssl openssl >/dev/null && conda activate openssl
openssl version
for hash in md5 sha256 sha512 sha512-256 ; do openssl speed -seconds 2 -bytes 1048576 -evp "${hash}" 2>/dev/null | tail -1 ; done
'
OpenSSL 1.1.1b  26 Feb 2019
md5             790102.02k
sha256          515857.24k
sha512          767557.63k
sha512-256      770179.07k

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! thanks for the pointer.

@msarahan msarahan added this to the 4.7.0 milestone May 10, 2019
@msarahan
Copy link
Contributor

The 2 test failures are confusing to me. They're reproducible locally, but only when running that test file as a whole. If those tests are run individually, or run in a subset (--lf after a failing run), then they pass. Clearly something is more stateful than it should be. I have verified that they are not sharing a package cache dir (the test folder paths are very different). If anyone has any ideas for what else to check, please chime in.

@msarahan msarahan merged commit 63fa8ce into conda:master May 14, 2019
This was referenced May 22, 2019
@github-actions
Copy link

Hi there, thank you for your contribution to Conda!

This pull request has been automatically locked since it has not had recent activity after it was closed.

Please open a new issue or pull request if needed.

@github-actions github-actions bot added the locked [bot] locked due to inactivity label Aug 27, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 27, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked [bot] locked due to inactivity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants