New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
separate conda package metadata under new root-level key in repodata #8639
separate conda package metadata under new root-level key in repodata #8639
Conversation
This PR does not currently specify or demonstrate how new packages should look - what the actual data in repodata.json should be. We need that in tests. @kalefranz are you going to have time to develop that, or should we take that on? The more hacky way that I had done things is very much still in the codebase, and this PR should probably remove it. It's not necessary to keep it because this is a better way that will require much less hacky stuff, but we need the tests for usage of the new file format to reflect this new way of doing things. |
I've updated this PR with a second commit. In addition to functional code changes, it exercises all the package cache code pretty thoroughly with additions to I need another hour or so on this though. The one remaining item on my TODO list here is to audit all uses of |
OK, pending test success, I'm happy with the state of this PR and welcome review and feedback. |
Signed-off-by: Kale Franz <kfranz@continuum.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some comments to the code. Need to make at least two more changes.
Signed-off-by: Kale Franz <kfranz@continuum.io>
The failing py27 test is The failing conda-build test is |
conda/core/subdir_data.py
Outdated
for fn, info in concatv( | ||
iteritems(conda_packages), | ||
((k, legacy_packages[k]) for k in use_these_legacy_keys), | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msarahan if we want to add a feature flag to opt-in to using the new package format (which we can switch to true
by default in a later conda release) then the only change we have to make to do that is localized right here.
@@ -275,17 +260,15 @@ def _check_writable(self): | |||
return i_wri | |||
|
|||
@staticmethod | |||
def _clean_tarball_path_and_get_checksums(tarball_path, sha256sum=None, md5sum=None): | |||
def _clean_tarball_path_and_get_md5sum(tarball_path, md5sum=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you are reverting a lot of the work I did to prefer sha256 over md5 where possible. Is that intentional? What's your reasoning on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of using md5 in all of these places is only for quick identification, and to have reasonable assurance that we have the "right" artifact, in the context of distinguishing if the package flask-1.0.2-py_0.tar.bz2
in the package cache came from free
, main
, conda-forge
, or some other channel. We know pretty close to what it should be by the file name, but because we don't include channel information with the bare tarballs in the package cache, we don't know what channel they come from. It's also a quick check to ensure that a package hasn't been updated in-place in the remote channel (and therefore needs to be downloaded again).
Matching the md5 to the expected md5 from repodata gives us the assurance that an artifact that has the right name in a package cache is actually the artifact that we want. To be clear: Using md5 here is not a security measure. Just as we use md5 as a checksum to ensure downloads have not been inadvertently corrupted, we use md5 here for a fast check to make sure the artifact is as we expect (or not).
Neither md5 nor sha256 in this context is sufficient as a security measure to guard against malicious attacks. The purpose of using md5 here is ONLY as a quick checksum. And in these quick checksum cases we should prefer md5 to sha256 since the former is indeed quite a bit faster.
When we add cryptographic signatures, we will be making specific and more targeted use of sha256 + size, in combination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That makes sense. Is it worth keeping at least for the file downloads, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't see any problem with that. And while we're at it, we might as well add in size to that function too and get rid of one of those TODOs. Pushing another commit soon with the change.
log.debug("MD5 sums mismatch for download: %s (%s != %s), " | ||
"trying again" % (url, digest_builder.hexdigest(), md5sum)) | ||
raise MD5MismatchError(url, target_full_path, md5sum, actual_md5sum) | ||
if md5: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worthwhile to compute both sha256 and md5? My original thought was to only compute the most secure option available. I guess also computing md5 won't add much time, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using sha256 instead of md5 isn't really a valid security measure. You need to couple sha256 with cryptographic signatures to actually have "security" in this context. Without signatures, the best assurance we can have here is that the artifact wasn't inadvertently corrupted by erroneous HTTP over the wire. It's false to think that md5 is any more "secure" in this use case than sha256. (Probably why s3 uses md5 for etags, as I mentioned in #8651.) So while it's not any more secure, you also incur the performance penalty of the heavier sha256.
All that said, we'll probably need to start using sha256 and incorporating that into any upcoming package signing implementation, so while we're really not adding any security now, it makes sense to prepare for the future. I'll make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worthwhile to compute both sha256 and md5?
Computing both is wasteful because computing SHA256 usually isn't x
times but rather 0.x
times slower than computing MD5. On my machine it's roughly 50 %.
Using sha256 instead of md5 isn't really a valid security measure.
Sure.
So while it's not any more secure, you also incur the performance penalty of the heavier sha256.
[...]
All that said, we'll probably need to start using sha256 and incorporating that into any upcoming package signing implementation
For performance reasons you might want to consider SHA512 or SHA512/256 (supported from openssl 1.1.1
onwards, available in Python via cryptography >=2.5
). For me ( = not representative, of course), computing SHA512/256 is nearly as performant as MD5:
> podman run --rm -it continuumio/miniconda3 sh -lic '
conda create -qynopenssl openssl >/dev/null && conda activate openssl
openssl version
for hash in md5 sha256 sha512 sha512-256 ; do openssl speed -seconds 2 -bytes 1048576 -evp "${hash}" 2>/dev/null | tail -1 ; done
'
OpenSSL 1.1.1b 26 Feb 2019
md5 790102.02k
sha256 515857.24k
sha512 767557.63k
sha512-256 770179.07k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool! thanks for the pointer.
The 2 test failures are confusing to me. They're reproducible locally, but only when running that test file as a whole. If those tests are run individually, or run in a subset ( |
Signed-off-by: Kale Franz <kfranz@continuum.io>
Hi there, thank you for your contribution to Conda! This pull request has been automatically locked since it has not had recent activity after it was closed. Please open a new issue or pull request if needed. |
This PR denormalizes the overlaying/entangled metadata in the original .conda package format work (i.e.
conda_size
,conda_sha256
,conda_inner_sha256
, andconda_outer_sha256
fields). With this PR,repodata.json
should now add a new root-levelpackages.conda
field containing information for all packages with the.conda
extension.This strategy has several advantages, including:
The root key
packages.conda
was chosen overconda_packages
so that sortingrepodata.json
keeps theinfo
key as the top key in the sort.An example of what repodata.json now looks like is included in this PR as
tests/data/conda_format_repo/win-64/repodata.json
.