Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of package metadata in repodata.json for artifact verification #10713

Open
adriendelsalle opened this issue Jun 8, 2021 · 5 comments
Labels
solver pertains to the solver stale::recovered [bot] recovered after being marked as stale tag::artifact-verification related to artifact verification and related content trust issues

Comments

@adriendelsalle
Copy link

adriendelsalle commented Jun 8, 2021

Hi!

This is a discussion about handling of package metadata of repodata.json index file. It concerns both conda and conda-content-trust projects.

Introduction/Context

Package signature

Signing a package metadata of a repodata.json is currently done over a canonicalized JSON object at /packages/<fn>, producing a signature stored at /signatures/<fn>/<keyid> as a JSON object.

To perform the verification of the metadata, the exact same data has to be obviously used (not using the same content would produce a verification/signature error).

Verification process

That's the reason why verification has been implemented before any updates of the content parsed from the repodata.json:
https://github.com/chenghlee/conda/blob/240287064b3b095c6ed304c3fcdadb659e888b76/conda/core/subdir_data.py#L554

From my understanding of the current implementation (do not hesitate if there is missing or incorrect info):

  • pros/benefits
    • the cache does need to store any signature: all packages are already verified and the signature status is pickled as part of the PackageRecord
  • cons/drawbacks
    • the cost of verification may be huge when dealing with large files with thousands of packages
    • a lot of metadata_signature_status may be logged, for packages that are not concerned by the current operation
    • failure of a single package verification may interrupt the operation if a strict verification is later implemented, even if this failing package is not needed (then not fetched from server)
    • if an operation is done without artifact verification downloads a repodata.json, then a later operation with artifact verification would consider the cache as trusted, which is dangerous
    • we may need to have a zero trust policy on that file
      • inter-operability between clients are much easier then
    • binary formats (pickled/.solv files) may store/cache a trusted verification status

Propositions

Package signature

Another implementation would be to check signatures of packages lazily before fetching them:

  • pros/benefits
    • verification is done on a small subset of the repodata
    • caching is still possible when a package has already been verified
  • cons/drawbacks
    • requires to store the signed package metadata or to be able to serialize it lazily when needed
      • keeping track of the "raw" signed package metadata means duplication and is not memory efficient (especially for large repodata.json files)
      • serialization means information has to be normalized and unchanged

Storage of signed package metadata

For the proposed implementation as well as for other possible technical solutions, it looks interesting to be able to reproduce or store package signed metadata.

Here are few possible solutions I would like to discuss:

Normalization

Normalizing package signed metadata looks a good way to provide flexibility about where signatures are verified, without paying an extra cost of memory, no I/O and few CPU usage.

  • EDIT: It may or may not mean signing a subset of the information provided.
  • some keys may/must be mandatory
  • gives capability to serialize back the information easily

An example of file->deserialization->serialization difficulty is:

  • a missing depends key will be deserialized as an empty array/vector
  • serialization then is ambiguous, it may be:
    • an empty array depends: []
    • missing key/val pair
  • normalization may help here by stating explicitly how to handle that case

Here is a first proposition of keys:

  • build
  • build_number
  • depends
  • license
  • md5
  • name
  • noarch
  • sha256
  • size
  • subdir
  • timestamp
  • version

Types definitions are well defined in https://github.com/conda/schemas and it could be nice to have a dedicated JSON schema for that. Maybe it would be redundant with https://github.com/conda/schemas/blob/master/repodata-1.schema.json and that schema could be used as-is TBD.

Storing signed metadata

Another option would be to parse the repodata.json file only once, but while deserializing it to operate some modifications to this/these structure(s) (adding information, modifying others) making re-serialization not 100% consistent with initial data, we could also add it the serialized initial metadata.

This storage would look like >95% duplicated information, and may be prohibitive for large repodata files (memory footprint).

Maybe signing a hash (SHA-256) of those metadata would make possible and non-prohibitive this storage. The security impact has to be assessed.

Parsing repodata multiple times

After solving and just before fetching data, the repodata.json files are stored in the cache folder and it's still possible to parse them again to get the original package metadata for the targets to be downloaded.

This option has the advantage to work without any change, but I would not recommend it for performance reasons.

Feedback much appreciated!

@adriendelsalle adriendelsalle changed the title Normalize signed package metadata of repodata.json Handling of package metadata in repodata.json for artifact verification Jun 10, 2021
@adriendelsalle
Copy link
Author

adriendelsalle commented Jun 16, 2021

Few metrics:

Performance of artifact verification

It takes about 200us to verify a package metadata signature.
So we can assume that performing a full verification of repodata.json using conda-forge channel on linux would take about 50s:

  • 59934+188427 packages metadata to check (resp. noarch and linux-64 subdirs)
  • assuming a single signature/package
conda-content-trust snippet...

import json

from conda_content_trust.authentication import verify_delegation as verify_trust_delegation
from conda_content_trust.signing import wrap_as_signable


with open("./noarch/repodata.json", 'r') as f:
    j = json.load(f)

signable = wrap_as_signable(j["packages"]["test-package-0.1-0.tar.bz2"])
signable['signatures'].update(j["signatures"]["test-package-0.1-0.tar.bz2"])

with open("1.root.json") as f:
    trusted_root = json.load(f)

with open("key_mgr.json") as f:
    key_mgr = json.load(f)

import time

N = 10000
start = time.time()
for i in range(N):
    verify_trust_delegation('pkg_mgr', signable, key_mgr)

print("Verification takes:", round((time.time() - start) * 1e6 / N, 0), "us/package metadata")

repodata sample...

{
  "info": {
    "subdir": "noarch"
  },
  "packages": {
    "test-package-0.1-0.tar.bz2": {
      "build": "0",
      "build_number": 0,
      "depends": [],
      "license": "BSD",
      "license_family": "BSD",
      "md5": "2a8595f37faa2950e1b433acbe91d481",
      "name": "test-package",
      "noarch": "generic",
      "sha256": "b908ffce2d26d94c58c968abf286568d4bcf87d1cfe6c994958351724a6f6988",
      "size": 5719,
      "subdir": "noarch",
      "timestamp": 1613117294885,
      "version": "0.1"
    }
  },
  "packages.conda": {},
  "removed": [],
  "repodata_version": 1,
  "signatures": {
    "test-package-0.1-0.tar.bz2": {
      "f46b5a7caa43640744186564c098955147daa8bac4443887bc64d8bfee3d3569": {
        "signature": "0a50063539baf249970f1d08b07f00f544e2d87982826790e9ec6e80874ad90aec21a9607cf38bb58897163533c39cb4a4f1c741a7f8e9e4f67e2ff2087d2d00"
      }
    }
  }
}

roles...

Root role:

{
  "signatures": {
    "2b920f88531576643ada0a632915d1dcdd377557647093f29cbe251ba8c33724": {
      "other_headers": "04001608001d1621040673d781a8b80bcb7b002040ac7bc8bcf821360d050260b687a1",
      "signature": "8eecc8f58df848f7af0188fbb47f99a0f2622f8a32ab8ede6340507fc48b8785c96a217c17889d39154c290d99ac0bb6ca75c971f913778598dbab368b49040e"
    }
  },
  "signed": {
    "delegations": {
      "key_mgr": {
        "pubkeys": [
          "013ddd714962866d12ba5bae273f14d48c89cf0773dee2dbf6d4561e521c83f7"
        ],
        "threshold": 1
      },
      "root": {
        "pubkeys": [
          "2b920f88531576643ada0a632915d1dcdd377557647093f29cbe251ba8c33724"
        ],
        "threshold": 1
      }
    },
    "expiration": "2022-06-01T19:16:49Z",
    "metadata_spec_version": "0.6.0",
    "timestamp": "2021-06-01T19:16:49Z",
    "type": "root",
    "version": 1
  }
}

Key mgr role:

{
  "signatures": {
    "013ddd714962866d12ba5bae273f14d48c89cf0773dee2dbf6d4561e521c83f7": {
      "signature": "20d8728ae8ba212e6229f9a69b3de14cd747fcd20cfaa1c5d39111cc6aad7a94036187a6c49e13a531d08c282a0d11b07c276d0f0773dc5344f54a14fb0d7700"
    }
  },
  "signed": {
    "delegations": {
      "pkg_mgr": {
        "pubkeys": [
          "f46b5a7caa43640744186564c098955147daa8bac4443887bc64d8bfee3d3569"
        ],
        "threshold": 1
      }
    },
    "expiration": "2022-06-01T19:16:49Z",
    "metadata_spec_version": "0.6.0",
    "timestamp": "2021-06-01T19:16:49Z",
    "type": "key_mgr",
    "version": 1
  }
}

Storing signed metadata

Impact on the binary cache file (pickled/.solv) of storing package metadata as a string on the deserialized object:

  • all metadata
  • only extra key/val pairs that are not deserialized

Evaluation made on .solv format (mamba implementation using libsolv), using conda-forge repodata files

file\repodata noarch linux-64
JSON 32316251 ref - 113309150 ref -
current .solv 8408388 26% ref 27190252 24% ref
.solv + all metadata 35989396 111% 428% 124692708 110% 459%
.solv + extra pairs 9077862 28% 108% 29348174 26% 108%

@jezdez jezdez added tag::artifact-verification related to artifact verification and related content trust issues solver pertains to the solver labels Sep 2, 2021
@mlschroe
Copy link

mlschroe commented Jul 20, 2022

Is that with libsolv commit e13455d011710a99ef1dfb33432044cc7eae0efb?

@adriendelsalle
Copy link
Author

It was done with the branch used for the related PR on libsolv linked just above

@github-actions
Copy link

Hi there, thank you for your contribution!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed automatically if no further activity occurs.

If you would like this issue to remain open please:

  1. Verify that you can still reproduce the issue at hand
  2. Comment that the issue is still reproducible and include:
    - What OS and version you reproduced the issue on
    - What steps you followed to reproduce the issue

NOTE: If this issue was closed prematurely, please leave a comment.

Thanks!

@github-actions github-actions bot added the stale [bot] marked as stale due to inactivity label Jul 22, 2023
@jezdez
Copy link
Member

jezdez commented Jul 27, 2023

Not stale

@github-actions github-actions bot added stale::recovered [bot] recovered after being marked as stale and removed stale [bot] marked as stale due to inactivity labels Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solver pertains to the solver stale::recovered [bot] recovered after being marked as stale tag::artifact-verification related to artifact verification and related content trust issues
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants