Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add convention for JSON metadata type names and port existing values to the new convention #1844

Closed
wagoodman opened this issue May 24, 2023 · 4 comments · Fixed by #1983
Closed
Assignees
Labels
breaking-change Change is not backwards compatible json Regarding JSON output
Milestone

Comments

@wagoodman
Copy link
Contributor

wagoodman commented May 24, 2023

The pkg.MetadataType constants will be removed from the core API in syft 1.0, however, the JSON format will still be outputting the string representation. There are inconsistencies with how these are crafted:

...
	ConanMetadataType              MetadataType = "ConanMetadataType"
	DartPubMetadataType            MetadataType = "DartPubMetadata"
...

Some of these end in Type and others do not. The correct answer is that they should not end in type. Why? These string constants represent type hints for the JSON format as to how to interpret the pkg.Metadata field. They describe the type names, not a "type of types".

The metadata type names should always be the same names as the structs. So for instance:

	RustCargoPackageMetadataType   MetadataType = "RustCargoPackageMetadata"

is not correct since the struct this represents is called CargoPackageMetadata.

@wagoodman wagoodman added this to the Stabilize user surfaces milestone May 24, 2023
@wagoodman wagoodman added the json Regarding JSON output label May 24, 2023
@wagoodman
Copy link
Contributor Author

Note that any changes we make should still be compatible in our syft decoders. That is, even if we change these names, we should still be able to convert old syft documents and their package metadata without issue.

@wagoodman
Copy link
Contributor Author

wagoodman commented Jul 19, 2023

When writing this issue originally it was really from the viewpoint of "there are a few metadata types which names are not consistent with the current naming convention". However, while digging I came to a realization that there is really not a strong enough convention other than "use the struct type name + add 'Metadata' to the end" which didn't seem grounded enough. I want to take a step back and try and define these types in a way that makes sense for what they need to be used for.

So! What is the MetadataType field used for? There is only one reason for it now: to understand the datashape of the pkg.Metadata field when unmarshaling syft-json.

Something interesting, these metadata type names are coupled to the json schema. Changing these names means there is a schema change, even if it's just the definition name. Does this mean if we rename a type definition that it's a breaking schema change? Using the struct name as the metadata type leaks what is essentially an internal detail into the json output. We want to be resilient to struct renames not causing parsing issues for json consumers.

I want to propose two things:

  1. adding guidelines for naming the MetadataType should be relative to what is being described and not just the type name and
  2. rename the existing metadata types to follow these guidelines

By convention the MetadataType value should follow these rules of thumb:

  • Only use lowercase letters, numbers, and hyphens. Use hyphens to separate words.
  • Try to anchor the name in the ecosystem, language, or packaging tooling it belongs to. For instance pubspec-lockfile is an OK name, but dart-pubspec-lockfile is better.
  • Be as specific as possible to what the data represents. For instance ruby-gem is NOT a good MetadataType value, but ruby-gemspec-guidefile is. Why? Ruby gem information can come from a gemspec file or a Gemfile.lock, which are very different. The latter name provides more context as to what to expect.
  • Should describe what the data is, not how it's used. For instance r-description-installed-file is NOT a good MetadataType value since it's trying to convey that we use the DESCRIPTION file in the R ecosystem to detect installed packages. Instead simply describe what the DESCRIPTION file is itself without context of how it's used: r-description-file.
  • Use guidefile and lockfile suffixes to distinct between manifest files that loosely describe package version requirements ("guide" files) and files that strongly specify one and only one version of a package ("lock" files). For instance swift-cocopods-podfile is ambiguous (this could be referring to a podfile or we only have one method which is parsing the podfile.lock). Using swift-cocopods-pod-guidefile and swift-cocopods-pod-lockfile removes this ambiguity. These should only be used with respect to package managers that have the guide and lock distinction, but would not be appropriate otherwise (e.g. rpm does not have a guide vs lock, so lockfile should not be used to describe a db entry).
  • Use the archive-manifest suffix to indicate a package archive (e.g. rpm file, apk file, etc) that describes the contents of the package.
  • Use the catalog-entry suffix to indicate information about a package that was found in a collection of flat files responsible for describing the state of packages on the system (e.g. apk has an "installed" file, alpm has several "desc" files, portage as several "content" files). When the information is stored in a single database file then use db-entry instead (e.g. entries in the RPM database).
  • Should not contain the phrase "package", though exceptions are allowed (say if the canonical name literally has the phrase package in it).
  • Try to keep exact filename+extensions out of the name. For instance pipfile.lock shouldn't really be in the name, instead try and describe what the file is: python-pipfile-lockfile.

Furthermore I think we should not overload Metadata to represent multiple use cases. They should represent the same thing no matter what instance of that metadata you are referring to . So today:

  • HackageMetadata seems to represent an amalgamation of stack.yaml, stack.lock, and cabal.project, all which are mutually exclusive (unlike with the JavaMetadata where these multiple sources are not mutually exclusive). These use cases should probably be different Metadatas.
  • RpmMetadata is really two use cases: from an RPM DB or RPM archive file, which are mutually exclusive.
  • PhpComposerJSONMetadata is really two use cases: from installed.json and composer.lock, which are mutually exclusive.
  • PythonPackageMetadata is really two use cases: egg metadata files and wheel metadata files, which again, are mutually exclusive.

This ensures when a new field needs to be added to the Metadata for one use case but not another that we don't need to do things like conditionally populate fields in syft or ignore consuming fields downstream.

I'll add another comment with specific impacts to the metadata type names shortly... to be continued! 🍿

@wagoodman
Copy link
Contributor Author

wagoodman commented Jul 19, 2023

Here are the proposed updates to the metadata type names, as they would appear in JSON output:

Today Proposed Comments
AlpmMetadata arch-alpm-catalog-entry
ApkMetadata alpine-apk-catalog-entry
BinaryMetadata binary-signature
CocoapodsMetadataType cocoapod-lockfile (or cocoa-pod-lockfile ?) Should this orient around the cocoa framework or the cocoapods package manager? (are these "pods" in the "cocoapods" canonical centralized package registry? or are the packages themselves "cocoapods"?)
ConanLockMetadataType c-conan-lockfile
ConanMetadataType c-conan-guidefile
DartPubMetadata dart-pubspec-lockfile
DotnetDepsMetadata dotnet-deps-guidefile
DpkgMetadata debian-dpkg-catalog-entry
GemMetadata ruby-gemspec-guidefile
GolangBinMetadata (aka GolangMetadata) go-binary-buildinfo
GolangModMetadata go-module-guidefile
HackageMetadata haskell-hackage-stack-guidefile, haskell-hackage-stack-lockfile, haskell-hackage-cabal-lockfile
JavaMetadata java-archive Though there are multiple use cases, they don't appear to be mutually exclusive
KbPackageMetadata microsoft-kb-package This breaks the rule of thumb by containing "package", but "kb" is odd since it doesn't necessarily mean a "package" (originally stood for "knowledge base")
LinuxKernelMetadata linux-kernel-archive
LinuxKernelModuleMetadata linux-kernel-module
MixLockMetadataType elixir-mix-lockfile
NixStoreMetadata nix-store-catalog-entry Since this is a directory name and not based on file contents it might not be accurate to refer to this as a -catalog-entry
NpmPackageJsonMetadata javascript-npm-guidefile
NpmPackageLockJsonMetadata javascript-npm-lockfile
PhpComposerJsonMetadata php-composer-lockfile, php-composer-catalog-entry Question: is installed.json a composer construct? or a php construct in general? (memory serves that it's composer, but need to look this up)
PortageMetadata gentoo-portage-catalog-entry
PythonPackageMetadata python-egg-catalog-entry, python-wheel-catalog-entry
PythonPipfileLockMetadata python-pipfile-lockfile
PythonRequirementsMetadata python-requirements-guidefile (or python-pip-guidefile) should this be oriented around pip or not? Not to be confused with Pipfile!
RebarLockMetadataType erlang-rebar-lockfile
RDescriptionFileMetadataType r-description-guidefile
RpmdbMetadata rpm-db-entry, rpm-archive-manifest
RustCargoPackageMetadata rust-cargo-lockfile

@wagoodman
Copy link
Contributor Author

wagoodman commented Jul 27, 2023

I've incorporated comments from a team sync and updated:

By convention the MetadataType value should follow these rules of thumb:

  • Only use lowercase letters, numbers, and hyphens. Use hyphens to separate words.
  • Try to anchor the name in the ecosystem, language, or packaging tooling it belongs to. For a package manager for a language ecosystem the language, framework or runtime should be used as a prefix. For instance pubspec-lock is an OK name, but dart-pubspec-lock is better. For an OS package manager, the common OS distribution ancestor name should be used. For instance alpine-apk-db-record is a good name, but apk-db-record is not.
  • Be as specific as possible to what the data represents. For instance ruby-gem is NOT a good MetadataType value, but ruby-gemspec is. Why? Ruby gem information can come from a gemspec file or a Gemfile.lock, which are very different. The latter name provides more context as to what to expect.
  • Should describe WHAT the data is, NOT HOW it's used. For instance r-description-installed-file is NOT a good MetadataType value since it's trying to convey that we use the DESCRIPTION file in the R ecosystem to detect installed packages. Instead simply describe what the DESCRIPTION file is itself without context of how it's used: r-description.
  • Use the lock suffix to distinct between manifest files that loosely describe package version requirements vs files that strongly specify one and only one version of a package ("lock" files). These should only be used with respect to package managers that have the guide and lock distinction, but would not be appropriate otherwise (e.g. rpm does not have a guide vs lock, so lock should NOT be used to describe a db entry).
  • Use the archive suffix to indicate a package archive (e.g. rpm file, apk file, etc) that describes the contents of the package. For example an RPM file that was cataloged would have a rpm-archive metadata type (not to be confused with an RPM DB record entry which would be rpm-db-record).
  • Use the db-record suffix to indicate information about a package that was found as an entry in a DB (or collection of flat files acting like a DB) responsible for describing the state of packages on the system (e.g. apk has an "installed" file, alpm has several "desc" files, portage as several "content" files).
  • Should NOT contain the phrase package, though exceptions are allowed (say if the canonical name literally has the phrase package in it).
  • Should NOT contain have a file suffix unless the canonical name has the term "file", such as a pipfile or gemfile. An example of a bad name for this rule isruby-gemspec-file; a better name would be ruby-gemspec.
  • Should NOT contain the exact filename+extensions. For instance pipfile.lock shouldn't really be in the name, instead try and describe what the file is: python-pipfile-lock (but shouldn't this be python-pip-lock you might ask? No, since the pip package manger is not related to the pipfile project).
  • Should NOT contain the phrase metadata, unless the canonical name has this term.
  • Should represent a single use case. For example, trying to describe Hackage metadata with a single HackageMetadata struct (and thus MetadataType) is not allowed since it represents 3 mutually exclusive use cases: representing a stack.yaml, stack.lock, or cabal.project file. Instead, each of these should have their own struct types and MetadataType values.

Regarding breaking up metadata types by use case, I found a one exception:

  • PythonPackageMetadata: though this is covering the egg and wheel files today, after a little digging through PEPs I found that they are described in the same spec.

Here are the proposed updates to the metadata type names, as they would appear in JSON output:

Today Proposed Comments
AlpmMetadata arch-alpm-db-record
ApkMetadata alpine-apk-db-record
BinaryMetadata binary-signature
CocoapodsMetadataType cocoa-podfile-lock
ConanLockMetadataType c-conan-lock
ConanMetadataType c-conan
DartPubMetadata dart-pubspec-lock
DotnetDepsMetadata dotnet-deps
DpkgMetadata debian-dpkg-db-record
GemMetadata ruby-gemspec
GolangBinMetadata go-module-binary-buildinfo
GolangModMetadata go-module-guidefile
HackageMetadata (split to HackageStackYamlLockMetadata and HackageStackYamlMetadata) haskell-hackage-stack, haskell-hackage-stack-lock
JavaMetadata java-archive
KbPackageMetadata microsoft-kb-patch Looking into if this can be removed
LinuxKernelMetadata linux-kernel-archive
LinuxKernelModuleMetadata linux-kernel-module
MixLockMetadataType elixir-mix-lock
NixStoreMetadata nix-store
NpmPackageJsonMetadata javascript-npm-package
NpmPackageLockJsonMetadata javascript-npm-package-lock
PhpComposerJsonMetadata (split into PhpComposerLockMetadata and PhpComposerInstalledMetadata) php-composer-lock, php-composer-installed
PortageMetadata gentoo-portage-db-record
PythonPackageMetadata python-package Though there are different use cases for this (egg and wheel) they are governed by the same spec
PythonPipfileLockMetadata python-pipfile-lock
PythonRequirementsMetadata python-pip-requirements
RebarLockMetadataType erlang-rebar-lock
RDescriptionFileMetadataType r-description
RpmdbMetadata (split into RpmDBMetadata and RpmArchiveMetadata) redhat-rpm-db-record, rpm-archive
RustCargoPackageMetadata rust-cargo-lockfile

@wagoodman wagoodman changed the title JSON metadata type names should be consistent Add convention for JSON metadata type names and port existing values to the new convention Jul 31, 2023
@wagoodman wagoodman added the breaking-change Change is not backwards compatible label Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change is not backwards compatible json Regarding JSON output
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant