Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Top-level API should be more composable #558

Closed
5 tasks done
wagoodman opened this issue Oct 16, 2021 · 26 comments · Fixed by #2517
Closed
5 tasks done

Top-level API should be more composable #558

wagoodman opened this issue Oct 16, 2021 · 26 comments · Fixed by #2517
Assignees
Labels
breaking-change Change is not backwards compatible enhancement New feature or request
Milestone

Comments

@wagoodman
Copy link
Contributor

wagoodman commented Oct 16, 2021

What would you like to be added:
More reusable primitives when syft is used as a library. This would be able to do at least the following tasks:

This should probably be done after (that is, this issue is most likely related to...):

Why is this needed:
The existing top-level API functions are for a full parse-catalog processing unit, where as separating parsing and cataloging into separate functions is useful if you want to reuse the same parsed image (which takes a while) with different catalogers calls (which relatively speaking doesn't take that long).

This would additionally improve interoperability with anchorectl.

Tasks

@wagoodman wagoodman added the enhancement New feature or request label Oct 16, 2021
@wagoodman wagoodman added this to the Syft 1.0 milestone Oct 16, 2021
@wagoodman wagoodman added the breaking-change Change is not backwards compatible label Oct 19, 2021
@luhring
Copy link
Contributor

luhring commented Oct 19, 2021

Adding my two cents:

  1. I love this. I see "Syft-as-a-library" consumption of this project becoming big as time goes on. ❤️
  2. It'd be great if we can try to use Syft as a library more ourselves during and even before taking this one, so we have a nice consumer-centric feedback loop to influence the API design.
  3. In the updated design, it'd be awesome to really think about the top-level "entrypoint". Today we have syft.CatalogPackages. This function's parameters are Syft types, which means this isn't a function consumers can easily start using right away — instead they have to figure out other Syft function calls before calling this function. It'd be great to think about an elegant way of simplifying the work of invoking Syft for callers, and allowing configurability on an "as needed" basis. One pattern I've seen a lot is to have a function take a config struct as an argument. We could add API sugar for creating and adjusting this struct ahead of time. And it'd be great to have the function's (and all inner functions') configuration be driven solely by config parameters, not side-effectful references to a global config or a config file, so that library consumers have a predictable and tightly controlled experience invoking Syft functionality.

@wagoodman
Copy link
Contributor Author

Example of using syft as a lib today: https://gist.github.com/wagoodman/57ed59a6d57600c23913071b8470175b

@wagoodman
Copy link
Contributor Author

From refinement:

  • Do we want to encapsulate the stereoscope primitives or not? Should not have to do stereoscope.* to interact with syft, even if the objects from stereoscope is what we end up allowing to be passed around the API.
  • It would be ideal to not expose behavior (e.g. tasks) and instead expose declarations (e.g. cataloger configurations?)

@wagoodman
Copy link
Contributor Author

wagoodman commented Mar 3, 2022

Update: much of this has been addressed in #864

Also, I think we should distinguish scoped capabilities of syft vs the pure definitions needed to interop with primitives that syft works with.

An example: Syft can encode and decode various formats (spdx, cyclonedx, etc). Additionally, syft has the definition of what a Format is... something that can encode an SBOM into bytes, decode an SBOM from bytes, and validate the underlying bytes to see if they represent the format the object represents.

These are probably not strongly separated enough in today's API.

The definition of a format should be more closely related to that of an SBOM, since that context is semantically useful (as opposed to format on it's own). For example format.Format does not communicate what the format is for where as sbom.Format scopes the concept to formats that deal specifically with SBOMs.

Dealing with the capabilities of syft, this should probably be elevated to a top-level package concern and away from the Format definitions. That is format.Option type with a value of format.CycloneDxXMLOption seems like it deals with the semantic of a format, when it really is describing a user option for syft as a capability. This hints at elevating these concepts up to the syft package where there is a syft.FormatOption with a value of syft.CycloneDxXMLFormatOption.

Based on some of the above points, here is a suggested update to the syft API:

// in the top-level "syft" package

// migrated from today's "format" package
func FormatByOption(option FormatOption) sbom.Format 

func FormatByName(name string) sbom.Format

// migrated from today's "format" package
func IdentifyFormat(by []byte) sbom.Format

func Encode(s sbom.SBOM, f sbom.Format) ([]byte, error)

func Decode(reader io.Reader) (*sbom.SBOM, sbom.Format, error)

Edit: in case there are functions within syft that may require encoding/decoding functionality, this should be migrated to it's own package (say in syft/format).

@wagoodman
Copy link
Contributor Author

A topic that was brought up by @westonsteimel and @luhring was how syft versions are expressed in SBOMs when using syft as a lib. Today we have a internal/version package which build options for the application syft are overridden to provide useful values based on the git information available. This allows for SBOMs to have a descriptor block that describes that syft at versionvX.Y.Z produced the SBOM document. When using syft as a lib, the caller must provide this information.

Ideally the syft API should be able to raise up this information automatically but additionally allow for override in case wrapping tooling prefers to provide a different name and version.

@luhring suggested that we could use the runtime buildinfo section from the running binary to provide this information. I think this is a good idea, however we need to account for the edge case when the buildinfo section is stripped from the binary too.

@wagoodman
Copy link
Contributor Author

wagoodman commented Mar 3, 2022

It would be ideal to not expose behavior (e.g. tasks) and instead expose declarations (e.g. cataloger configurations?)

Today we have this function call at the top-level syft package:

func CatalogPackages(src *source.Source, cfg cataloger.Config) (*pkg.Catalog, []artifact.Relationship, *linux.Release, error) 

A few points about this:

  • This only captures cataloging packages and not any other artifacts
  • Produces a package catalog and related data instead of an SBOM itself

It would be great to move towards something like this:

// where "options" would be declarative and that augment/select cataloging capabilities
func CatalogSource(src *source.Source, /*... options...*/ ) (*sbom.SBOM, error) 

or maybe injection of pre-constructed catalogers

func CatalogSource(src *source.Source, catalogers...) (*sbom.SBOM, error) 

The downside with the second suggestion:

  • There is no common cataloger interface, and probably no value in creating one since we catalog multiple kinds of artifacts
  • Caller construction hints at providing more helper functions to do this construction

... which makes me lean towards the first suggestion.

The upside of taking source.Source is that we can reuse the source object across multiple calls (getting the source object is very expensive, enabling caller reuse is advised).

It'd be great to think about an elegant way of simplifying the work of invoking Syft for callers...

We could additionally introduce a single call that would wrap source creation and cataloging together into one call (which would use CatalogSource).

func Catalog(input string, /*... cataloger options...*/) (*sbom.SBOM, error) 

This could be the "one stop shop" call that the cmd package would leverage.

@wagoodman
Copy link
Contributor Author

wagoodman commented Mar 14, 2022

Next up: what kind of cataloger options are we looking for here? Short answer: functional options that mostly represent existing configuration options seem best fit, as this allows for flexibility, sane defaults, and isn't mutually exclusive to a user-provided struct instance approach.

Let's assume we have something that looks like:

func Catalog(src *source.Source, options ...CatalogingOption) (*sbom.SBOM, error)

Where CatalogingOption is:

type CatalogingOption func(*source.Source, *CatalogingConfig) error

I propose the following options as a start:

# run all catalogers in serial fashion (default is to run in parallel)
WithoutConcurrency() CatalogingOption

# set the file resolver scope for all catalogers (except the secrets cataloger)
WithScope(scope source.Scope) CatalogingOption

# the tool name and version to show in the SBOM descriptor block (defaults to syft and current library version if not provided)
WithToolIdentification(name, version string) CatalogingOption 

# in the syft-json format, we allow for outputting application configuration (defaults to empty)
WithToolConfiguration(c interface{}) CatalogingOption 

# set the package catalogers you wish to use (must be pre-constructed)
WithPackageCatalogers(catalogers ...pkg.Cataloger) CatalogingOption

# use the package catalogers suited by the source provided when cataloging (default is to use this over `WithPackageCatalogers`)
WithDefaultPackageCatalogers(cfg packages.SearchConfig) CatalogingOption

# enable the file metadata cataloger (default is off)
WithFileMetadata() CatalogingOption

# enable the file digests cataloger (default is off)
WithFileDigests(hashes ...crypto.Hash) CatalogingOption

# enable the secrets cataloger (default is off)... the argument is a struct with all of todays cataloger options
WithSecrets(secretConfig *file.SecretsCatalogerConfig) CatalogingOption

# enable the file classification cataloger with default classifiers (default is off)
WithFileClassification() CatalogingOption

# set the specific file classifiers, and enables the file classification (default is off)
WithFileClassifiers(classifiers ...file.Classifier) CatalogingOption

# enable the file content cataloger (default is off)
WithFileContents(globs ...string) CatalogingOption 

# set the file size limits for the contents and secrets catalogers (default is 1 MB for both)
WithFileSizeLimit(byteLimit int64) CatalogingOption

Additionally, in case an API user has a way to generate a CatalogingConfig that can be provided instead of any additional arguments:

# override an entire config
WithConfig(override CatalogingConfig) CatalogingOption 

This suits well for situations where you wish to bring your own defaults, but still allow for augmentation via more functional options.

This is also well suited for syft, the application, where we can build our own CatalogingConfig instance based off of the application configuration.

Using these options within the syft application would look something like this:

func generateSBOM(src *source.Source) (*sbom.SBOM, error) {
	catalogingConfig, err := appConfig.ToCatalogingConfig()
	if err != nil {
		return nil, err
	}

	return syft.Catalog(src,
		syft.WithConfig(*catalogingConfig),
	)
}

Though other external users may choose to use it with defaults and augment behavior

# catalog all packages, file metadata, and digests
syft.Catalog(src,
		syft.WithFileMetadata(),
		syft.WithFileDigests(crypto.SHA256),
	)

Questions:

  • there seems to be a consistency problem with how you enable file-based catalogers and turn off/chainge package cataloging. Is there a better way to express this?

@wagoodman
Copy link
Contributor Author

wagoodman commented Mar 14, 2022

Separate from the options (but related) is the configuration object used to describe behavior. Based off of the options and todays configuration, I propose a CatalogingConfig that looks something like this:

type CatalogingConfig struct {
	// tool-specific information
	ToolName          string
	ToolVersion       string
	ToolConfiguration interface{}
	// applies to all catalogers
	Scope                source.Scope
	ProcessTasksInSerial bool
	// package
	PackageCatalogers []pkg.Cataloger
	// file metadata
	CaptureFileMetadata bool
	DigestHashes        []crypto.Hash
	// secrets
	CaptureSecrets bool
	SecretsConfig  file.SecretsCatalogerConfig // struct derived from todays secrets cataloger options
	SecretsScope   source.Scope
	// file classification
	ClassifyFiles   bool
	FileClassifiers []file.Classifier
	// file contents
	ContentsConfig file.ContentsCatalogerConfig // struct derived from todays contents cataloger options
}

The idea is that all field types are useful in their current form and require no further processing. For example, SecretsScope source.Scope is preferred over SecretsScope string since a string requires further processing and verification... so we stick with the "rich" object.

With the default options that can be fetched:

func DefaultCatalogingConfig() CatalogingConfig {
	return CatalogingConfig{
		Scope:           source.SquashedScope,
		ToolName:        internal.ApplicationName,
		ToolVersion:     extractSyftVersionFromLib(), // todo: this function does not exist, but differs from version.FromBuild()
		SecretsScope:    source.AllLayersScope,
		SecretsConfig:   file.DefaultSecretsCatalogerConfig(), // similar concept, but provided next to the cataloger
		FileClassifiers: file.DefaultClassifiers(),  // similar concept, but provided next to the cataloger
		ContentsConfig:  file.DefaultContentsCatalogerConfig(),  // similar concept, but provided next to the cataloger
	}
}

Why export the config and fields since we already have options? This enables someone being able to provide their own config and options as they see fit (using WithConfig and with the syft application's use case in mind).

@wagoodman
Copy link
Contributor Author

from refinement:

  • try to be minimal with the options
  • are there any more global default functional options that could be used here? (not just individual catalogers, but across different cataloger types)

@lyzs90
Copy link

lyzs90 commented Mar 17, 2022

This is a great initiative, can't wait for it!

I have a use-case for defining a custom package cataloger. Pre-construction is achievable since the Cataloger interface is exposed. However, not sure if there's a way to extend (not override) the list of default package catalogers with the proposed API.

@wagoodman
Copy link
Contributor Author

wagoodman commented Mar 22, 2022

not sure if there's a way to extend (not override) the list of default package catalogers with the proposed API

@lyzs90 there is one way with the proposed API, but I admit it isn't straight forward:

config := syft.DefaultCatalogingConfig()
config.PackageCatalogers = append(config.PackageCatalogers, <your catalogers>...)
syft.Catalog(
    syft.WithConfig(config),
    //... any other overriding options
)

We could add an additional helper which would append and not override the list of package catalogers:

// current proposed "set/override" operation
WithPackageCatalogers(catalogers ...pkg.Cataloger) CatalogingOption

// possible new "append" operation
WithAdditionalPackageCatalogers(catalogers ...pkg.Cataloger) CatalogingOption

How does that sound?

@lyzs90
Copy link

lyzs90 commented Mar 22, 2022

@wagoodman Now that I think of it, WithAdditionalPackageCatalogers may not be necessary after all because we can just set WithPackageCatalogers with the extended list of catalogers

@wagoodman
Copy link
Contributor Author

wagoodman commented Mar 22, 2022

update: see #558 (comment) for a more up to date suggestion

Most of this conversation has been about the syft package, however, that top-level API may require callers to use other packages for inputs (such as cataloger instances). For this reason I wanted to talk through some thoughts around changes to package organization.

Here is the current cataloger organization today:

syft/
└── pkg/
    └── cataloger/
        ├── apkdb/
        ├── common/
        │   ├── cpe/                        // contains CPE generation logic
        │   ├── generic_cataloger.go        // used by most other catalogers today
        │   ├── generic_cataloger_test.go
        │   └── parser.go
        ├── deb/
        ├── golang/
        ├── java/
        ├── javascript/
        ├── php/
        ├── python/
        ├── rpmdb/
        ├── ruby/
        ├── rust/
        ├── catalog.go          // entrypoint for package cataloging -- cataloger.Catalog()
        ├── cataloger.go        // cataloger interface + image/dir cataloger sets -- cataloger.Cataloger & cataloger.ImageCatalogers()
        ├── config.go           // config needed for cataloging (references the search config)
        └── search_config.go    // config that is common to how all catalogers search

This is good, but there is room for improvement on what the package names convey:

  • common contains resources that are common to all package catalogers, however, it's organizing both a common cataloger as well as resources (such as CPE generation). I think that, like the other catalogers, the common cataloger should be in its own package. Additionally, all-things relating to CPEs should probably be elevated higher in the package tree.
  • cataloger organizes a set of cataloger packages but also has common "wiring" code as well. How you invoke these wiring functions are a little odd: cataloger.Catalog(). The package is redundant relative to the verb and doesn't scope what is being cataloged (like with other cataloging packages). We could put these wiring elements into pkg (which may even be ideal), however, there is overlapping concepts here (pkg.Catalog the collection store and pkg.Catalog which would come from cataloger.Catalog). I think the next most ideal spot would be to carve out a pkgs or packages package that indicates that this is for cataloging multiple kinds of packages.
  • The definition of a package cataloger is within cataloger and not pkg. The location for this abstraction seems misplaced since there is no reason for it to reside far away from what is describes (how to catalog packages, so within the pkg package).

This culminates to a final tree that looks like so:

syft/
├── cpe/       // contains CPE generation logic
└── pkg/
    ├── cataloger/
    │   ├── apkdb/
    │   ├── generic/
    │   │   ├── cataloger.go
    │   │   ├── cataloger_test.go
    │   │   └── parser.go
    │   ├── deb/
    │   ├── golang/
    │   ├── java/
    │   ├── javascript/
    │   ├── packages
    │   │   ├── catalog.go        // entrypoint for package cataloging -- packages.Catalog()
    │   │   ├── catalogers.go     // image/dir cataloger sets -- packages.ImageCatalogers()
    │   │   └── search_config.go  // config that is common to how all catalogers search
    │   ├── php/
    │   ├── python/
    │   ├── rpmdb/
    │   ├── ruby/
    │   └── rust/
    └── cataloger.go   // cataloger interface -- pkg.Cataloger

Lastly, providing a helper function that ties together source schemes and sets of package catalogers could be provided within the packages package:

package packages

func CatalogersBySourceScheme(scheme source.Scheme, cfg SearchConfig) []pkg.Cataloger {
	switch scheme {
	case source.ImageScheme:
		return InstalledCatalogers(cfg)
	case source.FileScheme:
		return AllCatalogers(cfg)
	case source.DirectoryScheme:
		return IndexCatalogers(cfg)
	}
	return nil
}

@wagoodman
Copy link
Contributor Author

wagoodman commented Mar 22, 2022

In a similar vein to the last comment about package organization, I wanted to touch on the differences between the file and pkg packages.

The current organization of pkg is such:

syft/
└── pkg/
    ├── cataloger/
    │   └── (collections of catalogers, organized in packages)
    └── (basic definitions relating to packages)

The current organization of file is as such:

syft/
└── file/
    ├── (collections of catalogers, not organized in packages)
    └── (basic definitions relating to files)

Ideally the organization of file would be more like pkg, where:

  • capabilities (catalogers) are kept separate from basic file-related definitions
  • catalogers are organized into their own packages (e.g. syft/file/cataloger/secrets contains the secrets cataloger only... syft/file/cataloger/filemetadata contains the file metadata cataloger)

The final layout would look like:

syft/
└── file/
    ├── cataloger/
    │   └── (collections of catalogers, organized in packages)
    └── (basic definitions relating to files)

This paves the way to do a few more things:

  1. move byte-size-related definitions from internal/file to syft/file (KB, MB, GB, etc)
  2. move filename-glob-related functionality from internal/file to internal (GlobMatch)
  3. rename internal/file to internal/archive as the purpose of the package is now a smaller focus (relating to zip and tar helpers only)

@wagoodman
Copy link
Contributor Author

wagoodman commented Mar 22, 2022

Focusing on the relationship between the source and file packages, there are some concepts that are currently in the source package that semantically belong in the file package. For instance, for concrete types:

  • source.FileType which enumerates different kinds of files (e.g. symlinks, hardlinks, etc) would be better fit as file.Type
  • source.FileMetadata which is a basic definition for data relating to a file would be better fit as file.Metadata
  • source.Location which relates a real filepath within a designated filesystem to an access path ("virtual path") would be better fit as file.Location (note: this embeds source.Coordinates)
  • source.Coordinates which is the minimal definition for a location would be better fit as file.Coordinates

Take for example the definition of sbom.Artifacts:

type Artifacts struct {
	PackageCatalog      *pkg.Catalog
	FileMetadata        map[source.Coordinates]source.FileMetadata
	FileDigests         map[source.Coordinates][]file.Digest
	FileClassifications map[source.Coordinates][]file.Classification
	FileContents        map[source.Coordinates]string
	Secrets             map[source.Coordinates][]file.SearchResult
	LinuxDistribution   *linux.Release
}

With the proposed source/file migrations, the nouns that make up sbom.Artifacts make more semantic sense:

type Artifacts struct {
	PackageCatalog      *pkg.Catalog
	FileMetadata        map[file.Coordinates]file.Metadata
	FileDigests         map[file.Coordinates][]file.Digest
	FileClassifications map[file.Coordinates][]file.Classification
	FileContents        map[file.Coordinates]string
	Secrets             map[file.Coordinates][]file.SearchResult
	LinuxDistribution   *linux.Release
}

Lastly, the same migrations should apply to file abstractions as well:

  • source.File*Resolver objects, which abstract how you extract file path, content, and metadata from an underlying representation, would be better fit as file.*Resolver
  • similarly, source.excludingResolver would be better fit as file.ExcludingResolver

The advantage to this last suggestion is that catalogers would only depend on objects in a package that has no relation to stereoscope image or source packages.

@wagoodman
Copy link
Contributor Author

Focusing on the source package in particular:

  • it seems that the source.Scheme could be disambiguated to be a source.Type and the helper function source. DetectScheme() could be renamed to source.DetectTypeFromScheme. In this way user input schemes are related to (but not the same as) source types.

@wagoodman
Copy link
Contributor Author

I've made a mockup of the proposed changes in the above comments, feel free to try it out here https://github.com/anchore/syft/tree/api-wip (this is just for illustration and prototyping, not for merging or cleaning up)

@wagoodman
Copy link
Contributor Author

wagoodman commented Mar 29, 2022

Incorporating comments from refinement

The proposed package structure for pkg has a few drawbacks:

  • cataloger/packages is a sibling package of the other individual package catalogers, which hints that they are hierarchically the same, however, this is not true since the packages cataloger implementation is meant to take one or more individual package catalogers and return the populated package catalog.
  • Ideally the packages.Catalog would be under pkg since the function does not have any inherent capabilities (since package catalogers are injected). The problem with this is that there is already a pkg.Catalog datastructure.

Given these points, and combining with the original points in other comments (e.g. package names should communicate what they operate on) here is an alternative proposed package structure:

syft/
├── cpe/                       // contains CPE generation logic
├── cataloger/
│   ├── packages/              //... all of the package-based catalogers organized into subpackges
│   │   ├── apkdb/
│   │   ├── generic/
│   │   │   ├── cataloger.go
│   │   │   ├── cataloger_test.go
│   │   │   └── parser.go
│   │   ├── deb/
│   │   ├── golang/
│   │   ├── java/
│   │   ├── javascript/
│   │   ├── php/
│   │   ├── python/
│   │   ├── rpmdb/
│   │   ├── ruby/
│   │   ├── rust/
│   │   └── cataloger_groups.go     // image/dir cataloger sets -- packages.ImageCatalogers()
│   └── files/                      //... all of the file-based catalogers organized into subpackges
|       ├── filemetadata/
|       ├── filedigests/
|       ├── fileclassifier/
│       └── secrets/
├── file/
│   └── ...    // basic file definitions
└── pkg/       // basic package definitions
    ├── collection.go       // old pkg.Catalog struct, renamed to pkg.Collection
    ├── catalog.go          // entrypoint for package cataloging -- pkg.Catalog()
    ├── config.go           // config that is common to how all catalogers search
    └── cataloger.go        // cataloger interface -- pkg.Cataloger

Specific changes:

  • file/cataloger and package/cataloger have been moved to a common root-level package syft/cataloger (so syft/cataloger/files and syft/cataloger/packages). This further separates the basic-definitions from capabilities.
  • Renamed pkg.Catalog to pkg.Collection. There is a lot of use of the name Catalog/Cataloger and a mixed use of noun and verb. There is much more use of the verb, thus changing this is harder (also, I feel the verb is a better fit since it's descriptive as to what it is doing). There is only one use of "catalog" as a noun and thats with pkg.Catalog, a data structure that holds a collection of package objects.

@houdini91
Copy link
Contributor

houdini91 commented Apr 17, 2022

Hi guys,
I also found myself with the need to use syft as a library, its awesome to see you already two steps ahead.
I understand the vision is a very well defined library.
I took a couple hours to try and create a more simplistic route for some tests (also as a self learning process).
Its a bit relevant to this thread so i am attaching patch and PR to my changes.
scribe-security#7
package_lib_patch.txt

// LibPackagesExec run packages command as a library
// userInput: target
// cfg: syft configuration structure
// l: logger to attach to, nil  for default syft logger
// enable_ui: enable disable ui output
// Function return sbom or errors.
func LibPackagesExec(userInput string, cfg *config.Application, l logger.Logger, enable_ui bool) (*sbom.SBOM, error)```

@wagoodman
Copy link
Contributor Author

wagoodman commented Jun 4, 2022

Summary of tasks

Group: organize definitions (details in
#558 (comment) and #558 (comment) and #558 (comment) )

Group: organize catalogers (details in
#558 (comment))

  • migrate package catalogers to syft/cataloger/packages/* no longer doing this work
  • migrate file catalogers to syft/file/cataloger/* (slightly different than in the above proposed comments... this starts to remove hierarchical organization of catalogers [allowing them to be siblings]) Migrate location-related structs to the file package #1751

Group: new top-level API

@spiffcs
Copy link
Contributor

spiffcs commented Jun 9, 2022

@samj1912 if you have more thoughts from the community meeting re: using syft as a library please feel free to add them here

@spiffcs spiffcs removed their assignment Dec 1, 2022
@tommyknows
Copy link

After reading through this ticket, I have a few ideas & questions that I'd want to raise. First off, I generally like the structure as proposed in #558 (comment).

However, I find the usage of package and packages a bit confusing.

  • packages as referred to in the syft/catalogers/packages, is a package format like deb, apk, but also refers to language-specific artifacts like binaries from golang, rust or others.
  • package is a generic representation of the items found by the packages catalogers, but also the files catalogers, if I understand correctly. Additionally, naming it pkg to me is a very weird usage of that abbreviation, as pkg commonly refers to the directory where users store all their Go package definitions (like this repo demonstrates, although there is some controversy about using pkg).

Also, maybe I've missed this, but what will happen to the currently existing pkg.Package?

My suggestion would be:

syft
├── artifact
│   ├── artifact.go
│   ├── artifacts.go
│   ├── config.go
│   ├── cataloger
│   │   ├── files
│   │   │   ├── classifier
│   │   │   ├── digest
│   │   │   ├── metadata
│   │   │   └── secret
│   │   ├── generic
│   │   ├── language
│   │   │   ├── go
│   │   │   ├── java
│   │   │   ├── javascript
│   │   │   ├── php
│   │   │   ├── python
│   │   │   ├── ruby
│   │   │   └── rust
│   │   └── os
│   │       ├── apk
│   │       ├── deb
│   │       └── rpm
│   └── cataloger.go
├── cpe
├── file
└── syft.go

I went quite heavy on the renaming here, but in general feel free to s/artifact/package. I'd argue however that artifact is a better term than package? YMMV.

additionally, I've added the top-level syft.go file, that would tie everything together. This would probably contain:

  • the top-level API with it's options
  • the catalogers by source scheme

And I could imagine the API being pretty much what is proposed here, which would result in syft.Catalog(...)

I'm also curious to hear why the name "Cataloger", as to me "Scanner" is the term that comes to mind first for me. To make an example, I'd imagine a book cataloger to look at the cover & back, and put it into a specific section, while a scanner would actually check the contents of the book. That analogy also holds true for the files / artifacts that we're talking about here, I'd say.
(As that's something quite personal, I wouldn't say one or the other is wrong, I'm just curious about how that came to be)

@wagoodman
Copy link
Contributor Author

Thanks @tommyknows for your thoughts! I've been starting to get some of these changes in recently with #1383 (just part of this issue) so good timing!

I find the usage of package and packages a bit confusing.

me too... this was an attempt to separate definitions and capabilities, which helps tremendously when avoiding package cycles. That being said, I think there is another approach we can take here (see the update below)

what will happen to the currently existing pkg.Package?

I didn't explicitly have that in the tree, but it would remain in the pkg package

Additionally, naming it pkg to me is a very weird usage of that abbreviation.

This came out of a necessity since package is a go keyword so prevents it's use as a way to organize code in a package. I'm aware of the pkg naming norm in the go community, however, this particular project is very "package" centric so this took precedence.

I'd argue however that artifact is a better term than package?

Today we use the term artifact to mean "something that was cataloged" (in a noun-less way)... so we catalog file metadata, digests, secrets (soon to be removed), packages, etc... each of these are normalized by their ability to be identified and related to one another. The only other spot this is inconsistent with is the syft json output where we have a top-level artifacts key where all packages are stored. One day we may rename this to packages or combine multiple types (files and packages) under artifacts in the syft json... there hasn't been a clear direction here, so we haven't made a change.

I'm quite hesitant to make such a large change with this issue (which helps us get to 1.0) but am open to talking about how it could change in the future.

I'm also curious to hear why the name "Cataloger", as to me "Scanner" is the term that comes to mind first for me

Yeah, naming is hard! Scanner came up during initial development, but when trying to craft a package searching interface we wanted the nouns and verbs to be self-descriptive and as specific as possible. The verb "scan" doesn't say what the ultimate action is and can be used to generically describe any "scanning-like" action... where as "catalog" had a more specific connotation. With the raw definitions:

  • scanning: look at all parts of (something) carefully in order to detect some feature.
  • cataloging: make a systematic list of (items of the same type).
    While both definitely fit, the latter seemed more specific to what was happening --we're building a catalog or index of packages.

I think where the term catalog started to go sideways in the codebase was when we used the exact same word as a verb and a noun ("write packages to the catalog" and "catalog packages"). This is when it started to get confusing (for me).

additionally, I've added the top-level syft.go file, that would tie everything together. This would probably contain: the top-level API with it's options and the catalogers by source scheme

Agreed! We ended up going with something like this in #1383 , happy for opinions / comments!

Based off of some of your comments, other internal comments, and time passing and thinking about it again... here's an updated package organization proposal:


syft/
├── cataloger/
│   ├── apkdb/
│   ├── generic/
│   │   ├── cataloger.go
│   │   ├── cataloger_test.go
│   │   └── parser.go
│   ├── deb/
│   ├── golang/
│   ├── java/
│   ├── javascript/
│   ├── php/
│   ├── python/
│   ├── rpmdb/
│   ├── ruby/
│   ├── rust/
|   ├── filemetadata/
|   ├── filedigests/
│   └── config.go           // configuration for cataloging behavior (cataloger.SearchConfig, cataloger.RelationshipsConfig, etc...)
│                           // overall capture what to catalog and how to catalog it. Used up in the syft package for capruting 
│                           // cataloging intent and for wiring up catalogers.
├── file/
│   └── ...                 // basic file definitions
│
├── pkg/
│   ├── collection.go       // old pkg.Catalog struct, renamed to pkg.Collection
│   ├── cataloger.go        // cataloger interface -- pkg.Cataloger
│   └── ...                 // other package definitions -- pkg.Package, pkg.*Metadata, etc...
│ 
├── create.go   // entrypoint for syft.CreateSBOM() and the syft.SBOMBuilderConfig type
├── task.go     // coordinates all catalogers into tasks that can be selected via configuration (responsible for wiring up catalogers)
└── ...

@tommyknows
Copy link

Thanks a lot for the detailed feedback @wagoodman, very nice write-up! With that background, I definitely agree with what you're saying and proposing.

@luhring
Copy link
Contributor

luhring commented Feb 13, 2023

Adding two cents as I've just looked again at using Syft as a lib with "fresh eyes" 😄 .

I like the idea of sbom, err := syft.CreateSBOM(src, cfg) that I saw in #1383.

For consumers, I think that's "step 2" in the flow. The src := source.New(...) line is "step 1", and so it's where the first bit of friction (or potentially lack thereof) will be encountered.

I have two major thoughts so far... that I want to be sure there are no "smarts" being invoked, and that I want to learn about as few proprietary or library-specific concepts as possible before I can create the sbom.

Smarts vs. no smarts

This is how I think about it: The Syft CLI experience has a lot of "smarts". It will do things like auto-detect the availability of the Docker daemon API in the environment, and use that when it can. It can parse "schemes" like sbom: and registry:. And it'll even retry image pulls in certain edge-case scenarios based on the scheme name. The Syft CLI's smarts are designed to make the experience of running Syft friendlier and more forgiving of user mistakes.

As a lib consumer, I don't want smarts. I'm usually writing a program where I want the flow to be as simple and reliable as possible, and I want to explicitly opt-in to any more complicated behavior if I ever need it — since I'm going to have to be the one to support the experience for my users.

As a concrete example, I'm interested in writing a tool that uses Syft (as a lib) to detect dark matter in container images. I want a straightforward flow — given an image reference (e.g. registry.fun/foo@sha:bar123), pull the image from the registry (no conditional logic w/ local Docker API), don't do any scheme parsing, and don't retry the pull implicitly on my behalf. Especially if I'm using a mainstream library to handle image retrieval like GGCR, there should be minimal work needed to generate the SBOM after I've correctly described the image in my logic.

Minimizing the number of custom objects to learn about before I can do a thing

I spent an hour or so this past weekend to try to figure out how I'd construct a Source to do this, and I ended up giving up. (So if I missed it, please point it out to me!)

I see functions like:

func New(in Input, registryOptions *image.RegistryOptions, exclusions []string) (*Source, func(), error)
func NewFromRegistry(in Input, registryOptions *image.RegistryOptions, exclusions []string) (*Source, func(), error)

So before I can even begin constructing a "source", I need to figure a few more things out, like this Input concept, and image.RegistryOptions which comes from a separate project.

Instead, I'd love 😍 for my journey to start with something like:

src, err := source.NewFromRemoteImageRef("registry.fun/foo@sha:bar123")

And then I'm good to call "make the SBOM!".

@wagoodman
Copy link
Contributor Author

wagoodman commented Apr 27, 2023

@luhring I've got some draft code that we could talk through conceptually (with some options not entirely figured yet)... think of this as a conversation starter (with a couple more fictitious source objects thrown in for good measure):

package source

import (
	"github.com/anchore/stereoscope/pkg/image"
	"github.com/anchore/syft/syft/artifact"
	v1 "github.com/google/go-containerregistry/pkg/v1"
)

type StereoscopeImageConfig struct {
	Reference       string
	From            image.Source
	Platform        *image.Platform
	RegistryOptions *image.RegistryOptions // TODO: takes platform? as string?
	// name?
}

type GGCRImageConfig struct {
	Image              v1.Image
	ContentPath        string
	AdditionalMetadata []image.AdditionalMetadata
	// name?
}

// the below configs could just be simple constructor args, but I kept it as a config to start...

type DirectoryConfig struct {
	Path string
	// name?
	// root?
}

type FileConfig struct {
	Path string
	// name?
	// root?
}

type GitConfig struct {
	URI string // url or path
	// name?
	// root?
}

func NewFromDirectory(cfg DirectoryConfig) (Source, error) {

}

func NewFromFile(cfg FileConfig) (Source, error) {

}

func NewFromImage(cfg StereoscopeImageConfig) (Source, error) {

}

func NewFromGGCRImage(cfg GGCRImageConfig) (Source, error) {

}

func NewFromGit(cfg GitConfig) (Source, error) {

}

type StereoscopeImageSource struct {
	// ...

	// implements ImageInterpreter
}

type GGCRImageSource struct {
	// ...

	// implements ImageInterpreter
}

type DirectorySource struct {
	// ...

	// implements PathInterpreter
}

type FileSource struct {
	// ...

	// implements PathInterpreter
}

type GitSource struct {
	// ...

	// implements GitInterpreter
}

type ImageMetadata struct {
	UserInput      string          `json:"userInput"`
	ID             string          `json:"imageID"`
	ManifestDigest string          `json:"manifestDigest"`
	MediaType      string          `json:"mediaType"`
	Tags           []string        `json:"tags"`
	Size           int64           `json:"imageSize"`
	Layers         []LayerMetadata `json:"layers"`
	RawManifest    []byte          `json:"manifest"`
	RawConfig      []byte          `json:"config"`
	RepoDigests    []string        `json:"repoDigests"`
	Architecture   string          `json:"architecture"`
	Variant        string          `json:"architectureVariant,omitempty"`
	OS             string          `json:"os"`
}

// LayerMetadata represents all static metadata that defines what a container image layer is.
type LayerMetadata struct {
	MediaType string `json:"mediaType"`
	Digest    string `json:"digest"`
	Size      int64  `json:"size"`
}

type GitMetadata struct {
	Commit string
	// ...
}

type Source interface {
	artifact.Identifiable
	FileResolver(Scope) (FileResolver, error)
}

type ImageInterpreter interface {
	Metadata() ImageMetadata
}

type PathInterpreter interface {
	Path() string
}

type GitInterpreter interface {
	Metadata() GitMetadata
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change is not backwards compatible enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants