Top-level API should be more composable #558

wagoodman · 2021-10-16T18:52:53Z

What would you like to be added:
More reusable primitives when syft is used as a library. This would be able to do at least the following tasks:

get source objects from descriptions + configuration
catalog packages and persist work into an sbom.Document (as proposed in Promote cataloging task pattern #554)
catalog other artifacts (e.g. files, secrets, etc)
encode sbom (initial work from Introduce new format pattern + port json processing #550 need to be adjusted when the sbom document is incorporated)
decode sbom (initial work from Introduce new format pattern + port json processing #550 need to be adjusted when the sbom document is incorporated)

This should probably be done after (that is, this issue is most likely related to...):

Task pattern for catalogers + encapsulating sbom document: Promote cataloging task pattern #554
Utilizing the format pattern introduced in Introduce new format pattern + port json processing #550 (possibly provide a shim for presenter-only options today so they are represented as formats)
Using a more agnostic source object (not coupled to syft) Migrate scheme parsing and source resolution to a separate lib #557

Why is this needed:
The existing top-level API functions are for a full parse-catalog processing unit, where as separating parsing and cataloging into separate functions is useful if you want to reuse the same parsed image (which takes a while) with different catalogers calls (which relatively speaking doesn't take that long).

This would additionally improve interoperability with anchorectl.

Tasks

Migrate location-related structs to the file package #1751
Replace core SBOM-creation API with builder pattern #1383
Rename pkg.Catalog to pkg.Collection #1764
Provide API proposal (source API refactor PR and SBOM builder refactor PR has this)
Provide examples for using this API Add API examples #2517

The text was updated successfully, but these errors were encountered:

luhring · 2021-10-19T21:42:23Z

Adding my two cents:

I love this. I see "Syft-as-a-library" consumption of this project becoming big as time goes on. ❤️
It'd be great if we can try to use Syft as a library more ourselves during and even before taking this one, so we have a nice consumer-centric feedback loop to influence the API design.
In the updated design, it'd be awesome to really think about the top-level "entrypoint". Today we have syft.CatalogPackages. This function's parameters are Syft types, which means this isn't a function consumers can easily start using right away — instead they have to figure out other Syft function calls before calling this function. It'd be great to think about an elegant way of simplifying the work of invoking Syft for callers, and allowing configurability on an "as needed" basis. One pattern I've seen a lot is to have a function take a config struct as an argument. We could add API sugar for creating and adjusting this struct ahead of time. And it'd be great to have the function's (and all inner functions') configuration be driven solely by config parameters, not side-effectful references to a global config or a config file, so that library consumers have a predictable and tightly controlled experience invoking Syft functionality.

wagoodman · 2022-01-24T18:44:58Z

Example of using syft as a lib today: https://gist.github.com/wagoodman/57ed59a6d57600c23913071b8470175b

wagoodman · 2022-02-07T18:50:52Z

From refinement:

Do we want to encapsulate the stereoscope primitives or not? Should not have to do stereoscope.* to interact with syft, even if the objects from stereoscope is what we end up allowing to be passed around the API.
It would be ideal to not expose behavior (e.g. tasks) and instead expose declarations (e.g. cataloger configurations?)

wagoodman · 2022-03-03T15:33:43Z

Update: much of this has been addressed in #864

Also, I think we should distinguish scoped capabilities of syft vs the pure definitions needed to interop with primitives that syft works with.

An example: Syft can encode and decode various formats (spdx, cyclonedx, etc). Additionally, syft has the definition of what a Format is... something that can encode an SBOM into bytes, decode an SBOM from bytes, and validate the underlying bytes to see if they represent the format the object represents.

These are probably not strongly separated enough in today's API.

The definition of a format should be more closely related to that of an SBOM, since that context is semantically useful (as opposed to format on it's own). For example format.Format does not communicate what the format is for where as sbom.Format scopes the concept to formats that deal specifically with SBOMs.

Dealing with the capabilities of syft, this should probably be elevated to a top-level package concern and away from the Format definitions. That is format.Option type with a value of format.CycloneDxXMLOption seems like it deals with the semantic of a format, when it really is describing a user option for syft as a capability. This hints at elevating these concepts up to the syft package where there is a syft.FormatOption with a value of syft.CycloneDxXMLFormatOption.

Based on some of the above points, here is a suggested update to the syft API:

// in the top-level "syft" package

// migrated from today's "format" package
func FormatByOption(option FormatOption) sbom.Format 

func FormatByName(name string) sbom.Format

// migrated from today's "format" package
func IdentifyFormat(by []byte) sbom.Format

func Encode(s sbom.SBOM, f sbom.Format) ([]byte, error)

func Decode(reader io.Reader) (*sbom.SBOM, sbom.Format, error)

Edit: in case there are functions within syft that may require encoding/decoding functionality, this should be migrated to it's own package (say in syft/format).

wagoodman · 2022-03-03T15:37:58Z

A topic that was brought up by @westonsteimel and @luhring was how syft versions are expressed in SBOMs when using syft as a lib. Today we have a internal/version package which build options for the application syft are overridden to provide useful values based on the git information available. This allows for SBOMs to have a descriptor block that describes that syft at versionvX.Y.Z produced the SBOM document. When using syft as a lib, the caller must provide this information.

Ideally the syft API should be able to raise up this information automatically but additionally allow for override in case wrapping tooling prefers to provide a different name and version.

@luhring suggested that we could use the runtime buildinfo section from the running binary to provide this information. I think this is a good idea, however we need to account for the edge case when the buildinfo section is stripped from the binary too.

wagoodman · 2022-03-03T15:52:39Z

It would be ideal to not expose behavior (e.g. tasks) and instead expose declarations (e.g. cataloger configurations?)

Today we have this function call at the top-level syft package:

func CatalogPackages(src *source.Source, cfg cataloger.Config) (*pkg.Catalog, []artifact.Relationship, *linux.Release, error)

A few points about this:

This only captures cataloging packages and not any other artifacts
Produces a package catalog and related data instead of an SBOM itself

It would be great to move towards something like this:

// where "options" would be declarative and that augment/select cataloging capabilities
func CatalogSource(src *source.Source, /*... options...*/ ) (*sbom.SBOM, error)

or maybe injection of pre-constructed catalogers

func CatalogSource(src *source.Source, catalogers...) (*sbom.SBOM, error)

The downside with the second suggestion:

There is no common cataloger interface, and probably no value in creating one since we catalog multiple kinds of artifacts
Caller construction hints at providing more helper functions to do this construction

... which makes me lean towards the first suggestion.

The upside of taking source.Source is that we can reuse the source object across multiple calls (getting the source object is very expensive, enabling caller reuse is advised).

It'd be great to think about an elegant way of simplifying the work of invoking Syft for callers...

We could additionally introduce a single call that would wrap source creation and cataloging together into one call (which would use CatalogSource).

func Catalog(input string, /*... cataloger options...*/) (*sbom.SBOM, error)

This could be the "one stop shop" call that the cmd package would leverage.

wagoodman · 2022-03-14T16:48:13Z

Next up: what kind of cataloger options are we looking for here? Short answer: functional options that mostly represent existing configuration options seem best fit, as this allows for flexibility, sane defaults, and isn't mutually exclusive to a user-provided struct instance approach.

Let's assume we have something that looks like:

func Catalog(src *source.Source, options ...CatalogingOption) (*sbom.SBOM, error)

Where CatalogingOption is:

type CatalogingOption func(*source.Source, *CatalogingConfig) error

I propose the following options as a start:

# run all catalogers in serial fashion (default is to run in parallel)
WithoutConcurrency() CatalogingOption

# set the file resolver scope for all catalogers (except the secrets cataloger)
WithScope(scope source.Scope) CatalogingOption

# the tool name and version to show in the SBOM descriptor block (defaults to syft and current library version if not provided)
WithToolIdentification(name, version string) CatalogingOption 

# in the syft-json format, we allow for outputting application configuration (defaults to empty)
WithToolConfiguration(c interface{}) CatalogingOption 

# set the package catalogers you wish to use (must be pre-constructed)
WithPackageCatalogers(catalogers ...pkg.Cataloger) CatalogingOption

# use the package catalogers suited by the source provided when cataloging (default is to use this over `WithPackageCatalogers`)
WithDefaultPackageCatalogers(cfg packages.SearchConfig) CatalogingOption

# enable the file metadata cataloger (default is off)
WithFileMetadata() CatalogingOption

# enable the file digests cataloger (default is off)
WithFileDigests(hashes ...crypto.Hash) CatalogingOption

# enable the secrets cataloger (default is off)... the argument is a struct with all of todays cataloger options
WithSecrets(secretConfig *file.SecretsCatalogerConfig) CatalogingOption

# enable the file classification cataloger with default classifiers (default is off)
WithFileClassification() CatalogingOption

# set the specific file classifiers, and enables the file classification (default is off)
WithFileClassifiers(classifiers ...file.Classifier) CatalogingOption

# enable the file content cataloger (default is off)
WithFileContents(globs ...string) CatalogingOption 

# set the file size limits for the contents and secrets catalogers (default is 1 MB for both)
WithFileSizeLimit(byteLimit int64) CatalogingOption

Additionally, in case an API user has a way to generate a CatalogingConfig that can be provided instead of any additional arguments:

# override an entire config
WithConfig(override CatalogingConfig) CatalogingOption

This suits well for situations where you wish to bring your own defaults, but still allow for augmentation via more functional options.

This is also well suited for syft, the application, where we can build our own CatalogingConfig instance based off of the application configuration.

Using these options within the syft application would look something like this:

func generateSBOM(src *source.Source) (*sbom.SBOM, error) {
	catalogingConfig, err := appConfig.ToCatalogingConfig()
	if err != nil {
		return nil, err
	}

	return syft.Catalog(src,
		syft.WithConfig(*catalogingConfig),
	)
}

Though other external users may choose to use it with defaults and augment behavior

# catalog all packages, file metadata, and digests
syft.Catalog(src,
		syft.WithFileMetadata(),
		syft.WithFileDigests(crypto.SHA256),
	)

Questions:

there seems to be a consistency problem with how you enable file-based catalogers and turn off/chainge package cataloging. Is there a better way to express this?

wagoodman · 2022-03-14T16:59:29Z

Separate from the options (but related) is the configuration object used to describe behavior. Based off of the options and todays configuration, I propose a CatalogingConfig that looks something like this:

type CatalogingConfig struct {
	// tool-specific information
	ToolName          string
	ToolVersion       string
	ToolConfiguration interface{}
	// applies to all catalogers
	Scope                source.Scope
	ProcessTasksInSerial bool
	// package
	PackageCatalogers []pkg.Cataloger
	// file metadata
	CaptureFileMetadata bool
	DigestHashes        []crypto.Hash
	// secrets
	CaptureSecrets bool
	SecretsConfig  file.SecretsCatalogerConfig // struct derived from todays secrets cataloger options
	SecretsScope   source.Scope
	// file classification
	ClassifyFiles   bool
	FileClassifiers []file.Classifier
	// file contents
	ContentsConfig file.ContentsCatalogerConfig // struct derived from todays contents cataloger options
}

The idea is that all field types are useful in their current form and require no further processing. For example, SecretsScope source.Scope is preferred over SecretsScope string since a string requires further processing and verification... so we stick with the "rich" object.

With the default options that can be fetched:

func DefaultCatalogingConfig() CatalogingConfig {
	return CatalogingConfig{
		Scope:           source.SquashedScope,
		ToolName:        internal.ApplicationName,
		ToolVersion:     extractSyftVersionFromLib(), // todo: this function does not exist, but differs from version.FromBuild()
		SecretsScope:    source.AllLayersScope,
		SecretsConfig:   file.DefaultSecretsCatalogerConfig(), // similar concept, but provided next to the cataloger
		FileClassifiers: file.DefaultClassifiers(),  // similar concept, but provided next to the cataloger
		ContentsConfig:  file.DefaultContentsCatalogerConfig(),  // similar concept, but provided next to the cataloger
	}
}

Why export the config and fields since we already have options? This enables someone being able to provide their own config and options as they see fit (using WithConfig and with the syft application's use case in mind).

wagoodman · 2022-03-14T19:06:36Z

from refinement:

try to be minimal with the options
are there any more global default functional options that could be used here? (not just individual catalogers, but across different cataloger types)

lyzs90 · 2022-03-17T03:48:25Z

This is a great initiative, can't wait for it!

I have a use-case for defining a custom package cataloger. Pre-construction is achievable since the Cataloger interface is exposed. However, not sure if there's a way to extend (not override) the list of default package catalogers with the proposed API.

wagoodman · 2022-03-22T14:40:42Z

not sure if there's a way to extend (not override) the list of default package catalogers with the proposed API

@lyzs90 there is one way with the proposed API, but I admit it isn't straight forward:

config := syft.DefaultCatalogingConfig()
config.PackageCatalogers = append(config.PackageCatalogers, <your catalogers>...)
syft.Catalog(
    syft.WithConfig(config),
    //... any other overriding options
)

We could add an additional helper which would append and not override the list of package catalogers:

// current proposed "set/override" operation
WithPackageCatalogers(catalogers ...pkg.Cataloger) CatalogingOption

// possible new "append" operation
WithAdditionalPackageCatalogers(catalogers ...pkg.Cataloger) CatalogingOption

How does that sound?

lyzs90 · 2022-03-22T14:58:12Z

@wagoodman Now that I think of it, WithAdditionalPackageCatalogers may not be necessary after all because we can just set WithPackageCatalogers with the extended list of catalogers

wagoodman · 2022-03-22T17:08:41Z

update: see #558 (comment) for a more up to date suggestion

Most of this conversation has been about the syft package, however, that top-level API may require callers to use other packages for inputs (such as cataloger instances). For this reason I wanted to talk through some thoughts around changes to package organization.

Here is the current cataloger organization today:

syft/
└── pkg/
    └── cataloger/
        ├── apkdb/
        ├── common/
        │   ├── cpe/                        // contains CPE generation logic
        │   ├── generic_cataloger.go        // used by most other catalogers today
        │   ├── generic_cataloger_test.go
        │   └── parser.go
        ├── deb/
        ├── golang/
        ├── java/
        ├── javascript/
        ├── php/
        ├── python/
        ├── rpmdb/
        ├── ruby/
        ├── rust/
        ├── catalog.go          // entrypoint for package cataloging -- cataloger.Catalog()
        ├── cataloger.go        // cataloger interface + image/dir cataloger sets -- cataloger.Cataloger & cataloger.ImageCatalogers()
        ├── config.go           // config needed for cataloging (references the search config)
        └── search_config.go    // config that is common to how all catalogers search

This is good, but there is room for improvement on what the package names convey:

common contains resources that are common to all package catalogers, however, it's organizing both a common cataloger as well as resources (such as CPE generation). I think that, like the other catalogers, the common cataloger should be in its own package. Additionally, all-things relating to CPEs should probably be elevated higher in the package tree.
cataloger organizes a set of cataloger packages but also has common "wiring" code as well. How you invoke these wiring functions are a little odd: cataloger.Catalog(). The package is redundant relative to the verb and doesn't scope what is being cataloged (like with other cataloging packages). We could put these wiring elements into pkg (which may even be ideal), however, there is overlapping concepts here (pkg.Catalog the collection store and pkg.Catalog which would come from cataloger.Catalog). I think the next most ideal spot would be to carve out a pkgs or packages package that indicates that this is for cataloging multiple kinds of packages.
The definition of a package cataloger is within cataloger and not pkg. The location for this abstraction seems misplaced since there is no reason for it to reside far away from what is describes (how to catalog packages, so within the pkg package).

This culminates to a final tree that looks like so:

syft/
├── cpe/       // contains CPE generation logic
└── pkg/
    ├── cataloger/
    │   ├── apkdb/
    │   ├── generic/
    │   │   ├── cataloger.go
    │   │   ├── cataloger_test.go
    │   │   └── parser.go
    │   ├── deb/
    │   ├── golang/
    │   ├── java/
    │   ├── javascript/
    │   ├── packages
    │   │   ├── catalog.go        // entrypoint for package cataloging -- packages.Catalog()
    │   │   ├── catalogers.go     // image/dir cataloger sets -- packages.ImageCatalogers()
    │   │   └── search_config.go  // config that is common to how all catalogers search
    │   ├── php/
    │   ├── python/
    │   ├── rpmdb/
    │   ├── ruby/
    │   └── rust/
    └── cataloger.go   // cataloger interface -- pkg.Cataloger

Lastly, providing a helper function that ties together source schemes and sets of package catalogers could be provided within the packages package:

package packages

func CatalogersBySourceScheme(scheme source.Scheme, cfg SearchConfig) []pkg.Cataloger {
	switch scheme {
	case source.ImageScheme:
		return InstalledCatalogers(cfg)
	case source.FileScheme:
		return AllCatalogers(cfg)
	case source.DirectoryScheme:
		return IndexCatalogers(cfg)
	}
	return nil
}

wagoodman · 2022-03-22T21:39:16Z

In a similar vein to the last comment about package organization, I wanted to touch on the differences between the file and pkg packages.

The current organization of pkg is such:

syft/
└── pkg/
    ├── cataloger/
    │   └── (collections of catalogers, organized in packages)
    └── (basic definitions relating to packages)

The current organization of file is as such:

syft/
└── file/
    ├── (collections of catalogers, not organized in packages)
    └── (basic definitions relating to files)

Ideally the organization of file would be more like pkg, where:

capabilities (catalogers) are kept separate from basic file-related definitions
catalogers are organized into their own packages (e.g. syft/file/cataloger/secrets contains the secrets cataloger only... syft/file/cataloger/filemetadata contains the file metadata cataloger)

The final layout would look like:

syft/
└── file/
    ├── cataloger/
    │   └── (collections of catalogers, organized in packages)
    └── (basic definitions relating to files)

This paves the way to do a few more things:

move byte-size-related definitions from internal/file to syft/file (KB, MB, GB, etc)
move filename-glob-related functionality from internal/file to internal (GlobMatch)
rename internal/file to internal/archive as the purpose of the package is now a smaller focus (relating to zip and tar helpers only)

wagoodman · 2022-03-22T22:34:23Z

Focusing on the relationship between the source and file packages, there are some concepts that are currently in the source package that semantically belong in the file package. For instance, for concrete types:

source.FileType which enumerates different kinds of files (e.g. symlinks, hardlinks, etc) would be better fit as file.Type
source.FileMetadata which is a basic definition for data relating to a file would be better fit as file.Metadata
source.Location which relates a real filepath within a designated filesystem to an access path ("virtual path") would be better fit as file.Location (note: this embeds source.Coordinates)
source.Coordinates which is the minimal definition for a location would be better fit as file.Coordinates

Take for example the definition of sbom.Artifacts:

type Artifacts struct {
	PackageCatalog      *pkg.Catalog
	FileMetadata        map[source.Coordinates]source.FileMetadata
	FileDigests         map[source.Coordinates][]file.Digest
	FileClassifications map[source.Coordinates][]file.Classification
	FileContents        map[source.Coordinates]string
	Secrets             map[source.Coordinates][]file.SearchResult
	LinuxDistribution   *linux.Release
}

With the proposed source/file migrations, the nouns that make up sbom.Artifacts make more semantic sense:

type Artifacts struct {
	PackageCatalog      *pkg.Catalog
	FileMetadata        map[file.Coordinates]file.Metadata
	FileDigests         map[file.Coordinates][]file.Digest
	FileClassifications map[file.Coordinates][]file.Classification
	FileContents        map[file.Coordinates]string
	Secrets             map[file.Coordinates][]file.SearchResult
	LinuxDistribution   *linux.Release
}

Lastly, the same migrations should apply to file abstractions as well:

source.File*Resolver objects, which abstract how you extract file path, content, and metadata from an underlying representation, would be better fit as file.*Resolver
similarly, source.excludingResolver would be better fit as file.ExcludingResolver

The advantage to this last suggestion is that catalogers would only depend on objects in a package that has no relation to stereoscope image or source packages.

wagoodman · 2022-03-22T22:36:01Z

Focusing on the source package in particular:

it seems that the source.Scheme could be disambiguated to be a source.Type and the helper function source. DetectScheme() could be renamed to source.DetectTypeFromScheme. In this way user input schemes are related to (but not the same as) source types.

wagoodman · 2022-03-23T00:54:18Z

I've made a mockup of the proposed changes in the above comments, feel free to try it out here https://github.com/anchore/syft/tree/api-wip (this is just for illustration and prototyping, not for merging or cleaning up)

wagoodman · 2022-03-29T13:05:50Z

Incorporating comments from refinement

The proposed package structure for pkg has a few drawbacks:

cataloger/packages is a sibling package of the other individual package catalogers, which hints that they are hierarchically the same, however, this is not true since the packages cataloger implementation is meant to take one or more individual package catalogers and return the populated package catalog.
Ideally the packages.Catalog would be under pkg since the function does not have any inherent capabilities (since package catalogers are injected). The problem with this is that there is already a pkg.Catalog datastructure.

Given these points, and combining with the original points in other comments (e.g. package names should communicate what they operate on) here is an alternative proposed package structure:

syft/
├── cpe/                       // contains CPE generation logic
├── cataloger/
│   ├── packages/              //... all of the package-based catalogers organized into subpackges
│   │   ├── apkdb/
│   │   ├── generic/
│   │   │   ├── cataloger.go
│   │   │   ├── cataloger_test.go
│   │   │   └── parser.go
│   │   ├── deb/
│   │   ├── golang/
│   │   ├── java/
│   │   ├── javascript/
│   │   ├── php/
│   │   ├── python/
│   │   ├── rpmdb/
│   │   ├── ruby/
│   │   ├── rust/
│   │   └── cataloger_groups.go     // image/dir cataloger sets -- packages.ImageCatalogers()
│   └── files/                      //... all of the file-based catalogers organized into subpackges
|       ├── filemetadata/
|       ├── filedigests/
|       ├── fileclassifier/
│       └── secrets/
├── file/
│   └── ...    // basic file definitions
└── pkg/       // basic package definitions
    ├── collection.go       // old pkg.Catalog struct, renamed to pkg.Collection
    ├── catalog.go          // entrypoint for package cataloging -- pkg.Catalog()
    ├── config.go           // config that is common to how all catalogers search
    └── cataloger.go        // cataloger interface -- pkg.Cataloger

Specific changes:

file/cataloger and package/cataloger have been moved to a common root-level package syft/cataloger (so syft/cataloger/files and syft/cataloger/packages). This further separates the basic-definitions from capabilities.
Renamed pkg.Catalog to pkg.Collection. There is a lot of use of the name Catalog/Cataloger and a mixed use of noun and verb. There is much more use of the verb, thus changing this is harder (also, I feel the verb is a better fit since it's descriptive as to what it is doing). There is only one use of "catalog" as a noun and thats with pkg.Catalog, a data structure that holds a collection of package objects.

houdini91 · 2022-04-17T10:25:53Z

Hi guys,
I also found myself with the need to use syft as a library, its awesome to see you already two steps ahead.
I understand the vision is a very well defined library.
I took a couple hours to try and create a more simplistic route for some tests (also as a self learning process).
Its a bit relevant to this thread so i am attaching patch and PR to my changes.
scribe-security#7
package_lib_patch.txt

// LibPackagesExec run packages command as a library
// userInput: target
// cfg: syft configuration structure
// l: logger to attach to, nil  for default syft logger
// enable_ui: enable disable ui output
// Function return sbom or errors.
func LibPackagesExec(userInput string, cfg *config.Application, l logger.Logger, enable_ui bool) (*sbom.SBOM, error)```

wagoodman · 2022-06-04T14:30:14Z

Summary of tasks

Group: organize definitions (details in
#558 (comment) and #558 (comment) and #558 (comment) )

migrate file definitions to syft/file Migrate location-related structs to the file package #1751
rename pkg.Catalog to pkg.Collection Rename pkg.Catalog to pkg.Collection #1764
~~migrate cpe generation logic to syft/speculate/cpes~~ -- this is stale, consider moving CPE generation logic next to each parser that builds the ecosystem package (like done with pURLs)
~~expose formats on the external API~~ migrate syft.Format* functions to syft/format package

Group: organize catalogers (details in
#558 (comment))

~~migrate package catalogers to syft/cataloger/packages/*~~ no longer doing this work
migrate file catalogers to syft/file/cataloger/* (slightly different than in the above proposed comments... this starts to remove hierarchical organization of catalogers [allowing them to be siblings]) Migrate location-related structs to the file package #1751

Group: new top-level API

introduce "cataloging config" Top-level API should be more composable #558 (comment) Replace core SBOM-creation API with builder pattern #1383
introduce "cataloging options" Top-level API should be more composable #558 (comment) Replace core SBOM-creation API with builder pattern #1383
migrate tasks to top-level api, unexported, and leverage the new cataloger config (not the application config) Replace core SBOM-creation API with builder pattern #1383

spiffcs · 2022-06-09T16:20:01Z

@samj1912 if you have more thoughts from the community meeting re: using syft as a library please feel free to add them here

tommyknows · 2022-12-05T15:20:33Z

After reading through this ticket, I have a few ideas & questions that I'd want to raise. First off, I generally like the structure as proposed in #558 (comment).

However, I find the usage of package and packages a bit confusing.

packages as referred to in the syft/catalogers/packages, is a package format like deb, apk, but also refers to language-specific artifacts like binaries from golang, rust or others.
package is a generic representation of the items found by the packages catalogers, but also the files catalogers, if I understand correctly. Additionally, naming it pkg to me is a very weird usage of that abbreviation, as pkg commonly refers to the directory where users store all their Go package definitions (like this repo demonstrates, although there is some controversy about using pkg).

Also, maybe I've missed this, but what will happen to the currently existing pkg.Package?

My suggestion would be:

syft
├── artifact
│   ├── artifact.go
│   ├── artifacts.go
│   ├── config.go
│   ├── cataloger
│   │   ├── files
│   │   │   ├── classifier
│   │   │   ├── digest
│   │   │   ├── metadata
│   │   │   └── secret
│   │   ├── generic
│   │   ├── language
│   │   │   ├── go
│   │   │   ├── java
│   │   │   ├── javascript
│   │   │   ├── php
│   │   │   ├── python
│   │   │   ├── ruby
│   │   │   └── rust
│   │   └── os
│   │       ├── apk
│   │       ├── deb
│   │       └── rpm
│   └── cataloger.go
├── cpe
├── file
└── syft.go

I went quite heavy on the renaming here, but in general feel free to s/artifact/package. I'd argue however that artifact is a better term than package? YMMV.

additionally, I've added the top-level syft.go file, that would tie everything together. This would probably contain:

the top-level API with it's options
the catalogers by source scheme

And I could imagine the API being pretty much what is proposed here, which would result in syft.Catalog(...)

I'm also curious to hear why the name "Cataloger", as to me "Scanner" is the term that comes to mind first for me. To make an example, I'd imagine a book cataloger to look at the cover & back, and put it into a specific section, while a scanner would actually check the contents of the book. That analogy also holds true for the files / artifacts that we're talking about here, I'd say.
(As that's something quite personal, I wouldn't say one or the other is wrong, I'm just curious about how that came to be)

wagoodman · 2022-12-16T17:47:10Z

Thanks @tommyknows for your thoughts! I've been starting to get some of these changes in recently with #1383 (just part of this issue) so good timing!

I find the usage of package and packages a bit confusing.

me too... this was an attempt to separate definitions and capabilities, which helps tremendously when avoiding package cycles. That being said, I think there is another approach we can take here (see the update below)

what will happen to the currently existing pkg.Package?

I didn't explicitly have that in the tree, but it would remain in the pkg package

Additionally, naming it pkg to me is a very weird usage of that abbreviation.

This came out of a necessity since package is a go keyword so prevents it's use as a way to organize code in a package. I'm aware of the pkg naming norm in the go community, however, this particular project is very "package" centric so this took precedence.

I'd argue however that artifact is a better term than package?

Today we use the term artifact to mean "something that was cataloged" (in a noun-less way)... so we catalog file metadata, digests, secrets (soon to be removed), packages, etc... each of these are normalized by their ability to be identified and related to one another. The only other spot this is inconsistent with is the syft json output where we have a top-level artifacts key where all packages are stored. One day we may rename this to packages or combine multiple types (files and packages) under artifacts in the syft json... there hasn't been a clear direction here, so we haven't made a change.

I'm quite hesitant to make such a large change with this issue (which helps us get to 1.0) but am open to talking about how it could change in the future.

I'm also curious to hear why the name "Cataloger", as to me "Scanner" is the term that comes to mind first for me

Yeah, naming is hard! Scanner came up during initial development, but when trying to craft a package searching interface we wanted the nouns and verbs to be self-descriptive and as specific as possible. The verb "scan" doesn't say what the ultimate action is and can be used to generically describe any "scanning-like" action... where as "catalog" had a more specific connotation. With the raw definitions:

scanning: look at all parts of (something) carefully in order to detect some feature.
cataloging: make a systematic list of (items of the same type).
While both definitely fit, the latter seemed more specific to what was happening --we're building a catalog or index of packages.

I think where the term catalog started to go sideways in the codebase was when we used the exact same word as a verb and a noun ("write packages to the catalog" and "catalog packages"). This is when it started to get confusing (for me).

additionally, I've added the top-level syft.go file, that would tie everything together. This would probably contain: the top-level API with it's options and the catalogers by source scheme

Agreed! We ended up going with something like this in #1383 , happy for opinions / comments!

Based off of some of your comments, other internal comments, and time passing and thinking about it again... here's an updated package organization proposal:


syft/
├── cataloger/
│   ├── apkdb/
│   ├── generic/
│   │   ├── cataloger.go
│   │   ├── cataloger_test.go
│   │   └── parser.go
│   ├── deb/
│   ├── golang/
│   ├── java/
│   ├── javascript/
│   ├── php/
│   ├── python/
│   ├── rpmdb/
│   ├── ruby/
│   ├── rust/
|   ├── filemetadata/
|   ├── filedigests/
│   └── config.go           // configuration for cataloging behavior (cataloger.SearchConfig, cataloger.RelationshipsConfig, etc...)
│                           // overall capture what to catalog and how to catalog it. Used up in the syft package for capruting 
│                           // cataloging intent and for wiring up catalogers.
├── file/
│   └── ...                 // basic file definitions
│
├── pkg/
│   ├── collection.go       // old pkg.Catalog struct, renamed to pkg.Collection
│   ├── cataloger.go        // cataloger interface -- pkg.Cataloger
│   └── ...                 // other package definitions -- pkg.Package, pkg.*Metadata, etc...
│ 
├── create.go   // entrypoint for syft.CreateSBOM() and the syft.SBOMBuilderConfig type
├── task.go     // coordinates all catalogers into tasks that can be selected via configuration (responsible for wiring up catalogers)
└── ...

tommyknows · 2022-12-17T10:16:54Z

Thanks a lot for the detailed feedback @wagoodman, very nice write-up! With that background, I definitely agree with what you're saying and proposing.

luhring · 2023-02-13T17:49:09Z

Adding two cents as I've just looked again at using Syft as a lib with "fresh eyes" 😄 .

I like the idea of sbom, err := syft.CreateSBOM(src, cfg) that I saw in #1383.

For consumers, I think that's "step 2" in the flow. The src := source.New(...) line is "step 1", and so it's where the first bit of friction (or potentially lack thereof) will be encountered.

I have two major thoughts so far... that I want to be sure there are no "smarts" being invoked, and that I want to learn about as few proprietary or library-specific concepts as possible before I can create the sbom.

Smarts vs. no smarts

This is how I think about it: The Syft CLI experience has a lot of "smarts". It will do things like auto-detect the availability of the Docker daemon API in the environment, and use that when it can. It can parse "schemes" like sbom: and registry:. And it'll even retry image pulls in certain edge-case scenarios based on the scheme name. The Syft CLI's smarts are designed to make the experience of running Syft friendlier and more forgiving of user mistakes.

As a lib consumer, I don't want smarts. I'm usually writing a program where I want the flow to be as simple and reliable as possible, and I want to explicitly opt-in to any more complicated behavior if I ever need it — since I'm going to have to be the one to support the experience for my users.

As a concrete example, I'm interested in writing a tool that uses Syft (as a lib) to detect dark matter in container images. I want a straightforward flow — given an image reference (e.g. registry.fun/foo@sha:bar123), pull the image from the registry (no conditional logic w/ local Docker API), don't do any scheme parsing, and don't retry the pull implicitly on my behalf. Especially if I'm using a mainstream library to handle image retrieval like GGCR, there should be minimal work needed to generate the SBOM after I've correctly described the image in my logic.

Minimizing the number of custom objects to learn about before I can do a thing

I spent an hour or so this past weekend to try to figure out how I'd construct a Source to do this, and I ended up giving up. (So if I missed it, please point it out to me!)

I see functions like:

func New(in Input, registryOptions *image.RegistryOptions, exclusions []string) (*Source, func(), error)

func NewFromRegistry(in Input, registryOptions *image.RegistryOptions, exclusions []string) (*Source, func(), error)

So before I can even begin constructing a "source", I need to figure a few more things out, like this Input concept, and image.RegistryOptions which comes from a separate project.

Instead, I'd love 😍 for my journey to start with something like:

src, err := source.NewFromRemoteImageRef("registry.fun/foo@sha:bar123")

And then I'm good to call "make the SBOM!".

wagoodman · 2023-04-27T14:48:33Z

@luhring I've got some draft code that we could talk through conceptually (with some options not entirely figured yet)... think of this as a conversation starter (with a couple more fictitious source objects thrown in for good measure):

package source

import (
	"github.com/anchore/stereoscope/pkg/image"
	"github.com/anchore/syft/syft/artifact"
	v1 "github.com/google/go-containerregistry/pkg/v1"
)

type StereoscopeImageConfig struct {
	Reference       string
	From            image.Source
	Platform        *image.Platform
	RegistryOptions *image.RegistryOptions // TODO: takes platform? as string?
	// name?
}

type GGCRImageConfig struct {
	Image              v1.Image
	ContentPath        string
	AdditionalMetadata []image.AdditionalMetadata
	// name?
}

// the below configs could just be simple constructor args, but I kept it as a config to start...

type DirectoryConfig struct {
	Path string
	// name?
	// root?
}

type FileConfig struct {
	Path string
	// name?
	// root?
}

type GitConfig struct {
	URI string // url or path
	// name?
	// root?
}

func NewFromDirectory(cfg DirectoryConfig) (Source, error) {

}

func NewFromFile(cfg FileConfig) (Source, error) {

}

func NewFromImage(cfg StereoscopeImageConfig) (Source, error) {

}

func NewFromGGCRImage(cfg GGCRImageConfig) (Source, error) {

}

func NewFromGit(cfg GitConfig) (Source, error) {

}

type StereoscopeImageSource struct {
	// ...

	// implements ImageInterpreter
}

type GGCRImageSource struct {
	// ...

	// implements ImageInterpreter
}

type DirectorySource struct {
	// ...

	// implements PathInterpreter
}

type FileSource struct {
	// ...

	// implements PathInterpreter
}

type GitSource struct {
	// ...

	// implements GitInterpreter
}

type ImageMetadata struct {
	UserInput      string          `json:"userInput"`
	ID             string          `json:"imageID"`
	ManifestDigest string          `json:"manifestDigest"`
	MediaType      string          `json:"mediaType"`
	Tags           []string        `json:"tags"`
	Size           int64           `json:"imageSize"`
	Layers         []LayerMetadata `json:"layers"`
	RawManifest    []byte          `json:"manifest"`
	RawConfig      []byte          `json:"config"`
	RepoDigests    []string        `json:"repoDigests"`
	Architecture   string          `json:"architecture"`
	Variant        string          `json:"architectureVariant,omitempty"`
	OS             string          `json:"os"`
}

// LayerMetadata represents all static metadata that defines what a container image layer is.
type LayerMetadata struct {
	MediaType string `json:"mediaType"`
	Digest    string `json:"digest"`
	Size      int64  `json:"size"`
}

type GitMetadata struct {
	Commit string
	// ...
}

type Source interface {
	artifact.Identifiable
	FileResolver(Scope) (FileResolver, error)
}

type ImageInterpreter interface {
	Metadata() ImageMetadata
}

type PathInterpreter interface {
	Path() string
}

type GitInterpreter interface {
	Metadata() GitMetadata
}

wagoodman added the enhancement New feature or request label Oct 16, 2021

wagoodman added this to the Syft 1.0 milestone Oct 16, 2021

wagoodman added the breaking-change Change is not backwards compatible label Oct 19, 2021

wagoodman mentioned this issue Nov 8, 2021

First class relationships #607

Merged

wagoodman assigned spiffcs Jan 24, 2022

wagoodman mentioned this issue Mar 3, 2022

Migrate format definitions to sbom package #864

Merged

wagoodman added needs refinement and removed needs refinement labels Mar 7, 2022

wagoodman self-assigned this Mar 9, 2022

wagoodman mentioned this issue Mar 31, 2022

Go Library Support #646

Closed

This was referenced Oct 21, 2022

Upgrade generic cataloger #1281

Merged

Update deb to new generic cataloger pattern #1288

Merged

spiffcs removed their assignment Dec 1, 2022

wagoodman mentioned this issue Dec 16, 2022

Replace core SBOM-creation API with builder pattern #1383

Merged

kzantow mentioned this issue Jan 4, 2023

chore: refactor basic CPE functionality to its own package #1436

Merged

wagoodman mentioned this issue Jan 11, 2023

Continue - Option to enable specific language or ecosystem cataloger #888

Closed

This was referenced Apr 20, 2023

Migrate location-related structs to the file package #1751

Merged

Rename pkg.Catalog to pkg.Collection #1764

Merged

wagoodman mentioned this issue May 1, 2023

Rename sbom.PackageCatalog to sbom.Packages #1773

Merged

kzantow mentioned this issue May 2, 2023

chore: migrate configuration to fangs #1775

Closed

wagoodman mentioned this issue May 25, 2023

Refactor source API #1846

Merged

4 tasks

wagoodman mentioned this issue Aug 31, 2023

Automatically detect source based on local context #2084

Open

wagoodman mentioned this issue Jan 31, 2024

Add API examples #2517

Merged

wagoodman closed this as completed in #2517 Feb 2, 2024

BrewTestBot mentioned this issue Feb 10, 2024

syft 0.104.0 Homebrew/homebrew-core#162257

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Top-level API should be more composable #558

Top-level API should be more composable #558

wagoodman commented Oct 16, 2021 •

edited

luhring commented Oct 19, 2021

wagoodman commented Jan 24, 2022

wagoodman commented Feb 7, 2022

wagoodman commented Mar 3, 2022 •

edited

wagoodman commented Mar 3, 2022

wagoodman commented Mar 3, 2022 •

edited

wagoodman commented Mar 14, 2022 •

edited

wagoodman commented Mar 14, 2022 •

edited

wagoodman commented Mar 14, 2022

lyzs90 commented Mar 17, 2022

wagoodman commented Mar 22, 2022 •

edited

lyzs90 commented Mar 22, 2022

wagoodman commented Mar 22, 2022 •

edited

wagoodman commented Mar 22, 2022 •

edited

wagoodman commented Mar 22, 2022 •

edited

wagoodman commented Mar 22, 2022

wagoodman commented Mar 23, 2022

wagoodman commented Mar 29, 2022 •

edited

houdini91 commented Apr 17, 2022 •

edited by wagoodman

wagoodman commented Jun 4, 2022 •

edited

spiffcs commented Jun 9, 2022

tommyknows commented Dec 5, 2022

wagoodman commented Dec 16, 2022

tommyknows commented Dec 17, 2022

luhring commented Feb 13, 2023

wagoodman commented Apr 27, 2023 •

edited

Top-level API should be more composable #558

Top-level API should be more composable #558

Comments

wagoodman commented Oct 16, 2021 • edited

luhring commented Oct 19, 2021

wagoodman commented Jan 24, 2022

wagoodman commented Feb 7, 2022

wagoodman commented Mar 3, 2022 • edited

wagoodman commented Mar 3, 2022

wagoodman commented Mar 3, 2022 • edited

wagoodman commented Mar 14, 2022 • edited

wagoodman commented Mar 14, 2022 • edited

wagoodman commented Mar 14, 2022

lyzs90 commented Mar 17, 2022

wagoodman commented Mar 22, 2022 • edited

lyzs90 commented Mar 22, 2022

wagoodman commented Mar 22, 2022 • edited

wagoodman commented Mar 22, 2022 • edited

wagoodman commented Mar 22, 2022 • edited

wagoodman commented Mar 22, 2022

wagoodman commented Mar 23, 2022

wagoodman commented Mar 29, 2022 • edited

houdini91 commented Apr 17, 2022 • edited by wagoodman

wagoodman commented Jun 4, 2022 • edited

spiffcs commented Jun 9, 2022

tommyknows commented Dec 5, 2022

wagoodman commented Dec 16, 2022

tommyknows commented Dec 17, 2022

luhring commented Feb 13, 2023

Smarts vs. no smarts

Minimizing the number of custom objects to learn about before I can do a thing

wagoodman commented Apr 27, 2023 • edited

wagoodman commented Oct 16, 2021 •

edited

wagoodman commented Mar 3, 2022 •

edited

wagoodman commented Mar 3, 2022 •

edited

wagoodman commented Mar 14, 2022 •

edited

wagoodman commented Mar 14, 2022 •

edited

wagoodman commented Mar 22, 2022 •

edited

wagoodman commented Mar 22, 2022 •

edited

wagoodman commented Mar 22, 2022 •

edited

wagoodman commented Mar 22, 2022 •

edited

wagoodman commented Mar 29, 2022 •

edited

houdini91 commented Apr 17, 2022 •

edited by wagoodman

wagoodman commented Jun 4, 2022 •

edited

wagoodman commented Apr 27, 2023 •

edited