Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files are not output in CycloneDX formats #1710

Open
bureado opened this issue Apr 4, 2023 · 9 comments
Open

Files are not output in CycloneDX formats #1710

bureado opened this issue Apr 4, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@bureado
Copy link
Contributor

bureado commented Apr 4, 2023

What happened:

When scanning an image such as debian:bullseye, syft will catalog the individual files found in the image and output that information apparently only when using the spdx-tag-value, spdx-json and syft-json formats. This also means that users of syft using those formats appear to be penalized in terms of output file size.

What you expected to happen:

I would have expected a combination of:

  1. This behavior being consistent across all output types
  2. Files not being catalogued if they are already "claimed" by a catalogued package
  3. More controls on the files that get catalogued (e.g., only the ones introduced in a given layer)

It's possible some output formats (e.g., syft-table) don't support files or files aren't a good fit for the use case, since files can be less helpful for downstream consumption scenarios such as grype performing a vulnerability assessment.

It's possible this is by design. I'm bringing it up as it surprised me when looking at "stage" scans produced by buildkitd (which currently uses syft) given it only supports SPDX JSON and I noticed some very large SBOM files containing file references which appear largely duplicative and of limited security interest.

Steps to reproduce the issue:

One of:

for format in syft-json cyclonedx-xml cyclonedx-json github-json spdx-tag-value spdx-json syft-table syft-text ; do echo $format ; syft packages -o $format --quiet debian:bullseye | grep chgrp ; echo ; done
for format in syft-json cyclonedx-xml cyclonedx-json github-json spdx-tag-value spdx-json syft-table syft-text ; do echo $format ; syft packages -o $format --quiet alpine:latest | grep /usr/bin/scanelf ; echo ; done

Anything else we need to know?:

This appears to correspond with the use of s.AllCoordinates only in the handlers for SPDX and Syft formats.

A compounding scenario is that

digests = append(digests, file.Digest{Algorithm: "sha1", Value: "0000000000000000000000000000000000000000"})
sets all digests to SHA-1 blanks (unless the file-metadata cataloger is enabled) which makes the output larger at no increased security value. In syft-json output, MD5 digests are present. It's unclear if they can't be used in SPDX because of a spec constraint or something else.

Environment:

$ syft version
Application:        syft
Version:            0.76.0
JsonSchemaVersion:  7.0.1
BuildDate:          2023-03-31T15:11:44Z
GitCommit:          dfcc07e5122217ca9e2fc75817c593356fc0c405
GitDescription:     v0.76.0
Platform:           linux/amd64
GoVersion:          go1.19.7
Compiler:           gc
@bureado bureado added the bug Something isn't working label Apr 4, 2023
@bureado
Copy link
Contributor Author

bureado commented Apr 4, 2023

Related to #1524 in that files belonging to said packages could be potentially excluded from the generic cataloguing path.

@bureado
Copy link
Contributor Author

bureado commented Apr 4, 2023

Related to #1256 in potential SPDX specification constraints.

@tgerla
Copy link
Contributor

tgerla commented Apr 13, 2023

Hey @bureado, thanks for the report. We're discussing this as a team and we think it might be useful if we were able to talk to you "live" about this issue and some of the related topics. Any chance you can join one of our upcoming community meetings? The next one will be April 27, at Noon Eastern time. Or ping us on Slack (https://get.anchore.com/join-anchore-community/) and we can have an async conversation. Much appreciated!

@kzantow kzantow changed the title Files are catalogued alongside other types of components only in the SPDX and Syft JSON formats Files are not output in CycloneDX formats Aug 10, 2023
@kzantow
Copy link
Contributor

kzantow commented Aug 10, 2023

This issue has a few different changes described. If we limit this to outputting files in CycloneDX format, this would solve what I believe is the main issue: SPDX and CycloneDX formats are not equivalent in terms of files being output.

We could, of course, add more options to restrict what is actually being cataloged, but I think these would be separate issues from enhancing CycloneDX output.

@wagoodman
Copy link
Contributor

I think this will be made a little better with #1383 , specifically, we'll be capturing file digests for files that are directly related to packages by default. That means that:

  • we won't be capturing 00000... for digests in spdx sboms
  • the core syft model would have digests for the cyclonedx encoder to reference
  • there shouldn't be use of s.AllCoordinates anymore for file sets in sboms

@coheigea
Copy link
Contributor

@wagoodman What do you think about whether to make it configurable if we output information on files for SPDX? Other vendors like GitHub do not include this in the SBOM.

@wagoodman
Copy link
Contributor

We could start doing that since we have format-specific configurations now, however, in the upcoming workin #1383 I think this will get much better:

 cat /tmp/output-before-1383.spdx.json | grep '"checksumValue": "0000000000000000000000000000000000000000"' | wc -l
   11280

cat /tmp/output-after-1383.spdx.json | grep '"checksumValue": "0000000000000000000000000000000000000000"' | wc -l
      34

Where file selection would be controlled by configuration (regardless of the format):

file:
  metadata:
    # can be: all-files, owned-files, no-files
    selection: all-files

@coheigea
Copy link
Contributor

Sounds good thanks @wagoodman . I ran a Syft generated spfx document through https://tools.spdx.org/app/validate/ and it complained that "Found analyzed files for package apk-tools when analyzedFiles is set to false", so the current output is not quite aligned. I'll check again once your PR is merged.

@wagoodman
Copy link
Contributor

wagoodman commented Nov 15, 2023

Independent of #1383, I think a format configuration like you're suggesting wouldn't be a bad idea. Something like:

spdx:
    # common options...

    analyze-files: true

    # json specific options
    json:
       ....

    # xml specific options
    xml:
        ...

We'd need to adjust the app configuration to allow for this nesting, right now it's:

spdx-json:
    ...

spdx-xml:
   ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Backlog
Development

No branches or pull requests

5 participants