Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up cataloging by replacing globs searching with index lookups #1510

Merged
merged 57 commits into from
Feb 9, 2023

Conversation

wagoodman
Copy link
Contributor

@wagoodman wagoodman commented Jan 24, 2023

Leverages anchore/stereoscope#154 to improve cataloging speeds. Searching by full path globs is used everywhere, however, in images that have a large number of files this kind of operation is very intensive. This PR introduces new searching objects that leverage basename indexes in stereoscope in order to vastly improve the initial lookup against the index and then additionally prune potential results with additional requirements.

Today's behavior:

time go run ./cmd/syft ~/images/virt-manager.tar -vvv  
# ...
real	4m34.067s
user	5m49.291s
sys	0m6.189s

With the new indexes being leveraged:

time go run ./cmd/syft ~/images/virt-manager.tar -vvv
# ...
real	38.155s
user	43.73s
sys	12.88s

Specific changes:

  • Adds several tests around all source.FileResover implementations to show that they behave very similarly (this PR addresses several inconsistencies)
  • The indexing done in the source.directoryResolver has been split out into another object entirely: source.directoryIndexer
  • Refactors the source.directoryResolver to use the new stereoscope filetree.Index and filetree.Searcher to best leverage the new performance related enhancements from Add additional catalog indexes for performance stereoscope#154
  • Fixes the directory resolve to not visit the same file twice during indexing (once from the real path and another from one or more virtual paths).
  • Adds a new form of testing around the catalogers: glob match testing. This is done within each package where a cataloger is implemented and ensures that the cataloger is wired appropriately to select all possible files that can be fetched from the cataloger via globs without testing specific parsing logic.
  • Replaces the source.FileMetadata object with the stereoscope file.Metadata object instead.
  • Replaces the source.FileType object with the stereoscope file.Type object instead.

Closes #1328
Closes #1480

@wagoodman wagoodman added the WIP work in progress / do not merge label Jan 24, 2023
@github-actions
Copy link

github-actions bot commented Jan 24, 2023

Benchmark Test Results

Benchmark results from the latest changes vs base branch
goos: linux
goarch: amd64
pkg: github.com/anchore/syft/test/integration
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
                                                          │ ./.tmp/benchmark-ba8be9b.txt │
                                                          │            sec/op            │
ImagePackageCatalogers/alpmdb-cataloger-2                                   12.03m ± 29%
ImagePackageCatalogers/ruby-gemspec-cataloger-2                             855.6µ ±  2%
ImagePackageCatalogers/python-package-cataloger-2                           2.152m ±  8%
ImagePackageCatalogers/php-composer-installed-cataloger-2                   669.7µ ±  2%
ImagePackageCatalogers/javascript-package-cataloger-2                       346.6µ ±  1%
ImagePackageCatalogers/dpkgdb-cataloger-2                                   480.9µ ±  2%
ImagePackageCatalogers/rpm-db-cataloger-2                                   446.2µ ±  2%
ImagePackageCatalogers/java-cataloger-2                                     10.90m ±  1%
ImagePackageCatalogers/graalvm-native-image-cataloger-2                     7.963µ ±  2%
ImagePackageCatalogers/apkdb-cataloger-2                                    466.2µ ±  1%
ImagePackageCatalogers/go-module-binary-cataloger-2                         18.00µ ±  1%
ImagePackageCatalogers/dotnet-deps-cataloger-2                              945.6µ ±  1%
ImagePackageCatalogers/portage-cataloger-2                                  288.4µ ±  2%
ImagePackageCatalogers/sbom-cataloger-2                                     103.9µ ±  1%
ImagePackageCatalogers/binary-cataloger-2                                   139.5µ ±  0%
geomean                                                                     430.0µ

                                                          │ ./.tmp/benchmark-ba8be9b.txt │
                                                          │             B/op             │
ImagePackageCatalogers/alpmdb-cataloger-2                                   5.060Mi ± 0%
ImagePackageCatalogers/ruby-gemspec-cataloger-2                             141.9Ki ± 0%
ImagePackageCatalogers/python-package-cataloger-2                           767.5Ki ± 0%
ImagePackageCatalogers/php-composer-installed-cataloger-2                   155.9Ki ± 0%
ImagePackageCatalogers/javascript-package-cataloger-2                       95.83Ki ± 0%
ImagePackageCatalogers/dpkgdb-cataloger-2                                   144.7Ki ± 0%
ImagePackageCatalogers/rpm-db-cataloger-2                                   170.8Ki ± 0%
ImagePackageCatalogers/java-cataloger-2                                     2.723Mi ± 0%
ImagePackageCatalogers/graalvm-native-image-cataloger-2                     1.523Ki ± 0%
ImagePackageCatalogers/apkdb-cataloger-2                                    122.2Ki ± 0%
ImagePackageCatalogers/go-module-binary-cataloger-2                         3.102Ki ± 0%
ImagePackageCatalogers/dotnet-deps-cataloger-2                              314.3Ki ± 0%
ImagePackageCatalogers/portage-cataloger-2                                  75.51Ki ± 0%
ImagePackageCatalogers/sbom-cataloger-2                                     13.06Ki ± 0%
ImagePackageCatalogers/binary-cataloger-2                                   20.55Ki ± 0%
geomean                                                                     105.2Ki

                                                          │ ./.tmp/benchmark-ba8be9b.txt │
                                                          │          allocs/op           │
ImagePackageCatalogers/alpmdb-cataloger-2                                    86.71k ± 0%
ImagePackageCatalogers/ruby-gemspec-cataloger-2                              2.159k ± 0%
ImagePackageCatalogers/python-package-cataloger-2                            10.35k ± 0%
ImagePackageCatalogers/php-composer-installed-cataloger-2                    3.458k ± 0%
ImagePackageCatalogers/javascript-package-cataloger-2                        1.253k ± 0%
ImagePackageCatalogers/dpkgdb-cataloger-2                                    2.646k ± 0%
ImagePackageCatalogers/rpm-db-cataloger-2                                    3.759k ± 0%
ImagePackageCatalogers/java-cataloger-2                                      38.26k ± 0%
ImagePackageCatalogers/graalvm-native-image-cataloger-2                       40.00 ± 0%
ImagePackageCatalogers/apkdb-cataloger-2                                     3.234k ± 0%
ImagePackageCatalogers/go-module-binary-cataloger-2                           101.0 ± 0%
ImagePackageCatalogers/dotnet-deps-cataloger-2                               5.011k ± 0%
ImagePackageCatalogers/portage-cataloger-2                                   1.487k ± 0%
ImagePackageCatalogers/sbom-cataloger-2                                       392.0 ± 0%
ImagePackageCatalogers/binary-cataloger-2                                     609.0 ± 0%
geomean                                                                      2.112k

@wagoodman wagoodman closed this Jan 25, 2023
@wagoodman wagoodman deleted the add-basename-catalog-indexes branch January 25, 2023 13:41
@wagoodman wagoodman restored the add-basename-catalog-indexes branch January 25, 2023 13:41
@wagoodman wagoodman reopened this Jan 25, 2023
@wagoodman wagoodman requested a review from a team January 25, 2023 20:53
@wagoodman wagoodman added enhancement New feature or request and removed WIP work in progress / do not merge labels Jan 25, 2023
@wagoodman wagoodman changed the title Replace raw globs with index equivalent operations Speed up cataloging by replacing globs searching with index lookups Jan 25, 2023
syft/pkg/cataloger/alpm/parse_alpm_db_test.go Show resolved Hide resolved
syft/pkg/cataloger/apkdb/cataloger.go Outdated Show resolved Hide resolved
syft/pkg/cataloger/binary/cataloger.go Outdated Show resolved Hide resolved
syft/pkg/cataloger/deb/cataloger.go Outdated Show resolved Hide resolved
syft/pkg/cataloger/generic/cataloger.go Outdated Show resolved Hide resolved
@kzantow kzantow linked an issue Jan 30, 2023 that may be closed by this pull request
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Copy link
Contributor

@spiffcs spiffcs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments about if values could be nil and cause a panic. I'm going to take another look at the resolver in a second pass since those had the most changes.

syft/source/image_all_layers_resolver.go Show resolved Hide resolved
syft/source/location.go Show resolved Hide resolved
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
@wagoodman wagoodman linked an issue Feb 9, 2023 that may be closed by this pull request
…indexes

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
@wagoodman wagoodman enabled auto-merge (squash) February 9, 2023 16:13
@wagoodman wagoodman merged commit 988041b into main Feb 9, 2023
@wagoodman wagoodman deleted the add-basename-catalog-indexes branch February 9, 2023 16:19
GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024
…nchore#1510)

* replace raw globs with index equivelent operations

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add cataloger test for alpm cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* fix import sorting for binary cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* fix linting for mock resolver

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* separate portage cataloger parser impl from cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* enhance cataloger pkgtest utils to account for resolver responses

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for alpm cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for apkdb cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for dpkg cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for cpp cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for dart cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for dotnet cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for elixir cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for erlang cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for golang cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for haskell cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for java cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for javascript cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for php cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for portage cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for python cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for rpm cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for rust cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for sbom cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for swift cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* allow generic catloger to run all mimetype searches at once

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* remove stutter from php and javascript cataloger constructors

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* bump stereoscope

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add tests for generic.Search

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add exceptions for java archive git ignore entries

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* enhance basename and extension resolver methods to be variadic

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* dont allow * prefix on extension searches

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add glob-based cataloger tests for ruby cataloger

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* remove unnecessary string casting

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* incorporate surfacing of leaf link resolitions from stereoscope results

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* [wip] switch to stereoscope file metadata

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* [wip + failing] revert to old globs but keep new resolvers

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* index files, links, and dirs within the directory resolver

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* fix several resolver bugs and inconsistencies

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* move format testutils to internal package

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* update syft json to account for file type string normalization

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* split up directory resolver from indexing

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* update docs to include details about searching

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* [wip] bump stereoscope to development version

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* fix linting

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* adjust symlinks fixture to be fixed to digest

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* fix all-locations resolver tests

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* fix test fixture reference

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* rename file.Type

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* bump stereoscope

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* fix PR comment to exclude extra *

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* bump to dev version of stereoscope

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* bump to final version of stereoscope

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* move observing resolver to pkgtest

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

---------

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve syft performance by memoizing filetree Very long cataloging process
3 participants