Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional catalog indexes for performance #154

Merged
merged 35 commits into from
Feb 8, 2023

Conversation

wagoodman
Copy link
Contributor

@wagoodman wagoodman commented Jan 24, 2023

This adds additional indexes for:

  • basename
  • basename globs
  • file extension

These can be optionally leveraged for improved performance instead of using the standard search by glob against the filetree.

Preview of change in syft (today's behavior):

time go run ./cmd/syft ~/images/virt-manager.tar -vvv  
# ...
real	4m34.067s
user	5m49.291s
sys	0m6.189s

With the new indexes being leveraged:

time go run ./cmd/syft ~/images/virt-manager.tar -vvv
# ...
real	38.155s
user	43.73s
sys	12.88s

In order for the indexes to work with consideration to link resolution, the file.References returned from the file tree need to additionally have link resolution results for the leaf-case only. By leaf case, I mean this only surfaces links on the basenames of paths, not in parent (directory) paths. Surfacing ancestor link resolution can be done but is outside of the scope of this PR.

Why is this link resolution needed? With tree-based searches the results from the search are inherently in the tree. This is not the case with indexes --the indexes are across all trees (say all layer or squash trees). This means that the results from an index lookup needs to be filtered based on nodes that are in the tree. It is important during this step that file catalog entries have the same reference ID as what is fetched from the tree, otherwise it would be possible to return a result that was fetched from the catalog but is not in the tree. However, consider a search against an index that returns a symlink which resolves to a real file that is in the tree (so is a valid result)... without considering link resolution results from these cases would be dropped.

Specific changes:

  • Splits the image.FileCatalog into two objects, where most of the index-specific implementation has been migrated to filetree.Index
  • filetree.Filetree now returns file.Resolution to illustrate basename link resolution with multiple file.Resolution objects. These objects show the path that was requested and the file.Reference that resolved to.
  • Deprecates multiple Image and Layer methods which names and functions made more sense with pervious implementations (and no longer with this one). This includes any FileContents*() and FileByMIMEType* methods.
  • Removes tar-specific fields from file.Metadata and raises the file.Metadata object to be used more agnostically (e.g. from syft's directory resolver). Additionally this pivots the existing file.Type enumerations away from the standard lib tar package constants, swapping for new, tar-independent constants. A new type (int) is selected to ensure this is a breaking change from a type-system perspective and not a breaking change from a (subtle) value semantics perspecitive.
  • Adds consistent set implementations for ID, path, and filenodes (which are leveraged more in this PR). This sets up for a generic implementation (which was removed from this PR for scope and go 1.19 generics bugs).
  • Adds a new filetree.Searcher abstraction and concrete filetree.searchContext implementation for leveraging optional indexes when searching by glob, path, or MIME type.
  • Adds a new glob parser to drive glob searching for the new filetree.searchContext, rewriting globs as needed to best match the available indexes from filetree.Index
  • Introduces filetree.Reader, filetree.Writer and filetree.ReadWriter interfaces to better restrain operations on filetrees from Image and Layer objects.
  • Adds a filetree.Builder to help construct both filetree.FileTrees and filetree.Indexes in a coordinated fashion.

@github-actions
Copy link

github-actions bot commented Jan 24, 2023

Benchmark Test Results

Benchmark results from the latest changes vs base branch
latest: Pulling from library/ubuntu
goos: linux
goarch: amd64
pkg: github.com/anchore/stereoscope/pkg/file
cpu: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
docker: 
           │ ./.tmp/benchmark-42c7e84.txt │
           │            sec/op            │
TarIndex-2                   42.51µ ± ∞ ¹
¹ need >= 6 samples for confidence interval at level 0.95

           │ ./.tmp/benchmark-42c7e84.txt │
           │             B/op             │
TarIndex-2                  5.560Ki ± ∞ ¹
¹ need >= 6 samples for confidence interval at level 0.95

           │ ./.tmp/benchmark-42c7e84.txt │
           │          allocs/op           │
TarIndex-2                    93.00 ± ∞ ¹
¹ need >= 6 samples for confidence interval at level 0.95

pkg: github.com/anchore/stereoscope/test/integration
                                      │ ./.tmp/benchmark-42c7e84.txt │
                                      │            sec/op            │
SimpleImage_GetImage/docker-archive-2                   1.507m ± ∞ ¹
SimpleImage_GetImage/oci-archive-2                      1.293m ± ∞ ¹
SimpleImage_GetImage/oci-dir-2                          821.5µ ± ∞ ¹
geomean                                                 1.170m
¹ need >= 6 samples for confidence interval at level 0.95

                                      │ ./.tmp/benchmark-42c7e84.txt │
                                      │             B/op             │
SimpleImage_GetImage/docker-archive-2                  350.8Ki ± ∞ ¹
SimpleImage_GetImage/oci-archive-2                     632.6Ki ± ∞ ¹
SimpleImage_GetImage/oci-dir-2                         399.8Ki ± ∞ ¹
geomean                                                446.0Ki
¹ need >= 6 samples for confidence interval at level 0.95

                                      │ ./.tmp/benchmark-42c7e84.txt │
                                      │          allocs/op           │
SimpleImage_GetImage/docker-archive-2                   2.640k ± ∞ ¹
SimpleImage_GetImage/oci-archive-2                      1.550k ± ∞ ¹
SimpleImage_GetImage/oci-dir-2                          1.334k ± ∞ ¹
geomean                                                 1.761k
¹ need >= 6 samples for confidence interval at level 0.95

docker: Error response from daemon: Get "http://localhost/v2/": dial tcp [::1]:80: connect: connection refused.
                                                   │ ./.tmp/benchmark-42c7e84.txt │
                                                   │            sec/op            │
SimpleImage_FetchSquashedContents/docker-archive-2                   17.35µ ± ∞ ¹
¹ need >= 6 samples for confidence interval at level 0.95

                                                   │ ./.tmp/benchmark-42c7e84.txt │
                                                   │             B/op             │
SimpleImage_FetchSquashedContents/docker-archive-2                  2.648Ki ± ∞ ¹
¹ need >= 6 samples for confidence interval at level 0.95

                                                   │ ./.tmp/benchmark-42c7e84.txt │
                                                   │          allocs/op           │
SimpleImage_FetchSquashedContents/docker-archive-2                    21.00 ± ∞ ¹
¹ need >= 6 samples for confidence interval at level 0.95

@wagoodman wagoodman marked this pull request as ready for review January 24, 2023 17:54
@wagoodman wagoodman requested a review from a team January 24, 2023 17:54
@wagoodman wagoodman marked this pull request as draft January 25, 2023 13:30
@wagoodman wagoodman marked this pull request as ready for review January 25, 2023 13:38
Copy link
Contributor

@spiffcs spiffcs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial LGTM - I need to go back and double check the test file since it was the largest diff from this change. The tests run correctly locally and no diffs between local env and CI.

Do you want to wait for the syft PR that incorporates these changes before doing the final merge?

I can also test the syft integration if that's up for review

pkg/image/content_helpers.go Outdated Show resolved Hide resolved

// fetchFilesByBasenameGlob is a common helper function for resolving file references for a file basename glob pattern
// catalog relative to the given tree.
func fetchFilesByBasenameGlob(ft *filetree.FileTree, fileCatalog *FileCatalog, basenameGlob string) ([]file.Reference, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we're trying to move indexes away from Glob matching in this PR - Is this the best name here?

The function is written so that ** and / are not supported so I'm trying to think through what kinds of globs are matched then which improve the performance. Let me also check out the test for this since that might help in my review. I'l follow up with other comments as I go through this.

pkg/image/file_catalog.go Outdated Show resolved Hide resolved
kzantow
kzantow previously approved these changes Jan 25, 2023
@wagoodman
Copy link
Contributor Author

@spiffcs

Do you want to wait for the syft PR that incorporates these changes before doing the final merge?

Already done! anchore/syft#1510 (all PR checks passing, a lot of tests added)

@wagoodman wagoodman dismissed kzantow’s stale review January 27, 2023 22:30

I've added a lot of link resolution logic changes which warrants another review

@wagoodman wagoodman marked this pull request as draft January 27, 2023 22:30
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
…ution)

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
@wagoodman wagoodman requested a review from a team February 5, 2023 23:31
@wagoodman wagoodman marked this pull request as ready for review February 5, 2023 23:31
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Copy link
Contributor

@kzantow kzantow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

internal/string_set.go Show resolved Hide resolved

import "sort"

type IDSet map[ID]struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is feeling like we could start to use generics

Copy link
Contributor Author

@wagoodman wagoodman Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already did, I ran into a 1.19 specific generics bug that causes it to fail under a specific circumstance. For that reason I reverted and went back to plain-ol'e copy paste. From the PR description:

Adds consistent set implementations for ID, path, and filenodes (which are leveraged more in this PR). This sets up for a generic implementation (which was removed from this PR for scope and go 1.19 generics bugs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is the commit where I implemented the generic approach 033d3e4 ... one of the tests fails in go 1.19 but passes in 1.20

type PathSet map[Path]struct{}

func NewPathSet() PathSet {
return make(PathSet)
func NewPathSet(is ...Path) PathSet {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 3rd set that could benefit by generics, maybe

Copy link
Contributor Author

@wagoodman wagoodman Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/file/reference_test.go Outdated Show resolved Hide resolved
pkg/file/resolution.go Show resolved Hide resolved
}

// nolint: funlen
func (sc searchContext) _pathsToNode(fn *filenode.FileNode, observedPaths file.PathSet, cache map[cacheRequest]cacheResult) (file.PathSet, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is underscore naming a standard thing in go?

Copy link
Contributor Author

@wagoodman wagoodman Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably more of a alex-ism, more from python really

pkg/image/content_helpers.go Show resolved Hide resolved

err = img.applyOverrideMetadata()
if err != nil {
t.Fatalf("could not create image: %+v", err)
}
for _, d := range deep.Equal(img, &test.image) {
if d := cmp.Diff(img, &test.image,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could an assert.JSONEq work here? The output can be easier to read than the cmp.Diff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is with struct instances and am not planning in testing the json representation of these I think something like cmp.Diff is the right tool here.

Copy link
Contributor

@spiffcs spiffcs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass done - I've got about 23 files left I've marked to go back through to figure out how they work.

.github/scripts/build.sh Show resolved Hide resolved
.github/scripts/go-mod-tidy-check.sh Show resolved Hide resolved
.github/workflows/benchmark-testing.yaml Show resolved Hide resolved
.github/workflows/validations.yaml Show resolved Hide resolved
DEVELOPING.md Show resolved Hide resolved
pkg/image/image.go Show resolved Hide resolved
pkg/image/layer.go Show resolved Hide resolved
test/integration/fixture_image_simple_test.go Outdated Show resolved Hide resolved
test/integration/fixture_image_simple_test.go Outdated Show resolved Hide resolved
test/integration/fixture_image_simple_test.go Outdated Show resolved Hide resolved
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
@wagoodman wagoodman enabled auto-merge (squash) February 8, 2023 15:44
@wagoodman wagoodman merged commit 5a306f0 into main Feb 8, 2023
@wagoodman wagoodman deleted the add-basename-catalog-indexes branch February 8, 2023 15:46
gnmahanth pushed a commit to deepfence/stereoscope that referenced this pull request Jun 15, 2023
* add additional catalog indexes for performance

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* [wip] link resolution

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add leaf link resolution on tree responses (defer ancestor link resolution)

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add filetree search context

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add tests for new search context object

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* remove unused tar header fields from file.Metadata struct

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* use singular file type definitions

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add logging for filetree searches

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add limited support for glob classes and alternatives

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add failing test to show that index shortcircuits correct behavior

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add link resolution via filetree search context

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* allow index symlink resolution to function through cycles

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add tests for filetree.Index

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add search by parent basename and fix requirements filtering

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* sort search results

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* change file.Type to int + fix layer 0 squashed search context

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* more cleanup

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* switch to generic set implementation

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* update linter

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* replace generic set implemetation with plain set (unstable in go1.19)

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* introduce filtree builter and foster usage of reader interfaces

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* rename content helper functions

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* update docs with background

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* fix get_xid for cross compilation

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* upgrade CI validations workflow

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* fix snapshot builds

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add tests for file.Index.GetByFileType

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* rename file.Type and file.Resolution

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* ensure that glob results match search facade

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* replace stringset implementation + move resolution tests

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* add note about podman dependency for testing

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* address PR comments

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* remove extra whitespace

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* constrain OS build support

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* update/remove TODO comments

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

---------

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants