Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support package base layer descriptor in package images #2932

Merged
merged 15 commits into from
Mar 7, 2022

Conversation

hasheddan
Copy link
Member

Description of your changes

Updates the package manager (primarily the package revision controller) to honor io.crossplane.xpkg: base layers in Crossplane packages, as described in the xpkg specification. A few notes on the implementation:

  • The ImageBackend used for the package parser, which IMO was pretty bloated (and still is to some extent, but further improvements require restructuring parser interface in crossplane-runtime), is refactored to only deal with fetching package images and extracting the YAML stream. It does not handle caching.
  • If one layer descriptor is annotated with io.crossplane.xpkg: base, the ImageBackend will only download it for extracting the YAML stream. If no layer descriptor is annotated, the whole image is downloaded and flattened to extract the YAML stream, but only the stream contents are cached.
  • The revision reconciler is refactored to handle all caching directly, rather than deferring some of it to the ImageBackend, which created a leaky separation of concern.
  • In the event that package contents are not in the cache, we read the YAML stream contents while also passing them along to the cache to be written to a gzipped file concurrently.
  • In the event that package contents are in the cache, we stream them through the gzip reader to the parser.

Note: potential future optimizations can be made around the use of afero.TarFs, which reads file contents into a map. The simplest improvement would be to just return the tar.Reader in annotated packages since we require that the package.yaml is the only file in that layer. I am running some benchmarks to illustrate the change in performance before implementing.

Fixes #2888

Note: as detailed in the issue, additional machinery will be added to crank / crossplane-runtime to support writing annotated package layer descriptors.

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Tested with both existing packages (registry.upbound.io/upbound/platform-ref-azure:v0.1.0) and one with annotated layer descriptors (hasheddan/cool-layers:v0.0.3). Cache used a local directory mount:

馃 (crossplane) k get pkgrev
NAME                                                                              HEALTHY   REVISION   IMAGE                                                   STATE    DEP-FOUND   DEP-INSTALLED   AGE
configurationrevision.pkg.crossplane.io/upbound-platform-ref-azure-2941a425862c   True      1          registry.upbound.io/upbound/platform-ref-azure:v0.1.0   Active   2           2               18s

NAME                                                                        HEALTHY   REVISION   IMAGE                                                   STATE    DEP-FOUND   DEP-INSTALLED   AGE
providerrevision.pkg.crossplane.io/crossplane-provider-azure-4ce0c0d45ac9   True      1          registry.upbound.io/crossplane/provider-azure:v0.18.1   Active                               14s
providerrevision.pkg.crossplane.io/crossplane-provider-helm-295bcd0e6dc3    True      1          registry.upbound.io/crossplane/provider-helm:v0.9.0     Active                               8s
providerrevision.pkg.crossplane.io/hasheddan-cool-layers-0a2ec46c4ef1       True      1          hasheddan/cool-layers:v0.0.3                            Active                               3m14s

馃 (crossplane) ls -la .work/inttest-package-cache/ --block-size=K
total 420K
drwxrwxrwx 2 dan   dan     4K Mar  1 14:37 .
drwxrwxr-x 6 dan   dan     4K Mar  1 14:31 ..
-rw-r--r-- 1 65532 65532  33K Mar  1 14:37 crossplane-provider-azure-4ce0c0d45ac9.gz
-rw-r--r-- 1 65532 65532  13K Mar  1 14:37 crossplane-provider-helm-295bcd0e6dc3.gz
-rw-r--r-- 1 65532 65532 347K Mar  1 14:35 hasheddan-cool-layers-0a2ec46c4ef1.gz
-rw-r--r-- 1 65532 65532   9K Mar  1 14:37 upbound-platform-ref-azure-2941a425862c.gz

Also verified that unzipped contents of items in the package cache matched expected YAML stream.

Refactors the image cache to accept and return an io.ReadCloser, which
it expects to deliver a valid package YAML stream. The stream is stored
as a gzipped file.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Updates the package cache tests to reflect the use of gzipped YAML
streams rather than full OCI images.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Updates the mock package cache to implement the required Has method.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Updates the path name builder to allow passing file extension.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Updates the crank build command to explicitly pass the .xpkg extension
to the path builder.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Refactors the image parser backend to not depend on cache and instead
only deal with pulling and extracting YAML streams from images. The
cache will now be managed separately from the image backend, and the
image backend will only be called in the event we don't have the data in
the cache.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Updates image backend tests to remove caching functionality and exercise
the layer annotation checks.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Updates the revision reconciler to directly manage the cache instead of
deferring the reponsibility to the image backend. The uncompressed file
contents are read by the package parser and forwarded to the cache
concurrently in the event that the contents are not already cached.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Updates revision reconciler tests with direct caching such that image
backend is no longer responsible for writing new packages to the cache.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Updates the crossplane start entrypoint to use PackageCache instead of
ImageCache.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
gzipFileReader allows returns an io.ReadCloser while still ensuring that
the underlying file can be read, while the TeeReadCloser wraps TeeReader
such that the Writer is closed when Close() is called.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
@hasheddan hasheddan requested review from negz and muvaf March 1, 2022 21:09
JoinedReadCloser joins an io.Reader and an io.Closer. It is typically
used in the case where a ReadCloser is wrapped by another Reader, but
the caller needs to be able close the underlying ReadCloser when
finished.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Updates the image backend for the package parser to only read the
package file contents from an image or image layer, rather than reading
the entire filesystem contents into memory just to be able to extract
the package file contents.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Copy link
Member Author

@hasheddan hasheddan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@muvaf @negz I updated to eliminate the tarfs usage as mentioned previously and when using a non-annotated fat package (using hasheddan/provider-aws-fat:v0.0.2 for testing), the reduction in allocations is significant as we are no longer reading the whole filesystem into memory just to get the package.yaml.

Note: these benchmarks are run only on ImageBackend.Init, not on the subsequent parsing, which is still quite inefficient due to how we are consuming the upstream k8s tooling.

With tarfs:

馃 (revision) go test -bench=. -benchmem -memprofile memprofiletarfs.out -benchtime=20s
goos: linux
goarch: amd64
pkg: github.com/crossplane/crossplane/internal/controller/pkg/revision
cpu: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
BenchmarkInit-12    	      15	1536714280 ns/op	291899693 B/op	  237475 allocs/op
PASS
ok  	github.com/crossplane/crossplane/internal/controller/pkg/revision	55.692s

Without:

馃 (revision) go test -bench=. -benchmem -memprofile memprofile.out -benchtime=20s
goos: linux
goarch: amd64
pkg: github.com/crossplane/crossplane/internal/controller/pkg/revision
cpu: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
BenchmarkInit-12    	      69	 336935786 ns/op	 4640937 B/op	    5012 allocs/op
PASS
ok  	github.com/crossplane/crossplane/internal/controller/pkg/revision	29.939s

@hasheddan
Copy link
Member Author

A comparison of some of the cached content sizes as well:

馃 (crossplane) ls .work/inttest-package-cache/ -la --block-size=K
total 2668K
drwxrwxrwx 2 dan   dan      4K Mar  3 12:09 .
drwxrwxr-x 6 dan   dan      4K Mar  1 14:31 ..
-rw-r--r-- 1 65532 65532  364K Mar  3 12:09 crossplane-provider-aws-f78664a342f1.gz
-rw-r--r-- 1 65532 65532   33K Mar  1 14:37 crossplane-provider-azure-4ce0c0d45ac9.gz
-rw-r--r-- 1 65532 65532   13K Mar  1 14:37 crossplane-provider-helm-295bcd0e6dc3.gz
-rw-r--r-- 1 65532 65532  299K Mar  3 12:07 hasheddan-provider-aws-fat-6ae095a00184.gz

Note: hasheddan/provider-aws-fat is using a slightly older version of provider-aws, hence the size difference from crossplane/provider-aws.

hasheddan/provider-aws-fat show the most obvious improvement here as the compressed image size is 25MB, but we are caching 299K.

Copy link
Member

@negz negz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts, but nothing blocking. I like the refactoring to pull the cache out of the image backend. Thanks @hasheddan!

// verify that we aren't just using the first layer annotated as xpkg
// base.
if foundAnnotated {
return nil, errors.New(errMultipleAnnotatedLayers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to fall back to using the artifacts from the 'flattened' image in this case? I'm thinking no; it would perhaps be more compatible, but also less predictable and in violation of the spec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a violation of the spec so I would concur that we should not fall back if multiple annotated layers 馃憤馃徎

// thread-safe manner.
type ImageCache struct {
// PackageCache stores and retrieves package content in a filesystem-backed
// cache in a thread-safe manner.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine changing from a cache of packages to a cache of YAML streams has some implications:

  1. We're slightly more committed to "the content of a Crossplane package" as far as the things the package manager needs to operate on a package being a single YAML stream. Probably not a problem?
  2. Folks updating to this implementation will presumably have their package caches invalidated. Is It worth adding logic to cleanup the old image based cache? IMO probably not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't think moving to caching just the YAML stream necessarily prohibits us from extending the package format in the future -- all packages would just need to be pulled again, which shouldn't be a huge issue.
  2. I thought about this, but given that there is not a possibility for conflict with content previously written by the cache, I agree that the added complexity of trying to clean up is not worth it. One area I do want to note in the release notes is that if you had manually populated the cache, you'll need to repopulate it again with a different process. I don't think this is a common use case today though.

type ImageCache struct {
// PackageCache stores and retrieves package content in a filesystem-backed
// cache in a thread-safe manner.
type PackageCache struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit: Aren't all hypothetical instances of Cache (the interface) "a package cache"? What do you think about:

  • type Cache interface -> type PackageCache interface
  • type PackageCache struct -> type FilesystemPackageCache struct (or similar)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call out -- I think I'll opt for PackageCache and FsPackageCache 馃憤馃徎

Comment on lines +39 to +40
Get(id string) (io.ReadCloser, error)
Store(id string, content io.ReadCloser) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we clarify what format content io.ReadCloser is expected to contain - i.e. a gzipped xpkg compliant YAML stream? In particular the fact that it's gzipped wasn't obvious to me until I read the implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content here is actually not gzipped. Content comes into the cache uncompressed, Store wraps the file being written to in a gzip writer, and Get wraps the file being read in a gzip reader.

}

// GzipFileReader constructs a new gzipFileReader from the passed file.
func GzipFileReader(f afero.File) (io.ReadCloser, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to take afero.File as an argument? At a glance it seems like an io.ReadCloser would work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope -- good catch! This can just be a GzipReadCloser 馃憤馃徎

}

var rc io.ReadCloser
var err error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this err be more tightly scoped? I think it's only used in the if r.cache.Has block, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, I think this is a relic of a previous iteration, thanks for pointing out!

Changes gzipFileReader to gzipReadCloser and accepts a ReadCloser
rather than an afero File.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Uses more specific PackageCache for Cache and FsPackageCache
implementation for filesystem backed package cache.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support layer descriptor annotations in package manager
2 participants