Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunked: store compressed digest if validated #2001

Merged

Conversation

giuseppe
Copy link
Member

@giuseppe giuseppe commented Jul 5, 2024

if the compressed digest was validated, as it happens when 'pull_options = {convert_images = "true"}' is set, then store it as well so that reusing the blob by its compressed digest works.

Previously, when an image converted to zstd:chunked was pulled a second time, it would not be recognized by its compressed digest, resulting in the need to re-pull the image again.

Copy link
Contributor

openshift-ci bot commented Jul 5, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jul 5, 2024
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works for simple cases, but it is not optimal in general: A single Layer can have multiple compressed representations, and this only stores one.

For ordinary uncompressed layers, that is handled by putBlobToPendingFile (computing both digests and) calling RecordDigestUncompressedPair; that can fully record a M-to-1 mapping. And that mapping is then consulted in tryReusingBlobAsPending.

So, it seems to me that PutBlobPartial should call RecordDigestUncompressedPair if the uncompressed digest is available. It’s already making an assumption that the compressed digest has been validated … though having pkg/chunked explicitly return the value in DriverWithDifferOutput, as this PR does, and explicitly validating that CompressedDigest == blobDigest in TryReusingBlobAsPending, would be more obviously correct.

@@ -2529,6 +2529,7 @@ func (r *layerStore) applyDiffFromStagingDirectory(id string, diffOutput *driver
layer.GIDs = diffOutput.GIDs
updateDigestMap(&r.byuncompressedsum, layer.UncompressedDigest, diffOutput.UncompressedDigest, layer.ID)
layer.UncompressedDigest = diffOutput.UncompressedDigest
layer.CompressedDigest = diffOutput.CompressedDigest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This might overwrite a valid CompressedDigest value (hypothetically, from a template layer) with ""
  • If this field is modified, a corresponding updateDigestMap call needs to happen as well.

I think the other parts of this PR are clearly valuable; this one doesn’t really hurt but seems unnecessary for the purpose. OTOH handling the DriverWithDifferOutput fields consistently is somewhat valuable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I'll fix it and call updateDigestMap

@giuseppe
Copy link
Member Author

giuseppe commented Jul 8, 2024

I think this works for simple cases, but it is not optimal in general: A single Layer can have multiple compressed representations, and this only stores one.

The Layer object already knows about the CompressedDigest, so I think we can use for the case where we convert an image and validate the compressed digest.

when we pull a layer without any conversion happening, we store the following information in the overlay-layers/layers.json file:

[{"id":"94e5f06ff8e3d4441dc3cd8b090ff38dc911bfa8ebdb0dc28395bc98f82f983f","created":"2024-07-08T09:28:18.17830826Z","compressed-diff-digest":"sha256:ec99f8b99825a742d50fb3ce173d291378a46ab54b8ef7dd75e5654e2a296e99","compressed-size":3623844,"diff-digest":"sha256:94e5f06ff8e3d4441dc3cd8b090ff38dc911bfa8ebdb0dc28395bc98f82f983f","diff-size":8082944,"compression":2,"uidset":[0],"gidset":[0,42]}]

previously, without the current change we weren't storing the compressed-diff-digest:

[{"id":"d51af96cf93e225825efd484ea457f867cb2b967f7415b9a3b7e65a2f803838a","created":"2024-07-08T09:49:07.867152639Z","diff-digest":"sha256:d51af96cf93e225825efd484ea457f867cb2b967f7415b9a3b7e65a2f803838a","diff-size":4495360,"uidset":[65534,1,8,0],"gidset":[8,0,65534,1],"big-data-names":["zstd-chunked-manifest","zstd-chunked-layer-data"]}]

@giuseppe giuseppe force-pushed the chunked-store-compressed-digest branch from 7b9d43f to 4ba7ad8 Compare July 8, 2024 10:00
@mtrmac
Copy link
Collaborator

mtrmac commented Jul 8, 2024

I think this works for simple cases, but it is not optimal in general: A single Layer can have multiple compressed representations, and this only stores one.

The Layer object already knows about the CompressedDigest

It knows one per deduplicated layer. That loses information if the same layer is compressed multiple ways. (And, also, the information is lost when the layer is deleted.)

The way this is expected to work is that is primarily recorded in the c/image BlobInfoCache.

mtrmac added a commit to mtrmac/image that referenced this pull request Jul 8, 2024
Signed-off-by: Miloslav Trmač <mitr@redhat.com>
@mtrmac
Copy link
Collaborator

mtrmac commented Jul 8, 2024

@giuseppe containers/image#2478 is what I think code based on this PR can look like; that should work correctly even without the layerStoreapplyDiffFromStagingDirectory part of this PR.

Alternatively, WIP containers/image#2321 among various other changes, also updates BIC, without needing a new c/storage API (but hard-coding the same assumption about what c/storage validates).

@giuseppe
Copy link
Member Author

giuseppe commented Jul 8, 2024

It knows one per deduplicated layer. That loses information if the same layer is compressed multiple ways. (And, also, the information is lost when the layer is deleted.)

The way this is expected to work is that is primarily recorded in the c/image BlobInfoCache.

I am not opposing the change in c/image. I think we need that as well to deal with multiple compressed digests for the same uncompressed diff. I agree your change in containers/image#2478 is correct and we should have it. I still think though we need also the current change in c/storage so it can be used by APIs like LayersByCompressedDigest

@mtrmac
Copy link
Collaborator

mtrmac commented Jul 8, 2024

LayersByCompressedDigest is fundamentally lossy … but, sure, as long as we have the API, it’s valuable to try to keep it as good as possible.

@giuseppe
Copy link
Member Author

giuseppe commented Jul 9, 2024

LayersByCompressedDigest is fundamentally lossy … but, sure, as long as we have the API, it’s valuable to try to keep it as good as possible.

so can we merge the current PR? :-)

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m sorry, I screwed something up and this remained “pending” .

LGTM otherwise.

layers.go Outdated Show resolved Hide resolved
if the compressed digest was validated, as it happens when
'pull_options = {convert_images = "true"}' is set, then store it as
well so that reusing the blob by its compressed digest works.

Previously, when an image converted to zstd:chunked was pulled a
second time, it would not be recognized by its compressed digest,
resulting in the need to re-pull the image again.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe giuseppe force-pushed the chunked-store-compressed-digest branch from 4ba7ad8 to 28cba30 Compare July 10, 2024 07:42
@rhatdan
Copy link
Member

rhatdan commented Jul 10, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jul 10, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 11c8426 into containers:main Jul 10, 2024
18 checks passed
@mtrmac
Copy link
Collaborator

mtrmac commented Jul 10, 2024

LGTM, for the record.

mtrmac added a commit to mtrmac/image that referenced this pull request Jul 10, 2024
Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants