Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunked: various improvements #1092

Merged
merged 35 commits into from
Jan 10, 2022
Merged

Conversation

giuseppe
Copy link
Member

@giuseppe giuseppe commented Dec 21, 2021

various improvements for the pkg/chunked package.

Noteworthy:

  • support splitting a file into multiple chunks and perform network deduplication.
  • local metadata cache for each layer. It avoids parsing the json metadata file for each layer in the storage.
  • use valyala/gozstd instead of github.com/klauspost/compress. It is much faster.
  • copy local files from multiple goroutines.
  • use json-iterator to improve JSON manifest parsing
  • reuse a static buffer for io operations

More details in each commit.

Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com

@giuseppe
Copy link
Member Author

giuseppe commented Dec 21, 2021

@cgwalters PTAL

It is still a draft as it is barely tested, but I think this could help with the deduplication of the large Go binaries we were discussing.

The deduplication, in this case, would be just for the network data since locally it is not usable as reflinks require the data to be aligned.

@giuseppe giuseppe force-pushed the chunked-intra-files branch 4 times, most recently from 989f9c0 to 98c459b Compare December 22, 2021 14:35
Copy link
Contributor

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you toss up some sort of even brief description or diagram (e.g. https://asciiflow.com/#/ ) of how the chunks are stored? How does it relate to the container layers, etc? It's hard to just jump in and look at code without that.

Also, it'd really help build confidence in this to have even some basic unit tests. See e.g. https://github.com/ostreedev/ostree/blob/main/tests/test-rollsum.c

The ostree static delta code which uses this is also tested by unit tests that verify that it regenerates exactly the same files.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe
Copy link
Member Author

Can you toss up some sort of even brief description or diagram (e.g. https://asciiflow.com/#/ ) of how the chunks are stored? How does it relate to the container layers, etc? It's hard to just jump in and look at code without that.

Also, it'd really help build confidence in this to have even some basic unit tests. See e.g. https://github.com/ostreedev/ostree/blob/main/tests/test-rollsum.c

The ostree static delta code which uses this is also tested by unit tests that verify that it regenerates exactly the same files.

yes, I need to write some documentation about it.

The metadata JSON file is based on the same format as estargz: https://github.com/containerd/stargz-snapshotter/blob/main/docs/estargz.md, chunks are represented using the estargz TOC file.

For the tarball itself, there is not much difference, the only difference is that the compression must be restarted when a chunk is created, so we can pick the offset and use it for generating the TOC JSON metadata file that is appended at the end of the tarball.

Locally there is no difference in how the files are stored. We lookup the layers metadata to know what files/chunks are present. If a chunk is already present locally, we read directly from the checked-out file.

e.g. we have the following entries in the metadata for a layer present in the local store:

 {
      "type": "reg",
      "name": "usr/bin/podman",
      "mode": 493,
      "size": 49984904,
      "modtime": "2021-12-23T13:10:02+01:00",
      "accesstime": "0001-01-01T00:00:00Z",
      "changetime": "0001-01-01T00:00:00Z",
      "digest": "sha256:8270b870b26cdca16bc158f6d230a1d6ca1d3d7503b2f97d9bcf85dbb6327f1b",
      "offset": 131,
      "endOffset": 17374600,
      "chunkSize": 1625,
      "chunkDigest": "sha256:1b1a58f4ba739b5d076472a4559c15a7c31b20182c7e1abd5bb55761bd185a8b"
    },
    {
      "type": "chunk",
      "name": "usr/bin/podman",
      "offset": 1055,
      "chunkSize": 9188,
      "chunkOffset": 1625,
      "chunkDigest": "sha256:5d7700542a3069fecd4cdfc5f1e1f5f7f50f5109e78ff137267be199af52f724"
    },
    {
      "type": "chunk",
      "name": "usr/bin/podman",
      "offset": 4720,
      "chunkSize": 2799,
      "chunkOffset": 10813,
      "chunkDigest": "sha256:ebb089e3b60dac8a238d1154e9259efee9ffd298d3da1ba1d13a34c3eda9340a"
    },

if we pull a new image that has a chunk with the digest "chunkDigest": "sha256:ebb089e3b60dac8a238d1154e9259efee9ffd298d3da1ba1d13a34c3eda9340a" then we copy it from the file $LAYER_CHECKOUT/usr/bin/podman at offset 2799 for 2799 bytes.

Loading the JSON metadata for each layer in the storage is quite expensive, this is something that could be improved, but I've not looked into it yet.

I've done some tests with appending data to a big file, and the rolling checksum seems to help a lot. Unfortunately, I don't see a big improvement with Go binaries (I've tried with podman version 3.4.0 and podman version 3.4.4 and I get ~12% dedup if I use 13 bits for the rolling checksum, and ~%5 if using 16 bits as it is set now). With 13 bits, there are too many chunks generated and that has an impact on the final tarball size. I wonder if there is anything we can do with the linker settings to generate more similar binaries

@giuseppe giuseppe changed the title [WIP] chunked: split files using a rolling checksum chunked: split files using a rolling checksum Dec 23, 2021
@giuseppe giuseppe force-pushed the chunked-intra-files branch 5 times, most recently from b92cb8e to 5dd63b3 Compare December 24, 2021 12:22
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
it solves a problem where the discard could be performed before the
compression handler was closed (through a deferred call).

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe giuseppe force-pushed the chunked-intra-files branch 5 times, most recently from 0795a4d to 45e6852 Compare December 24, 2021 16:48
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe giuseppe force-pushed the chunked-intra-files branch 3 times, most recently from b4340ae to c2defb9 Compare January 10, 2022 10:22
@giuseppe giuseppe changed the title chunked: split files using a rolling checksum chunked: various improvements Jan 10, 2022
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
avoid using slices.  I've seen a drop of ~20M in memory
usage with a fedora image.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
a reference to the just created parent directory is already opened, so
use it directly.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
avoid parsing each json TOC file for the layers in the local storage,
but attempt to create a lookaside cache in a custom format faster to
load (and potentially be mmap'able).

The same cache is used to lookup files, chunks and candidates for
deduplication with hard links.

There are 3 kind of digests stored:

- digest(file.payload))
- digest(digest(file.payload) + file.UID + file.GID + file.mode + file.xattrs)
- digest(i) for each i in chunks(file payload)

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
reduce the number of allocations done by the parser by reading into a
bytes.Buffer.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@rhatdan
Copy link
Member

rhatdan commented Jan 10, 2022

LGTM

@rhatdan rhatdan merged commit a6837c9 into containers:main Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants