Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to perform syncfs after pull #9401

Merged
merged 2 commits into from
Dec 12, 2023
Merged

Conversation

fuweid
Copy link
Member

@fuweid fuweid commented Nov 20, 2023

It's to ensure the data integrity during unexpected power failure.

Background

Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.

As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).

The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.

We, containerd, received several issues about data loss after unexpected
power failure.

Solution

Option 1: SyncFs after unpack

Linux platform provides syncfs syscall to synchronize just the
filesystem containing a given file.

Option 2: Fsync directories recursively and fsync on regular file

The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.

However, based on xfstest-dev, there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.

Checkout generic/690, generic/520: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
filesystem implementation.

NOTE: All the related test cases has _flakey_drop_and_remount in
xfstest-dev.

Based on discussion about Documenting the crash-recovery guarantees of Linux file systems,
we can't rely on Fsync-on-parent.

Option 1 is winner

This patch is using option 1.

There is test result based on test-tool.
All the networking traffic created by pull is local.

  • Image: docker.io/library/golang:1.19.4 (992 MiB)

    • Current: 5.446738579s
      • WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
    • Option 1: 6.239686088s
      • WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
    • Option 2: 1m30.510934813s
      • WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344
  • Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)

    • Current: 8.852718042s
      • WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
    • Option 1: 9.683387174s
      • WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
    • Option 2: 1m54.302103719s
      • WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568

The Option 1 will increase wios. So, the image_pull_with_sync_fs is
option in CRI plugin.

@fuweid
Copy link
Member Author

fuweid commented Nov 21, 2023

It's based on the discussion at weekly meeting - September 14, 2023

ping @dmcgowan @mikebrow @samuelkarp @neersighted

@zhuangqh
Copy link
Contributor

just writing file with O_direct?
write through to disk without page cache

@fuweid
Copy link
Member Author

fuweid commented Nov 25, 2023

just writing file with O_direct?

Hi @zhuangqh , there is not just regular file, but also symlink, block device and directory. For special files, it's unlikely to use O_direct to do persistence.

Comment on lines +60 to +67
case sync && len(mounts) == 1 && mounts[0].Type == "bind":
defer func() {
if retErr != nil {
return
}

retErr = doSyncFs(mounts[0].Source)
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my knowledge here, we need this logic because umount syscall does not flush the cache for bind mounts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on test result, yes.

The alpine image has only one layer. Based on the following test, the file cache is gone after reboot.

https://github.com/fuweid/go-dmflakey/blob/08b32bc47733b5be82ceb3e7460d250de83112f9/contrib/test/containerd/issue5854_test.go#L32

Copy link
Member

@henry118 henry118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@113xiaoji
Copy link

Mark

@@ -38,6 +38,8 @@ func apply(ctx context.Context, mounts []mount.Mount, r io.Reader) error {
path := mounts[0].Source
_, err := archive.Apply(ctx, path, r, opts...)
return err

// TODO: Do we need to sync all the filesystems?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we open an issue to track this TODO?

@@ -26,7 +26,8 @@ import (
"github.com/containerd/containerd/v2/mount"
)

func apply(ctx context.Context, mounts []mount.Mount, r io.Reader) error {
func apply(ctx context.Context, mounts []mount.Mount, r io.Reader, _sync bool) error {
// TODO: for windows, how to sync?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, let's have an issue to track. Note that _other isn't just Windows; FreeBSD will be in this codepath too (but there's no overlayfs, so there is a mount/umount operation).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. #9497 PTAL. Thanks

@fuweid
Copy link
Member Author

fuweid commented Dec 11, 2023

Ping @dmcgowan this pull request needs your help to skip those ghost pipelines. 😂

@fuweid fuweid disabled auto-merge December 12, 2023 02:18
It's flag to synchronize the underlying filesystem containing files
created during Apply.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
It's to ensure the data integrity during unexpected power failure.

Background:

Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.

As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).

The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.

We, containerd, received several issues about data loss after unexpected
power failure.

* containerd#5854
* containerd#3369 (comment)

Solution:

* Option 1: SyncFs after unpack

Linux platform provides [syncfs][syncfs] syscall to synchronize just the
filesystem containing a given file.

* Option 2: Fsync directories recursively and fsync on regular file

The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.

However, based on [xfstest-dev][xfstest-dev], there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.

Checkout [generic/690][generic/690]: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
implemetation.

> NOTE: All the related test cases has `_flakey_drop_and_remount` in
[xfstest-dev].

Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity],
we can't rely on Fsync-on-parent.

* Option 1 is winner

This patch is using option 1.

There is test result based on [test-tool][test-tool].
All the networking traffic created by pull is local.

  * Image: docker.io/library/golang:1.19.4 (992 MiB)
    * Current: 5.446738579s
      * WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
    * Option 1: 6.239686088s
      * WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
    * Option 2: 1m30.510934813s
      * WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344

  * Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
    * Current: 8.852718042s
      * WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
    * Option 1: 9.683387174s
      * WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
    * Option 2: 1m54.302103719s
      * WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568

The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is
option in CRI plugin.

[syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html>
[xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git>
[generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19>
[kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/>
[test-tool]: <https://github.com/fuweid/go-dmflakey/blob/a17fb2010db22654b3e54cf506b0dbb5ef7b33ca/contrib/syncfs/containerd/main_test.go#L51>

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@dmcgowan dmcgowan added this pull request to the merge queue Dec 12, 2023
Merged via the queue into containerd:main with commit 1feb234 Dec 12, 2023
45 checks passed
@fuweid fuweid deleted the v2-mode branch December 12, 2023 15:28
@fuweid fuweid added cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Dec 12, 2023
@AkihiroSuda
Copy link
Member

@fuweid Do you still plan to cherry-pick this?

@fuweid
Copy link
Member Author

fuweid commented Jan 24, 2024

@AkihiroSuda will do

@dmcgowan dmcgowan changed the title *: introduce image_pull_with_sync_fs in CRI Add option to perform syncfs after pull Jan 29, 2024
@fuweid fuweid added cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch and removed cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Feb 7, 2024
@dmcgowan dmcgowan added the area/distribution Image Distribution label May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distribution Image Distribution cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch impact/changelog size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants