Add option to perform syncfs after pull #9401

fuweid · 2023-11-20T16:48:47Z

It's to ensure the data integrity during unexpected power failure.

Background

Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.

As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).

The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.

We, containerd, received several issues about data loss after unexpected
power failure.

Solution

Option 1: SyncFs after unpack

Linux platform provides syncfs syscall to synchronize just the
filesystem containing a given file.

Option 2: Fsync directories recursively and fsync on regular file

The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.

However, based on xfstest-dev, there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.

Checkout generic/690, generic/520: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
filesystem implementation.

NOTE: All the related test cases has _flakey_drop_and_remount in
xfstest-dev.

Based on discussion about Documenting the crash-recovery guarantees of Linux file systems,
we can't rely on Fsync-on-parent.

Option 1 is winner

This patch is using option 1.

There is test result based on test-tool.
All the networking traffic created by pull is local.

Image: docker.io/library/golang:1.19.4 (992 MiB)
- Current: 5.446738579s
  - WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
- Option 1: 6.239686088s
  - WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
- Option 2: 1m30.510934813s
  - WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344
Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
- Current: 8.852718042s
  - WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
- Option 1: 9.683387174s
  - WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
- Option 2: 1m54.302103719s
  - WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568

The Option 1 will increase wios. So, the image_pull_with_sync_fs is
option in CRI plugin.

fuweid · 2023-11-21T10:54:31Z

It's based on the discussion at weekly meeting - September 14, 2023

ping @dmcgowan @mikebrow @samuelkarp @neersighted

zhuangqh · 2023-11-25T01:24:58Z

just writing file with O_direct?
write through to disk without page cache

fuweid · 2023-11-25T09:13:40Z

just writing file with O_direct?

Hi @zhuangqh , there is not just regular file, but also symlink, block device and directory. For special files, it's unlikely to use O_direct to do persistence.

henry118 · 2023-12-02T05:19:53Z

diff/apply/apply_linux.go

+	case sync && len(mounts) == 1 && mounts[0].Type == "bind":
+		defer func() {
+			if retErr != nil {
+				return
+			}
+
+			retErr = doSyncFs(mounts[0].Source)
+		}()


Just for my knowledge here, we need this logic because umount syscall does not flush the cache for bind mounts?

based on test result, yes.

The alpine image has only one layer. Based on the following test, the file cache is gone after reboot.

https://github.com/fuweid/go-dmflakey/blob/08b32bc47733b5be82ceb3e7460d250de83112f9/contrib/test/containerd/issue5854_test.go#L32

henry118

LGTM

113xiaoji · 2023-12-10T06:17:26Z

Mark

samuelkarp · 2023-12-10T08:10:06Z

diff/apply/apply_darwin.go

@@ -38,6 +38,8 @@ func apply(ctx context.Context, mounts []mount.Mount, r io.Reader) error {
 		path := mounts[0].Source
 		_, err := archive.Apply(ctx, path, r, opts...)
 		return err
+
+		// TODO: Do we need to sync all the filesystems?


Can we open an issue to track this TODO?

samuelkarp · 2023-12-10T08:11:58Z

diff/apply/apply_other.go

@@ -26,7 +26,8 @@ import (
 	"github.com/containerd/containerd/v2/mount"
 )

-func apply(ctx context.Context, mounts []mount.Mount, r io.Reader) error {
+func apply(ctx context.Context, mounts []mount.Mount, r io.Reader, _sync bool) error {
+	// TODO: for windows, how to sync?


Same here, let's have an issue to track. Note that _other isn't just Windows; FreeBSD will be in this codepath too (but there's no overlayfs, so there is a mount/umount operation).

Sure. #9497 PTAL. Thanks

fuweid · 2023-12-11T10:18:44Z

Ping @dmcgowan this pull request needs your help to skip those ghost pipelines. 😂

It's flag to synchronize the underlying filesystem containing files created during Apply. Signed-off-by: Wei Fu <fuweid89@gmail.com>

It's to ensure the data integrity during unexpected power failure. Background: Since release 1.3, in Linux system, containerD unpacks and writes files into overlayfs snapshot directly. It doesn’t involve any mount-umount operations so that the performance of pulling image has been improved. As we know, the umount syscall for overlayfs will force kernel to flush all the dirty pages into disk. Without umount syscall, the files’ data relies on kernel’s writeback threads or filesystem's commit setting (for instance, ext4 filesystem). The files in committed snapshot can be loss after unexpected power failure. However, the snapshot has been committed and the metadata also has been fsynced. There is data inconsistency between snapshot metadata and files in that snapshot. We, containerd, received several issues about data loss after unexpected power failure. * containerd#5854 * containerd#3369 (comment) Solution: * Option 1: SyncFs after unpack Linux platform provides [syncfs][syncfs] syscall to synchronize just the filesystem containing a given file. * Option 2: Fsync directories recursively and fsync on regular file The fsync doesn't support symlink/block device/char device files. We need to use fsync the parent directory to ensure that entry is persisted. However, based on [xfstest-dev][xfstest-dev], there is no case to ensure fsync-on-parent can persist the special file's metadata, for example, uid/gid, access mode. Checkout [generic/690][generic/690]: Syncing parent dir can persist symlink. But for f2fs, it needs special mount option. And it doesn't say that uid/gid can be persisted. All the details are behind the implemetation. > NOTE: All the related test cases has `_flakey_drop_and_remount` in [xfstest-dev]. Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity], we can't rely on Fsync-on-parent. * Option 1 is winner This patch is using option 1. There is test result based on [test-tool][test-tool]. All the networking traffic created by pull is local. * Image: docker.io/library/golang:1.19.4 (992 MiB) * Current: 5.446738579s * WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056 * Option 1: 6.239686088s * WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056 * Option 2: 1m30.510934813s * WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344 * Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes) * Current: 8.852718042s * WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712 * Option 1: 9.683387174s * WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016 * Option 2: 1m54.302103719s * WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568 The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is option in CRI plugin. [syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html> [xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git> [generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19> [kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/> [test-tool]: <https://github.com/fuweid/go-dmflakey/blob/a17fb2010db22654b3e54cf506b0dbb5ef7b33ca/contrib/syncfs/containerd/main_test.go#L51> Signed-off-by: Wei Fu <fuweid89@gmail.com>

AkihiroSuda · 2024-01-24T13:55:35Z

@fuweid Do you still plan to cherry-pick this?

fuweid · 2024-01-24T15:48:53Z

@AkihiroSuda will do

fuweid force-pushed the v2-mode branch 2 times, most recently from b05e1b4 to 82389b4 Compare November 20, 2023 16:55

e-minguez mentioned this pull request Nov 20, 2023

Error: failed to create containerd container: failed to stat parent: stat /var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/214/fs: no such file or directory rancher/rke2#4961

Closed

henry118 reviewed Dec 2, 2023

View reviewed changes

henry118 approved these changes Dec 4, 2023

View reviewed changes

samuelkarp approved these changes Dec 10, 2023

View reviewed changes

samuelkarp enabled auto-merge December 10, 2023 08:15

fuweid mentioned this pull request Dec 10, 2023

[diff/applier plugin] support sync-fs for data persistence #9497

Open

7 tasks

fuweid disabled auto-merge December 12, 2023 02:18

fuweid added 2 commits December 12, 2023 10:18

api: introduce sync_fs to diff.ApplyRequest

bd5c602

It's flag to synchronize the underlying filesystem containing files created during Apply. Signed-off-by: Wei Fu <fuweid89@gmail.com>

fuweid force-pushed the v2-mode branch from 82389b4 to 23278c8 Compare December 12, 2023 02:18

k8s-ci-robot added the size/L label Dec 12, 2023

dmcgowan approved these changes Dec 12, 2023

View reviewed changes

dmcgowan added this pull request to the merge queue Dec 12, 2023

Merged via the queue into containerd:main with commit 1feb234 Dec 12, 2023
45 checks passed

fuweid deleted the v2-mode branch December 12, 2023 15:28

fuweid added cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Dec 12, 2023

pendo324 mentioned this pull request Jan 20, 2024

fix: properly reload persistent snapshotter data and restart services runfinch/finch#767

Merged

1 task

fuweid added the impact/changelog label Jan 21, 2024

dmcgowan changed the title *: introduce image_pull_with_sync_fs in CRI Add option to perform syncfs after pull Jan 29, 2024

fuweid mentioned this pull request Feb 6, 2024

[release/1.7] Add option to perform syncfs after pull #9769

Merged

fuweid added cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch and removed cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Feb 7, 2024

henry118 mentioned this pull request Feb 23, 2024

Set cache dir to volatile dir containernetworking/cni#1066

Closed

thaJeztah mentioned this pull request Mar 12, 2024

vendor: github.com/containerd/containerd v1.7.14 moby/moby#47552

Merged

dmcgowan mentioned this pull request Apr 19, 2024

Prepare release notes for v2.0.0-rc.1 #10098

Merged

dmcgowan added the area/distribution Image Distribution label May 3, 2024

dmcgowan mentioned this pull request May 3, 2024

Prepare release notes for api/v1.8.0-rc.0 #10167

Merged

dmcgowan mentioned this pull request May 17, 2024

Prepare release notes for v2.0.0-rc.2 #10240

Merged

ningmingxiao mentioned this pull request May 29, 2024

support to syncfs after pull by using diff plugin #10284

Open

dmcgowan mentioned this pull request Jun 13, 2024

Prepare v2.0.0-rc.3 release #10331

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to perform syncfs after pull #9401

Add option to perform syncfs after pull #9401

fuweid commented Nov 20, 2023 •

edited

Loading

fuweid commented Nov 21, 2023

zhuangqh commented Nov 25, 2023

fuweid commented Nov 25, 2023

henry118 Dec 2, 2023

fuweid Dec 2, 2023

henry118 left a comment

113xiaoji commented Dec 10, 2023

samuelkarp Dec 10, 2023

samuelkarp Dec 10, 2023

fuweid Dec 10, 2023

fuweid commented Dec 11, 2023

AkihiroSuda commented Jan 24, 2024

fuweid commented Jan 24, 2024

Add option to perform syncfs after pull #9401

Add option to perform syncfs after pull #9401

Conversation

fuweid commented Nov 20, 2023 • edited Loading

Background

Solution

Option 1: SyncFs after unpack

Option 2: Fsync directories recursively and fsync on regular file

Option 1 is winner

fuweid commented Nov 21, 2023

zhuangqh commented Nov 25, 2023

fuweid commented Nov 25, 2023

henry118 Dec 2, 2023

Choose a reason for hiding this comment

fuweid Dec 2, 2023

Choose a reason for hiding this comment

henry118 left a comment

Choose a reason for hiding this comment

113xiaoji commented Dec 10, 2023

samuelkarp Dec 10, 2023

Choose a reason for hiding this comment

samuelkarp Dec 10, 2023

Choose a reason for hiding this comment

fuweid Dec 10, 2023

Choose a reason for hiding this comment

fuweid commented Dec 11, 2023

AkihiroSuda commented Jan 24, 2024

fuweid commented Jan 24, 2024

fuweid commented Nov 20, 2023 •

edited

Loading