store: simplify imagestore implementation #1784

giuseppe · 2023-12-21T11:31:14Z

simplify the imagestore implementation and use it as a first-class store. It reduces the amount of code to deal with an image store in the driver (only overlay supports it at the moment).

Closes: #1779

Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com

rhatdan · 2023-12-21T15:09:12Z

Can you run a test against podman and Buildah as well?

giuseppe · 2024-01-08T09:23:52Z

this is ready for review. Both Podman and Buildah pass

Podman test PR: containers/podman#21070
Buildah test PR: containers/buildah#5236

rhatdan · 2024-01-08T11:47:55Z

LGTM
@saschagrunert @mtrmac @nalind @vrothberg PTAL

mtrmac

There are parts of this, like removing all the d.imageStore conditions throughout overlay, which I very much like.

On the general principle, I suppose that’s pre-existing, but I just can’t see how this works. All of the read-only stores need to be RO locked, don’t they? And locks are handled in the generic code, drivers can’t do that for themselves. So any kind of iteration over stores in dir2 seems completely broken to me.

So… what’s really going on? Is dir2 only to handle cross-store parent layers during an overlay mount, or something? If that were so, the driver code should not be touching the other stores during any other operation; and still that cross-store parent situation would need to be somehow explicitly tracked and locked.

I’m afraid I can’t do a full review now. Maybe there is some way to split this into much smaller PRs?

mtrmac · 2024-01-08T17:08:51Z

drivers/overlay/overlay.go

+	if d.imageStore != "" {
+		_ = os.RemoveAll(filepath.Join(d.imageStore, stagingDir))
+	}
+	_ = os.RemoveAll(filepath.Join(d.home, stagingDir))


How does this work? stagingDir is here a per-driver directory, but getStagingDir creates per-layer-ID staging directories.

Similarly the EDIT ~~check use of~~ use in ListLayers seems inappropriate now — or quite possibly it needs to be retained for compatibility with older writers, but then it should be explicitly documented as such a non-obvious case.

Stylitically, I’d prefer:

Define a separate globalStagingDir/perLayerStagingDir constants; don’t use a single string for two different purposes!

As I always want, document top-level symbols.

mtrmac · 2024-01-08T17:35:34Z

store.go

+	stores := driver.AdditionalImageStores()
+	if s.imageStoreDir != "" {
+		stores = append([]string{s.graphRoot}, stores...)
+	}


I would greatly appreciate consolidating this data further:

Only ever call AdditionalImageStores() once in load

Only ever have one if s.imageStoreDir != "" condition, somewhere around there.

Together, populate something like s.rwImageLayerStoreDirs / s.roImageLayerStoreDirs. And then have all other graph iterations be trivial loops.

mtrmac · 2024-01-08T17:41:38Z

store.go

 func (s *store) createGraphDriverLocked() (drivers.Driver, error) {
-	driverRoot := s.imageStoreDir
-	imageStoreBase := s.graphRoot
-	if driverRoot == "" {
-		driverRoot = s.graphRoot
-		imageStoreBase = ""
-	}
 	config := drivers.Options{
-		Root:           driverRoot,
-		ImageStore:     imageStoreBase,
+		Root:           s.graphRoot,
+		ImageStore:     s.imageStoreDir,
 		RunRoot:        s.runRoot,


AFAICS this effectively reverses the semantics of the two fields. While I 100% support the resulting simpler code and semantics, I’m not smart enough to review such changes in a GitHub PR without ~re-doing the work.

One possible approach:

Define completely new fields with the new semantics. Set them in the first commit.

In the first and/or subsequent commits, migrate the users to the new fields.

Now that the old fields have no readers, possibly rename the fields back to the original names, with the new semantics,

mtrmac · 2024-01-08T17:47:08Z

store.go

+
+		ro := store != s.imageStoreDir && store != s.graphRoot
+
+		rls, err := s.newLayerStore(rlpath, glpath, s.graphDriver, false, ro)


Why is this introducing the “second read-write layer store” concept? Yes, it mirrors the “second read-write image store” thing (which I found dubious in #1578), but AFAICS the Delete code which uses the “second RW image store” never deletes layers from the second RW layer store.

So either that should be added as well, or (hopefully) we don’t need this complexity.

this is needed because we have this weird specification that a layer must be deleted from the current store if it is found there. We even have a test for it :/

https://github.com/containers/storage/blob/main/tests/split-store.bats#L96-L101

I am fine if we decide to break this specification, but let's do it separately after we verify with CRI-O folks that it is not necessary.

Fine… but I can’t see how that works. layersToRemove is only applied to one rlstore. That uses getLayerStoreLocked, and AFAICS never gets to write to this special member of roLayerStoresUseGetters.

I’m probably being stupid.

layers.go

drivers/vfs/driver.go

mtrmac · 2024-01-08T17:55:39Z

drivers/vfs/driver.go

+	newpath := filepath.Join(homedir, d.String(), "dir", filepath.Base(id))
+	if _, err := os.Stat(newpath); err != nil {
+		additionalHomes := d.homes[1:]
+		if d.imageStore != "" {


Can’t this condition be determined at driver initialization time?

Every single consumer of d.homes is either indexing [0] or [:1], suggesting that the data can be recorded in some more convenient way.

mtrmac · 2024-01-08T18:00:56Z

drivers/vfs/driver.go

+		homedir = d.homes[0]
+	}
+	newpath := filepath.Join(homedir, d.String(), "dir", filepath.Base(id))
+	if _, err := os.Stat(newpath); err != nil {


⚠️ How does this handle any locking? The generic code maintains RW/RO locks on the individual images / layers. This… does nothing?

this is still outstanding.

I don't see how this is different than the previous version, what locking do we need here?

mtrmac · 2024-01-08T18:01:12Z

drivers/overlay/overlay.go

 	}
+
+	newpath := path.Join(homedir, id)
+
 	if _, err := os.Stat(newpath); err != nil {


⚠️ How does this handle any locking? The generic code maintains RW/RO locks on the individual images / layers. This… does nothing?

this is still outstanding.

I don't see how this is different than the previous version, what locking do we need here?

Yes, see the top-level comment #1784 (review) .

One immediate thing that springs to mind is deleting the parent layer while the child is being created/mounted/unmounted. I don’t know that we can fully defend against those kinds of things (we don’t hold layers locked for the whole time a child is mounted, and mount reference counts are, I think, basically per-machine and not visible over network shares).

Or maybe a layer with the same ID (pulled layers have deterministic IDs) being simultaneously created in both stores, making the path returned by d.dir[2]() change over time — e.g. for getStagingDir as an example.

There might well be more, or maybe these are handled by some mechanism I don’t know about.

Assuming the concern is real:

I’d like to see the scope of the problem restricted to the unavoidable minimum (whatever that is); in all situations where the generic code is iterating over getROLayerStoresLocked to find the right layer, and handling locking, the driver should not be iterating over the secondary layer stores again / redundantly without locking. Structure the code, or document it, in a way that is likely to be maintained correctly over time.

In the situations where the problem remains outstanding afterwards and is not fixable short-term, we should loudly document the problem, the situations where it arises, and the constraints, for future maintainers (compare LOCKING BUG in layers.go)

Alternatively, if this is never going to be fixed, and users need to use the feature appropriately (“the image store must never be modified while it is used with outer stores” ?), clearly document that in end-user documentation, and maybe point at it in the code to acknowledge and justify the unsynchronized accesses.

(As another option, a few years ago overlay used to do its own per-layer locking. I have removed that because I thought it’s redundant; actually external users can get a pointer to the driver object, so it is not 100% redundant, and doing that locking here might also be an option. I would prefer not to add that cost to every operation.)

I could see an argument that this is out of scope of this PR … I’m thinking that with this touching dir2 and dir, while the feature is still comparatively new, now is the best chance for fixing this we’ll ever have. Admittedly the timing is not super convenient now, but it will probably never be ideal.

the assumption with additional image stores is that images/layers are not removed. There are cases where we don't grab any lock, e.g. a read-only file system, so that is already the current state.

When we use a read-only layer, the layer could disappear also while a container is running, and there is nothing that prevents the removal now:

# mkdir -p /tmp/store/{rw,ro,empty} # podman --root /tmp/store/rw pull busybox # mount --bind --read-only /tmp/store/rw /tmp/store/ro # podman --root /tmp/store/empty --storage-opt overlay.imagestore=/tmp/store/ro run -d busybox top 5d4baa2643c77deca49ff9b1a0a10df5c2965626b725b8a2225db54706e69980 # podman --root /tmp/store/rw rmi busybox Untagged: docker.io/library/busybox:latest Deleted: 9211bbaa0dbd68fed073065eb9f0a6ed00a75090a9235eca2554c62d1e75c58f

Yeah, the silent deletion of dependent parents seems hard to fix. It’s also not really documented that the users must prevent that, AFAICT.

What about concurrent creations, and other operations?

I've not thought about them, but are they very different than delete though? We would end up with the same layer in multiple stores and use it from the first one where we find it.

We could see e.g. a partial pull of a layer being started on our store, triggering the creation of a staging directory; before it finishes, the other image store creates the same layer; and now the code trying too commit the partially-pulled layer sees the new layer, determines a different staging directory path, and rejects the commit attempt. (I’m not saying that this is the only problem.)

let's address this in a separate PR. I've played with it this morning, but I am not sure yet how this should be done.

Do you've any suggestions?

store.go

mtrmac · 2024-01-10T16:01:16Z

store.go

+
+		ro := store != s.imageStoreDir && store != s.graphRoot
+
+		rls, err := s.newLayerStore(rlpath, glpath, s.graphDriver, false, ro)


Fine… but I can’t see how that works. layersToRemove is only applied to one rlstore. That uses getLayerStoreLocked, and AFAICS never gets to write to this special member of roLayerStoresUseGetters.

I’m probably being stupid.

drivers/vfs/driver.go

giuseppe · 2024-01-25T10:46:53Z

ready for review

giuseppe · 2024-01-30T09:10:05Z

@rhatdan @mtrmac can we please move this forward?

rhatdan · 2024-01-30T13:29:22Z

@nalind PTAL

mtrmac · 2024-01-30T19:22:57Z

Is this related to the Zstd work?

giuseppe · 2024-01-30T20:28:02Z

Is this related to the Zstd work?

no, this is not related to zstd

rhatdan · 2024-02-10T12:03:43Z

@giuseppe @mtrmac is this something we should get in before 5.0?

giuseppe · 2024-02-12T09:05:21Z

I don't think it is too important for Podman 5.0, but it solves an issue that was reported some time ago: #1779

giuseppe · 2024-02-14T12:07:34Z

@kolyshkin @nalind PTAL

rhatdan · 2024-02-14T13:17:29Z

@saschagrunert PTAL

giuseppe · 2024-02-21T08:49:12Z

@haircommander @saschagrunert this PR affects the "split store" functionality in CRI-O

haircommander · 2024-02-21T14:26:32Z

cc @kannon92

giuseppe · 2024-02-28T09:03:23Z

rebased

this is a preparation for the next commit, which will use this feature. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

simplify the imagestore implementation and use it as a first-class store. It reduces the amount of code to deal with an image store in the driver (only overlay supports it at the moment). Closes: containers#1779 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

when an image store is used, lock it as well as the graphroot. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

openshift-ci · 2024-03-01T08:13:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [giuseppe,saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rhatdan · 2024-03-01T11:14:32Z

/lgtm

openshift-ci bot added the approved label Dec 21, 2023

This was referenced Dec 21, 2023

overlay: use dir for final diff/merged dirs #1781

Closed

WIP: test github.com/giuseppe/storage refactor-image-store containers/podman#21070

Closed

giuseppe force-pushed the refactor-image-store branch from f7ad9de to a5de3f7 Compare December 21, 2023 12:05

giuseppe marked this pull request as draft December 21, 2023 20:27

openshift-ci bot added the do-not-merge/work-in-progress label Dec 21, 2023

giuseppe force-pushed the refactor-image-store branch 4 times, most recently from f6e38aa to 5385212 Compare December 22, 2023 10:02

rhatdan mentioned this pull request Dec 22, 2023

[WIP] Testing storage pr containers/podman#21058

Closed

giuseppe force-pushed the refactor-image-store branch 2 times, most recently from bc35798 to 814fddc Compare December 23, 2023 09:54

giuseppe marked this pull request as ready for review January 8, 2024 09:23

openshift-ci bot removed the do-not-merge/work-in-progress label Jan 8, 2024

mtrmac reviewed Jan 8, 2024

View reviewed changes

giuseppe force-pushed the refactor-image-store branch 6 times, most recently from f3ffbd6 to 5c00dd3 Compare January 10, 2024 14:43

mtrmac reviewed Jan 10, 2024

View reviewed changes

drivers/vfs/driver.go Outdated Show resolved Hide resolved

drivers/vfs/driver.go Outdated Show resolved Hide resolved

giuseppe force-pushed the refactor-image-store branch 3 times, most recently from 19fc9a6 to be5720d Compare January 15, 2024 12:20

giuseppe marked this pull request as ready for review January 25, 2024 10:43

openshift-ci bot removed the do-not-merge/work-in-progress label Jan 25, 2024

giuseppe force-pushed the refactor-image-store branch from 591d900 to 7d1b3f3 Compare February 20, 2024 21:23

giuseppe force-pushed the refactor-image-store branch from 7d1b3f3 to 2b50bc4 Compare February 28, 2024 08:52

layers: add infra to lock multiple files

f31412a

this is a preparation for the next commit, which will use this feature. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe force-pushed the refactor-image-store branch from 2b50bc4 to 2a625d5 Compare February 29, 2024 15:35

giuseppe added 3 commits February 29, 2024 16:35

store: lock both graphroot and the imagestore root

0ab61df

when an image store is used, lock it as well as the graphroot. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

overlay: support imageStore with partial images

d1cc476

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe force-pushed the refactor-image-store branch from 2a625d5 to d1cc476 Compare February 29, 2024 15:36

saschagrunert approved these changes Mar 1, 2024

View reviewed changes

openshift-ci bot assigned rhatdan Mar 1, 2024

openshift-ci bot added the lgtm label Mar 1, 2024

openshift-merge-bot bot merged commit 226cffb into containers:main Mar 1, 2024
18 checks passed

kannon92 mentioned this pull request Mar 12, 2024

add patch for imagestore from 1.53 #1859

Closed

kingsleyzissou mentioned this pull request Mar 15, 2024

Podman image mount: Error: layer not known #1864

Closed


		ro := store != s.imageStoreDir && store != s.graphRoot

		rls, err := s.newLayerStore(rlpath, glpath, s.graphDriver, false, ro)

store: simplify imagestore implementation #1784

store: simplify imagestore implementation #1784

Conversation

giuseppe commented Dec 21, 2023 • edited

rhatdan commented Dec 21, 2023 • edited

giuseppe commented Jan 8, 2024

rhatdan commented Jan 8, 2024

mtrmac left a comment

Choose a reason for hiding this comment

mtrmac Jan 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtrmac Jan 11, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

giuseppe commented Jan 25, 2024

giuseppe commented Jan 30, 2024

rhatdan commented Jan 30, 2024

mtrmac commented Jan 30, 2024

giuseppe commented Jan 30, 2024

rhatdan commented Feb 10, 2024

giuseppe commented Feb 12, 2024

giuseppe commented Feb 14, 2024

rhatdan commented Feb 14, 2024

giuseppe commented Feb 21, 2024 • edited

haircommander commented Feb 21, 2024

giuseppe commented Feb 28, 2024

openshift-ci bot commented Mar 1, 2024

rhatdan commented Mar 1, 2024

giuseppe commented Dec 21, 2023 •

edited

rhatdan commented Dec 21, 2023 •

edited

mtrmac Jan 8, 2024 •

edited

mtrmac Jan 11, 2024 •

edited

giuseppe commented Feb 21, 2024 •

edited