Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of consistency checks #1569

Merged
merged 1 commit into from
Apr 17, 2023
Merged

Conversation

nalind
Copy link
Member

@nalind nalind commented Apr 13, 2023

Add initial Check() and Repair() methods to Stores.

Check() checks for inconsistencies between the layers which the lower-level storage driver claims to know about and the ones which we know we're managing. It checks that layers referenced by layers, images, and containers are known to us and that images referenced by containers are known to us. It checks that data which we store alongside layers, images, and containers is still present, and to the extent which we store other information about that data (frequenly just the size of the data), verifies that it matches recorded expectations. Lastly, it checks that layers which are part of images (and which we therefore know what they should have in them) have the expected content, and nothing else.

Repair() removes any containers, images, and layers which have any errors associated with them. This is destructive, so its use should be considered and deliberate.

Add initial Check() and Repair() methods to Stores.

Check() checks for inconsistencies between the layers which the
lower-level storage driver claims to know about and the ones which we
know we're managing.  It checks that layers referenced by layers,
images, and containers are known to us and that images referenced by
containers are known to us.  It checks that data which we store
alongside layers, images, and containers is still present, and to the
extent which we store other information about that data (frequenly just
the size of the data), verifies that it matches recorded expectations.
Lastly, it checks that layers which are part of images (and which we
therefore know what they should have in them) have the expected content,
and nothing else.

Repair() removes any containers, images, and layers which have any
errors associated with them.  This is destructive, so its use should be
considered and deliberate.

Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
@nalind nalind changed the title [WIP] Initial implementation of consistency checks Initial implementation of consistency checks Apr 13, 2023
@rhatdan
Copy link
Member

rhatdan commented Apr 16, 2023

Nice work.
LGTM
@giuseppe @vrothberg @saschagrunert @flouthoc @umohnani8 @mtrmac PTAL

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@giuseppe giuseppe merged commit a9ace5f into containers:main Apr 17, 2023
18 checks passed
@nalind nalind deleted the check branch April 17, 2023 13:11
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fairly brief single-pass skim; please treat it mostly as unbaked low-priority suggestions.


// ErrLayerUnaccounted describes a layer that is present in the lower-level storage driver,
// but which is not known to or managed by the higher-level driver-agnostic logic.
ErrLayerUnaccounted = errors.New("layer in lower level storage driver not accounted for")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many of these cases, I think it would be useful to use an error type (even if that type is currently an empty struct); that would allow later adding more information (like which store, or which layer) is involved to the error type as an API.

Once this is an errors.New, adding more fields would be a silent breaking change (users would need to switch from errors.Is to errors.As).

OTOH that’s a fairly generic concern; I can very well accept the argument that no callers are expected to inspect the details of these specific errors in code anyway.

// It returns the return value of fn, or its own error initializing the store.
func (s *store) readContainerStore(fn func() (bool, error)) (bool, error) {
if err := s.containerStore.startReading(); err != nil {
return true, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern of returning done == true looks a bit out of place here; I think we can do without.

Let me dust off a Go 1.18 generics-based update to these helpers, that should make things nicer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +134 to +135
flags.BoolVar(&jsonOutput, []string{"-json", "j"}, jsonOutput, "Prefer JSON output")
flags.StringVar(&maximumUnreferencedLayerAge, []string{"-max", "m"}, "24h", "Maximum allowed age for unreferenced layers")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These flags are not documented; should they be? At least the latter one; -json seems not to be documented for most of the commands.

if err != nil {
return 1, err
}
outputNonJSON := func(report storage.CheckReport) {
Copy link
Collaborator

@mtrmac mtrmac Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I’m not sure what’s the purpose of this nested function — it seems to me it could be just open-coded inside that if below, or an external function. But this works just fine.)

return 1, err
}
outputNonJSON := func(report storage.CheckReport) {
for id, errs := range report.Layers {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all the report entries have this regular structure, a helper vaguely like reportPerItemErrors(os.Stdout, "layer", report.Layers) would make things shorter.

// If the driver can tell us about which layers it knows about, we should have previously
// examined all of them. Any that we didn't are probably just wasted space.
// Note: if the driver doesn't support enumerating layers, it returns ErrNotSupported.
if err := s.startUsingGraphDriver(); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, should the graph lock be held for the lifetime of this function, to ensure a consistent view?

Comment on lines +670 to +673
if err != nil && !errors.Is(err, drivers.ErrNotSupported) {
return CheckReport{}, err
}
if !errors.Is(err, drivers.ErrNotSupported) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

switch {
case errors.Is(…):
case err != nil:
default:
}

for id := range report.Layers {
layersToDelete = append(layersToDelete, id)
}
depth := func(id string) int {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is O(N^2) in the worst case; building a map[layerId]depth would allow making this O(N). Probably not worth worrying about now…)

}
return d
}
isUnaccounted := func(errs []error) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be called for every single layer at least once — so building a map[layerID]bool would mean it is called O(N) times instead of O(N log N). times.

if _, ok := deletedLayers[id]; ok {
continue
}
for _, reportedErr := range report.Layers[id] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could try to remove the same layer multiple times, is that intentional?

I’d expect a single attempt, at the driver or higher level depending on isUnaccounted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants