resilience: do simple existence check of replica on pool to avoid dar… #4748
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
…k removes
Motivation:
A recent experience with data loss from removes during
(more than likely) multiple concurrent "pool up" scans
suggested that relying on the namespace for replica
locations is inherently subject to a race between
resilience and the pool clear cache location message.
Testing has confirmed this.
See issue #4742
Modification:
Add a SpreadAndWait of the PoolCheckFileMessage to
each of the readable pool locations and return the
pools on which a replica for that pnfsid is found.
This is done for both copy and remove.
An alarm at the warning level is issued at most once
every 15 minutes with new alarms keyed to hourly
timestamp, when the lag or inconsistency is detected.
Missing files (no replicas and no tape) continue
to be logged to activity logger (.resilience file),
just as with inaccessible files, whenever they
are detected.
A small correctness adjustment has been made to
the method which determines the operation type
(in order to handle "excluded" pools correctly).
Junit testing has been modified to simulate
the new messaging (always returning true
for file existence, in order to maintain
consistency with existing tests).
Result:
Dark removes which can result in the
removal of all replicas for a given file
are prevented.
Target: master
Request: 5.0
Request: 4.2
Request: 4.1
Request: 4.0
Requires-notes: yes
Requires-book: no
Acked-by: Tigran