resilience: do simple existence check of replica on pool to avoid dar… #4748

alrossi · 2019-03-27T18:04:41Z

…k removes

Motivation:

A recent experience with data loss from removes during
(more than likely) multiple concurrent "pool up" scans
suggested that relying on the namespace for replica
locations is inherently subject to a race between
resilience and the pool clear cache location message.

Testing has confirmed this.
See issue #4742

Modification:

Add a SpreadAndWait of the PoolCheckFileMessage to
each of the readable pool locations and return the
pools on which a replica for that pnfsid is found.

This is done for both copy and remove.

An alarm at the warning level is issued at most once
every 15 minutes with new alarms keyed to hourly
timestamp, when the lag or inconsistency is detected.

Missing files (no replicas and no tape) continue
to be logged to activity logger (.resilience file),
just as with inaccessible files, whenever they
are detected.

A small correctness adjustment has been made to
the method which determines the operation type
(in order to handle "excluded" pools correctly).

Junit testing has been modified to simulate
the new messaging (always returning true
for file existence, in order to maintain
consistency with existing tests).

Result:

Dark removes which can result in the
removal of all replicas for a given file
are prevented.

Target: master
Request: 5.0
Request: 4.2
Request: 4.1
Request: 4.0
Requires-notes: yes
Requires-book: no
Acked-by: Tigran

…k removes Motivation: A recent experience with data loss from removes during (more than likely) multiple concurrent "pool up" scans suggested that relying on the namespace for replica locations is inherently subject to a race between resilience and the pool clear cache location message. Testing has confirmed this. See issue dCache#4742 Modification: Add a SpreadAndWait of the PoolCheckFileMessage to each of the readable pool locations and return the pools on which a replica for that pnfsid is found. This is done for both copy and remove. An alarm at the warning level is issued at most once every 15 minutes with new alarms keyed to hourly timestamp, when the lag or inconsistency is detected. Missing files (no replicas and no tape) continue to be logged to activity logger (.resilience file), just as with inaccessible files, whenever they are detected. A small correctness adjustment has been made to the method which determines the operation type (in order to handle "excluded" pools correctly). Junit testing has been modified to simulate the new messaging (always returning true for file existence, in order to maintain consistency with existing tests). Result: Dark removes which can result in the removal of all replicas for a given file are prevented. Target: master Request: 5.0 Request: 4.2 Request: 4.1 Request: 4.0 Requires-notes: yes Requires-book: no Acked-by: Tigran

mksahakyan merged commit 1e6d6ec into dCache:4.0 Mar 28, 2019

alrossi deleted the fix/4.0/resilience-check-replicas-simple branch March 28, 2019 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resilience: do simple existence check of replica on pool to avoid dar… #4748

resilience: do simple existence check of replica on pool to avoid dar… #4748

alrossi commented Mar 27, 2019

resilience: do simple existence check of replica on pool to avoid dar… #4748

resilience: do simple existence check of replica on pool to avoid dar… #4748

Conversation

alrossi commented Mar 27, 2019