Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resilience: do simple existence check of replica on pool to avoid dar… #4748

Merged

Conversation

alrossi
Copy link
Member

@alrossi alrossi commented Mar 27, 2019

…k removes

Motivation:

A recent experience with data loss from removes during
(more than likely) multiple concurrent "pool up" scans
suggested that relying on the namespace for replica
locations is inherently subject to a race between
resilience and the pool clear cache location message.

Testing has confirmed this.
See issue #4742

Modification:

Add a SpreadAndWait of the PoolCheckFileMessage to
each of the readable pool locations and return the
pools on which a replica for that pnfsid is found.

This is done for both copy and remove.

An alarm at the warning level is issued at most once
every 15 minutes with new alarms keyed to hourly
timestamp, when the lag or inconsistency is detected.

Missing files (no replicas and no tape) continue
to be logged to activity logger (.resilience file),
just as with inaccessible files, whenever they
are detected.

A small correctness adjustment has been made to
the method which determines the operation type
(in order to handle "excluded" pools correctly).

Junit testing has been modified to simulate
the new messaging (always returning true
for file existence, in order to maintain
consistency with existing tests).

Result:

Dark removes which can result in the
removal of all replicas for a given file
are prevented.

Target: master
Request: 5.0
Request: 4.2
Request: 4.1
Request: 4.0
Requires-notes: yes
Requires-book: no
Acked-by: Tigran

…k removes

Motivation:

A recent experience with data loss from removes during
(more than likely) multiple concurrent "pool up" scans
suggested that relying on the namespace for replica
locations is inherently subject to a race between
resilience and the pool clear cache location message.

Testing has confirmed this.
See issue dCache#4742

Modification:

Add a SpreadAndWait of the PoolCheckFileMessage to
each of the readable pool locations and return the
pools on which a replica for that pnfsid is found.

This is done for both copy and remove.

An alarm at the warning level is issued at most once
every 15 minutes with new alarms keyed to hourly
timestamp, when the lag or inconsistency is detected.

Missing files (no replicas and no tape) continue
to be logged to activity logger (.resilience file),
just as with inaccessible files, whenever they
are detected.

A small correctness adjustment has been made to
the method which determines the operation type
(in order to handle "excluded" pools correctly).

Junit testing has been modified to simulate
the new messaging (always returning true
for file existence, in order to maintain
consistency with existing tests).

Result:

Dark removes which can result in the
removal of all replicas for a given file
are prevented.

Target: master
Request: 5.0
Request: 4.2
Request: 4.1
Request: 4.0
Requires-notes: yes
Requires-book: no
Acked-by: Tigran
@mksahakyan mksahakyan merged commit 1e6d6ec into dCache:4.0 Mar 28, 2019
@alrossi alrossi deleted the fix/4.0/resilience-check-replicas-simple branch March 28, 2019 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants