Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resilience: do simple existence check of replica on pool to avoid dar… #4748

Merged

Commits on Mar 27, 2019

  1. resilience: do simple existence check of replica on pool to avoid dar…

    …k removes
    
    Motivation:
    
    A recent experience with data loss from removes during
    (more than likely) multiple concurrent "pool up" scans
    suggested that relying on the namespace for replica
    locations is inherently subject to a race between
    resilience and the pool clear cache location message.
    
    Testing has confirmed this.
    See issue dCache#4742
    
    Modification:
    
    Add a SpreadAndWait of the PoolCheckFileMessage to
    each of the readable pool locations and return the
    pools on which a replica for that pnfsid is found.
    
    This is done for both copy and remove.
    
    An alarm at the warning level is issued at most once
    every 15 minutes with new alarms keyed to hourly
    timestamp, when the lag or inconsistency is detected.
    
    Missing files (no replicas and no tape) continue
    to be logged to activity logger (.resilience file),
    just as with inaccessible files, whenever they
    are detected.
    
    A small correctness adjustment has been made to
    the method which determines the operation type
    (in order to handle "excluded" pools correctly).
    
    Junit testing has been modified to simulate
    the new messaging (always returning true
    for file existence, in order to maintain
    consistency with existing tests).
    
    Result:
    
    Dark removes which can result in the
    removal of all replicas for a given file
    are prevented.
    
    Target: master
    Request: 5.0
    Request: 4.2
    Request: 4.1
    Request: 4.0
    Requires-notes: yes
    Requires-book: no
    Acked-by: Tigran
    alrossi committed Mar 27, 2019
    Configuration menu
    Copy the full SHA
    bc1e5a2 View commit details
    Browse the repository at this point in the history