replica manager (old): fix countable logic when rescanning pool repos…

…itory Motivation: Users of the old replica manager have reported inconsistent behavior regarding source files which, depending on the dCache configuration, can cause thrashing or unbounded copy-retries. The problem has its origin in the fact that the replica manager was designed to handle only precious files, but since then there has been a change such that the RetentionPolicy and AccessLatency tags are mapped to the attributes precious, cached and sticky. The manager was apparently not adjusted to take into consideration this change. Inconsistency can thus arise if the resilient pools are linked to storage units which are in turn linked to directories where the RetentionPolicy is not CUSTODIAL [=> precious], because the original/source copy will land on the pool as "cached+sticky", regardless of whether the value of the pool.lfs property is 'precious' (i.e., no-flush) for that pool. The initial replication proceeds as normal in this case, because the entry method simply takes all files from a resilient pool and marks them as 'countable', the prerequisite for being included in the hard replica count. However, if for any reason (such as a pool state change, reboot of the replica manager, etc.) a rescan of the pool occurs, and the pool contains such originals, the replicas database entry for this file now becomes marked countable=false, because it is cached, not precious (the scan uses a different method from the initial one). Even though the required number of copies may still exist on readable pools, the replica manager nevertheless thinks one is missing and proceeds to attempt a p2p. This may lead, as in the reported ticket, to repeated failures. In any case, it leads to more work than is needed, since there is no danger that a cached+sticky (i.e., system sticky) file will be removed and thus should count as a hard replica. Modification: The method used to process pool repository entries into the replicas table now considers both precious and cached+sticky copies as 'countable'. Result: Inconsistent counting of hard replicas is eliminated via a minimal code intervention. Note 1: Soft replicas (cached but not sticky) are still countable='f', since they are subject to removal by the sweeper. Note 2: The other solution to this issue is simply to inform our users that RetentionPolicy should always be CUSTODIAL for the directories linked to resilient pools. This, however, still would require the site to (a) propagate the tag change throughout the directory tree where needed; (b) change the tags for the files in these directories in the namespace; (b) modify those files to countable='t' in the replicas table. This could be a complicated and lengthy process for a big installation, especially one like BNL with 6 separate replication managers. Target: master RT: 8871 (replication problem: bitmask=258 and countable=false for a file copy in Replication Manager DB) Request: 2.14 Request: 2.13 Request: 2.12 Request: 2.11 Request: 2.10 Require-notes: yes Require-book: yes Acked-by: Dmitry RELEASE NOTES: Fixes a bug where source files which are written to directories with REPLICA ONLINE tags are at first marked countable but then on rescan of the pool are marked not countable. The fix makes 'cached' + (system-owned)'sticky' files the equivalent of precious files on pools with pool.lfs='precious'.
dCache · Jan 20, 2016 · 4106cb3 · 4106cb3
1 parent e44543c
commit 4106cb3
Showing 1 changed file with 4 additions and 6 deletions.
diff --git a/modules/dcache/src/main/java/diskCacheV111/replicaManager/ReplicaDbV1.java b/modules/dcache/src/main/java/diskCacheV111/replicaManager/ReplicaDbV1.java
@@ -13,14 +13,13 @@
 import java.sql.SQLException;
 import java.sql.Statement;
 import java.text.MessageFormat;
+import java.util.Collections;
 import java.util.Iterator;
 import java.util.List;
 
 import diskCacheV111.repository.CacheRepositoryEntryInfo;
 import diskCacheV111.util.PnfsId;
-
 import dmg.cells.nucleus.CellAdapter;
-import java.util.Collections;
 
 import static org.dcache.commons.util.SqlHelper.tryToClose;
 
@@ -143,16 +142,15 @@ public synchronized void addPnfsToPool(List<CacheRepositoryEntryInfo> fileList,
                                                                     // table
                 String pnfsId = info.getPnfsId().toString();
                 int bitmask = info.getBitMask();
+                boolean notRemovable = info.isPrecious() ||
+                                (info.isCached() && info.isSticky());
                 boolean countable =
-                        info.isPrecious() &&
-//                        info.isCached() &&
+                        notRemovable &&
                         !info.isReceivingFromClient() &&
                         !info.isReceivingFromStore() &&
-//                        info.isSendingToStore() &&
                         !info.isBad() &&
                         !info.isRemoved() &&
                         !info.isDestroyed();
-//                        info.isSticky();
 
                 try {
                     pstmt.setString(1, pnfsId);