Skip to content

Commit

Permalink
dcache-qos: add policy support to scanner (qos rule engine 6)
Browse files Browse the repository at this point in the history
Motivation:

Implement the rule engine extension to QoS services.

Modification:

This patch modifies the periodic system-wide
scanning of files to two ways:

1.  The `nearline` scan no longer looks at
    all `NEARLINE CUSTODIAL` files, but only
    those for which a QoS policy is defined.
    It is no longer disabled by default.

2.  The `online` scan by default still
    loops through all such files in the
    namespace in their natural order, but
    this form of scan can be swapped off
    for a full periodic pool scan (à la Resilience)
    instead.  For the advantages and
    disadvantages of each, see the section
    in TheBook.

Most of the rest of this patch is simply
renaming or non-functional code changes.

Result:

Scanner better adapted to the new rule
engine semantics.

Target: master
Patch: https://rb.dcache.org/r/14074
Depends-on: #14073
Acked-by: Tigran
  • Loading branch information
alrossi committed Sep 6, 2023
1 parent 710ee61 commit 0ebfa05
Show file tree
Hide file tree
Showing 13 changed files with 381 additions and 245 deletions.
26 changes: 21 additions & 5 deletions docs/TheBook/src/main/markdown/config-qos-engine.md
Expand Up @@ -663,7 +663,7 @@ restage it before considering it inaccessible.
### Pool scan vs Sys scan

For the scanner component, there are two kinds of scans. The pool scan runs a query
by location (= pool) and verifies each of the files that the namespace indicates is
by location (= pool) and verifies each of the ``ONLINE`` files that the namespace indicates is
resident on that pool. This is generally useful for disk-resident replicas, but
will not be able to detect missing replicas (say, from faulty migration, where the
old pool is no longer in the pool configuration). Nevertheless, a pool scan
Expand All @@ -685,10 +685,26 @@ copy, regardless of the current available pools, and will stage it back in if it
SCANNING, QOS vs Resilience
Formerly (in resilience), individual pool scans were both triggered by pool state changes
and were run periodically; in QoS, however, they are only triggered by state changes
(or by an explicit admin command). The sys scans, on the other hand, run periodically
in the background, touching each file in the natural order of their primary key in the
namespace.
and were run periodically; in QoS, they are still triggered by state changes
(or by an explicit admin command), but there is an option as to how to run ONLINE scans
periodically. By enabling 'online' scans (the default), the sys scans will
touch each file in the natural order of their primary key in the namespace.
The advantage to this is avoiding scanning the same file more than once if
it has more than one location. The disadvantage is that files whose locations
are currently offline or have been removed from the dCache configuration will
trigger an alarm. If 'online' is disabled, the old-style pool scan (more properly,
location-based scan) will be triggered instead. This will look at only ONLINE
files on IDLE pools that are ENABLED, but will end up running redundant checks
for files with multiple replicas.
With the advent of the rule engine (9.2), the NEARLINE scan has been limited to
files with a defined qos policy.
NEARLINE is no longer turned off by default, since it no longer necessarily
encompasses all files on tape, but just the ones for which the policy state
currently involves state. Of course, if the majority of files in the dCache
instance have a policy, then this scan will again involve a much longer run-time
and thus the window should be adjusted accordingly.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

----------------------
Expand Down
Expand Up @@ -771,7 +771,7 @@ protected String doCall() throws Exception {
@Command(name = "sys cancel",
hint = "cancel background scan operations",
description = "Cancels operations matching the options; either single operation, all "
+ "online or nearline; notifies the verifier.")
+ "online or qos; notifies the verifier.")
class SysCancelCommand extends InitializerAwareCommand {

SysCancelCommand() {
Expand All @@ -782,9 +782,9 @@ class SysCancelCommand extends InitializerAwareCommand {
usage = "Cancel the sub-operation matching this uuid.")
String id;

@Option(name = "nearline",
usage = "Cancel all nearline sub-operations.")
boolean nearline = false;
@Option(name = "qos",
usage = "Cancel all qosNearline sub-operations.")
boolean qos = false;

@Option(name = "online",
usage = "Cancel all online sub-operations.")
Expand All @@ -795,7 +795,7 @@ protected String doCall() {
if (id != null) {
systemOperationMap.cancelSystemScan(id);
} else {
if (nearline) {
if (qos) {
systemOperationMap.cancelAll(true);
}
if (online) {
Expand Down Expand Up @@ -836,31 +836,30 @@ protected String doCall() {
hint = "control the periodic check for system scans",
description =
"Resets the properties governing system scanning, like the periodic interval,"
+ " whether nearline is enabled, batch size, and the number of concurrent operations allowed.")
+ " whether online is enabled, batch size, and the number of concurrent operations allowed.")
class SysControlCommand extends InitializerAwareCommand {

@Option(name = "enable-nearline",
usage = "Turn on or off NEARLINE (CUSTODIAL) system scanning.")
Boolean enableNearline;
@Option(name = "enable-online",
usage = "Turn on or off direct namespace ONLINE system scanning"
+ " (if false, scans of IDLE ENABLED pools are used).")
Boolean enableOnline;

@Option(name = "nearline-window",
usage = "(one of nearline-window|online-window). "
+ "Amount of time which must pass since the last full system scan for it to be run again.")
Integer nearlineWindow;
@Option(name = "qos-nearline-window",
usage = "Amount of time which must pass since the last full system scan for it to be run again.")
Integer qosNearlineWindow;

@Option(name = "online-window",
usage = "(one of nearline-window|online-window). "
+ "Amount of time which must pass since the last system scan (online files only) for "
usage = "Amount of time which must pass since the last system scan (online files only) for "
+ "it to be run again.")
Integer onlineWindow;

@Option(name = "online-batch-size",
usage = "Maximum number of pnsfids to send to the verifier at a time.")
Integer onlineBatch;

@Option(name = "nearline-batch-size",
@Option(name = "qos-nearline-batch-size",
usage = "Maximum number of pnsfids to send to the verifier at a time.")
Integer nearlineBatch;
Integer qosNearlineBatch;

@Option(name = "max-operations",
usage = "Maximum number of concurrent operations permitted; note that this number "
Expand All @@ -878,12 +877,12 @@ class SysControlCommand extends InitializerAwareCommand {

@Override
protected String doCall() {
if (enableNearline != null) {
systemOperationMap.setNearlineRescanEnabled(enableNearline);
if (enableOnline != null) {
systemOperationMap.setOnlineScanEnabled(enableOnline);
}

if (nearlineBatch != null) {
systemOperationMap.setNearlineBatchSize(nearlineBatch);
if (qosNearlineBatch != null) {
systemOperationMap.setQosNearlineBatchSize(qosNearlineBatch);
}

if (onlineBatch != null) {
Expand All @@ -894,10 +893,10 @@ protected String doCall() {
systemOperationMap.setMaxConcurrentRunning(maxOperations);
}

if (nearlineWindow != null) {
systemOperationMap.setNearlineRescanWindow(nearlineWindow);
if (qosNearlineWindow != null) {
systemOperationMap.setQosNearlineRescanWindow(qosNearlineWindow);
if (unit != null) {
systemOperationMap.setNearlineRescanWindowUnit(unit);
systemOperationMap.setQosNearlineRescanWindowUnit(unit);
}
} else if (onlineWindow != null) {
systemOperationMap.setOnlineRescanWindow(onlineWindow);
Expand All @@ -913,38 +912,38 @@ protected String doCall() {

@Command(name = "sys scan",
hint = "initiate an ad hoc background scan.",
description =
"If this is nearline, it will bypass the enable flag; however, if a scan of "
+ "the requested type is already running, it will not be automatically canceled.")
description = "If a scan of the requested type is already running, "
+ "it will not be automatically canceled.")
class SysScanCommand extends InitializerAwareCommand {

SysScanCommand() {
super(initializer);
}

@Option(name = "nearline",
usage = "Scan nearline custodial files with cached replicas. "
+ "For most deployments, NEARLINE will be the more costly scan and could run for many "
+ "days, depending on the size of the namespace.")
boolean nearline = false;
@Option(name = "qos",
usage = "Scan NEARLINE files for which a QoS policy has been defined.")
boolean qos = false;

@Option(name = "online",
usage = "Scan online files (both REPLICA and CUSTODIAL). "
+ "Equivalent to scanning all pools for persistent files.")
usage = "Scan online files (both REPLICA and CUSTODIAL). Depending on whether "
+ "online is enabled (true by default), it will either scan the namespace "
+ "entries or will trigger a scan of all IDLE ENABLED pools. Setting this scan "
+ "to be run periodically should take into account the size of the namespace or "
+ "the number of pools, and the proportion of ONLINE files they contain.")
boolean online = false;

@Override
protected String doCall() {
try {
if (nearline) {
if (qos) {
systemOperationMap.startScan(true);
}

if (online) {
systemOperationMap.startScan(false);
}

if (!nearline && !online) {
if (!qos && !online) {
return "No scan started; must be -online, -nearline or both.";
}
} catch (PermissionDeniedCacheException e) {
Expand Down
Expand Up @@ -83,6 +83,10 @@ public synchronized void incrementCount() {
++count;
}

public synchronized void incrementCount(long count) {
this.count += count;
}

public synchronized boolean isCancelled() {
return canceled;
}
Expand Down

0 comments on commit 0ebfa05

Please sign in to comment.