Skip to content

Commit

Permalink
srm: Drop generic support for retrying requests
Browse files Browse the repository at this point in the history
Motivation:

The SRM scheduler had support for placing requests in a RETRYWAIT state,
allowing them to be retried after some time. This infrastructure was hardly
used as it would only react to errors generated during execution of Job#run.
Most retriable errors would however be detected in asynchronous callbacks
and for those errors we would simply fail the request.

Further, the errors that were generated by Job#run would most of the time
be classified as transient errors when they were actually persistent errors.
So even in the few cases where the retry would be triggered, it shouldn't
have been.

Possibly the only exception to this rule was the server side srm copy
support as it is at least partially blocking. The ls implementation is
blocking too, but it deliberately does not propagate errors to prevent
retry.

At the same time clients tend to retry or have fallback solutions. Retrying
at several layers means we have a multiplicative effect that causes the
number of retries to explode.

Modification:

In the interest of simplifying the SRM, this patch removes the scheduler's
support for retrying requests.

Result:

Since it either wasn't used, or was triggered erronously, as well as prevented
fail fast behaviour that allows clients to control the recovery strategy,
removing the support seemed like a better option than trying to somehow fix
it.

The relevant configuration options have been marked obsolete (srm.*.retries
and srm.*.retry-timeout*). Except for srmCopy, the patch should not have
many user visible consequences.

Bring online request retrying is not affected by this patch - bring online
requests resubmitted the request internally, bypassing the generic retry
mechanism.

Target: trunk
Require-notes: no
Require-book: no
Acked-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Acked-by: Dmitry Litvintsev <litvinse@fnal.gov>
Patch: https://rb.dcache.org/r/8559/
  • Loading branch information
gbehrmann committed Sep 29, 2015
1 parent e9542f7 commit adbf507
Show file tree
Hide file tree
Showing 21 changed files with 57 additions and 287 deletions.
24 changes: 0 additions & 24 deletions modules/dcache-srm/src/main/resources/diskCacheV111/srm/srm.xml
Original file line number Diff line number Diff line change
Expand Up @@ -486,10 +486,6 @@
<property name="maxRequests" value="${srm.request.get.max-requests}"/>
<property name="maxInprogress" value="${srm.request.get.max-inprogress}"/>
<property name="maxReadyJobs" value="${srm.request.get.max-transfers}"/>
<property name="maxNumberOfRetries" value="${srm.request.get.retries}"/>
<property name="retryTimeout" value="#{T(java.util.concurrent.TimeUnit).MILLISECONDS.convert(
${srm.request.get.retry-timeout},
'${srm.request.get.retry-timeout.unit}')}" />
</bean>


Expand All @@ -504,10 +500,6 @@
<property name="transferStrategyProvider" ref="transfer-strategy-provider"/>
<property name="maxRequests" value="${srm.request.ls.max-requests}"/>
<property name="maxInprogress" value="${srm.request.ls.max-inprogress}"/>
<property name="maxNumberOfRetries" value="${srm.request.ls.retries}"/>
<property name="retryTimeout" value="#{T(java.util.concurrent.TimeUnit).MILLISECONDS.convert(
${srm.request.ls.retry-timeout},
'${srm.request.ls.retry-timeout.unit}')}" />
</bean>


Expand All @@ -522,10 +514,6 @@
<property name="transferStrategyProvider" ref="transfer-strategy-provider"/>
<property name="maxRequests" value="${srm.request.bring-online.max-requests}"/>
<property name="maxInprogress" value="${srm.request.bring-online.max-inprogress}"/>
<property name="maxNumberOfRetries" value="${srm.request.bring-online.retries}"/>
<property name="retryTimeout" value="#{T(java.util.concurrent.TimeUnit).MILLISECONDS.convert(
${srm.request.bring-online.retry-timeout},
'${srm.request.bring-online.retry-timeout.unit}')}" />
</bean>


Expand All @@ -541,10 +529,6 @@
<property name="maxRequests" value="${srm.request.put.max-requests}"/>
<property name="maxInprogress" value="${srm.request.put.max-inprogress}"/>
<property name="maxReadyJobs" value="${srm.request.put.max-transfers}"/>
<property name="maxNumberOfRetries" value="${srm.request.put.retries}"/>
<property name="retryTimeout" value="#{T(java.util.concurrent.TimeUnit).MILLISECONDS.convert(
${srm.request.put.retry-timeout},
'${srm.request.put.retry-timeout.unit}')}" />
</bean>

<bean id="scheduler-copy" class="diskCacheV111.srm.dcache.Scheduler"
Expand All @@ -558,10 +542,6 @@
<property name="transferStrategyProvider" ref="transfer-strategy-provider"/>
<property name="maxRequests" value="${srm.request.copy.max-requests}"/>
<property name="maxInprogress" value="${srm.request.copy.max-inprogress}"/>
<property name="maxNumberOfRetries" value="${srm.request.copy.retries}"/>
<property name="retryTimeout" value="#{T(java.util.concurrent.TimeUnit).MILLISECONDS.convert(
${srm.request.copy.retry-timeout},
'${srm.request.copy.retry-timeout.unit}')}" />
</bean>

<bean id="scheduler-reserve-space" class="diskCacheV111.srm.dcache.Scheduler"
Expand All @@ -575,10 +555,6 @@
<property name="transferStrategyProvider" ref="transfer-strategy-provider"/>
<property name="maxRequests" value="${srm.request.reserve-space.max-requests}"/>
<property name="maxInprogress" value="${srm.request.reserve-space.max-inprogress}"/>
<property name="maxNumberOfRetries" value="${srm.request.reserve-space.retries}"/>
<property name="retryTimeout" value="#{T(java.util.concurrent.TimeUnit).MILLISECONDS.convert(
${srm.request.reserve-space.retry-timeout},
'${srm.request.reserve-space.retry-timeout.unit}')}" />
</bean>


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -494,7 +494,6 @@ public TReturnStatus getReturnStatus()
switch (getState()) {
case UNSCHEDULED:
case QUEUED:
case RETRYWAIT:
return new TReturnStatus(TStatusCode.SRM_REQUEST_QUEUED, description);
case READY:
case TRANSFERRING:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -640,15 +640,15 @@ public void remoteFileRequestDone(String surl, String requestId,
if (isSourceSrm() && !isSourceLocal()) {
RemoteTurlGetterV2.staticReleaseFile(credential,
surl, requestId,
scheduler.getRetryTimeout(),
scheduler.getMaxNumberOfRetries(),
0,
0,
clientTransport);
} else {
RemoteTurlPutterV2.staticPutDone(credential,
surl, requestId,
scheduler.getRetryTimeout(),
scheduler.getMaxNumberOfRetries(),
clientTransport);
surl, requestId,
0,
0,
clientTransport);
}
} catch (Exception e) {
LOG.error("set remote file status to done failed, surl={}, " +
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -481,7 +481,6 @@ public TReturnStatus getReturnStatus()
switch (getState()) {
case UNSCHEDULED:
case QUEUED:
case RETRYWAIT:
return new TReturnStatus(TStatusCode.SRM_REQUEST_QUEUED, description);
case READY:
case TRANSFERRING:
Expand Down
64 changes: 0 additions & 64 deletions modules/srm-server/src/main/java/org/dcache/srm/request/Job.java
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,6 @@ LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
import java.util.Date;
import java.util.List;
import java.util.Map;
import java.util.TimerTask;
import java.util.concurrent.locks.ReentrantReadWriteLock;

import org.dcache.srm.SRMAbortedException;
Expand Down Expand Up @@ -135,14 +134,11 @@ public abstract class Job {

protected long lifetime;

protected int numberOfRetries;
private long lastStateTransitionTime = System.currentTimeMillis();

private final List<JobHistory> jobHistory = new ArrayList<>();
private transient JobIdGenerator generator;

private transient TimerTask retryTimer;

private transient boolean savedInFinalState;

protected transient JDC jdc;
Expand Down Expand Up @@ -173,7 +169,6 @@ protected Job(long id, Long nextJobId, long creationTime,
this.state = State.getState(stateId);
this.schedulerId = schedulerId;
this.schedulerTimeStamp = schedulerTimestamp;
this.numberOfRetries = numberOfRetries;
this.lastStateTransitionTime = lastStateTransitionTime;
this.jdc = new JDC();
if(jobHistoryArray != null) {
Expand Down Expand Up @@ -295,10 +290,6 @@ public final void setState(State newState, String description)

jobHistory.add( new JobHistory(nextLong(),newState,description,lastStateTransitionTime));

if(newState == State.RETRYWAIT) {
inclreaseNumberOfRetries();
}

notifySchedulerOfStateChange(oldState, newState);

if (!newState.isFinal() && schedulerId == null) {
Expand Down Expand Up @@ -329,14 +320,9 @@ private boolean isValidTransition(State currentState, State newState)
case INPROGRESS:
return newState == State.CANCELED
|| newState == State.FAILED
|| newState == State.RETRYWAIT
|| newState == State.RQUEUED
|| newState == State.READY
|| newState == State.DONE;
case RETRYWAIT:
return newState == State.CANCELED
|| newState == State.FAILED
|| newState == State.QUEUED;
case RQUEUED:
return newState == State.CANCELED
|| newState == State.FAILED
Expand Down Expand Up @@ -482,42 +468,6 @@ public JobHistory getLastJobChange()
// job's storage (instance of Jon.JobStorage (possibly in a database )
protected abstract void stateChanged(State oldState);



/** Getter for property numberOfRetries.
* @return Value of property numberOfRetries.
*
*/
public final int getNumberOfRetries() {
rlock();
try {
return numberOfRetries;
} finally {
runlock();
}
}

/** Setter for property numberOfRetries.
*
*/
private void inclreaseNumberOfRetries() {
wlock();
try {
numberOfRetries++;
} finally {
wunlock();
}
}

/** Getter for property retry_timer.
* @return Value of property retry_timer.
*
*/
public TimerTask getRetryTimer() {
return retryTimer;
}


public TStatusCode getStatusCode() {
rlock();
try {
Expand Down Expand Up @@ -821,15 +771,6 @@ public long getLastStateTransitionTime(){
}
}

public void cancelRetryTimer()
{
TimerTask task = getRetryTimer();
if (task != null) {
task.cancel();
setRetryTimer(null);
}
}

public static class JobHistory implements Comparable<JobHistory> {
private final long id;
private final State state;
Expand Down Expand Up @@ -927,10 +868,6 @@ public synchronized void setSaved() {

}

public void setRetryTimer(TimerTask retryTimer) {
this.retryTimer = retryTimer;
}

public JDC applyJdc()
{
JDC current = jdc.apply();
Expand Down Expand Up @@ -1044,7 +981,6 @@ public void onSrmRestart(Scheduler scheduler, boolean shouldFailJobs)
// simply queue them now.
case UNSCHEDULED:
case QUEUED:
case RETRYWAIT:
addHistoryEvent("Restored from database.");
scheduler.queue(this);
break;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -535,7 +535,6 @@ public TReturnStatus getReturnStatus()
switch (getState()) {
case UNSCHEDULED:
case QUEUED:
case RETRYWAIT:
return new TReturnStatus(TStatusCode.SRM_REQUEST_QUEUED, description);
case READY:
case TRANSFERRING:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ private PreparedStatement getStatement(Connection connection,
gfr.getErrorMessage(),//5
gfr.getSchedulerId(),
gfr.getSchedulerTimeStamp(),
gfr.getNumberOfRetries(),
0, // num of retries
gfr.getLastStateTransitionTime(),
gfr.getRequestId(),//10
gfr.getStatusCodeString(),
Expand Down Expand Up @@ -185,7 +185,7 @@ public PreparedStatement getCreateStatement(Connection connection,
gfr.getErrorMessage(),
gfr.getSchedulerId(),
gfr.getSchedulerTimeStamp(),
gfr.getNumberOfRetries(),
0, // num of retries
gfr.getLastStateTransitionTime(),
//DATABSE FILE REQUEST STORAGE
gfr.getRequestId(),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ public PreparedStatement getCreateStatement(Connection connection, Job job) thro
bor.getErrorMessage(),
bor.getSchedulerId(),
bor.getSchedulerTimeStamp(),
bor.getNumberOfRetries(),
0, // num of retries
bor.getLastStateTransitionTime(), // 10
//Database Request Storage
bor.getRetryDeltaTime(),
Expand Down Expand Up @@ -121,7 +121,7 @@ public PreparedStatement getUpdateStatement(Connection connection,
bor.getErrorMessage(),//5
bor.getSchedulerId(),
bor.getSchedulerTimeStamp(),
bor.getNumberOfRetries(),
0, // num of retries
bor.getLastStateTransitionTime(),//10
//Database Request Storage
bor.getRetryDeltaTime(),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ private PreparedStatement getStatement(Connection connection,
request.getErrorMessage(),
request.getSchedulerId(),
request.getSchedulerTimeStamp(),
request.getNumberOfRetries(),
0, // num of retries
request.getLastStateTransitionTime(),
request.getRequestId(), // 10
request.getCredentialId(),
Expand Down Expand Up @@ -201,7 +201,7 @@ public PreparedStatement getCreateStatement(Connection connection,
request.getErrorMessage(),
request.getSchedulerId(),
request.getSchedulerTimeStamp(),
request.getNumberOfRetries(),
0, // num of retries
request.getLastStateTransitionTime(),
request.getRequestId(),
request.getCredentialId(),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ public PreparedStatement getCreateStatement(Connection connection, Job job) thro
cr.getErrorMessage(),
cr.getSchedulerId(),
cr.getSchedulerTimeStamp(),
cr.getNumberOfRetries(),
0, // num of retries
cr.getLastStateTransitionTime(), // 10
//Database Request Storage
cr.getCredentialId(),
Expand Down Expand Up @@ -145,7 +145,7 @@ public PreparedStatement getUpdateStatement(Connection connection,
cr.getErrorMessage(),//5
cr.getSchedulerId(),
cr.getSchedulerTimeStamp(),
cr.getNumberOfRetries(),
0, // num of retries
cr.getLastStateTransitionTime(),
//Database Request Storage
cr.getCredentialId(), // 10
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ private PreparedStatement getStatement(Connection connection,
request.getErrorMessage(),
request.getSchedulerId(),
request.getSchedulerTimeStamp(),
request.getNumberOfRetries(),
0, // num of retries
request.getLastStateTransitionTime(),
request.getRequestId(),
request.getStatusCodeString(),
Expand Down Expand Up @@ -119,7 +119,7 @@ public PreparedStatement getCreateStatement(Connection connection,
request.getErrorMessage(),
request.getSchedulerId(),
request.getSchedulerTimeStamp(),
request.getNumberOfRetries(),
0, // num of retries
request.getLastStateTransitionTime(),
request.getRequestId(),
request.getStatusCodeString(),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ public PreparedStatement getCreateStatement(Connection connection, Job job) thro
gr.getErrorMessage(),
gr.getSchedulerId(),
gr.getSchedulerTimeStamp(),
gr.getNumberOfRetries(),
0, // num of retries
gr.getLastStateTransitionTime(), // 10
//Database Request Storage
gr.getRetryDeltaTime(),
Expand Down Expand Up @@ -115,7 +115,7 @@ public PreparedStatement getUpdateStatement(Connection connection,
gr.getErrorMessage(),//5
gr.getSchedulerId(),
gr.getSchedulerTimeStamp(),
gr.getNumberOfRetries(),
0, // num of retries
gr.getLastStateTransitionTime(),
//Database Request Storage
gr.getRetryDeltaTime(), // 10
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ private PreparedStatement getStatement(Connection connection,
gfr.getErrorMessage(),
gfr.getSchedulerId(),
gfr.getSchedulerTimeStamp(),
gfr.getNumberOfRetries(),
0, // num of retries
gfr.getLastStateTransitionTime(),
gfr.getRequestId(),
gfr.getStatusCodeString(),
Expand Down Expand Up @@ -153,7 +153,7 @@ public PreparedStatement getCreateStatement(Connection connection,
gfr.getErrorMessage(),
gfr.getSchedulerId(),
gfr.getSchedulerTimeStamp(),
gfr.getNumberOfRetries(),
0, // num of retries
gfr.getLastStateTransitionTime(),
gfr.getRequestId(),
gfr.getStatusCodeString(),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ public PreparedStatement getCreateStatement(Connection connection, Job job) thro
lr.getErrorMessage(),
lr.getSchedulerId(),
lr.getSchedulerTimeStamp(),
lr.getNumberOfRetries(),
0, // num of retries
lr.getLastStateTransitionTime(), // 10
//Database Request Storage
lr.getRetryDeltaTime(),
Expand Down Expand Up @@ -114,7 +114,7 @@ public PreparedStatement getUpdateStatement(Connection connection,
lr.getErrorMessage(),//5
lr.getSchedulerId(),
lr.getSchedulerTimeStamp(),
lr.getNumberOfRetries(),
0, // num of retries
lr.getLastStateTransitionTime(),
//Database Request Storage
lr.getRetryDeltaTime(), // 10
Expand Down
Loading

0 comments on commit adbf507

Please sign in to comment.