-
Notifications
You must be signed in to change notification settings - Fork 746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GOBBLIN-1101(DSS-25241): Enhance bulk api retry for ExceedQuota #2942
Conversation
// any new synchronous Apex request results in a runtime exception. | ||
if (e.isCurrentExceptionExceedQuta()) { | ||
log.warn("--Caught ExceededQuota: " + e.getMessage()); | ||
threadSleep(5 * 60 * 1000); // 5 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the timeout configurable. Default can be set to 5 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument is not critical. Because of resource is not sufficient, we need to wait a random duration. 1 minute is OK, 5 minutes is OK too, for it fails, continue to wait until succeeds.
User would not need to know a new concept about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not only about Gobblin ETL jobs. There can be other consumers of the resource that is throttled. Having a configuration key with default value does not make the user experience any worse and allows tuning when required.
if (e.isCurrentExceptionExceedQuta()) { | ||
log.warn("--Caught ExceededQuota: " + e.getMessage()); | ||
threadSleep(5 * 60 * 1000); // 5 minutes | ||
executeCount --; // if the current exception is Quota Exceeded, keep trying forever |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether to retry forever should be configurable, with the default as false to match the existing behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just for avoid request peak. In practice, it will never be forever. We'd better not let user learn this concept.
It is a resource management strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This affects resource utilization and latency, so I think users may want to control that based on their workload and priorities. For example, they may want the ETL job to fail immediately and not retry or to have a longer polling interval to reduce contention with higher priority jobs.
} catch (Exception e) { | ||
// Retry may resolve other exceptions. | ||
rootCause = e; | ||
threadSleep(1 * 60 * 1000); // 1 minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout should also be configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is meant to be random duration as well.
This kind of failure will be retried only limited times.
User wouldn't know how long to set up.
Only way to find a better duration is test and get an optimal value. However this is not a critical value. we don't have to really find out the most optimal value.
try { | ||
Thread.sleep(millis); | ||
} catch (Exception ee) { | ||
log.warn("--sleep exception--"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the exception to the log message. You should reraise InterruptedException or shutdown of the process may be blocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to
try {
Thread.sleep(millis);
} catch (Exception e) {
log.error("--Failed to sleep--", e);
throw new RuntimeException(e);
}
_isCurrentExceptionExceedQuota = true; | ||
} | ||
} | ||
public boolean isCurrentExceptionExceedQuta() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quta -> Quota
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! thanks! fixed.
@@ -25,7 +25,7 @@ private SalesforceConfigurationKeys() { | |||
} | |||
public static final String SOURCE_QUERYBASED_SALESFORCE_IS_SOFT_DELETES_PULL_DISABLED = | |||
"source.querybased.salesforce.is.soft.deletes.pull.disabled"; | |||
public static final int DEFAULT_FETCH_RETRY_LIMIT = 5; | |||
public static final int DEFAULT_FETCH_RETRY_LIMIT = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change the default here? It may affect existing users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's restore it.
if (e.isCurrentExceptionExceedQuta()) { | ||
log.warn("--Caught ExceededQuota: " + e.getMessage()); | ||
threadSleep(5 * 60 * 1000); // 5 minutes | ||
executeCount --; // if the current exception is Quota Exceeded, keep trying forever |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This affects resource utilization and latency, so I think users may want to control that based on their workload and priorities. For example, they may want the ETL job to fail immediately and not retry or to have a longer polling interval to reduce contention with higher priority jobs.
@htran1 thanks for your above comment.
Very good question. We have key |
// any new synchronous Apex request results in a runtime exception. | ||
if (e.isCurrentExceptionExceedQuta()) { | ||
log.warn("--Caught ExceededQuota: " + e.getMessage()); | ||
threadSleep(5 * 60 * 1000); // 5 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not only about Gobblin ETL jobs. There can be other consumers of the resource that is throttled. Having a configuration key with default value does not make the user experience any worse and allows tuning when required.
@htran1 OK, I will add a key for the duration. |
Codecov Report
@@ Coverage Diff @@
## master #2942 +/- ##
============================================
- Coverage 45.58% 44.61% -0.97%
+ Complexity 9157 8989 -168
============================================
Files 1936 1936
Lines 73286 73325 +39
Branches 8088 8095 +7
============================================
- Hits 33409 32716 -693
- Misses 36782 37556 +774
+ Partials 3095 3053 -42 Continue to review full report at Codecov.
|
Throwable rootCause = null; | ||
int executeCount = 0; | ||
while (executeCount < retryLimit + 1) { | ||
executeCount ++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the space before ++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
// check if last skipped line is same as the line before error | ||
throw new RuntimeException("Failed to verify last skipped line - retrying for =>", exceptionRetryFor); | ||
String msg = rootCause == null? "null" : rootCause.getMessage(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add space between null and ?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
throw new RuntimeException("Failed to [" + e.getMessage() + "] - retrying for => " , exceptionRetryFor); | ||
|
||
} catch (Exception currentException) { // failed to open reader and skip lineCount lines // ssl failures go here | ||
Throwable cause = rootCause == null? currentException : rootCause; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add space between null and ?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
if (e.isCurrentExceptionExceedQuota()) { | ||
log.warn("--Caught ExceededQuota: " + e.getMessage()); | ||
threadSleep(retryExceedQuotaInterval); | ||
executeCount --; // if the current exception is Quota Exceeded, keep trying forever |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove space before --.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. They are all good catch! thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
GOBBLIN-1101(DSS-25241): Enhance bulk api retry for ExceedQuota throw out runtime exception for InteruptedException fix typo restore DEFAULT_FETCH_RETRY_LIMIT fix find root cause exception trigger build again add key for sleep duration trigger again fix format Closes apache#2942 from arekusuri/GOBBLIN-1101
GOBBLIN-1101(DSS-25241): Enhance bulk api retry for ExceedQuota throw out runtime exception for InteruptedException fix typo restore DEFAULT_FETCH_RETRY_LIMIT fix find root cause exception trigger build again add key for sleep duration trigger again fix format Closes apache#2942 from arekusuri/GOBBLIN-1101
JIRA
https://issues.apache.org/jira/browse/GOBBLIN-1101
Description
If the ExceedQuota exception happens, we should let the thread sleep 5 minutes and try again. There should not be a retryLimit for this exception.
For example we set up retryLimit to 10, we retried 10 times, and failed; we need to print out exception stack in log file, there are 10 of them in the exception stack. We'd better skip all the retry exception, only keep the root cause exception.
Tests
https://ltx1-holdemaz05.grid.linkedin.com:8443/executor?execid=25014730&job=salesforce_task_full&attempt=0