Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GOBBLIN-1101(DSS-25241): Enhance bulk api retry for ExceedQuota #2942

Closed
wants to merge 9 commits into from

Conversation

arekusuri
Copy link

@arekusuri arekusuri commented Mar 30, 2020

JIRA

https://issues.apache.org/jira/browse/GOBBLIN-1101

Description

  1. ExceedQuota exception
    If the ExceedQuota exception happens, we should let the thread sleep 5 minutes and try again. There should not be a retryLimit for this exception.
  2. Except stack in log file
    For example we set up retryLimit to 10, we retried 10 times, and failed; we need to print out exception stack in log file, there are 10 of them in the exception stack. We'd better skip all the retry exception, only keep the root cause exception.

Tests

https://ltx1-holdemaz05.grid.linkedin.com:8443/executor?execid=25014730&job=salesforce_task_full&attempt=0

// any new synchronous Apex request results in a runtime exception.
if (e.isCurrentExceptionExceedQuta()) {
log.warn("--Caught ExceededQuota: " + e.getMessage());
threadSleep(5 * 60 * 1000); // 5 minutes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the timeout configurable. Default can be set to 5 minutes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument is not critical. Because of resource is not sufficient, we need to wait a random duration. 1 minute is OK, 5 minutes is OK too, for it fails, continue to wait until succeeds.
User would not need to know a new concept about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not only about Gobblin ETL jobs. There can be other consumers of the resource that is throttled. Having a configuration key with default value does not make the user experience any worse and allows tuning when required.

if (e.isCurrentExceptionExceedQuta()) {
log.warn("--Caught ExceededQuota: " + e.getMessage());
threadSleep(5 * 60 * 1000); // 5 minutes
executeCount --; // if the current exception is Quota Exceeded, keep trying forever
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether to retry forever should be configurable, with the default as false to match the existing behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just for avoid request peak. In practice, it will never be forever. We'd better not let user learn this concept.
It is a resource management strategy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This affects resource utilization and latency, so I think users may want to control that based on their workload and priorities. For example, they may want the ETL job to fail immediately and not retry or to have a longer polling interval to reduce contention with higher priority jobs.

} catch (Exception e) {
// Retry may resolve other exceptions.
rootCause = e;
threadSleep(1 * 60 * 1000); // 1 minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout should also be configurable.

Copy link
Author

@arekusuri arekusuri Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to be random duration as well.
This kind of failure will be retried only limited times.
User wouldn't know how long to set up.
Only way to find a better duration is test and get an optimal value. However this is not a critical value. we don't have to really find out the most optimal value.

try {
Thread.sleep(millis);
} catch (Exception ee) {
log.warn("--sleep exception--");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the exception to the log message. You should reraise InterruptedException or shutdown of the process may be blocked.

Copy link
Author

@arekusuri arekusuri Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to

 try {
      Thread.sleep(millis);
    } catch (Exception e) {
      log.error("--Failed to sleep--", e);
      throw new RuntimeException(e);
    }

_isCurrentExceptionExceedQuota = true;
}
}
public boolean isCurrentExceptionExceedQuta() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quta -> Quota

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! thanks! fixed.

@@ -25,7 +25,7 @@ private SalesforceConfigurationKeys() {
}
public static final String SOURCE_QUERYBASED_SALESFORCE_IS_SOFT_DELETES_PULL_DISABLED =
"source.querybased.salesforce.is.soft.deletes.pull.disabled";
public static final int DEFAULT_FETCH_RETRY_LIMIT = 5;
public static final int DEFAULT_FETCH_RETRY_LIMIT = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the default here? It may affect existing users.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's restore it.

if (e.isCurrentExceptionExceedQuta()) {
log.warn("--Caught ExceededQuota: " + e.getMessage());
threadSleep(5 * 60 * 1000); // 5 minutes
executeCount --; // if the current exception is Quota Exceeded, keep trying forever
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This affects resource utilization and latency, so I think users may want to control that based on their workload and priorities. For example, they may want the ETL job to fail immediately and not retry or to have a longer polling interval to reduce contention with higher priority jobs.

@arekusuri
Copy link
Author

arekusuri commented Apr 1, 2020

This affects resource utilization and latency, so I think users may want to control that based on their workload and priorities. For example, they may want the ETL job to fail immediately and not retry or to have a longer polling interval to reduce contention with higher priority jobs.

@htran1 thanks for your above comment.
Add sleep in the retry logic is to avoid resource consuming peak. The sleep duration should be a random number. Nobody knows what is the right number to set up. (to do a lot of experiments may help find out a optimal number)
It is not depending anything like table size.
Flow job developer would not be able to figure out what number to set up. Leave users the key can only confuse them more.

they may want the ETL job to fail immediately

Very good question. We have key salesforce.fetchRetryLimit. if set to 0, there won't be retry. Default retry limit is 5.

// any new synchronous Apex request results in a runtime exception.
if (e.isCurrentExceptionExceedQuta()) {
log.warn("--Caught ExceededQuota: " + e.getMessage());
threadSleep(5 * 60 * 1000); // 5 minutes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not only about Gobblin ETL jobs. There can be other consumers of the resource that is throttled. Having a configuration key with default value does not make the user experience any worse and allows tuning when required.

@arekusuri
Copy link
Author

@htran1 OK, I will add a key for the duration.

@codecov-io
Copy link

Codecov Report

Merging #2942 into master will decrease coverage by 0.96%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2942      +/-   ##
============================================
- Coverage     45.58%   44.61%   -0.97%     
+ Complexity     9157     8989     -168     
============================================
  Files          1936     1936              
  Lines         73286    73325      +39     
  Branches       8088     8095       +7     
============================================
- Hits          33409    32716     -693     
- Misses        36782    37556     +774     
+ Partials       3095     3053      -42     
Impacted Files Coverage Δ Complexity Δ
.../apache/gobblin/salesforce/BulkResultIterator.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...che/gobblin/salesforce/ResultChainingIterator.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...apache/gobblin/salesforce/SalesforceExtractor.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...gobblin/runtime/mapreduce/GobblinOutputFormat.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-2.00%)
...askStateCollectorServiceHiveRegHandlerFactory.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-2.00%)
...re/filesystem/FsDatasetStateStoreEntryManager.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-3.00%)
...in/runtime/mapreduce/CustomizedProgresserBase.java 0.00% <0.00%> (-83.34%) 0.00% <0.00%> (-1.00%)
...rg/apache/gobblin/runtime/ZkDatasetStateStore.java 0.00% <0.00%> (-80.77%) 0.00% <0.00%> (-7.00%)
...lin/runtime/locks/LegacyJobLockFactoryManager.java 0.00% <0.00%> (-78.58%) 0.00% <0.00%> (-2.00%)
.../apache/gobblin/metastore/ZkStateStoreFactory.java 0.00% <0.00%> (-71.43%) 0.00% <0.00%> (-2.00%)
... and 42 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d54e66b...bdc36d9. Read the comment docs.

Throwable rootCause = null;
int executeCount = 0;
while (executeCount < retryLimit + 1) {
executeCount ++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the space before ++.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

// check if last skipped line is same as the line before error
throw new RuntimeException("Failed to verify last skipped line - retrying for =>", exceptionRetryFor);
String msg = rootCause == null? "null" : rootCause.getMessage();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add space between null and ?.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

throw new RuntimeException("Failed to [" + e.getMessage() + "] - retrying for => " , exceptionRetryFor);

} catch (Exception currentException) { // failed to open reader and skip lineCount lines // ssl failures go here
Throwable cause = rootCause == null? currentException : rootCause;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add space between null and ?.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

if (e.isCurrentExceptionExceedQuota()) {
log.warn("--Caught ExceededQuota: " + e.getMessage());
threadSleep(retryExceedQuotaInterval);
executeCount --; // if the current exception is Quota Exceeded, keep trying forever
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove space before --.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. They are all good catch! thanks

Copy link
Contributor

@htran1 htran1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@asfgit asfgit closed this in 5722a85 Apr 7, 2020
haojiliu pushed a commit to haojiliu/incubator-gobblin that referenced this pull request Apr 9, 2020
GOBBLIN-1101(DSS-25241): Enhance bulk api retry
for ExceedQuota

throw out runtime exception for
InteruptedException

fix typo

restore DEFAULT_FETCH_RETRY_LIMIT

fix find root cause exception

trigger build again

add key for sleep duration

trigger again

fix format

Closes apache#2942 from arekusuri/GOBBLIN-1101
arekusuri pushed a commit to arekusuri/incubator-gobblin that referenced this pull request May 18, 2020
GOBBLIN-1101(DSS-25241): Enhance bulk api retry
for ExceedQuota

throw out runtime exception for
InteruptedException

fix typo

restore DEFAULT_FETCH_RETRY_LIMIT

fix find root cause exception

trigger build again

add key for sleep duration

trigger again

fix format

Closes apache#2942 from arekusuri/GOBBLIN-1101
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants