New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #349 retry on jobInternalError #363
Fix #349 retry on jobInternalError #363
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @SebHeuze . Left some minor comments. Could you take a look at those please. Will be taking a look at tests in next pass.
@@ -153,13 +151,13 @@ private void mergeFlush( | |||
bigQuery.query(QueryJobConfiguration.of(mergeFlushQuery)); | |||
success = true; | |||
} catch (BigQueryException e) { | |||
if (BigQueryErrorResponses.isCouldNotSerializeAccessError(e)) { | |||
if (BigQueryErrorResponses.isCouldNotSerializeAccessError(e) || BigQueryErrorResponses.isJobInternalError(e)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets use bigQueryRetry
config to decide the retry count for isJobInternalError
} | ||
SleepUtils.waitRandomTime(10000, 20000); | ||
waitRandomTime(10000, 20000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add a warn
log to keep the user informed that retries are happening.
} | ||
SleepUtils.waitRandomTime(10000, 20000); | ||
waitRandomTime(10000, 20000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bigQueryRetryWait
for calculating sleep
@@ -481,6 +479,10 @@ static String batchClearQuery(TableId intermediateTable, int batchNumber) { | |||
.toString(); | |||
} | |||
|
|||
public void waitRandomTime(int sleepMs, int jitterMs) throws InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SleepUtils looks like the right place for this. Any reasons for moving it to MergeQueries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was because the current sleep call in MergeQueries had hardcoded values and was static so it was annoying for testing, since BigqueryWriter has a method waitRandomTime I made the same thing in MergeQueries so I was able to modify the sleep call in test.
But now we use config for sleepMs and maxRetry so the testing problem is solved I recreated SleepUtils
f33158a
to
caa684d
Compare
I edited my PR based on the last reviews, Both retriable exception in this class now use bigQueryRetry and bigQueryRetryWait, does it look good to you or you want to leave the current hardcoded 10s-20s sleep and 30 maxRetry that was used until now for SerializeAccessError (added in this PR #239 )? I added another test that ensure it throw an exception after X retry |
kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/MergeQueries.java
Outdated
Show resolved
Hide resolved
kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/MergeQueries.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thankyou @SebHeuze for taking time and looking into the comments.
Left 2 minor comments, could you take a look please. Would be good to merge once those are resolved. Thanks!
kcbq-connector/src/test/java/com/wepay/kafka/connect/bigquery/MergeQueriesTest.java
Outdated
Show resolved
Hide resolved
kcbq-connector/src/test/java/com/wepay/kafka/connect/bigquery/MergeQueriesTest.java
Outdated
Show resolved
Hide resolved
@SebHeuze , would you be able to target these changes to 2.5.x and change the base branch to 2.5.x please. |
caa684d
to
36cea26
Compare
36cea26
to
1eaba0b
Compare
@b-goyal 4 last comments has been resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @SebHeuze
This is a fix for #349 to retry when the error is a jobInternalError (400)
After contacting support this error should be rare and can happen during system overload
The documentation mention this "Retry the job with a new jobId. If the error continues to occur, contact support."
So I reused the existing catch for the retriable serialization error, since this error happen during system overload it can be relevant to wait few seconds before retrying.
I am open to any suggestion.