Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #349 retry on jobInternalError #363

Merged
merged 1 commit into from Dec 18, 2023

Conversation

SebHeuze
Copy link

This is a fix for #349 to retry when the error is a jobInternalError (400)

After contacting support this error should be rare and can happen during system overload
The documentation mention this "Retry the job with a new jobId. If the error continues to occur, contact support."

So I reused the existing catch for the retriable serialization error, since this error happen during system overload it can be relevant to wait few seconds before retrying.

I am open to any suggestion.

@SebHeuze SebHeuze requested a review from a team as a code owner October 31, 2023 00:41
@cla-assistant
Copy link

cla-assistant bot commented Oct 31, 2023

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@b-goyal b-goyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @SebHeuze . Left some minor comments. Could you take a look at those please. Will be taking a look at tests in next pass.

@@ -153,13 +151,13 @@ private void mergeFlush(
bigQuery.query(QueryJobConfiguration.of(mergeFlushQuery));
success = true;
} catch (BigQueryException e) {
if (BigQueryErrorResponses.isCouldNotSerializeAccessError(e)) {
if (BigQueryErrorResponses.isCouldNotSerializeAccessError(e) || BigQueryErrorResponses.isJobInternalError(e)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use bigQueryRetry config to decide the retry count for isJobInternalError

}
SleepUtils.waitRandomTime(10000, 20000);
waitRandomTime(10000, 20000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a warn log to keep the user informed that retries are happening.

}
SleepUtils.waitRandomTime(10000, 20000);
waitRandomTime(10000, 20000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bigQueryRetryWait for calculating sleep

@@ -481,6 +479,10 @@ static String batchClearQuery(TableId intermediateTable, int batchNumber) {
.toString();
}

public void waitRandomTime(int sleepMs, int jitterMs) throws InterruptedException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SleepUtils looks like the right place for this. Any reasons for moving it to MergeQueries?

Copy link
Author

@SebHeuze SebHeuze Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was because the current sleep call in MergeQueries had hardcoded values and was static so it was annoying for testing, since BigqueryWriter has a method waitRandomTime I made the same thing in MergeQueries so I was able to modify the sleep call in test.
But now we use config for sleepMs and maxRetry so the testing problem is solved I recreated SleepUtils

@SebHeuze SebHeuze force-pushed the fix-retry-job-internal-error branch 2 times, most recently from f33158a to caa684d Compare October 31, 2023 12:04
@SebHeuze
Copy link
Author

SebHeuze commented Oct 31, 2023

I edited my PR based on the last reviews,
I edited the class so it now use configurable retry max count and retryWait, the MAX Jitter has been set to 1000 (based on what value other class use) instead of 10000 (I think it does not make sense to leave 10000 jitter since the default bigQueryRetryWait is 1000 ?)

Both retriable exception in this class now use bigQueryRetry and bigQueryRetryWait, does it look good to you or you want to leave the current hardcoded 10s-20s sleep and 30 maxRetry that was used until now for SerializeAccessError (added in this PR #239 )?

I added another test that ensure it throw an exception after X retry

Copy link
Member

@b-goyal b-goyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thankyou @SebHeuze for taking time and looking into the comments.
Left 2 minor comments, could you take a look please. Would be good to merge once those are resolved. Thanks!

@b-goyal
Copy link
Member

b-goyal commented Nov 30, 2023

@SebHeuze , would you be able to target these changes to 2.5.x and change the base branch to 2.5.x please.

@SebHeuze SebHeuze changed the base branch from master to 2.5.x December 8, 2023 18:29
@SebHeuze
Copy link
Author

SebHeuze commented Dec 8, 2023

@b-goyal 4 last comments has been resolved
I changed base branch and target branch, should be ok now

Copy link
Member

@b-goyal b-goyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @SebHeuze

@b-goyal b-goyal merged commit 0c2a448 into confluentinc:2.5.x Dec 18, 2023
3 of 4 checks passed
C0urante referenced this pull request in Aiven-Open/bigquery-connector-for-apache-kafka Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants