-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARC-788 error handling for the push queue (SQS) #922
Conversation
…rror-handling-for-push-queue
@@ -22,7 +22,7 @@ const sqsQueues = { | |||
queueRegion: envVars.SQS_PUSH_QUEUE_REGION, | |||
longPollingIntervalSec: LONG_POLLING_INTERVAL_SEC, | |||
timeoutSec: 60, | |||
maxAttempts: 5}, pushQueueMessageHandler, defaultErrorHandler), | |||
maxAttempts: 5}, pushQueueMessageHandler, pushQueueErrorHandler), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need a feature flag for this change because if something goes wrong we can always use existing feature flag to fallback to the Redis queue.
@@ -106,110 +106,103 @@ export function processPushJob(app: Application) { | |||
} | |||
|
|||
export const processPush = async (github: GitHubAPI, payload, rootLogger: LoggerWithTarget) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I did here is removing try/catch block. Many lines got changes because rows got shifted
} catch (err) {
log.error({ err }, "Failed to process push");
emitWebhookFailedMetrics("push");
throw err;
}
We don't need to do it here because:
- If this function fail in the middleware during webhook processing, we already emit metrics in the middleware error handling (currently we emitting failure metrics twice)
- For SQS we'll have its own error handling. (Otherwise we'll emit a failure on each retry)
- We don't use Redis for pushes anymore, so we won't loosing its metrics
src/sqs/push.ts
Outdated
|
||
//Handle non-failure cases | ||
const maybeResult = maybeHandleNonFailureCase(error, context); | ||
if(maybeResult) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supernit: here and in other cases: there should either be a space between if
and (
everywhere or nowhere. I see both in different parts of your PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need to add some liner rules so it will be auto-formatted
src/sqs/push.ts
Outdated
export const pushQueueErrorHandler : ErrorHandler<PushQueueMessagePayload> = async (error: JiraClientError | Octokit.HookError | RateLimitingError | Error, | ||
context: Context<PushQueueMessagePayload>) : Promise<ErrorHandlingResult> => { | ||
|
||
//Handle non-failure cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the comment repeats the code, I suggest to remove it
//Handle non-failure cases | ||
const maybeResult = maybeHandleNonFailureCase(error, context); | ||
if(maybeResult) { | ||
return maybeResult; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Would return !maybeResult
make it clearer?
nit: Would return maybeResult!
make it clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if there is a result we don't need to return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, wait, sorry, my bad! I meant return maybeResult!
🤦 🤣 Explicitly show in the code that we are returning a value, not "maybe" value - this is what I wanted
src/sqs/push.ts
Outdated
return maybeResult; | ||
} | ||
|
||
//Handle failure cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same, IMHO the commend should tell "why?", not "what?". I'd suggest to remove the comment
src/sqs/push.ts
Outdated
const errorHandlingResult = handleFailureCase(error, context); | ||
|
||
if(!errorHandlingResult.retryable || context.lastAttempt ) { | ||
//Emit metrics if the failure is not retryable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same, IMHO the commend should tell "why?", not "what?". I'd suggest to remove the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I guess log message explains it enough
src/sqs/push.ts
Outdated
function maybeHandleNonFailureCase(error: Error, context: Context<PushQueueMessagePayload>): ErrorHandlingResult | undefined { | ||
if(error instanceof JiraClientError && | ||
error.status && | ||
NON_ERRONEOUS_STATUSES.includes(error.status)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd suggest NON_FAILURE_STATUSES
, because you've used word "failure" for this situation before (less notions => easier to follow IMHO)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, should it be something like RETRYABLE_STATUSES
? I feel like "retryable/non-retryable" is more clear and explicit than "error/failure"....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the same goes to the names of functions (e.g. maybeHandleRetryableCase
)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the thing is when it is not a failure, we won't send a faulre metric.
So I wanted to name it to reflect taht it is not a failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Non-failure = we retry", is my understanding correct?
//If error is Octokit.HookError, then we need to check the response status | ||
//Unfortunately we can't check if error is instance of Octokit.HookError because it is not a calss, so we'll just rely on status | ||
//TODO Add error handling for the new GitHub client when it will be done | ||
const maybeErrorWithStatus : any = error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could it be const { status } = error;
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was done to avoid typescript errors. If I do const { status } = error;
It will say that the error doesnt' have "status" field. However its subclasses can. So I had to do this to avoid ignoring typescript&linter errors
src/sqs/push.ts
Outdated
|
||
function handleFailureCase(error: Error, context: Context<PushQueueMessagePayload>): ErrorHandlingResult { | ||
if (error instanceof RateLimitingError) { | ||
// delaying until the rate limit is reset (plus a buffer of a couple seconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same, IMHO the commend should tell "why?", not "what?". I'd suggest to remove the comment
src/sqs/push.ts
Outdated
@@ -39,3 +41,60 @@ export const pushQueueMessageHandler : MessageHandler<PushQueueMessagePayload> = | |||
await processPush(github, payload, wrapLogger(context.log)); | |||
} | |||
|
|||
|
|||
const NON_ERRONEOUS_STATUSES = [401, 404, 403]; | |||
const RATE_LIMITING_DELAY_BUFFER = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe also include measurement units? Is it seconds? Minutes?
src/sqs/push.ts
Outdated
|
||
const NON_ERRONEOUS_STATUSES = [401, 404, 403]; | ||
const RATE_LIMITING_DELAY_BUFFER = 10; | ||
const EXPONENTIAL_BACKOFF_BASE = 60; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe also include measurement units? Is it seconds? Minutes?
src/sqs/push.ts
Outdated
return {retryable: true, retryDelaySec: delaySec} | ||
} | ||
|
||
//Use exponential backoff in any other case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same, IMHO the commend should tell "why?", not "what?". I'd suggest to remove the comment
src/sqs/push.ts
Outdated
} | ||
|
||
//Use exponential backoff in any other case | ||
const delaySec = EXPONENTIAL_BACKOFF_BASE * Math.pow(3, context.receiveCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should 3 be a constant, too?
describe("Push Queue Error Handler", () => { | ||
|
||
|
||
let statsdIncrementSpy = jest.spyOn(statsd, "histogram"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supernit: formatting (empty lines)
}); | ||
|
||
|
||
function getJiraClientError(code: number) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supernit: too many empty lines
Great work, @KAbakumovAtl ! In general it looks good to me, the only sensible comment I have is about the notions that you are using. I see that you've used 3 different words: failure, error and non-retryable. I'd suggest to stick to a single one and rename everything else to use it. My personal perefence is "retryable/non-retryable" cause I think it is pretty explicit and describes the behaviour of the system best.
|
I fixed what I could. Ready for re-review |
In this PR: