Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARC-788 error handling for the push queue (SQS) #922

Merged
merged 10 commits into from
Nov 17, 2021

Conversation

KAbakumovAtl
Copy link
Contributor

In this PR:

  • Metrics emission from processPush function was removed
  • Added push queue error handler & tests

@@ -22,7 +22,7 @@ const sqsQueues = {
queueRegion: envVars.SQS_PUSH_QUEUE_REGION,
longPollingIntervalSec: LONG_POLLING_INTERVAL_SEC,
timeoutSec: 60,
maxAttempts: 5}, pushQueueMessageHandler, defaultErrorHandler),
maxAttempts: 5}, pushQueueMessageHandler, pushQueueErrorHandler),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a feature flag for this change because if something goes wrong we can always use existing feature flag to fallback to the Redis queue.

@@ -106,110 +106,103 @@ export function processPushJob(app: Application) {
}

export const processPush = async (github: GitHubAPI, payload, rootLogger: LoggerWithTarget) => {
Copy link
Contributor Author

@KAbakumovAtl KAbakumovAtl Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I did here is removing try/catch block. Many lines got changes because rows got shifted

} catch (err) {
		log.error({ err }, "Failed to process push");
		emitWebhookFailedMetrics("push");
		throw err;
}

We don't need to do it here because:

  1. If this function fail in the middleware during webhook processing, we already emit metrics in the middleware error handling (currently we emitting failure metrics twice)
  2. For SQS we'll have its own error handling. (Otherwise we'll emit a failure on each retry)
  3. We don't use Redis for pushes anymore, so we won't loosing its metrics

@KAbakumovAtl KAbakumovAtl changed the title ARC-788 error handling for push queue ARC-788 error handling for the push queue (SQS) Nov 17, 2021
src/sqs/push.ts Outdated

//Handle non-failure cases
const maybeResult = maybeHandleNonFailureCase(error, context);
if(maybeResult) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supernit: here and in other cases: there should either be a space between if and ( everywhere or nowhere. I see both in different parts of your PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need to add some liner rules so it will be auto-formatted

src/sqs/push.ts Outdated
export const pushQueueErrorHandler : ErrorHandler<PushQueueMessagePayload> = async (error: JiraClientError | Octokit.HookError | RateLimitingError | Error,
context: Context<PushQueueMessagePayload>) : Promise<ErrorHandlingResult> => {

//Handle non-failure cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the comment repeats the code, I suggest to remove it

//Handle non-failure cases
const maybeResult = maybeHandleNonFailureCase(error, context);
if(maybeResult) {
return maybeResult;
Copy link
Contributor

@bgvozdev bgvozdev Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would return !maybeResult make it clearer?
nit: Would return maybeResult! make it clearer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if there is a result we don't need to return

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, wait, sorry, my bad! I meant return maybeResult! 🤦 🤣 Explicitly show in the code that we are returning a value, not "maybe" value - this is what I wanted

src/sqs/push.ts Outdated
return maybeResult;
}

//Handle failure cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same, IMHO the commend should tell "why?", not "what?". I'd suggest to remove the comment

src/sqs/push.ts Outdated
const errorHandlingResult = handleFailureCase(error, context);

if(!errorHandlingResult.retryable || context.lastAttempt ) {
//Emit metrics if the failure is not retryable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same, IMHO the commend should tell "why?", not "what?". I'd suggest to remove the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess log message explains it enough

src/sqs/push.ts Outdated
function maybeHandleNonFailureCase(error: Error, context: Context<PushQueueMessagePayload>): ErrorHandlingResult | undefined {
if(error instanceof JiraClientError &&
error.status &&
NON_ERRONEOUS_STATUSES.includes(error.status)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd suggest NON_FAILURE_STATUSES, because you've used word "failure" for this situation before (less notions => easier to follow IMHO)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, should it be something like RETRYABLE_STATUSES? I feel like "retryable/non-retryable" is more clear and explicit than "error/failure"....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the same goes to the names of functions (e.g. maybeHandleRetryableCase)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the thing is when it is not a failure, we won't send a faulre metric.

So I wanted to name it to reflect taht it is not a failure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Non-failure = we retry", is my understanding correct?

//If error is Octokit.HookError, then we need to check the response status
//Unfortunately we can't check if error is instance of Octokit.HookError because it is not a calss, so we'll just rely on status
//TODO Add error handling for the new GitHub client when it will be done
const maybeErrorWithStatus : any = error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could it be const { status } = error; instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was done to avoid typescript errors. If I do const { status } = error; It will say that the error doesnt' have "status" field. However its subclasses can. So I had to do this to avoid ignoring typescript&linter errors

src/sqs/push.ts Outdated

function handleFailureCase(error: Error, context: Context<PushQueueMessagePayload>): ErrorHandlingResult {
if (error instanceof RateLimitingError) {
// delaying until the rate limit is reset (plus a buffer of a couple seconds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same, IMHO the commend should tell "why?", not "what?". I'd suggest to remove the comment

src/sqs/push.ts Outdated
@@ -39,3 +41,60 @@ export const pushQueueMessageHandler : MessageHandler<PushQueueMessagePayload> =
await processPush(github, payload, wrapLogger(context.log));
}


const NON_ERRONEOUS_STATUSES = [401, 404, 403];
const RATE_LIMITING_DELAY_BUFFER = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe also include measurement units? Is it seconds? Minutes?

src/sqs/push.ts Outdated

const NON_ERRONEOUS_STATUSES = [401, 404, 403];
const RATE_LIMITING_DELAY_BUFFER = 10;
const EXPONENTIAL_BACKOFF_BASE = 60;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe also include measurement units? Is it seconds? Minutes?

src/sqs/push.ts Outdated
return {retryable: true, retryDelaySec: delaySec}
}

//Use exponential backoff in any other case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same, IMHO the commend should tell "why?", not "what?". I'd suggest to remove the comment

src/sqs/push.ts Outdated
}

//Use exponential backoff in any other case
const delaySec = EXPONENTIAL_BACKOFF_BASE * Math.pow(3, context.receiveCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should 3 be a constant, too?

describe("Push Queue Error Handler", () => {


let statsdIncrementSpy = jest.spyOn(statsd, "histogram");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supernit: formatting (empty lines)

});


function getJiraClientError(code: number) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supernit: too many empty lines

@bgvozdev
Copy link
Contributor

bgvozdev commented Nov 17, 2021

Great work, @KAbakumovAtl ! In general it looks good to me, the only sensible comment I have is about the notions that you are using. I see that you've used 3 different words: failure, error and non-retryable. I'd suggest to stick to a single one and rename everything else to use it. My personal perefence is "retryable/non-retryable" cause I think it is pretty explicit and describes the behaviour of the system best.

  • consider unifying the used notions describing "retryable/non-retryable" errors

@KAbakumovAtl
Copy link
Contributor Author

I fixed what I could. Ready for re-review

bgvozdev
bgvozdev previously approved these changes Nov 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants