Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outgoing_webhooks: automatically disable an endpoint after multiple failures #24

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

bhumi1102
Copy link
Contributor

@bhumi1102 bhumi1102 commented Dec 23, 2022

The goal is to automatically mark an endpoint as disabled if a success response is not received after multiple attempts (currently 5 attempts at increasing re-attempts delays)

Paired with bullet-train-co/bullet_train#565

Testing Done:
Add a new endpoint, set the url to fail with something like webhook.site
Create the resource to trigger the outgoing webhook
Watch the logs and the DB tables outgoing_webhooks_* for the 5 delivery attempts failures
Confirm that the endpoint's disabled_from_failure field is set to true

Create a resource that would trigger the endpoint. This time while disabled.
Confirm from the logs that no outgoing webhooks are queued for delivery

Note: user can add a new endpoint with the desired config to re-enable

closes bullet-train-co/bullet_train-outgoing_webhooks#11

@andrewculver
Copy link
Contributor

@bhumi1102 Can we make this so it doesn't disable the webhook after just a single webhook failing, but after maybe a day of webhooks not being successfully delivered? We could track it at the endpoint level, and if it's been failing for more than 24 hours, then disable the endpoint.

@bhumi1102
Copy link
Contributor Author

@bhumi1102 Can we make this so it doesn't disable the webhook after just a single webhook failing, but after maybe a day of webhooks not being successfully delivered? We could track it at the endpoint level, and if it's been failing for more than 24 hours, then disable the endpoint.

Sure, will take a look at this next!

@bhumi1102
Copy link
Contributor Author

Hi @andrewculver - to understand the 24 hours thing, you're imagining adding another attempt after 24 hours to double check that it still fails before disabling? OR adding a background job or something that checks periodically for up to 24 hours (more involved).

Also to clarify, with this change we are not disabling after a single webhook failing but after the sequence of 5 attempts, the longest one being 1 hours.

  ATTEMPT_SCHEDULE = {
    1 => 15.seconds,
    2 => 1.minute,
    3 => 5.minutes,
    4 => 15.minutes,
    5 => 1.hour,
  }

So the options I see are 1. add a 6th attempt 24 hours later or 2. add some background job to run in 24 hours or something more involved or 3. leave it as is if we're okay with concluding that if the endpoint is still not responding after 5 attempts separated by an hours it's safe to disable.

@bhumi1102
Copy link
Contributor Author

Update:
Only disable an endpoint after all delivery attempts have failed, if there is no successful delivery in the last 24 hours (by checking delivered_at)

Copy link
Contributor

@jagthedrummer jagthedrummer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me. @andrewculver do you have any other concerns about this before we merge it?

Comment on lines +33 to +37
last_successful_delivery = endpoint.deliveries.where.not(delivered_at: nil).maximum(:delivered_at)
if last_successful_delivery.blank? || last_successful_delivery < 24.hours.ago
endpoint.disabled_from_failure = true
endpoint.save
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is maximum a potentially costly query? Will the database have to do a full table scan or do we have an index on the column?

I wonder if it would be clearer too if we rephrased the logic here a bit? Date math is always something I'm terrible with, so this logic might not be exactly right, it would need to be verified.

Suggested change
last_successful_delivery = endpoint.deliveries.where.not(delivered_at: nil).maximum(:delivered_at)
if last_successful_delivery.blank? || last_successful_delivery < 24.hours.ago
endpoint.disabled_from_failure = true
endpoint.save
end
no_successful_delivery_within_past_24_hours = endpoint.deliveries.where(delivered_at: 24.hours.ago..).none?
if no_successful_delivery_within_past_24_hours
endpoint.disabled_from_failure = true
endpoint.save
end

last_successful_delivery.before? 24.hours.ago or 24.hours.ago.after? last_successful_delivery may be clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This. I had quite a few instances in the recent past where sorting in the Ruby domain was faster than against an unoptimized table.

But: wouldn’t this then be cleaner by adding an index?

also, currently evades me but wasn’t there a shorthand for .where.not(…: nil)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding an index seems reasonable to me since we definitely care about being able to find recent records quickly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, currently evades me but wasn’t there a shorthand for .where.not(…: nil)?

@julianrubisch maybe you're thinking of where.missing or where.associated but that's for associations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where.not(attribute_name: nil) is it.

For instance:

User.where.not(created_at: nil).count
D, [2023-08-03T11:18:09.830819 #22549] DEBUG -- :   User Count (0.7ms)  SELECT COUNT(*) FROM "users" WHERE "users"."created_at" IS NOT NULL
=> 22

Though I think I'm missing how that would be applicable to this situation.

@jagthedrummer
Copy link
Contributor

@andrewculver asked me not to merge anything related to outgoing webhooks, so I'm going to leave this for him to review and merge. Going to convert it to a draft PR for now.

@jagthedrummer jagthedrummer marked this pull request as draft August 22, 2023 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatically disable an endpoint after multiple failures.
5 participants