fix(code-review): Add cache to dedupe github webhook events#107734
fix(code-review): Add cache to dedupe github webhook events#107734suejung-sentry wants to merge 3 commits intomasterfrom
Conversation
|
|
||
|
|
||
| def _get_webhook_seen_cluster() -> RedisCluster[str] | StrictRedis[str]: | ||
| return redis_clusters.get("default") |
There was a problem hiding this comment.
Thank you for adding references to the patterns you use 👍🏻
724dc45 to
0230d2e
Compare
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| WEBHOOK_SEEN_TTL_SECONDS = 20 |
There was a problem hiding this comment.
I picked 20 seconds because per the redash, if there ever is a dupe, they happen within 500 milliseconds of each other. I thought 20 seconds would comfortably cover that and any errant github redelivery behavior
There was a problem hiding this comment.
Thank you for the source and speaking out your reasoning. Makes sense 👍🏻
vaind
left a comment
There was a problem hiding this comment.
While I'm not familiar with this part of the codebase, this looks reasonable to me and saves quite a few runs 👍 . Would be good to get another set of eyes on this though.
| if github_delivery_id is not None: | ||
| github_delivery_id = str(github_delivery_id) | ||
| sentry_sdk.set_extra("github_delivery_id", github_delivery_id) |
There was a problem hiding this comment.
Does it make sense to add to context if it's None?
| if github_delivery_id is not None: | |
| github_delivery_id = str(github_delivery_id) | |
| sentry_sdk.set_extra("github_delivery_id", github_delivery_id) | |
| if github_delivery_id is not None: | |
| github_delivery_id = str(github_delivery_id) | |
| sentry_sdk.set_extra("github_delivery_id", github_delivery_id) |
There was a problem hiding this comment.
You could still have the filter in Sentry if you want to find events without it.
For the record, this header will never be missing (unless GitHub introduces a bug in their code).
| def test_same_delivery_id_second_seen_skipped(self) -> None: | ||
| """ | ||
| Two deliveries with the same id, one after the other: the first marks seen and runs; | ||
| the second is already seen (key exists), so only one handler run. | ||
| """ |
There was a problem hiding this comment.
Do we need both this test and test_webhook_already_seen_handler_not_invoked?
|
|
||
| def __call__(self, event: Mapping[str, Any], **kwargs: Any) -> None: | ||
| github_event = kwargs["github_event"] | ||
| github_delivery_id = kwargs.get("github_delivery_id") |
There was a problem hiding this comment.
As reference:
X-GitHub-Delivery: A globally unique identifier (GUID) to identify the event.
| if github_delivery_id is not None: | ||
| github_delivery_id = str(github_delivery_id) | ||
| sentry_sdk.set_extra("github_delivery_id", github_delivery_id) |
There was a problem hiding this comment.
You could still have the filter in Sentry if you want to find events without it.
For the record, this header will never be missing (unless GitHub introduces a bug in their code).
|
|
||
|
|
||
| def _get_webhook_seen_cluster() -> RedisCluster[str] | StrictRedis[str]: | ||
| return redis_clusters.get("default") |
There was a problem hiding this comment.
Thank you for adding references to the patterns you use 👍🏻
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| WEBHOOK_SEEN_TTL_SECONDS = 20 |
There was a problem hiding this comment.
Thank you for the source and speaking out your reasoning. Makes sense 👍🏻
| delivery_id = f"already-seen-{uuid4()}" | ||
| cluster = redis_clusters.get("default") | ||
| seen_key = f"{WEBHOOK_SEEN_KEY_PREFIX}{delivery_id}" | ||
| cluster.set(seen_key, "1", ex=WEBHOOK_SEEN_TTL_SECONDS, nx=True) |
There was a problem hiding this comment.
A more realistic test would be to simply call handle_webhook_event twice.
That way if the code managing the cluster changes the test would be exercise the logic within handle_webhook_event.
Perhaps delete this test and use test_same_delivery_id_second_seen_skipped since it does what I mention.
This PR handles webhook delivery deduplication by introducing redis idempotency keys for the github webhook id.
GitHub guarantees "at-least-once" delivery so may send duplicate webhooks. We have seen anecdotally that seer can receive multiple requests for a single commit (from a pull_request.synchronize event) within 500 milliseconds of each other (redash).
It's unclear whether GitHub is delivering the webhook twice or something in our control-->regional forwarding queues is causing redelivery. In any case, it seems likely that the same payload is getting processed with the same github webhook id. So use that as the idempotency key.
I considered whether we should go for a lock instead. The downside of that is the lock would release after the function returns, which may happen sooner than the 500 milliseconds we are currently seeing dupes in.
So instead in this PR, we just say any webhook with the same webhook id delivered within the same 20 second window are not replayed and re-forwarded to seer.
I chose 20 second TTL to cover the 500 milliseconds period and any errant github retry+backoff behavior.
Redis should be able to handle this load which is one SET per github webhook that makes it past our "preflight" feature enablement filters (a max hour is around 2,000 code reviews, so say 1 request per second in a peak hour). Also the keys are small with a short TTL.
Closes CW-673