-
-
Notifications
You must be signed in to change notification settings - Fork 1
Kill analysis subprocess if join timeout is triggered implying task should be killed #504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kill analysis subprocess if join timeout is triggered implying task should be killed #504
Conversation
| return LaunchpadKafkaConsumer(processor, strategy_factory, healthcheck_path) | ||
|
|
||
|
|
||
| class ShutdownAwareStrategy(ProcessingStrategy[KafkaPayload]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id love to not have to do this.. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but i dont think there is another way to intercept the _close_strategy call that we care about..
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #504 +/- ##
==========================================
- Coverage 81.13% 80.95% -0.18%
==========================================
Files 164 164
Lines 14226 14273 +47
Branches 1505 1511 +6
==========================================
+ Hits 11542 11555 +13
- Misses 2111 2145 +34
Partials 573 573 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| return decoded # type: ignore[no-any-return] | ||
| finally: | ||
| with registry_lock: | ||
| process_registry.pop(process.pid, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this also handles removing PIDs of successful message processes, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes!
An unexpected side effect of moving the analysis to a subprocess was that when the kafka partition is signaled to be revoked, we would normally give the task 10s and then kill it... but in arroyo we wait until all the threads are completed before moving on...
so it seems like this is what is happening in prod currently:
and the JOIN taking super long causes kafka to think our consumer is dead and kills it.
Instead, if we can kill the subprocess, it should look like this: