Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Scheduler and triggerer crashes in daemon mode when statsd metrics are enabled #35181

Merged
merged 4 commits into from Nov 12, 2023

Conversation

pavansharma36
Copy link
Contributor

related: #33897


Fix services crashing in daemon mode when statsd is enabled.

@eladkal eladkal added this to the Airflow 2.7.3 milestone Oct 29, 2023
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label Oct 29, 2023
@eladkal eladkal changed the title reset stats instance in daemon context Fix Scheduler and triggerer crashes in daemon mode when statsd metrics are enabled Oct 29, 2023
@eladkal eladkal requested a review from potiuk October 29, 2023 06:47
@potiuk
Copy link
Member

potiuk commented Oct 29, 2023

I see no big problem with it and I see why it could be needed, but I would love someone else to take a look as well @Taragolis @uranusjr @ferruzzi ?

Seems to me that this could possibly be solved in a more generic way - I do recall similar open-telemetry problem year ago or so (that has since been fixed), so I guess this is really only a problem with statsd, and maybe the solution here is "good enough" ?

@ferruzzi
Copy link
Contributor

This doesn't flush the existing metrics when you create a new instance? For example, graphs still have their data from before the flush??

@potiuk
Copy link
Member

potiuk commented Nov 1, 2023

This doesn't flush the existing metrics when you create a new instance? For example, graphs still have their data from before the flush??

I believe there are virtually no metrics before. Daemonisation happens right at the very start of any component that enables it, and I think what this problem mainly solves is that it seems Statsd integration when imported and initialized saves some states and possibly starts some thread that then are "lost" when daemonizing.

The deamonisation does a few things:

  1. forks the process (twice I believe) to make sure that it detaches from the process that started it and moves itself to have "init" process as parent so that any signals (SIGTERM/SIGHUP) from the original terminal process do not kill the background process

  2. closes stdin/stdout and opened files/descriptors/sockets to make sure it does not continue writng / reading from terminal which was used to start it and that the files are not accessed from multiple processes at once.

I think what happens here - is that Statsd manages to open a socket / file descriptor before daemonization happnes and that socket gets closed when forking/daemonising. When deamonization happens, the parent/intermediate processes exit right after the fork happens. So even if Statsd managed to open the socket before daemonization - it gets closed right after (and by reinitialising Statsd in this PR we are reopening it in the forked process)

But also writing this comment makes me think: If we manage to find out the resource (likely socket) that Statsd opens and get hang of file descriptor for that - there is a better way of handling it https://daemonize.readthedocs.io/en/latest/ - you can add such file descriptor to "keep_fds" list of descriptors that should NOT be closed when forking and their ownership passed to the forked process. So @pavansharma36 - maybe you can investigate what Statsd does under the hood to make it stop working - find the descriptor it opens at initialization and pass it in keep_fds parameter?

@pavansharma36
Copy link
Contributor Author

pavansharma36 commented Nov 1, 2023

@potiuk I believe that is unnecessary complicating things as not all process are opening stats socket before daemonize,
plus there can be different impl we need to take care and will make things tightly coupled with stats in cli

@potiuk
Copy link
Member

potiuk commented Nov 1, 2023

Deleted my comment by mistake - but you can likely see it in notification. Generally I am ok with reinitializing as long as we do not have any side effects. Have you looked at those abd can confirm that such reinitialization does not have any ? @pavansharma36 ?

@pavansharma36
Copy link
Contributor Author

Yes @potiuk I've verified all cases and works without any issue

@potiuk
Copy link
Member

potiuk commented Nov 1, 2023

Yes @potiuk I've verified all cases and works without any issue

Which cases? out of curiosity? Do you know exactly what causes the problem in the first place?

@potiuk potiuk merged commit 4b63e36 into apache:main Nov 12, 2023
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:CLI type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants