Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttle streaming log reads #28818

Merged
merged 4 commits into from
Jan 11, 2023

Conversation

dstandish
Copy link
Contributor

@dstandish dstandish commented Jan 9, 2023

When reading logs from running task, current behavior will loop as fast as possible. This causes unnecessary CPU util and loading of external resources (e.g. elasticsearch cluster). We can reduce load by sleeping a bit in each loop.

We only sleep when no logs returned.

When reading logs from running task, current behavior will loop as fast as possible. This causes unnecessary CPU util and loading of external resources (e.g. elasticsearch cluster). We can reduce load by sleeping a bit in each loop.
if "end_of_log" not in metadata or (
not metadata["end_of_log"] and ti.state not in [State.RUNNING, State.DEFERRED]
):
time.sleep(0.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 0.5 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a magic number ...

Copy link
Contributor Author

@dstandish dstandish Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely arbitrary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You prefer one second, perhaps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well. I am not complaining about the value, but about it being "magic".

I find such magic numbers often necessary for tests but in "production" code, I think they always have to have at least some justification written down.

If at any point I have a magic number like that and I have no good explanation why this is the value chosen here It is always a red flag. Imagine someone else coming here a year from now trying to solve an issue with log streaming and askiong a question - why do we have 0.5 here ?.

The answer should be at least documented somewhere with reasoning - because code does not explain it.

Of coudse iddeally this should be an async wait for the condition to get it out. But I understand it is not possible. So having some "default" value here might make sense. The tradeoff is - as I understand is that we will get logs statistically with 0.25 sec delay but then the cpu usage will be much lower.

But there are at least few questions. How much is ok? What's the acceptable delay? Should it be different by default for different handlers? Is it needed for those other handlers as well? Or maybe we should add a possibility of specifying it differently for each handler?

Maybe answer to all those question is (after applying a common-sense judgment):

The 0.5 s delay here is ok because it causes average 250ms delay on streaming the logs. but it lowers the CPU usage. While we could make it handler-dependent, we decided to hard-code it as 0.5s for all handlers for simplicity

But then your judgment now might be different in the future - so for the future-self and future-others having this commented in the code would be the right thing to do I think, this way the person in the future (either yourself or someone else) will not loose mental capacity on finding out why 0.5 was chosen.

Just this :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i should add a comment for sure that's a good point.
separately, do you think we should treat other handlers differently? generally they are gonna be network handlers i think. if they are just FTH it's about unnecessary CPU only. but for remote (probably most common) its also about external services request throttling. i the logs reader has a timeout of 5 minutes so if user leaves logs tab open when say the task is not emitting logs for that time, it will keep hitting ES as fast as it can. suppose there are many users doing at same time? admittedly probably not likely to cause serious problem IRL. but i think i do remember a report on slack about cluster load and i imagine this could be part of it.
it does not require long sleeping tasks either -- every time it's waiting for new logs, it's looping. so it seems like a reasonable thing to do but sincerely interested in what you think (apart from the mechanics of making this mergeable)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah looking again, i think better would be to do something more akin to exponential backoff... i.e. if in two consecutive loops there are no logs, then sleep 0.5. otherwise let it run.

previously i had this only sleep while waiting before getting any logs. i.e. when offset stays at 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, implemented more selective throttling, added comments, made it overrideable through airflow local settings bby using constant

if "end_of_log" not in metadata or (
not metadata["end_of_log"] and ti.state not in [State.RUNNING, State.DEFERRED]
):
time.sleep(0.5)
Copy link
Member

@potiuk potiuk Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another (more concrete) question (Actually just asking the questions above in a written form sparked more thinking about it).

What happens if last few logs are coming when the thread is sleeping and the task completes?

I understand that this log reader is used in a deamon Thread in base_task_runner and I think that task runner will simply wait for the forked (or exec'ed) process and will exit immediately when it exits. Which means that it will also stop the task reader immediately (I believe daemon threads are stopped immediately when they are last remaining thread). This means that any logs that wil be produced in that 0.5 s will not be processed by the task_reader.

The question is - do we care (I think we previously had subtle race condition here - because similar thing could have happened from time to time if the logs were written just before the task completed). But by adding the sleep here we are greatly increasing the probability of it happening. I think there will be tasks that will consistently have this problem (those that are doing log.info("Exiting") followed by sys.exit(0) immediately).

So my question is - do we care that we lost the last logs? I am not 100% sure what will be the effect in case of various handlers.

I think if we care, then we should stop using daemon thread and do a join on the reader thread just before we leave - with another magic timeout of course - but likely related to the one used in sleep).

Copy link
Contributor Author

@dstandish dstandish Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm so the reading part has nothing to do with the task runner. this reading code is only invoked in the webserver. for sure it's a confusing aspect of the way this code has evolved, that the webserver "log reader" logic lives on the "task logging handler" class, for convenience. it's more than a logging handler.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it how it is now!

@potiuk
Copy link
Member

potiuk commented Jan 10, 2023

I think there were some weird network failures (on top of inventory files changed by Google). So it needs rebase :)

@uranusjr uranusjr merged commit 46704ee into apache:main Jan 11, 2023
@uranusjr uranusjr deleted the sleep-each-task-log-read-loop branch January 11, 2023 06:16
@pierrejeambrun pierrejeambrun added this to the Airflow 2.5.2 milestone Feb 27, 2023
@pierrejeambrun pierrejeambrun added the type:bug-fix Changelog: Bug Fixes label Feb 27, 2023
pierrejeambrun pushed a commit that referenced this pull request Mar 6, 2023
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
(cherry picked from commit 46704ee)
pierrejeambrun pushed a commit that referenced this pull request Mar 8, 2023
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
(cherry picked from commit 46704ee)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:logging type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants