Throttle streaming log reads #28818

dstandish · 2023-01-09T22:42:25Z

When reading logs from running task, current behavior will loop as fast as possible. This causes unnecessary CPU util and loading of external resources (e.g. elasticsearch cluster). We can reduce load by sleeping a bit in each loop.

We only sleep when no logs returned.

When reading logs from running task, current behavior will loop as fast as possible. This causes unnecessary CPU util and loading of external resources (e.g. elasticsearch cluster). We can reduce load by sleeping a bit in each loop.

potiuk · 2023-01-09T22:53:15Z

airflow/utils/log/log_reader.py

+                if "end_of_log" not in metadata or (
+                    not metadata["end_of_log"] and ti.state not in [State.RUNNING, State.DEFERRED]
+                ):
+                    time.sleep(0.5)


looks like a magic number ...

Completely arbitrary

You prefer one second, perhaps?

Well. I am not complaining about the value, but about it being "magic".

I find such magic numbers often necessary for tests but in "production" code, I think they always have to have at least some justification written down.

If at any point I have a magic number like that and I have no good explanation why this is the value chosen here It is always a red flag. Imagine someone else coming here a year from now trying to solve an issue with log streaming and askiong a question - why do we have 0.5 here ?.

The answer should be at least documented somewhere with reasoning - because code does not explain it.

Of coudse iddeally this should be an async wait for the condition to get it out. But I understand it is not possible. So having some "default" value here might make sense. The tradeoff is - as I understand is that we will get logs statistically with 0.25 sec delay but then the cpu usage will be much lower.

But there are at least few questions. How much is ok? What's the acceptable delay? Should it be different by default for different handlers? Is it needed for those other handlers as well? Or maybe we should add a possibility of specifying it differently for each handler?

Maybe answer to all those question is (after applying a common-sense judgment):

The 0.5 s delay here is ok because it causes average 250ms delay on streaming the logs. but it lowers the CPU usage. While we could make it handler-dependent, we decided to hard-code it as 0.5s for all handlers for simplicity

But then your judgment now might be different in the future - so for the future-self and future-others having this commented in the code would be the right thing to do I think, this way the person in the future (either yourself or someone else) will not loose mental capacity on finding out why 0.5 was chosen.

Just this :).

yes, i should add a comment for sure that's a good point.
separately, do you think we should treat other handlers differently? generally they are gonna be network handlers i think. if they are just FTH it's about unnecessary CPU only. but for remote (probably most common) its also about external services request throttling. i the logs reader has a timeout of 5 minutes so if user leaves logs tab open when say the task is not emitting logs for that time, it will keep hitting ES as fast as it can. suppose there are many users doing at same time? admittedly probably not likely to cause serious problem IRL. but i think i do remember a report on slack about cluster load and i imagine this could be part of it.
it does not require long sleeping tasks either -- every time it's waiting for new logs, it's looping. so it seems like a reasonable thing to do but sincerely interested in what you think (apart from the mechanics of making this mergeable)

yeah looking again, i think better would be to do something more akin to exponential backoff... i.e. if in two consecutive loops there are no logs, then sleep 0.5. otherwise let it run.

previously i had this only sleep while waiting before getting any logs. i.e. when offset stays at 0.

ok, implemented more selective throttling, added comments, made it overrideable through airflow local settings bby using constant

potiuk · 2023-01-10T11:24:46Z

airflow/utils/log/log_reader.py

+                if "end_of_log" not in metadata or (
+                    not metadata["end_of_log"] and ti.state not in [State.RUNNING, State.DEFERRED]
+                ):
+                    time.sleep(0.5)


Another (more concrete) question (Actually just asking the questions above in a written form sparked more thinking about it).

What happens if last few logs are coming when the thread is sleeping and the task completes?

I understand that this log reader is used in a deamon Thread in base_task_runner and I think that task runner will simply wait for the forked (or exec'ed) process and will exit immediately when it exits. Which means that it will also stop the task reader immediately (I believe daemon threads are stopped immediately when they are last remaining thread). This means that any logs that wil be produced in that 0.5 s will not be processed by the task_reader.

The question is - do we care (I think we previously had subtle race condition here - because similar thing could have happened from time to time if the logs were written just before the task completed). But by adding the sleep here we are greatly increasing the probability of it happening. I think there will be tasks that will consistently have this problem (those that are doing log.info("Exiting") followed by sys.exit(0) immediately).

So my question is - do we care that we lost the last logs? I am not 100% sure what will be the effect in case of various handlers.

I think if we care, then we should stop using daemon thread and do a join on the reader thread just before we leave - with another magic timeout of course - but likely related to the one used in sleep).

hmm so the reading part has nothing to do with the task runner. this reading code is only invoked in the webserver. for sure it's a confusing aspect of the way this code has evolved, that the webserver "log reader" logic lives on the "task logging handler" class, for convenience. it's more than a logging handler.

potiuk

I love it how it is now!

potiuk · 2023-01-10T21:27:55Z

I think there were some weird network failures (on top of inventory files changed by Google). So it needs rebase :)

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> (cherry picked from commit 46704ee)

dstandish added 2 commits January 9, 2023 14:32

Throttle streaming log reads

1d50f70

When reading logs from running task, current behavior will loop as fast as possible. This causes unnecessary CPU util and loading of external resources (e.g. elasticsearch cluster). We can reduce load by sleeping a bit in each loop.

Don't sleep when we should exit

95d07c1

dstandish requested a review from jedcunningham January 9, 2023 22:42

boring-cyborg bot added the area:logging label Jan 9, 2023

potiuk reviewed Jan 9, 2023

View reviewed changes

potiuk reviewed Jan 10, 2023

View reviewed changes

comment, monkeypatcheable, only when no logs

4cb45e9

potiuk approved these changes Jan 10, 2023

View reviewed changes

jedcunningham approved these changes Jan 10, 2023

View reviewed changes

Merge branch 'main' into sleep-each-task-log-read-loop

36980d1

uranusjr merged commit 46704ee into apache:main Jan 11, 2023

uranusjr deleted the sleep-each-task-log-read-loop branch January 11, 2023 06:16

Taragolis mentioned this pull request Jan 24, 2023

Auto refreshing (or auto tail) of logs in Web UI is erroneous #29138

Closed

2 tasks

pierrejeambrun added this to the Airflow 2.5.2 milestone Feb 27, 2023

pierrejeambrun added the type:bug-fix Changelog: Bug Fixes label Feb 27, 2023

pierrejeambrun pushed a commit that referenced this pull request Mar 6, 2023

Throttle streaming log reads (#28818)

76b2796

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> (cherry picked from commit 46704ee)

pierrejeambrun pushed a commit that referenced this pull request Mar 8, 2023

Throttle streaming log reads (#28818)

7a03a73

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> (cherry picked from commit 46704ee)

pierrejeambrun mentioned this pull request Mar 10, 2023

Status of testing of Apache Airflow 2.5.2rc2 #30028

Closed

64 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throttle streaming log reads #28818

Throttle streaming log reads #28818

dstandish commented Jan 9, 2023 •

edited

potiuk Jan 9, 2023

potiuk Jan 9, 2023

dstandish Jan 9, 2023 •

edited

dstandish Jan 9, 2023

potiuk Jan 10, 2023

dstandish Jan 10, 2023

dstandish Jan 10, 2023

dstandish Jan 10, 2023

potiuk Jan 10, 2023 •

edited

dstandish Jan 10, 2023 •

edited

potiuk left a comment

potiuk commented Jan 10, 2023

Throttle streaming log reads #28818

Throttle streaming log reads #28818

Conversation

dstandish commented Jan 9, 2023 • edited

potiuk Jan 9, 2023

Choose a reason for hiding this comment

potiuk Jan 9, 2023

Choose a reason for hiding this comment

dstandish Jan 9, 2023 • edited

Choose a reason for hiding this comment

dstandish Jan 9, 2023

Choose a reason for hiding this comment

potiuk Jan 10, 2023

Choose a reason for hiding this comment

dstandish Jan 10, 2023

Choose a reason for hiding this comment

dstandish Jan 10, 2023

Choose a reason for hiding this comment

dstandish Jan 10, 2023

Choose a reason for hiding this comment

potiuk Jan 10, 2023 • edited

Choose a reason for hiding this comment

dstandish Jan 10, 2023 • edited

Choose a reason for hiding this comment

potiuk left a comment

Choose a reason for hiding this comment

potiuk commented Jan 10, 2023

dstandish commented Jan 9, 2023 •

edited

dstandish Jan 9, 2023 •

edited

potiuk Jan 10, 2023 •

edited

dstandish Jan 10, 2023 •

edited