airflow-webserver crashes if log files are larger than webserver available RAM #31105

getaaron · 2023-05-05T20:49:43Z

Apache Airflow version

2.6.0

What happened

I tried to view a 400MB dag log file in the web server UI.

The available RAM on the Kubernetes pod is 50MB.

The Airflow webserver crashed.

What you think should happen instead

Either

The log file should be truncated and an error shown that there's not enough RAM, or
The log file should not be fully realized into RAM and should be streamed to frontend instead

How to reproduce

create airflow on k8s with 50Mi requested pod sizes
make a dag that generates 400 megabyte log files
attempt to view the log file in the airflow UI

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

apache-airflow==2.6.0
apache-airflow-providers-cncf-kubernetes==6.1.0
apache-airflow-providers-oracle==3.0.0
apache-airflow-providers-slack==7.2.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

boring-cyborg · 2023-05-05T20:49:45Z

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

getaaron · 2023-05-05T20:50:27Z

Haven't worked in airflow before, I could potentially open a PR if someone can describe what the fix should be

potiuk · 2023-05-06T13:22:05Z

Haven't worked in airflow before, I could potentially open a PR if someone can describe what the fix should be

I think you pretty well described it :) .

What you think should happen instead
Either

The log file should be truncated and an error shown that there's not enough RAM, or
The log file should not be fully realized into RAM and should be streamed to frontend instead

One of those seems to be a good solution, just the place where airflow loads the file in memory has to be find and one of the solutions above appied (there are many different ways of loading logs - depending on the view you used and state of the task and log configuration (for example if there is remote logging or streaming it is different than when local logs are used and different when celery logs are used).

I marked it as good-first-issue for you whoever would like to fix it , it's likely a bit more than first issue though and requires quite a bit of digging in the code, but if someone knows a bit about Python/webserver/flask it should not be that difficult to find the right place.

getaaron · 2023-05-08T15:32:14Z

@SkandaPrasad-S I’m not picking it up so you should feel free to grab it

…

On Mon, May 8, 2023 at 10:25 AM SkandaPrasad-S ***@***.***> wrote: Is anyone picking this up? I can take this up as as I have experience in flask and webserver. Can I give it a go? — Reply to this email directly, view it on GitHub <#31105 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGAYSOFUBC36HKFOBL5PUTXFEF5BANCNFSM6AAAAAAXXSKOG4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

killua1zoldyck · 2023-06-20T05:23:57Z

@potiuk if no one's working on this and if this has not been fixed yet I can work on this.

potiuk · 2023-06-20T06:33:27Z

Feel free!

killua1zoldyck · 2023-06-23T21:09:58Z

@potiuk The current log-streaming logic pulls the whole file into memory before moving to the specific log_pos based on the metadata after combining all the logs and sorting them based on timestamp.

airflow/airflow/utils/log/file_task_handler.py

Lines 334 to 350 in 24e3d6c

    
           logs = "\n".join( 
        
               _interleave_logs( 
        
                   *local_logs, 
        
                   *remote_logs, 
        
                   *(executor_logs or []), 
        
                   *served_logs, 
        
               ) 
        
           ) 
        
           log_pos = len(logs) 
        
           messages = "".join([f"*** {x}\n" for x in messages_list]) 
        
           end_of_log = ti.try_number != try_number or ti.state not in [State.RUNNING, State.DEFERRED] 
        
           if metadata and "log_pos" in metadata: 
        
               previous_chars = metadata["log_pos"] 
        
               logs = logs[previous_chars:]  # Cut off previously passed log test as new tail 
        
           out_message = logs if "log_pos" in (metadata or {}) else messages + logs 
        
           return out_message, {"end_of_log": end_of_log, "log_pos": log_pos}

So, I don't think we can change the logic to make it streaming. Is my thinking correct?

potiuk · 2023-06-23T21:12:31Z

I have not looked at it befor so I have no idea. I was kind of hoping that whoever takes on the task will be able to propose the solution - possibly involving chnaging the way how it is done.

potiuk · 2023-06-23T21:13:53Z

The thing with any change here is that it starts with things being done one way, where the change to fix it is to reimplement it to be done differently :D

killua1zoldyck · 2023-06-23T21:49:32Z

Yeah, I thought of a way where we could maintain separate log_pos for different streams. This way we do not have to pull the whole file into memory just to get the final few lines. But I have some reservations about this with regard to whether the log order would change. I will look into this.

getaaron · 2023-06-23T23:25:27Z

Yeah, I thought of a way where we could maintain separate log_pos for different streams. This way we do not have to pull the whole file into memory just to get the final few lines. But I have some reservations about this with regard to whether the log order would change. I will look into this.

If the separate log files are already sorted (I assume they are) then you can use a k-way merge to produce a sorted combined list without loading the whole thing into memory: https://en.m.wikipedia.org/wiki/K-way_merge_algorithm

If they're not already sorted (unlikely) then you could sort them individually, then use the k-way merge sort.

killua1zoldyck · 2023-06-24T00:26:36Z

This seems great. Thank you!

potiuk · 2023-06-24T10:52:21Z

Yeah. Agree with @getaaron.

This one is not that eeasy because of interleaving logs from different sources. I looked at it and really what you would have to do is two do either of those:

change the reading method to accept ranges of position and knowing log position to start in all of them would be impossible (but it would be indeed complex when it comes to matching the position with the different sources)

or (probably easier)

instead of storing the logs in in-memory lists, stream them to temporary files and read them from there (and then indeed k-way merge would be better

Maybe a better solution will be to introduce some hard-limits on the size of logs that you can get to memory and "hard-stop" if any of the sources will attempt to get logs bigger than the size - returning "log too large to show" instead?. Then the thing to add is an optional max_lenght or max_size that could be passed to the methods returning arrays and implement it in all the implementations to raise a specific exception if the returned array would be too big.

This is not really usable to swift through the 400 MB log in Airflow UI. And I don't think we have good mechanism to keep such log in memory of the browser, so I am not even sure if we could show such log at all in Airflow UI even if the backend could handle it

killua1zoldyck · 2023-06-24T21:33:28Z

We can assign hard-limits and that would solve this current issue. We can just sum filesizes before loading any of them (or most of them, I do not know we can get all of their sizes) into memory. But, I believe even for file sizes smaller than that we need not load the whole file into memory and sort for every auto-tailing call.

instead of storing the logs in in-memory lists, stream them to temporary files and read them from there (and then indeed k-way merge would be better

We can do this. Now, this will be in sorted order however if the task is still running new logs could have come in the time we sent our response. Even for this, we need to maintain a log position for different log streams and call the reading methods with appropriate metadata to update this temp file. And for some of the methods we still need to load the whole file into memory like the HDFS-one. For them we could filter out after loading into memory. I believe we can do this as it could reduce memory-usage and network congestion and we are sending metadata back-and-forth anyways might as well send the log positions of a few more files. What do you think?

potiuk · 2023-06-25T07:22:04Z

We can assign hard-limits and that would solve this current issue. We can just sum filesizes before loading any of them (or most of them, I do not know we can get all of their sizes) into memory.

Or count rows if can't get the sizes - that should be easier.

But, I believe even for file sizes smaller than that we need not load the whole file into memory and sort for every auto-tailing call.

Correct - not having to join the files in-memory will defininitely decrease the memory reuirements for webserver even for smaller files.

We can do this. Now, this will be in sorted order however if the task is still running new logs could have come in the time we sent our response. Even for this, we need to maintain a log position for different log streams and call the reading methods with appropriate metadata to update this temp file.

Agree. I think the log pos in metadata in this case will be a bit tricky and should be "per stream"/ "user". And there will be cases where someone just aut-tails the logs (in which cases it is fine to keep returned data in memory), but for cases when someone reads log from the beginning, the size might still be substantial (so keeping it in a file makes sense).

And we should likely have a separate path for cases like S3 if we want to further optimize it - for cases where remote log is never streaming because for them we can either have a full log file or nothing (this is object, not file storage so, for S3 we will not even see a log until it is complete). In this case we should not worry about "tailing" the log. And we could use some caching (at the expense of isolation) so that the log from s3 is downloaded only once per "webserver" and kept for some time (and reused between userss/sessions) so that if few people look at the same task log or hit refresh button, the "remote" reading of that file will not happen over and over again.

And for some of the methods we still need to load the whole file into memory like the HDFS-one. For them we could filter out after loading into memory.

Hmm. Not sure how HDFS works in thie case - and not how much "streaming" vs. "static" - i.e. cannot change once it is published - caching it should help as well. Even if it is "appendable" but we cannot stream it, maybe there is a way to store a hash or mtime of the file and only pull it if it changed ? But even in this case - I guess we do not need to pull the whole file in memory - maybe we would need to update HDFS hook for that, but I can't imagine the case where we have to load the whole file to memory in order to pull it from remote.

I believe we can do this as it could reduce memory-usage and network congestion and we are sending metadata back-and-forth anyways might as well send the log positions of a few more files. What do you think?

Yeah. I think we could do actually both - save logs on disks and also have some limits, becuase that's not only memory but networking + I/O + time that is saved this way. Plus if we do some caching, we can also optimise cases where various people look at the same logs.

Overall I think it is all "doable" and even if we don't implement and handle all cases, it can be gradually optimized.

I'd love to hear also what @dstandish has to say about it :) if he has time to read all the discussion that is :)

killua1zoldyck · 2023-06-26T05:11:05Z

I will work on these inputs for now. Thank you!

getaaron · 2023-07-07T00:58:24Z

@killua1zoldyck how's it going? let us know if you get stuck on anything

killua1zoldyck · 2023-07-07T02:22:02Z

Thank you! I've started working on it, but I will reach out if I need any help.

killua1zoldyck · 2023-07-09T21:11:47Z

Hey @getaaron , I have implemented the k-way-merge and it works fine. For now, I have specified the limit as 50 lines per call and it keeps calling till the end_of_log is reached. The metadata that gets sent back-and-forth looks something like this

{
  "metadata": {
    "end_of_log": false,
    "executor_logs": {
      "log_pos": 0
    },
    "immediate_tail": true,
    "local_logs": {
      "/root/airflow/logs/dag_id=log_generator_dag/run_id=manual__2023-07-09T19:52:37.154177+00:00/task_id=generate_logs_5mb/attempt=1.log": {
        "done": false,
        "lines": 200,
        "need": 50,
        "prev_position": 14269
      }
    },
    "remote_logs": {
      "log_pos": 0
    },
    "served_logs": {
      "log_pos": 0
    }
  }
}

Where previously it looked something like this

{
  "metadata": {
    "end_of_log": false,
    "log_pos: 100,
  }
}

I thought this would only affect the Airflow UI piece that calls the /get_logs_with_metadata. However, I just realized that it is used in https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#operation/get_log. Should we be creating a separate flow for auto streaming logs from the UI and have the same for the rest api?

getaaron · 2023-07-09T21:55:08Z

Hmm, does it mean you’re doing the stream merge on the client (browser) side?

I haven’t seen this API before but I wonder if there’s a way to supplement the full_content: false param with a reverse: true?

Is your goal with the updated UI to support full pagination through all the logs, or just show the most recent ~1,000 lines?

killua1zoldyck · 2023-07-09T22:41:08Z

Hmm, does it mean you’re doing the stream merge on the client (browser) side?

Currently the merging is happening in the server side but it reads all the logs into memory and merging them before truncating already read parts. I have changed the reading and merging based on what was already read and this we get using the metadata (In the server side).

I haven’t seen this API before but I wonder if there’s a way to supplement the full_content: false param with a reverse: true?

I don't think it has a reverse: true parameter. https://github.com/apache/airflow/blob/main/airflow/api_connexion/endpoints/log_endpoint.py . If we know the total character length of our logs before hand we can send that as a token. This too would read the whole file and then truncate the end so the memory would still be an issue.

Is your goal with the updated UI to support full pagination through all the logs, or just show the most recent ~1,000 lines?

I was thinking full pagination through all the logs till the end. The one I have implemented now does that by reading 50 lines from a file at a time.

killua1zoldyck · 2023-07-09T23:57:45Z

If we know the total character length of our logs before hand we can send that as a token. This too would read the whole file and then truncate the end so the memory would still be an issue.

No. we can't do this.

killua1zoldyck · 2023-08-18T11:19:38Z

I won't be able to continue working on this issue for now. If anyone else is available, feel free to take it over. Apologies for any inconvenience!

potiuk · 2023-08-19T20:06:48Z

No worries. Thanks for attempting to fix it - it's not an easy one to deal with.

yatink · 2023-08-20T16:26:08Z

Hi everyone, I'm new to airflow and I arrived here because this issue had the good first issue label on it (which may not be accurate after reading the discussion 😅).

I'm willing to pick this one up since it seems like an interesting problem. The challenge for me is going to be able to reproduce the issue first and see it for myself before I start working on it.

What I can do is, I can start working on reproing it, after I've been able to do that, I can assign the issue to myself and work on a fix (that way if someone more familiar with the codebase has bandwidth to pick it up, they can still do so).

Just for reference, is there an existing PR associated with it ? From the discussion, it seemed like some work was already done on it...

potiuk · 2023-08-20T17:00:22Z

Yep. It's not a "Good first issue" -> removed it :)

If there were a PR - it would have been linked. I tink @killua1zoldyck was mostly experimented on his own changes and it never materialized in a form of PR. But maybe @killua1zoldyck has something that can be shared.

yatink · 2023-08-20T18:44:24Z

Ok cool, I'd like to keep the plugging away at it nonetheless, since it does sound like an interesting problem. I'll work on reproducing the problem and post back here when I'm ready to claim the issue. Wish me luck!

potiuk · 2023-08-20T20:35:12Z

🤞

killua1zoldyck · 2023-08-23T04:08:46Z

Hi, @yatink I did implement the k way merge suggested (although there were a few issues with my implementation). I didn't notice that there was a resource log_endpoint.py and my implementation meant that it was to be changed.

yatink · 2023-08-27T22:28:58Z

@killua1zoldyck, can you put your k-way merge implementation into a gist (or similar) temporarily so that I can reference them ? If not, no worries.

yatink · 2023-08-27T22:36:52Z

The way that I'm thinking about solving this issue is to do so in 2 steps:

First, stop the webserver from crashing and pop a reasonable error message instead
Continue down the path that @killua1zoldyck had started and implement the k way merge to enable it to handle large log files

I'm on the fence about whether they should be tracked as separate issues or not. I'll be able to give a more informed opinion after I've made more progress.

@potiuk and/or @getaaron, let me know if step 1 is a reasonable short term approach or whether I should just skip it and do step 2.

potiuk · 2023-08-28T05:32:36Z

@potiuk and/or @getaaron, let me know if step 1 is a reasonable short term approach or whether I should just skip it and do step 2.

Yes. This is exactly how I would have imagined it too. It's quite likely as well that we will stop at task 1 if we find 2 too much of complexity. I have a feeling that trying to implement k-way merge sort in this case is not giving "the bang for the bucks". If you have such a huge log, finding anything meaningful in Airlfow UI for it will be next to impossible, search and everything there will also struggle if you try to scroll through it and you would likely have to implement a full-fledged search and likely other APIs to make use of such a huge log.

In this case just posting a message "Your log is definitely too large to handle" and pointing the user to "Consider fine-tuning your tasks to make your logs smaller or configure remote logging and choose some dedicated logging solution such to store your logs" is quite likely 9X% of the effect with 1% of the effort.

yatink · 2023-09-12T00:08:21Z

Sorry for the delay, I got airflow on kube set up on my laptop, so I'm going to pick up this issue and attempt to implement option 1 👇🏽 .

First, stop the webserver from crashing and pop a reasonable error message instead

Do you all have any large dag files lying around, preferably one that's larger than 400MB ? 😄 That's all that I'm missing (in theory) to reproduce the issue.

killua1zoldyck · 2023-09-12T00:10:54Z

Sorry for the delay, I got airflow on kube set up on my laptop, so I'm going to pick up this issue and attempt to implement option 1 👇🏽 .

First, stop the webserver from crashing and pop a reasonable error message instead

Do you all have any large dag files lying around, preferably one that's larger than 400MB ? 😄 That's all that I'm missing (in theory) to reproduce the issue.

We can just have a PythonOperator that takes as an argument the size of logs that it has to produce and it can log, say 400 MB of random value.

yatink · 2023-09-12T00:12:12Z

Hmmm...it appears that I don't have the privs to assign this issue to myself. Can someone help me with that (either by giving me privs or assigning this issue to me directly) ?

potiuk · 2023-09-12T07:57:49Z

Assigned. Good Luck :)

mujiannan · 2024-01-07T19:02:48Z

Even a small log file can destroy the gunicorn in webserver.
Run a task in kubernetes, log to s3, then you will go into a `interleave' condition, viewing logs of the running task from webui cause a gunicorn "cpu 100%" and crashing soon.

potiuk · 2024-01-07T19:05:39Z

Even a small log file can destroy the gunicorn in webserver. Run a task in kubernetes, log to s3, then you will go into a `interleave' condition, viewing logs of the running task from webui cause a gunicorn "cpu 100%" and crashing soon.

If you can provide some actionable details - like size of your log for example details of memory allocated for your processes, then that would be much more actionable than "small log crashing server"

potiuk · 2024-01-07T19:06:39Z

Also versiin of packages that causes it - would also be useful.

getaaron added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels May 5, 2023

potiuk added good first issue and removed needs-triage label for new issues that we didn't triage yet labels May 6, 2023

potiuk added the area:logging label May 6, 2023

potiuk assigned killua1zoldyck Jun 20, 2023

jscheffl mentioned this issue Jul 31, 2023

Issue with fetching log file via download button #32800

Closed

2 tasks

killua1zoldyck removed their assignment Aug 18, 2023

eladkal added the affected_version:2.6 Issues Reported for 2.6 label Aug 19, 2023

uranusjr added area:webserver Webserver related Issues and removed area:core labels Aug 20, 2023

potiuk removed the good first issue label Aug 20, 2023

potiuk assigned yatink Sep 12, 2023

eladkal unassigned yatink Jun 8, 2024

airflow-webserver crashes if log files are larger than webserver available RAM #31105

airflow-webserver crashes if log files are larger than webserver available RAM #31105

Comments

getaaron commented May 5, 2023

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented May 5, 2023

getaaron commented May 5, 2023

potiuk commented May 6, 2023 • edited

getaaron commented May 8, 2023 via email

killua1zoldyck commented Jun 20, 2023

potiuk commented Jun 20, 2023

killua1zoldyck commented Jun 23, 2023

potiuk commented Jun 23, 2023

potiuk commented Jun 23, 2023

killua1zoldyck commented Jun 23, 2023

getaaron commented Jun 23, 2023

killua1zoldyck commented Jun 24, 2023

potiuk commented Jun 24, 2023

killua1zoldyck commented Jun 24, 2023 • edited

potiuk commented Jun 25, 2023

killua1zoldyck commented Jun 26, 2023

getaaron commented Jul 7, 2023

killua1zoldyck commented Jul 7, 2023

killua1zoldyck commented Jul 9, 2023 • edited

getaaron commented Jul 9, 2023

killua1zoldyck commented Jul 9, 2023 • edited

killua1zoldyck commented Jul 9, 2023

killua1zoldyck commented Aug 18, 2023

potiuk commented Aug 19, 2023

yatink commented Aug 20, 2023 • edited

potiuk commented Aug 20, 2023

yatink commented Aug 20, 2023

potiuk commented Aug 20, 2023

killua1zoldyck commented Aug 23, 2023

yatink commented Aug 27, 2023

yatink commented Aug 27, 2023

potiuk commented Aug 28, 2023

yatink commented Sep 12, 2023

killua1zoldyck commented Sep 12, 2023

yatink commented Sep 12, 2023

potiuk commented Sep 12, 2023

mujiannan commented Jan 7, 2024

potiuk commented Jan 7, 2024

potiuk commented Jan 7, 2024

potiuk commented May 6, 2023 •

edited

killua1zoldyck commented Jun 24, 2023 •

edited

killua1zoldyck commented Jul 9, 2023 •

edited

killua1zoldyck commented Jul 9, 2023 •

edited

yatink commented Aug 20, 2023 •

edited