Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Viewing logs of running task in grid view hangs airflow-webserver [elasticsearch] #28326

Open
1 of 2 tasks
sun2everyone opened this issue Dec 13, 2022 · 10 comments
Open
1 of 2 tasks
Labels
affected_version:2.4 Issues Reported for 2.4 area:webserver Webserver related Issues good first issue kind:bug This is a clearly a bug

Comments

@sun2everyone
Copy link

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

Airflow version: Airflow 2.4.2 with Elasticsearch log provider (ELK 7.17.0)
Python version: 3.8.13

Short description:
Airflow webserver stops responding due to infinity log loading loop.

Details:
If several users open grid view page and try to view logs of running task, they see no logs, but gunicorn workers became busy processing log loading requests. After some time all gunicorn workers became busy waiting for the end of log, and webserver stops responding to further client requests.

What you can see in webserver logs:
constantly appearing requests to elastic

{base.py:270} INFO - POST http://127.0.0.1:9202/_count [status:200 request:0.006s]
{base.py:270} INFO - POST http://127.0.0.1:9202/_search [status:200 request:0.015s]
{base.py:270} INFO - POST http://127.0.0.1:9202/_search [status:200 request:0.015s]
{base.py:270} INFO - POST http://127.0.0.1:9202/_count [status:200 request:0.010s]
{base.py:270} INFO - POST http://127.0.0.1:9202/_search [status:200 request:0.014s]
{base.py:270} INFO - POST http://127.0.0.1:9202/_count [status:200 request:0.011s]

What you think should happen instead

When trying to view logs of running task in grid view, you should receive one of the following:

  1. Message that running task logs couldn't be shown
  2. See partially loaded logs, that exist at the moment you load the page
  3. See continuously appearing logs, as it is done on the main task log page

How to reproduce

  1. Start airflow webserver with ELK logging and sync gunicorn workers
  2. Create dag with a task that is running for a long time and constantly writing it's logs
  3. Run that dag
  4. Open that dag in a grid view, try to view logs of that running task in a grid view
  5. Open more pages viewing logs of that running task in a grid view (5-6 or more)
  6. Webserver starts to work slowly, skips healthchecks and does not respond client requests

Operating System

Ubuntu 20.04.2 LTS

Versions of Apache Airflow Providers

apache-airflow-providers-elasticsearch 4.2.1

Deployment

Virtualenv installation

Deployment details

No response

Anything else

I guess the problem is in this parts:

Log reading stream keeps gunicorn worker busy all the time the task is in the Running state.

Using async workers like tornado makes a little better, but doesn't solve the problem. For now, I use this workaround:

--- log_endpoint_py.default    2022-12-12 19:16:30.280526903 +0300
+++ log_endpoint.py    2022-12-12 19:03:29.128970011 +0300
@@ -33,7 +33,7 @@
 from airflow.utils.airflow_flask_app import get_airflow_app
 from airflow.utils.log.log_reader import TaskLogReader
 from airflow.utils.session import NEW_SESSION, provide_session
-
+from airflow.utils.state import State 

@security.requires_access(
     [
@@ -108,6 +108,8 @@
         token = URLSafeSerializer(key).dumps(metadata)  # type: ignore[assignment]
         return logs_schema.dump(LogResponseObject(continuation_token=token, content=logs))
     # text/plain. Stream
-    logs = task_log_reader.read_log_stream(ti, task_try_number, metadata)
-
+    if ti.state not in State.running:
+        logs = task_log_reader.read_log_stream(ti, task_try_number, metadata)
+    else:
+        logs = {"Task is still running, logs in grid view not available."}
     return Response(logs, headers={"Content-Type": return_type}) 

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@sun2everyone sun2everyone added area:core kind:bug This is a clearly a bug labels Dec 13, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Dec 13, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@uranusjr uranusjr added area:webserver Webserver related Issues and removed area:core labels Dec 13, 2022
@potiuk potiuk changed the title Viewing logs of running task in grid view hangs airflow-webserver Viewing logs of running task in grid view hangs airflow-webserver [elasticsearch] Dec 22, 2022
@eladkal eladkal added the affected_version:2.4 Issues Reported for 2.4 label Jan 6, 2023
Copy link

github-actions bot commented Jan 7, 2024

This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.

Copy link

github-actions bot commented Feb 7, 2024

This issue has been closed because it has not received response from the issue author.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 7, 2024
@rightx2
Copy link

rightx2 commented Mar 26, 2024

I have had exact same problem. This issue should not be as stale.. @sun2everyone Did you solve this?

@potiuk
Copy link
Member

potiuk commented Mar 26, 2024

@rightx2 happy to reopen it, but it would help if you could provide more information that can help someone who might look at the issue (or maybe even look at it yourself and attempt to make PR to fix it).

@potiuk potiuk reopened this Mar 26, 2024
@rightx2
Copy link

rightx2 commented Mar 26, 2024

@potiuk I think what sun2everyone wrote is enough to describe the problem... Do you want think it needs some other info ?

@potiuk
Copy link
Member

potiuk commented Mar 27, 2024

@potiuk I think what sun2everyone wrote is enough to describe the problem... Do you want think it needs some other info ?

No idea. But by providing more info, specific to your case you increase chances that someone will have easier time of diagnosing it and maybe fixing. This is Open-source project. Maybe someone will take a look at the issue, maybe not. Maybe they will decide to spend their time on trying to reproduce it, diagnose and fix, or maybe not. Maybe if you gather some evidences you will find more informaation and show differences vs what the original report that will make it easier to diagnose. Or maybe by gathering evidences you will find it yourself, or maybe you will find that this is a different issue. Finally - maybe no-one will look at it. Seems like no-one was looking at it and diagnosing it since December 2022.

All these are maybes. All at scale of proability. And by providing more information you likely move the probability to be in favour of problem being diagnosed and fixed.

That's all I am saying.

@potiuk
Copy link
Member

potiuk commented Mar 27, 2024

Simply speaking - the more likely and easier you make it for somoene to diagnose and fix it, the more likely it will be diagnoe and fix it. So if you want it to get it fixed, spending time on making it easier (which mostly means gathering evidences, spotting the patterns, and doing more diagnosis on your side) is a good way to increase chances for it to happen.

@rightx2
Copy link

rightx2 commented Mar 27, 2024

Sounds reasonable. I'll do some experiments to reproduce them a lot easier and update it

@sun2everyone
Copy link
Author

sun2everyone commented Mar 27, 2024

Hello. I think ideal solution to this issue is to rewrite code for displaying logs on grid view page as it is done on logs page (continious batch loading). This is more simple solution I implemented (Airflow 2.4.2):

diff --git a/lib/python3.8/site-packages/airflow/api_connexion/endpoints/log_endpoint.py b/lib/python3.8/site-packages/airflow/api_connexion/endpoints/log_endpoint.py
index bda0af3..364a48f 100644
--- a/lib/python3.8/site-packages/airflow/api_connexion/endpoints/log_endpoint.py
+++ b/lib/python3.8/site-packages/airflow/api_connexion/endpoints/log_endpoint.py
@@ -33,7 +33,7 @@
 from airflow.utils.airflow_flask_app import get_airflow_app
 from airflow.utils.log.log_reader import TaskLogReader
 from airflow.utils.session import NEW_SESSION, provide_session
-
+from airflow.utils.state import State

 @security.requires_access(
     [
@@ -108,6 +108,8 @@
         token = URLSafeSerializer(key).dumps(metadata)  # type: ignore[assignment]
         return logs_schema.dump(LogResponseObject(continuation_token=token, content=logs))
     # text/plain. Stream
-    logs = task_log_reader.read_log_stream(ti, task_try_number, metadata)
-
+    if ti.state not in State.running:
+        logs = task_log_reader.read_log_stream(ti, task_try_number, metadata)
+    else:
+        logs = {"Task is still running, logs in grid view not available."}
     return Response(logs, headers={"Content-Type": return_type})

It just disables displaying logs in grid view if task state is 'Running'. Works in production for more than a year, without any problem. I guess it could still cause some trouble if you have really huge task log, but I didn't meet such cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.4 Issues Reported for 2.4 area:webserver Webserver related Issues good first issue kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

5 participants