Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DSIP-18][Remote Logging] Add support for writing task logs to remote storage #13017

Closed
3 tasks done
Tracked by #14102
rickchengx opened this issue Nov 28, 2022 · 7 comments
Closed
3 tasks done
Tracked by #14102
Assignees
Labels
3.2.0 for 3.2.0 version DSIP feature new feature
Milestone

Comments

@rickchengx
Copy link
Contributor

rickchengx commented Nov 28, 2022

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Why remote logging?

  • Avoid task log loss after worker is torn down
  • Easier to obtain logs and troubleshoot after the task logs are uploaded to remote storage
  • Enhanced cloud-native support for DS

Feature Design

截屏2023-01-18 14 20 57

Connect to different remote targets

DS can support a variety of common remote storage, and has strong scalability to support other types of remote storage

  • S3
  • OSS
  • ElasticSearch
  • Azure Blob Storage
  • Google Cloud Storage
  • ...

When to write logs to remote storage

Like airflow, DS writes the task logs to remote storage after the task completes (success or fail).

How to read logs

Since the task log is stored in both the worker's local and remote storage, when the api-server needs to read the log of a certain task instance, it needs to determine the reading strategy.

Airflow first tries to read the logs stored remotely, and if it fails, reads the local logs. But I prefer to try to read the local log first, and then read the remote log if the local log file does not exist. Because this can reduce the consumption of network bandwidth.

We could discuss this further.

Log retention strategy

For example, the maximum capacity of remote storage can be set, and old logs can be deleted by rolling.

Sub-tasks

Ref

Any comments or suggestions are welcome.

Use case

Discussed above.

Related issues

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@rickchengx rickchengx added feature new feature Waiting for reply Waiting for reply labels Nov 28, 2022
@github-actions
Copy link

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.

  • In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
  • If you haven't received a reply for a long time, you can join our slack and send your question to channel #troubleshooting

@EricGao888 EricGao888 added 3.2.0 for 3.2.0 version and removed Waiting for reply Waiting for reply labels Nov 28, 2022
@EricGao888 EricGao888 added this to the 3.2.0 milestone Nov 28, 2022
@EricGao888 EricGao888 added the discussion discussion label Nov 28, 2022
@Radeity
Copy link
Member

Radeity commented Nov 28, 2022

Good idea! One humble suggestion that different log level contributes to failure diagnosis differently, level of error, warn(high level) may contribute more than info, debug, trace(low level). Intuitively, it should better keep more higher level log and clean lower level log first when reach the maximum capacity of remote storage. Maybe we can separate different level of logs when flushing and support log aggregation when reading logs. WDYT?

@rickchengx
Copy link
Contributor Author

Good idea! One humble suggestion that different log level contributes to failure diagnosis differently, level of error, warn(high level) may contribute more than info, debug, trace(low level). Intuitively, it should better keep more higher level log and clean lower level log first when reach the maximum capacity of remote storage. Maybe we can separate different level of logs when flushing and support log aggregation when reading logs like Yarn. WDYT?

Hi, @Radeity , thanks a lot for your comment and suggestion.

  • This PR focus on writing task logs to remote storage after the task completes (E.g., it put the task log file to OSS). Also it reads the task log from the remote storage if the task log file does not exist on the local file system in worker.
  • So I think we should keep the content consistency of local logs and remote logs (E.g., the log-2 and log-3 in the image below) instead of aggregating logs on remote storage.
  • If the logs are aggregated and cleared according to the log level, users may be confused that he can only see part of the task logs.
  • So I think it is reasonable to clean up the oldest logs when the logs stored remotely reach the upper limit, just like RollingFileAppender in logback.

截屏2022-11-28 16 33 04

@Radeity
Copy link
Member

Radeity commented Nov 28, 2022

Hi, @rickchengx, i agree with you that we should keep consistency of local logs and remote logs. My central idea is to keep more useful logs after cleaning up, which can make DS different from Airflow. However, logs are widely used in practice for various maintenance activities, such as testing, failure diagnosis, and program comprehension, thus, my above suggestion to keep higher level log maybe immature. Anyway, thanks for your reply, hope my suggestion can give any inspiration!

@fuchanghai
Copy link
Member

Good idea!

@davidzollo davidzollo removed the discussion discussion label Dec 7, 2022
@rickchengx
Copy link
Contributor Author

Here are some discussions on the weekly meeting

Q1: In the k8s environment, users can choose to mount persistent volumes (E.g., OSS) to synchronize task logs to remote storage.
R1: This is indeed a way to synchronize logs to remote storage, and only need to mount persistent volumes (PV). But there are still some shortcomings as below:

  • Efficiency: Since the PV is connected to the remote storage, the speed of log writing will be reduced, which will further affect the task execution of the worker. On the contrary, uploading the task log to the remote storage asynchronously through the remote logging mechanism will not affect the execution of the task.
  • Generality: PV is not suitable for some remote storage, such as elasticsearch. And also it is not applicable to DS deployed in non-k8s environment.

Q2: Users can configure whether to use remote storage for task logs
R2: Yes, users can decide whether to enable log remote storage through configuration, and specify the corresponding configuration of remote logging.

Q3: The master-server also has task logs, which need to be uploaded to remote storage in a unified manner.
R3: Yes, users can set the master's task log related remote storage configuration in Master's configuration.

Q4: Is it possible to set the task log retention policy through the configuration supported by the remote storage itself?
R4: This is a good idea and it can simplify the design of remote logging, I'll look into it.

Thanks again for all the suggestions at the weekly meeting, please correct me if I'm wrong.

@ruanwenjun
Copy link
Member

This seems already finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.2.0 for 3.2.0 version DSIP feature new feature
Projects
Status: Done
Development

No branches or pull requests

7 participants