Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set scheduler log sizes automatically based on available memory #5570

Open
gjoseph92 opened this issue Dec 7, 2021 · 1 comment
Open

Set scheduler log sizes automatically based on available memory #5570

gjoseph92 opened this issue Dec 7, 2021 · 1 comment
Labels
diagnostics documentation Improve or add to documentation stability Issue or feature related to cluster stability (e.g. deadlock)

Comments

@gjoseph92
Copy link
Collaborator

There are frequent reports of scheduler memory growing over time:

They often involve memory graphs that look like:
image

It's very likely that there is a real bug in the scheduler causing memory to accumulate (#3898 (comment)), but often the steep slope on these graphs is caused by various logs on the scheduler accumulating, such as:

  • transition_log - distributed.scheduler.transition-log-length
  • log - distributed.scheduler.transition-log-length (should maybe be distributed.admin.log-length?)
  • events - distributed.scheduler.events-log-length
  • computations - distributed.diagnostics.computations.max-history
  • Node._deque_handler - distributed.admin.log-length

I propose two things:

  1. Log lengths should be set as a percentage of available memory, not as a length—this is much easier for users to configure
    Note that for some/most of these, that may be difficult to do accurately, since the size of the entries is unknown. A rough estimate is probably okay.
  2. A memory-cleanup callback that runs, say, once a second, and clears our excess logs if the scheduler is under memory pressure.
@fjetter fjetter added diagnostics documentation Improve or add to documentation stability Issue or feature related to cluster stability (e.g. deadlock) labels Dec 8, 2021
@fjetter
Copy link
Member

fjetter commented Dec 14, 2021

xref #4762 for the various pieces of logging mentioned here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
diagnostics documentation Improve or add to documentation stability Issue or feature related to cluster stability (e.g. deadlock)
Projects
None yet
Development

No branches or pull requests

2 participants