Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high CPU use due to continual state dumping #1744

Closed
hjoliver opened this issue Mar 1, 2016 · 5 comments · Fixed by #1745
Closed

high CPU use due to continual state dumping #1744

hjoliver opened this issue Mar 1, 2016 · 5 comments · Fixed by #1745
Assignees
Labels
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented Mar 1, 2016

(from @MartinDix)

We occasionally see cylc server processes with high CPU usage for long periods. The cause of this seems to be that they are continually writing files to cylc-run/SUITE/state, e.g. in this case at a frequency of about 1.5 s (username removed to protect the innocent)

% ls -l ..../state
1798144 Mar  1 14:32 cylc-suite.db
     29 Mar  1 14:32 state -> state.20160301T033226.531301Z
  69956 Mar  1 14:32 state.20160301T033212.313480Z
  69956 Mar  1 14:32 state.20160301T033213.906387Z
  69956 Mar  1 14:32 state.20160301T033215.450920Z
  69956 Mar  1 14:32 state.20160301T033217.007761Z
  69956 Mar  1 14:32 state.20160301T033218.547831Z
  69956 Mar  1 14:32 state.20160301T033220.227907Z
  69956 Mar  1 14:32 state.20160301T033221.708751Z
  69956 Mar  1 14:32 state.20160301T033223.677265Z
  69956 Mar  1 14:32 state.20160301T033225.011035Z
  69956 Mar  1 14:32 state.20160301T033226.531301Z

These files are all identical apart from the internal timestamp. In suites behaving normally it seems these files are only written when something changes (e.g. a task finishes). Here however nothing seems to be changing at this high frequency. The log files and cylc-suite.db aren't being modified at this frequency.

I've only seen this behaviour in complex NWP suites with lots of tasks (e.g. state file has ~1000 lines). The high CPU usage can also come and go while the suite runs.

@hjoliver
Copy link
Member Author

hjoliver commented Mar 1, 2016

My response:

Hi Martin,
Thanks for the bug report, and the great detective work.
If your state files really are identical apart from the internal timestamp, then you may well have identified the problem because (as you note) these files should only be written when something changes. The flag that triggers a state dump is probably not being reset somewhere it should be, which shouldn't be too hard to find.
I've copied your post to a new issue on Github: #1744
Hilary
p.s. longer term, state dump files will cease to exist. They've long been almost redundant because suite run databases record the same information (and more), but we just haven't taken the final step yet.

@hjoliver hjoliver added the bug label Mar 1, 2016
@hjoliver hjoliver added this to the soon milestone Mar 1, 2016
@hjoliver hjoliver changed the title high CPU use due to state dump files being continually rewritten high CPU use due to continual state dumping Mar 1, 2016
@hjoliver
Copy link
Member Author

hjoliver commented Mar 1, 2016

See #421

@matthewrmshin
Copy link
Contributor

The logic in the main loop in cylc.scheduler looks like this:

            if cylc.flags.iflag or self.do_update_state_summary:
                cylc.flags.iflag = False
                self.do_update_state_summary = False
                self.update_state_summary()
                self.state_dumper.dump()

It does look like they are reset, but perhaps they get set to True again too easily?

@benfitzpatrick
Copy link
Contributor

Well, I was just about to post this! Might as well!

My two cents:

The relevant code in scheduler.py looks like this:

            if cylc.flags.iflag or self.do_update_state_summary:
                cylc.flags.iflag = False
                self.do_update_state_summary = False
                self.update_state_summary()
                self.state_dumper.dump()

flags.iflag is set to True in task_proxy.py on incoming message, if a task poll fails, on job kill, and on all changes in status (set_status).

self.do_update_state_summary is set to True by self.process_command_queue() in scheduler.py if a command succeeds.

self.do_update_state_summary is also set to True if self.process_tasks() returns True.

process_tasks returns True if particular commands succeed in self.process_command_queue, or if tasks are ready_to_run, or in simulation mode when tasks 'succeed', or if cylc.flags.pflag is set to True.

ready_to_run is True if the task isn't expired and it is queued (!! Could be this) or if it's finished *retrying, if it's waiting but with all prereqs met.

cylc.flags.pflag is set on job submission failure, job submission success, job execution failure, incoming message which triggers another task's inputs, incoming task started message, and incoming task succeeded and vacation messages.

@benfitzpatrick
Copy link
Contributor

Yep, queued reproduces it - for example, this suite:

[scheduling]
    [[queues]]
        [[[people_queue]]]
            limit = 1
            members = person_a, person_b
    [[dependencies]]
        graph = """
            person_a & person_b
        """
[runtime]
    [[person_a,person_b]]
        script = sleep 360

will dump a state file every second while one of the tasks is queued. High CPU usage for a large suite will be triggered by the update state summary call which happens just before the dump.

benfitzpatrick added a commit to benfitzpatrick/cylc that referenced this issue Mar 1, 2016
This fixes a problem where the state file is continually dumped
if the task pool contains a queued task.

This is due to the ready_to_run check in process_tasks in scheduler.py
returning True if any task is queued and has reached its start time.

The fix is to only update the state summary if we really think
something has changed. This is a good thing to be doing anyway.
benfitzpatrick added a commit to benfitzpatrick/cylc that referenced this issue Mar 1, 2016
This fixes a problem where the state file is continually dumped
if the task pool contains a queued task.

This is due to the ready_to_run check in process_tasks in scheduler.py
returning True if any task is queued and has reached its start time.

The fix is to only update the state summary if we really think
something has changed. This is a good thing to be doing anyway.
@matthewrmshin matthewrmshin modified the milestones: next release, soon Mar 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants