high CPU use due to continual state dumping #1744

hjoliver · 2016-03-01T07:23:38Z

We occasionally see cylc server processes with high CPU usage for long periods. The cause of this seems to be that they are continually writing files to cylc-run/SUITE/state, e.g. in this case at a frequency of about 1.5 s (username removed to protect the innocent)

% ls -l ..../state
1798144 Mar  1 14:32 cylc-suite.db
     29 Mar  1 14:32 state -> state.20160301T033226.531301Z
  69956 Mar  1 14:32 state.20160301T033212.313480Z
  69956 Mar  1 14:32 state.20160301T033213.906387Z
  69956 Mar  1 14:32 state.20160301T033215.450920Z
  69956 Mar  1 14:32 state.20160301T033217.007761Z
  69956 Mar  1 14:32 state.20160301T033218.547831Z
  69956 Mar  1 14:32 state.20160301T033220.227907Z
  69956 Mar  1 14:32 state.20160301T033221.708751Z
  69956 Mar  1 14:32 state.20160301T033223.677265Z
  69956 Mar  1 14:32 state.20160301T033225.011035Z
  69956 Mar  1 14:32 state.20160301T033226.531301Z

These files are all identical apart from the internal timestamp. In suites behaving normally it seems these files are only written when something changes (e.g. a task finishes). Here however nothing seems to be changing at this high frequency. The log files and cylc-suite.db aren't being modified at this frequency.

I've only seen this behaviour in complex NWP suites with lots of tasks (e.g. state file has ~1000 lines). The high CPU usage can also come and go while the suite runs.

The text was updated successfully, but these errors were encountered:

hjoliver · 2016-03-01T07:27:18Z

My response:

Hi Martin,
Thanks for the bug report, and the great detective work.
If your state files really are identical apart from the internal timestamp, then you may well have identified the problem because (as you note) these files should only be written when something changes. The flag that triggers a state dump is probably not being reset somewhere it should be, which shouldn't be too hard to find.
I've copied your post to a new issue on Github: #1744
Hilary
p.s. longer term, state dump files will cease to exist. They've long been almost redundant because suite run databases record the same information (and more), but we just haven't taken the final step yet.

hjoliver · 2016-03-01T07:37:39Z

See #421

matthewrmshin · 2016-03-01T09:11:00Z

The logic in the main loop in cylc.scheduler looks like this:

            if cylc.flags.iflag or self.do_update_state_summary:
                cylc.flags.iflag = False
                self.do_update_state_summary = False
                self.update_state_summary()
                self.state_dumper.dump()

It does look like they are reset, but perhaps they get set to True again too easily?

benfitzpatrick · 2016-03-01T09:14:32Z

Well, I was just about to post this! Might as well!

My two cents:

The relevant code in scheduler.py looks like this:

            if cylc.flags.iflag or self.do_update_state_summary:
                cylc.flags.iflag = False
                self.do_update_state_summary = False
                self.update_state_summary()
                self.state_dumper.dump()

flags.iflag is set to True in task_proxy.py on incoming message, if a task poll fails, on job kill, and on all changes in status (set_status).

self.do_update_state_summary is set to True by self.process_command_queue() in scheduler.py if a command succeeds.

self.do_update_state_summary is also set to True if self.process_tasks() returns True.

process_tasks returns True if particular commands succeed in self.process_command_queue, or if tasks are ready_to_run, or in simulation mode when tasks 'succeed', or if cylc.flags.pflag is set to True.

ready_to_run is True if the task isn't expired and it is queued (!! Could be this) or if it's finished *retrying, if it's waiting but with all prereqs met.

cylc.flags.pflag is set on job submission failure, job submission success, job execution failure, incoming message which triggers another task's inputs, incoming task started message, and incoming task succeeded and vacation messages.

benfitzpatrick · 2016-03-01T09:47:27Z

Yep, queued reproduces it - for example, this suite:

[scheduling]
    [[queues]]
        [[[people_queue]]]
            limit = 1
            members = person_a, person_b
    [[dependencies]]
        graph = """
            person_a & person_b
        """
[runtime]
    [[person_a,person_b]]
        script = sleep 360

will dump a state file every second while one of the tasks is queued. High CPU usage for a large suite will be triggered by the update state summary call which happens just before the dump.

This fixes a problem where the state file is continually dumped if the task pool contains a queued task. This is due to the ready_to_run check in process_tasks in scheduler.py returning True if any task is queued and has reached its start time. The fix is to only update the state summary if we really think something has changed. This is a good thing to be doing anyway.

hjoliver added the bug label Mar 1, 2016

hjoliver added this to the soon milestone Mar 1, 2016

hjoliver changed the title ~~high CPU use due to state dump files being continually rewritten~~ high CPU use due to continual state dumping Mar 1, 2016

benfitzpatrick mentioned this issue Mar 1, 2016

Fix constant state dumps for queued tasks #1745

Merged

matthewrmshin modified the milestones: next release, soon Mar 1, 2016

matthewrmshin assigned benfitzpatrick Mar 1, 2016

hjoliver closed this as completed in #1745 Mar 1, 2016

MartinDix mentioned this issue Apr 12, 2016

High CPU usage with queued tasks #1787

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

high CPU use due to continual state dumping #1744

high CPU use due to continual state dumping #1744

hjoliver commented Mar 1, 2016

hjoliver commented Mar 1, 2016

hjoliver commented Mar 1, 2016

matthewrmshin commented Mar 1, 2016

benfitzpatrick commented Mar 1, 2016

benfitzpatrick commented Mar 1, 2016

high CPU use due to continual state dumping #1744

high CPU use due to continual state dumping #1744

Comments

hjoliver commented Mar 1, 2016

hjoliver commented Mar 1, 2016

hjoliver commented Mar 1, 2016

matthewrmshin commented Mar 1, 2016

benfitzpatrick commented Mar 1, 2016

benfitzpatrick commented Mar 1, 2016