Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcylc: large memory footprint in graph mode - possible leak? #1620

Closed
arjclark opened this issue Sep 25, 2015 · 22 comments · Fixed by #2966
Closed

gcylc: large memory footprint in graph mode - possible leak? #1620

arjclark opened this issue Sep 25, 2015 · 22 comments · Fixed by #2966

Comments

@arjclark
Copy link
Contributor

When using the graph view in gcylc the memory footprint grows over time to worryingly large values. My suspicion is that we have some sort of memory leak either in our code or the underlying graphviz library.

Following a report from a user about large memory footprints for their gcylc sessions of over a gig each (1.4G and 1.9G) I've been investigating this and reproducing signs of a problem with a version of the um's rose-stem suite running in simulation mode with each task set to take 60-120s to run. As per the user's report I've been setting the view to graph mode, ungrouping all families and filtering out succeeded tasks.

What leads me to believe we have a memory leak is that, logically, as the number of tasks decreases in the view the overheads should be smaller, whereas what I'm seeing is the memory footprint increasing with time until the point of completion of the suite at which point (blank graph) it remains at the higher value. Swapping view to one of the others doesn't reduce the footprint and swapping back to graph sets it increasing again. If I close and re-open the gui the footprint starts small again and then keeps increasing again adding to suggestion of a memory leak (although not as bad as it once had been in this case, presumably due to a smaller initial number of graph nodes and edges)

This doesn't appear to be a recent development (had thought it might be down to some of the caching we've started doing) as I've been able to reproduce it from master back to earlier 6.x versions (got as far back as 6.1.0 with this so far before hitting a task naming convention error).

What I'm not sure of is what's going on. Is this a cylc issue or a graphviz one as it doesn't relate to the other views. We generally wouldn't recommend users running with graph view for suites as large as the one I'm investigating but the problem presumably still exists for other suites too.

I'll look to produce a simple example suite to investigate with next week.

@arjclark arjclark added this to the soon milestone Sep 25, 2015
@arjclark
Copy link
Contributor Author

Example suite to play with:

#!jinja2
[scheduling]
    [[dependencies]]
        graph = """

            INSTALL:succeed-all => FOO_FAM
            MAKERS:succeed-all => FOO_FAM
            {% for i0 in range(40) -%}
            really_long_task_name_foo_{{ i0 }} => BAR_SUBFAM_{{ i0 }}:succeed-all => really_long_task_name_baz_{{ i0 }} => housekeep
            {% endfor %}
        """
[runtime]
    [[root]]
        [[[simulation mode]]] 
        run time range = 60,120 # seconds
    [[INSTALL]]
    [[MAKERS]]
    [[FOO_FAM]]
    [[BAR_FAM]]
    [[BAZ_FAM]]
    {% for i0 in range(40) -%}
    [[BAR_SUBFAM_{{i0}}]]
    [[really_long_task_name_install_{{i0}}]]
    inherit = INSTALL
    [[really_long_task_name_maker_{{i0}}]]
    inherit = MAKERS
    [[really_long_task_name_foo_{{i0}}]]
    inherit = FOO_FAM
    {% for i1 in range(5) -%}
    [[really_long_task_name_bar_{{i0}}_{{i1}}]]
    inherit = BAR_FAM, BAR_SUBFAM_{{i0}}
    {% endfor %}
    [[really_long_task_name_baz_{{i0}}]]
    inherit = BAZ_FAM
    {% endfor %}
    [[housekeep]]

Increase the number of tasks to make the problem more prominent.

@hjoliver
Copy link
Member

hjoliver commented Oct 5, 2015

With your example suite above, I get a very similar gross memory profile with constant task run length regardless of the value (of run length), e.g. for script = sleep 10:

mprof-sleep10

and script = sleep 120 is very similar:

mprof-sleep120

But for varied lengths, e.g. your sim mode settings above or live mode with sleep $(( 60 + RANDOM % 60)), it is considerably worse:

mprof-sleep60-120

In the latter case presumably the suite graph updates more often because fewer things happen near-simultaneously.

All this would seem to suggest that graph view memory goes up with each graph update, even though (certainly in this non-cycling case) the graph is not actually growing any bigger.

@hjoliver
Copy link
Member

hjoliver commented Oct 5, 2015

On a simpler suite, graph = foo => FAM:succeed-all => bar (with FAM having 100 members) the form of the memory curve remains the same but its steepness increases with increasing task name length (tested 3 chars and 100) ... which might support the theory that we are leaking graph nodes - which contain task names - in the graph view.

@hjoliver
Copy link
Member

hjoliver commented Oct 5, 2015

Same result at cylc-5.4.7

@hjoliver
Copy link
Member

hjoliver commented Oct 5, 2015

I've been doing some internal function profiling too - https://pypi.python.org/pypi/memory_profiler - but have not found the problem as yet. It could be in the xdot code.

@arjclark
Copy link
Contributor Author

arjclark commented Oct 5, 2015

All this would seem to suggest that graph view memory goes up with each graph update, even though (certainly in this non-cycling case) the graph is not actually growing any bigger.
All this would seem to suggest that graph view memory goes up with each graph update, even though (certainly in this non-cycling case) the graph is not actually growing any bigger.

Agreed. I think its definitely related to the updates themselves.

the form of the memory curve remains the same but its steepness increases with increasing task name length

Interesting, that probably explains why the user's suite was so badly affected - lots of very long task names.

It could be in the xdot code

I've been coming to that conclusion myself but struggling to narrow down where the problem is occurring in it. Commenting out the:

                   self.update_xdot(no_zoom=needed_no_redraw)

line under def run() under update_graph.py seems to remove the problem (as it then doesn't redraw the graph view).

@hjoliver
Copy link
Member

hjoliver commented Oct 5, 2015

Filename: /home/oliverh/cylc/cylc.git/lib/xdot.py

Line #    Mem usage    Increment   Line Contents
================================================
   468     70.4 MiB      0.0 MiB       @profile
   469                                 def draw(self, cr, highlight_items=None):
   470     70.4 MiB      0.0 MiB           if highlight_items is None:
   471     70.4 MiB      0.0 MiB               highlight_items = ()
   472     70.4 MiB      0.0 MiB           cr.set_source_rgba(0.0, 0.0, 0.0, 1.0)
   473                             
   474     70.4 MiB      0.0 MiB           cr.set_line_cap(cairo.LINE_CAP_BUTT)
   475     70.4 MiB      0.0 MiB           cr.set_line_join(cairo.LINE_JOIN_MITER)
   476                             
   477     70.4 MiB      0.0 MiB           for shape in self.shapes:
   478     70.4 MiB      0.0 MiB               shape.draw(cr)
   479     70.4 MiB      0.0 MiB           for edge in self.edges:
   480     70.4 MiB      0.0 MiB               edge.draw(cr, highlight=(edge in highlight_items))
   481     70.9 MiB  *** 0.6 MiB ***       for node in self.nodes:
   482     70.9 MiB      0.0 MiB               node.draw(cr, highlight=(node in highlight_items))

For my 100-member suite with 100-char task names (above), line 481 here increments 0.6MiB for every graph update.

From further investigation, it seems that text rendering in xdot is the problem: returning immediately from lib/xdot.py:TextShape.draw() gets rid of the leak (but the resulting graph nodes are not text-labelled of course).

@hjoliver
Copy link
Member

hjoliver commented Oct 7, 2015

I suspect the problem is due to this extant pygtk bug: https://bugzilla.gnome.org/show_bug.cgi?id=599730

@matthewrmshin
Copy link
Contributor

I suspect the problem is due to this extant pygtk bug: https://bugzilla.gnome.org/show_bug.cgi?id=599730

I have also come to the same conclusion having done various Google searches.

@hjoliver
Copy link
Member

hjoliver commented Oct 7, 2015

Not sure what to do about this. Presumably we could provide a pygtk patch ourselves, but having to build pygtk would be a significant complication for cylc installation!

@hjoliver
Copy link
Member

hjoliver commented Oct 7, 2015

@hjoliver
Copy link
Member

hjoliver commented Oct 7, 2015

@matthewrmshin
Copy link
Contributor

Given that it is not really a cylc bug, if we can confirm that the bug is fixed on the latest release of GTK and has been rolled out on latest releases of major Linux distros such as RHEL, Ubuntu, (SuSE?), etc then we should close this issue.

@hjoliver
Copy link
Member

hjoliver commented Oct 8, 2015

Unfortunately that's not going to work. I took a git clone of pygtk, the last release 2.24.0 was more than four years ago. It looks like the bug was fixed two years ago, but I'm guessing it will never be released because pygtk is being superseded by pygobject:

$ git log --date=short --pretty="%cd %s" | head -n 8
2013-04-05 Add John Stowers as a maintainer to pygtk.doap
2013-04-05 Fix leaks of Pango objects  <<<<-----------------------------------
2011-10-02 reference: pygtk-gtkdialog.xml: fix link to gtk.VBox
2011-08-06 typo in configure variable
2011-08-06 use objective-c for quartz
2011-04-02 Post release version bump
2011-04-01 Update for 2.24.0 release   <<<<-----------------------------------
2011-03-30 Update README

@matthewrmshin
Copy link
Contributor

That's unfortunate. Perhaps there is a need to move to pygobject (or other more modern GUI technology) sooner than we think.

@hjoliver
Copy link
Member

hjoliver commented Oct 8, 2015

See #112

@arjclark
Copy link
Contributor Author

arjclark commented Jun 9, 2016

@matthewrmshin/@hjoliver - as this is a non-cylc issue and will be resolved/irrelevent when we later move to a different gui technology can we close this down?

@hjoliver hjoliver modified the milestones: some-day, soon Jun 9, 2016
@hjoliver
Copy link
Member

hjoliver commented Jun 9, 2016

@arjclark - I think we should keep this open while we're still using the current GUI technology, as a reference for anyone who notices the problem. Also, I remain skeptical that the current GUI can be superseded very quickly, unless perhaps we go to pygobject before developing a web-based GUI.

@benfitzpatrick
Copy link
Contributor

Can we test this on a up-to-date distro?

@benfitzpatrick benfitzpatrick modified the milestones: soon, some-day Jun 15, 2016
@hjoliver hjoliver modified the milestones: never, soon Jun 19, 2016
@hjoliver
Copy link
Member

@hjoliver to mention this issue on the new FAQ

We can't do anything about it until #112 is feasible.

@matthewrmshin
Copy link
Contributor

Solving #1873 will hopefully kill this issue.

@kinow
Copy link
Member

kinow commented Mar 10, 2019

We are now working on the new Web GUI, which will replace the previous Desktop PyGTK GUI in our next major release of Cylc 8.

This issue is being closed for not being applicable to this new Web GUI.

@kinow kinow closed this as completed Mar 10, 2019
Python3 automation moved this from To do to Done Mar 10, 2019
@matthewrmshin matthewrmshin removed this from the never milestone Mar 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Python3
  
Done
Development

Successfully merging a pull request may close this issue.

5 participants