Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in daemon runners? #4603

Closed
giovannipizzi opened this issue Dec 1, 2020 · 52 comments
Closed

Memory leak in daemon runners? #4603

giovannipizzi opened this issue Dec 1, 2020 · 52 comments

Comments

@giovannipizzi
Copy link
Member

I have been running ~2000 (QE Relax) WorkChains with AiiDA 1.5.0.
Now, everything is finished, verdi process list is empty, and I have zero messages also in RabbitMQ (as a double check, see below).

However, my 8 workers are still using a lot of memory:

$ verdi daemon status
Profile: dispero2020
Daemon is running as PID 5782 since 2020-11-29 17:54:56
Active workers [8]:
  PID    MEM %    CPU %  started
-----  -------  -------  -------------------
 5786    2.965      0    2020-11-29 17:54:56
 5787    2.855      0    2020-11-29 17:54:56
 5788    2.907      0.2  2020-11-29 17:54:56
 5789    2.766      0    2020-11-29 17:54:56
 5790    2.908      0.1  2020-11-29 17:54:57
 5791    2.655      0    2020-11-29 17:54:57
 5792    2.749      0    2020-11-29 17:54:57
 5793    2.866      0    2020-11-29 17:54:57

Note that this means ~1.8GB RAM/worker, so a total of ~15GB used!

I initially reported this already in #4598 but I thought it was due to the overload described there.
Instead, this time everything went smoothly with no excepted jobs.

Therefore, I am assuming that this is a memory leak, with some resources not properly released.
Considering the size of the data, this is similar to the size of the corresponding file repository. Indeed, ArrayData should still have some 'caching' of the arrays in memory, so maybe an ArrayData node might keep in memory all arrays? Maybe this is the cause?
We might want to remove that caching, but still I think the daemon should explicitly delete or remove nodes from memory once they are not used, I believe. Maybe they remain because they stay in some DB session? (This is Django)

It would be good if someone could investigate this (maybe the next task for @chrisjsewell, but also feedback from @muhrin @sphuber @unkcpz is appreciated)

For completeness: as expected, if I stop the daemon and then restart it with verdi daemon start 8, I get a low memory usage:

$ verdi daemon status
Profile: dispero2020
Daemon is running as PID 56574 since 2020-12-01 13:00:33
Active workers [8]:
  PID    MEM %    CPU %  started
-----  -------  -------  -------------------
56578    0.145        0  2020-12-01 13:00:33
56579    0.144        0  2020-12-01 13:00:33
56580    0.144        0  2020-12-01 13:00:33
56581    0.144        0  2020-12-01 13:00:33
56582    0.145        0  2020-12-01 13:00:33
56583    0.144        0  2020-12-01 13:00:33
56584    0.144        0  2020-12-01 13:00:33
56585    0.145        0  2020-12-01 13:00:33

This is to show that there are no more messages (also before restarting the daemon)
rabbitmq-mgmt

@giovannipizzi
Copy link
Member Author

Question for the engine-savvy people (@sphuber @muhrin @unkcpz) - I guess when a workflow starts, submits a subprocess, and then waits, I guess the whole state is kept in memory.
Would it be possible to "unload" it from memory, and keep only a minimal "ProcessMonitor" class that has minimal memory footprint, and can reinstantiate the full node only when a new message comes back from RMQ?
In any case, reloading should already be possible easily, since this is done when the daemon is restarted?

In this way, one can always keep the memory usage very low, and therefore it becomes very easy to increase a lot the number of slots per worker without risking to overload the machine.
What do you think? How feasible is this?

@sphuber
Copy link
Contributor

sphuber commented Dec 2, 2020

It would definitely be possible but it would require a significant redesign and rewrite. The current idea is that the Process instance itself is responsible for listening to RPC and broadcast messages sent over RabbitMQ and when it receives one respond accordingly. The Runner is not doing much here, all it does is when a new process is submitted it serializes its state and sends the task to RabbitMQ. The Runner that is run in a daemon process has on top of that the ProcessLauncher listener that will actually pick up these process tasks from RabbitMQ and start running them.

To implement what you suggest, we would have to change this Process centralized concept to a Runner one, where it is the runner that subscribes to RPC and broadcast messages for all the processes that it is managing. It could theoretically load the Process instance from the corresponding serialized version stored on the node in the database, and forward the call and then destroy the instance when that is done. The problem is that we are not merely relying on messages from RabbitMQ. The Process itself is also scheduling callbacks on the event loop of the runner. If the Process instance is destroyed, these callbacks clearly cannot be executed.

@giovannipizzi
Copy link
Member Author

I see. Still, this leaves us with a lot of slots taken when running many workflows, as we know, and potentially a large memory usage (even if this leak is fixed - this issue is indeed in the memory left over after the processes should not be in memory anymore). There should be a LightProcess superclass that implements only the methods to listen/schedule callbacks/..., from which Process inherits, and LightProcess has a method to return the corresponding "full" Process when needed? Does it make sense?

@giovannipizzi
Copy link
Member Author

I confirm that the behaviour of #4598 (AiiDA starts having DuplicateSubscribers) happens when too much memory is used, the system starts putting programs in the disk swap, and everything becomes very slow (the whole machine, and then indirectly AiiDA itself starts choking).
For me, this is somewhere between 400 (OK, with ~1200 processes) and 800 (too much) submitted QERelaxWorkchains - I think we should try to reach higher numbers of managed processes!

This is not even an issue of a memory leak (or maybe it's partially related to it), but even just of memory consumption, with each worker quickly getting to >4GB of RAM needed, easily filling my (already powerful) workstation.

I think that it would be good at least to understand why so much memory is needed by the daemons, and fix it.
Suggestions on where to look? Maybe there is some logging that could be added to check what is using so much memory?
If useful I found this post, that beside other things points to the tracemalloc python module.
If someone can help in putting the logging in the right places, I can run in my setup to get the logging, so we get some understanding of the memory usage.

@giovannipizzi
Copy link
Member Author

Since I had to run a bit more, I injected some tracemalloc code at the top of my PwBaseWorkChains.
I dumped (some GB!) of data to disk, and then I checked it with something like

import tracemalloc
before = tracemalloc.Snapshot.load("tracemalloc-dispero2020-20724-20201211-174035.233858.dump")
after = tracemalloc.Snapshot.load("tracemalloc-dispero2020-20724-20201211-174757.761099.dump")

tot_stats = after.compare_to(before, 'traceback')
stat = tot_stats[0]
print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
for line in stat.traceback.format():
    print(line)



tot_stats = after.compare_to(before, 'lineno')
for stat in tot_stats[:30]:
    print(stat)

(this is at the beginning of the run, and after submitting some hundreds of WorkChains). Then I had to stop because both my hard disk got full and I ended the memory, the computer started swapping and everything got slow...

Anyway the output is this:

/usr/lib/python3.7/functools.py:60: size=195 MiB (+195 MiB), count=1932601 (+1932466), average=106 B
/usr/lib/python3.7/threading.py:238: size=144 MiB (+143 MiB), count=471893 (+471672), average=319 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py:282: size=141 MiB (+141 MiB), count=1168787 (+1168460), average=126 B
/usr/lib/python3.7/threading.py:235: size=85.0 MiB (+84.9 MiB), count=474099 (+473825), average=188 B
/usr/lib/python3.7/types.py:265: size=81.4 MiB (+81.3 MiB), count=808426 (+808154), average=106 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/ioloop.py:939: size=79.5 MiB (+79.5 MiB), count=1110803 (+1110409), average=75 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/communications.py:61: size=76.8 MiB (+76.8 MiB), count=927804 (+927552), average=87 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py:292: size=72.4 MiB (+72.3 MiB), count=249462 (+249310), average=304 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/stack_context.py:273: size=66.8 MiB (+66.8 MiB), count=603948 (+603710), average=116 B
/usr/lib/python3.7/contextlib.py:82: size=53.1 MiB (+53.0 MiB), count=171140 (+170999), average=325 B
/usr/lib/python3.7/concurrent/futures/_base.py:314: size=50.2 MiB (+50.2 MiB), count=705055 (+704785), average=75 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py:169: size=45.8 MiB (+45.8 MiB), count=600387 (+600158), average=80 B
/usr/lib/python3.7/types.py:275: size=45.3 MiB (+45.3 MiB), count=282860 (+282778), average=168 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/communications.py:33: size=42.9 MiB (+42.9 MiB), count=387526 (+387375), average=116 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/communications.py:63: size=25.3 MiB (+25.3 MiB), count=474063 (+473795), average=56 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/communications.py:62: size=25.3 MiB (+25.3 MiB), count=474063 (+473795), average=56 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/ioloop.py:595: size=22.7 MiB (+22.7 MiB), count=425326 (+424990), average=56 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/futures.py:95: size=21.5 MiB (+21.5 MiB), count=156367 (+156306), average=144 B
/usr/lib/python3.7/functools.py:79: size=21.2 MiB (+21.2 MiB), count=278272 (+278272), average=80 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py:985: size=21.1 MiB (+21.1 MiB), count=131977 (+131888), average=168 B
/usr/lib/python3.7/weakref.py:409: size=21.1 MiB (+21.1 MiB), count=132124 (+132011), average=167 B
/usr/lib/python3.7/threading.py:231: size=18.1 MiB (+18.1 MiB), count=237047 (+236908), average=80 B
/usr/lib/python3.7/threading.py:86: size=18.1 MiB (+18.1 MiB), count=474058 (+473790), average=40 B
/usr/lib/python3.7/threading.py:227: size=18.1 MiB (+18.1 MiB), count=236909 (+236775), average=80 B
/usr/lib/python3.7/threading.py:222: size=18.0 MiB (+18.0 MiB), count=236372 (+236274), average=80 B
/usr/lib/python3.7/functools.py:54: size=17.7 MiB (+17.7 MiB), count=231388 (+231388), average=80 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/stack_context.py:269: size=16.8 MiB (+16.8 MiB), count=573367 (+573137), average=31 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py:246: size=16.7 MiB (+16.7 MiB), count=313039 (+312915), average=56 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py:1276: size=16.2 MiB (+16.2 MiB), count=236385 (+236367), average=72 B
/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/futures.py:93: size=14.5 MiB (+14.5 MiB), count=237037 (+236903), average=64 B

The first 30 entries already account for almost all memory of this worker (>1.5GB).

If I do

    tot_stats = after.compare_to(before, 'traceback')
    stat = tot_stats[0]
    print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
    for line in stat.traceback.format():
        print(line)

for the first one, I get:

1694391 memory blocks: 180566.0 KiB
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1003
    self.run()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1069
    yielded = self.gen.send(value)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 237
    msg[messages.BroadcastMessage.CORRELATION_ID])
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 292
    result = func(*args, **kwargs)
  File "/usr/lib/python3.7/types.py", line 277
    coro = func(*args, **kwargs)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/communications.py", line 62
    task_future = futures.create_task(msg_fn, loop)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/futures.py", line 95
    @gen.coroutine
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 246
    return _make_coroutine_wrapper(func, replace_callback=True)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 282
    @functools.wraps(wrapped)
  File "/usr/lib/python3.7/functools.py", line 60
    getattr(wrapper, attr).update(getattr(wrapped, attr, {}))

Can anyone with a bit of knowledge of the engine (@sphuber @unkcpz @muhrin) figure out if there is something obvious where we can spare some memory? Most of the places seems to be code coming from the engine, and not the nodes of AiiDA, if I'm not wrong.

@chrisjsewell
Copy link
Member

I injected some tracemalloc code at the top of my PwBaseWorkChains

Out of interest, could you clarify what you did here? What exactly do you mean by "top", where exactly in the code does it go?

From an uneducated glimpse, it looks like tornado introduces a significant memory overhead, when wrapping a callback into a future.
Given that tornado is no longer used come v1.6, perhaps it would be good to compare this to running with develop, to see if this is still the case with asyncio?

@sphuber
Copy link
Contributor

sphuber commented Dec 11, 2020

What @chrisjsewell said about the fact that tornado is a goner in 1.6.0 and so may change things quite a bit. On top of that, it could also be worth taking a look in kiwipy. We can see in the last stack trace that the Communicator is involved. This is a single permanent instance per runner, that manages connections with RabbitMQ. When a new process is launched, broadcast and rpc subscribers are created and registered with this communicator. It might very well be that after the process is done, the subscribers are maybe not removed or if that, maybe the Communicator keeps a reference in memory, slowly accumulating dead space as processes are loaded and completed.

@giovannipizzi
Copy link
Member Author

Hi both. I'm aware that tornado is gone now; I fear that (even if the code has changed) some of these "leaks" might still be in and might be also an indirect cause of the issues in #4595. I am running in production and since develop still has some issues for me (#4595) I had to continue working with 1.5, and so I started this investigation - I thought it was due to other parts of AiIDA that didn't change; this last report of mine above confirms that indeed it's better to investigate on develop.

I have some runs to finish in the next few weeks that will fill my RAM, and also have to free up some disk space before I can do more tests, so won't be able to test this super soon.
But I attach below what I did so someone else maybe can try? It should be easy, just run ~400+ workflows (but I guess you see the issue even with less - actually good to check maybe with less, to see which memory is left once all workflows are completed; I had to do my test and compare the memory only while wfs were still running, since then my machine choked) - I'm running the PwRelaxWorkchain, one could run that with low convergence cutoffs, it does not matter if the calculations are not converged. Or even use some other simple "test" workchain, it shouldn't matter.
I was running with 8 daemon workers, and I had increased at some point the number of slots to 500.

NOTE: this will create a lot of files and a possibly tens of GBs, and also the memory usage will increase by 2x or more because of tracemalloc - so maybe better to start conservative, with e.g. 50 work chains only, and make sure there is enough disk space.

What I did was to insert this code at the end of the setup step of the PwBaseWorkChain:

# Injecting memory monitoring code
import os
import datetime
import tracemalloc

from aiida.manage.configuration.settings import AIIDA_CONFIG_FOLDER
from aiida import get_profile

if not tracemalloc.is_tracing():
    tracemalloc.start(10)
snapshot = tracemalloc.take_snapshot()
snapshot_fname = f"{AIIDA_CONFIG_FOLDER}/daemon/log/tracemalloc-{get_profile().name}-{os.getpid()}-{datetime.datetime.now().strftime('%Y%m%d-%H%M%S.%f')}.dump"
snapshot.dump(snapshot_fname)                

(I'm not sure if the parameter 10 is needed in tracemalloc.start - maybe the default of 1 is OK).

As you see, is creating a dump file at each run of a new workflow, in the daemon/log folder, with an appropriate - hopefully unique - name. This also includes the PID so one can distinguish different workflows.
At the end, when everything is done, to see the final memory usage, one can just run another workchain to get one more dump and see the leftover memory.

To analyze the dumps, I use something like this.

  1. I go in the folder, I do ls possibly filtering only the files of one specific worker
  2. I pick the before and after file I want to compare (e.g. one more or less at the beginning, and one more or less at the end - I think the file size is in some way proportional to the allocated memory.
  3. I run something like this in a python shell:
    import tracemalloc
    
    before = tracemalloc.Snapshot.load('<BEFORE_FNAME>.dump')
    after = tracemalloc.Snapshot.load('<AFTER_FNAME>.dump')
    
    top_stats = after.compare_to(before, 'lineno')
    print("[ Top 30 differences ]")
    for stat in top_stats[:30]:
        print(stat)

This gives the top 30 differences.

If one wants the traceback of e.g. the very first entry, one could do instead:

    tot_stats = after.compare_to(before, 'traceback')
    stat = tot_stats[0]
    print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
    for line in stat.traceback.format():
        print(line)

@ltalirz
Copy link
Member

ltalirz commented Dec 23, 2020

We confirm the issue of excessive memory use by the engine in AiiDA v1.5.2 (not yet on develop), running on python 3.6.9:

 verdi daemon status
Profile: reference_set_mofs
Daemon is running as PID 13422 since 2020-12-20 17:30:54
Active workers [3]:
  PID    MEM %    CPU %  started
-----  -------  -------  -------------------
32525   31.261     18.3  2020-12-21 15:56:54
32526   20.591      7.7  2020-12-21 15:56:54
32527   21.895      9.4  2020-12-21 15:56:54

That is three daemon workers together using ~74% of 32GB, i.e. 24GB of RAM!
A total crash of the machine we observed earlier was therefore, in retrospect, likely caused by excessive swapping as mentioned also by @giovannipizzi .

I attach the output of pip list
pip.txt

In this state AiiDA is essentially unusable for our high-throughput work.
It would be great to know:

  1. If there are any concrete ideas for what may be causing this issue
  2. What we can do to help pinpoint the issue
  3. If there is an older AiiDA version that does not suffer from this issue so that people can continue doing production work

@giovannipizzi
Copy link
Member Author

Some short answers:

  1. no more than my report above (it seems to be in the engine, anyway - loading 1000 nodes only takes a few MB)
  2. anything is appreciated. I doubt it's something that we can solve in a few days, but A) having more detailed reports of the memory usage to understand where the problem is, and B) testing earlier versions of AiiDA with a similar throughput to see if the issue was already there in AiIDA 1.0.x, 1.1.x, 1.2.x, 1.3.x, 1.4.x (maybe going with a "bisection algorithm"). I would start with the oldest version where the workflows would still work, and try to apply a similar "pressure", with similar workflows - it might well be that it always was like this; but it this is not the case then we're lucky and this can help find the commit where the problem was added
  3. See my answer above (short answer - I don't know)

@atztogo
Copy link
Contributor

atztogo commented Dec 24, 2020

Please ignore my comment below if I misunderstand this issue.

Probably this issue hit me. I ran with aiida-vasp and aiida-phonopy, supercell phonon (phono3py) calculation with 780 supercells. Submission of supercell vasp calculations are done at this point, https://github.com/aiida-phonopy/aiida-phonopy/blob/13efc549647f595b4f76ab455691cd95dc06d05d/aiida_phonopy/workflows/phono3py.py#L165-L198. With the latest develop, a few aiida-vasp calculations could not be retrieved (probably due to lack of memory of postgresql etc). I rerun the same phono3py calculation three times, but all similarly failed. With v1.5.2 (from git tag), memory usage is still a lot, but all the calculations went well, but maybe just lucky.

@unkcpz
Copy link
Member

unkcpz commented Dec 25, 2020

Think I might narrow down the place where this issue happening.

how to easily reproduce the issue

By running the script which modified from .ci/test_daemon.py with submitting more (16 in my case) NestedWorkChain into the running daemon. The memory usage is shown by daemon status quickly increase from 1.13% to 3.4% and will not get memory released after all the process finished. (My laptop has 8G mem, and I use only one daemon in the test)

The crux

I roll the version back to v1.2.1 the issue is not there anymore but happened again in v1.3.0. The most important changes related to engine added in v1.3.0 is 1) LoopCommunicator is introduced to hold all message transmission with RabbitMQ. 2) Reinstate broadcast subscription of workchain for child processes

The two commits are tightly related since the second depends on the first implement and the LoopCommunicator is used to realize the Runner.call_on_process_finish now also registers the broadcast subscriber which is the main purpose of the second commit.

In my test, the mem show in daemon status after running the test script:

Screenshot from 2020-12-25 20-29-55

However, I have no idea how to fix this, manually calling the garbage collector to free the memory in the callback of Runner.call_on_process_finish seems has no effect.
The good thing is I have a repeatable test case, so will make a full-scale memory profiling in the next few days. Any head-ups and ideas are welcome before I do that. @giovannipizzi @sphuber

Sorry to bother you with this. Happy Christmas!

@ltalirz
Copy link
Member

ltalirz commented Dec 27, 2020

Thanks a lot for the investigation @unkcpz ! Looks to me like you are on the right track.

However, I have no idea how to fix this, manually calling the garbage collector to free the memory in the callback of Runner.call_on_process_finish seems has no effect.

My suspicion would be that after these two commits the runner starts keeping references to objects it actually no longer needs, which prevents the garbage collector from freeing the memory.
Any ideas what might be having such an effect?

I would also mention that (without me knowing more details) it is possible that the cause is not in the aiida-core code but in the updated dependencies (plumpy,...)

@giovannipizzi
Copy link
Member Author

Thanks @unkcpz very useful!!! And I agree with Leo's comment.

From my memory reports above, it seems that (at least part of) the memory that does not get freed up happens here:
https://github.com/aiidateam/plumpy/blob/v0.15.0/plumpy/communications.py#L60-L64
The code was not added in plumpy in v0.15.0, but I guess it's getting called now due to the changes to the two commits you mention in AiiDA code (probably by the fact now a broadcast callback is registered?).

My guess is that indeed something (loop? Communicator? Callback?) gets stored, bound to some variable via functors.partial, but does not get released.
One question is if this code is properly removing the broadcast (e.g.: is the subscriber_identifier properly known by the inline_callback function?)

The next steps would be

  1. identify what is that gets referenced and not freed up, where the reference is kept
  2. understand at which point they can be freed up (e.g. as you say on on_process_finish, making sure the memory gets freed in all cases)
  3. Understand how to remove the reference

Not sure how much this is helpful, but maybe in helps in moving one step closer to the solution?

@giovannipizzi
Copy link
Member Author

By the way, I report here the next 5 tracebacks from tracemalloc: they happen from different places, but they all go back to the same routine. This is good, in the sense that if we fix that problem, we fix probably all memory leaks.

601945 memory blocks: 187338.2 KiB
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 319
    _futures_to_runners[future] = Runner(result, future, yielded)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1003
    self.run()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1069
    yielded = self.gen.send(value)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 237
    msg[messages.BroadcastMessage.CORRELATION_ID])
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 292
    result = func(*args, **kwargs)
  File "/usr/lib/python3.7/types.py", line 277
    coro = func(*args, **kwargs)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/communications.py", line 63
    return plum_to_kiwi_future(task_future)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/communications.py", line 31
    kiwi_future = kiwipy.Future()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 314
    self._condition = threading.Condition()
  File "/usr/lib/python3.7/threading.py", line 238
    self._waiters = _deque()

1153744 memory blocks: 130382.3 KiB
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 319
    _futures_to_runners[future] = Runner(result, future, yielded)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1003
    self.run()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1069
    yielded = self.gen.send(value)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 237
    msg[messages.BroadcastMessage.CORRELATION_ID])
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 292
    result = func(*args, **kwargs)
  File "/usr/lib/python3.7/types.py", line 277
    coro = func(*args, **kwargs)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/communications.py", line 62
    task_future = futures.create_task(msg_fn, loop)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/futures.py", line 95
    @gen.coroutine
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 246
    return _make_coroutine_wrapper(func, replace_callback=True)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 282
    @functools.wraps(wrapped)

605721 memory blocks: 111211.0 KiB
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 319
    _futures_to_runners[future] = Runner(result, future, yielded)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1003
    self.run()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1069
    yielded = self.gen.send(value)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 237
    msg[messages.BroadcastMessage.CORRELATION_ID])
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 292
    result = func(*args, **kwargs)
  File "/usr/lib/python3.7/types.py", line 277
    coro = func(*args, **kwargs)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/communications.py", line 63
    return plum_to_kiwi_future(task_future)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/communications.py", line 31
    kiwi_future = kiwipy.Future()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 314
    self._condition = threading.Condition()
  File "/usr/lib/python3.7/threading.py", line 235
    self._is_owned = lock._is_owned


954907 memory blocks: 97994.2 KiB
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1003
    self.run()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1069
    yielded = self.gen.send(value)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 237
    msg[messages.BroadcastMessage.CORRELATION_ID])
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 292
    result = func(*args, **kwargs)
  File "/usr/lib/python3.7/types.py", line 277
    coro = func(*args, **kwargs)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/communications.py", line 62
    task_future = futures.create_task(msg_fn, loop)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/futures.py", line 95
    @gen.coroutine
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 246
    return _make_coroutine_wrapper(func, replace_callback=True)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 280
    func = types.coroutine(func)
  File "/usr/lib/python3.7/types.py", line 265
    co.co_cellvars)

@unkcpz
Copy link
Member

unkcpz commented Dec 31, 2020

@giovannipizzi @ltalirz Thanks for the inputs!

Here is another information that might be related to the issue. It appeared many times during I migrate tornado to asyncio but I just ignore this RuntimeWarning at that time. It is now appears again and I guess asyncio capture it as some sort of memory leak:

/data/CONDA_ENV/aiida-testing38/lib/python3.8/asyncio/base_events.py:641: RuntimeWarning: coroutine 'create_task.<locals>.run_task' was never awaited                         
Coroutine created at (most recent call last)                                                                                                                                  
  File "/data/CONDA_ENV/aiida-testing38/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete                                                                
    self.run_forever()                                                                                                                                                        
  File "/data/CONDA_ENV/aiida-testing38/lib/python3.8/asyncio/base_events.py", line 570, in run_forever                                                                       
    self._run_once()
  File "/data/CONDA_ENV/aiida-testing38/lib/python3.8/site-packages/nest_asyncio.py", line 132, in _run_once
    handle._run()                     
  File "/data/CONDA_ENV/aiida-testing38/lib/python3.8/site-packages/nest_asyncio.py", line 201, in run              
    ctx.run(self._callback, *self._args)                                               
  File "/data/CONDA_ENV/aiida-testing38/lib/python3.8/site-packages/nest_asyncio.py", line 159, in step     
    result = coro.send(None)
  File "/data/CONDA_ENV/aiida-testing38/lib/python3.8/site-packages/kiwipy/rmq/communicator.py", line 217, in _on_broadcast
    await receiver(
  File "/data/CONDA_ENV/aiida-testing38/lib/python3.8/site-packages/kiwipy/rmq/utils.py", line 118, in wrap
    return coro_or_fn(*args, **kwargs)
  File "/data/CONDA_ENV/aiida-testing38/lib/python3.8/site-packages/plumpy/communications.py", line 61, in converted
    task_future = futures.create_task(msg_fn, loop)
  File "/data/CONDA_ENV/aiida-testing38/lib/python3.8/site-packages/plumpy/futures.py", line 75, in create_task
    asyncio.run_coroutine_threadsafe(run_task(), loop)
  self._ready.clear()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

@unkcpz
Copy link
Member

unkcpz commented Jan 5, 2021

I did more experiment in this issue and I think the leak comes from both new added LoopCommunicator and subscriber. I happened to have a commit unkcpz@9848899#diff-dda81e6cbcd5a3e056c7c2602e4e02db10bd5f711be8e1b6095ed9c256af42fb that deprecates the use of LoopCommunicator but use ThreadCommunicator directly to implement process_finish call.
With this commit, memory usage decrease from 3.4% to 2.1% with the low case test case I used above. Although the leak decreases a lot, it is still much large than using the spare poll mechanism.

The point is memory is leaking not only by the subscribers but from the implement of LoopCommunicator. However, I have no more idea about how to fix this. Maybe @muhrin have some intuitive opinion where I can look into deeply?

@ltalirz ltalirz changed the title Memory leak in deamon runners? Memory leak in daemon runners? Jan 13, 2021
@ltalirz
Copy link
Member

ltalirz commented Jan 13, 2021

I've now tested (so far only using toy examples) where the memory references are accumulating.
Report here

P.S. According to my tests, at least the basic fact of this accumulation does not seem to be affected by the presence or absence of the LoopCommunicator (tested on a branch of @unkcpz where the LoopCommunicator was removed from 1.5.2 and the results were unaffected).

@chrisjsewell
Copy link
Member

Here is another information that might be related to the issue.

FYI in the benchmark engine tests that same warning is appearing

https://github.com/aiidateam/aiida-core/runs/1677369011?check_suite_focus=true

 tests/benchmark/test_engine.py::test_workchain_local[serial-wc-loop]
  /opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/asyncio/base_events.py:641: RuntimeWarning: coroutine 'create_task.<locals>.run_task' was never awaited
  Coroutine created at (most recent call last)
    File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
      self.run_forever()
    File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
      self._run_once()
    File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/nest_asyncio.py", line 132, in _run_once
      handle._run()
    File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/nest_asyncio.py", line 201, in run
      ctx.run(self._callback, *self._args)
    File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/nest_asyncio.py", line 159, in step
      step_orig(task, exc)
    File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/asyncio/tasks.py", line 280, in __step
      result = coro.send(None)
    File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/kiwipy/rmq/communicator.py", line 217, in _on_broadcast
      await receiver(
    File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/kiwipy/rmq/utils.py", line 118, in wrap
      return coro_or_fn(*args, **kwargs)
    File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/plumpy/communications.py", line 61, in converted
      task_future = futures.create_task(msg_fn, loop)
    File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/plumpy/futures.py", line 75, in create_task
      asyncio.run_coroutine_threadsafe(run_task(), loop)
    self._ready.clear()

@muhrin
Copy link
Contributor

muhrin commented Jan 13, 2021

Hi @ltalirz , nice work.

I'd be tempted to try and pare down the problem further now, given that you're seeing that a complete round trip launch via the runner is clearly now cleaning up after itself.

For example, does running a workfunction from a simple script leave references? One of the things I'd be hunting for is an object that is hanging around that shouldn't be (for example the Process instance itself or the associated node). Then I'd want to get backreferences from that object (to see what's keeping it alive).

If you find such a beast in the while (don't be scared to just hack the runner to intercept the process instance after it has finished for example) then I've found the objgraph library super useful by calling this: https://mg.pov.lt/objgraph/#backreferences

(there blog entries for this library are also quite interesting and informative e.g. https://mg.pov.lt/blog/hunting-python-memleaks.html)

@chrisjsewell
Copy link
Member

chrisjsewell commented Jan 26, 2021

Ok so over the past few days I have created: https://github.com/aiidateam/aiida-integration-tests.
The README explains what this is in detail, but basically the idea is to create a reproducible environment (with Docker Compose) to investigate these kind of "production" issues.

In investigating this issue, here is what I did:

First spin up the docker network:

$ docker-compose up --build -d

Then in a separate terminal you can get a live reporting of the stats:

$ docker stats
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O        BLOCK I/O         PIDS
2940cef2c10d   aiida-int-core       0.00%     22.86MiB / 1.942GiB   1.15%     1.1kB / 412B   17.6MB / 9.26MB   13
e8b56bc0f9c7   aiida-int-slurm      7.69%     20.46MiB / 1.942GiB   1.03%     1.06kB / 0B    22.3MB / 4.1kB    20
90e2543dcd25   aiida-int-rmq        4.25%     103.5MiB / 1.942GiB   5.21%     1.06kB / 0B    43.5MB / 590kB    92
9313daf2fbf9   aiida-int-postgres   0.10%     49.57MiB / 1.942GiB   2.49%     1.15kB / 0B    42.1MB / 49.7MB   7

After configuring an AiiDA profile, connected to the other three containers:

$ docker exec -it aiida-int-core /bin/bash
root@2940cef2c10d:~# aiida_config/run_all.sh
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
2940cef2c10d   aiida-int-core       0.00%     49.37MiB / 1.942GiB   2.48%     269kB / 349kB     41.4MB / 9.27MB   13
e8b56bc0f9c7   aiida-int-slurm      7.73%     21.34MiB / 1.942GiB   1.07%     11.1kB / 10.2kB   24.3MB / 4.1kB    20
90e2543dcd25   aiida-int-rmq        3.48%     101.9MiB / 1.942GiB   5.13%     2.56kB / 1.17kB   43.5MB / 594kB    92
9313daf2fbf9   aiida-int-postgres   6.01%     59.62MiB / 1.942GiB   3.00%     338kB / 256kB     42.1MB / 62.9MB   7

If you start a daemon, this is the baseline memory usage:

root@2940cef2c10d:~# verdi daemon start 1
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
2940cef2c10d   aiida-int-core       2.96%     213.4MiB / 1.942GiB   10.73%    278kB / 358kB     43.6MB / 9.27MB   25
e8b56bc0f9c7   aiida-int-slurm      2.12%     23.68MiB / 1.942GiB   1.19%     11.3kB / 10.2kB   24.4MB / 41kB     20
90e2543dcd25   aiida-int-rmq        3.99%     101.3MiB / 1.942GiB   5.09%     5.75kB / 3.52kB   43.6MB / 631kB    92
9313daf2fbf9   aiida-int-postgres   0.07%     61.27MiB / 1.942GiB   3.08%     343kB / 261kB     42.1MB / 66.2MB   8

If you start an ipython shell, this is the baseline memory usage:

root@2940cef2c10d:~# verdi daemon stop
root@2940cef2c10d:~# verdi shell
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
2940cef2c10d   aiida-int-core       0.00%     136.4MiB / 1.942GiB   6.86%     281kB / 360kB     46MB / 9.38MB     18
e8b56bc0f9c7   aiida-int-slurm      7.12%     25.96MiB / 1.942GiB   1.31%     11.4kB / 10.2kB   24.4MB / 41kB     24
90e2543dcd25   aiida-int-rmq        3.31%     101.1MiB / 1.942GiB   5.08%     6.08kB / 3.67kB   43.6MB / 635kB    92
9313daf2fbf9   aiida-int-postgres   0.00%     61.18MiB / 1.942GiB   3.08%     344kB / 263kB     42.1MB / 66.4MB   8

Now here is the important part: If you run a calculation in ipython (which I think should mirror what is happening in a daemon?) you can see that the memory usage increases and does not release:

In [1]: import gc 
   ...: from aiida_sleep.cli import run_calc 
   ...: from aiida_sleep.sleep_job import SleepCalculation 
   ...: from pympler import muppy, refbrowser                                                                                                        
In [2]: run_calc(payload=int(1e6), output=int(1e6))                                                                                                  
Out[2]: <CalcJobNode: uuid: 4b1a2c94-e93d-427a-9ec4-380566eae038 (pk: 5) (aiida.calculations:sleep)>
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
2940cef2c10d   aiida-int-core       0.00%     873.3MiB / 1.942GiB   43.91%    223MB / 79.4MB    58.2MB / 9.44MB   20
e8b56bc0f9c7   aiida-int-slurm      8.53%     32.08MiB / 1.942GiB   1.61%     5.3MB / 5.07MB    33MB / 135kB      20
90e2543dcd25   aiida-int-rmq        2.51%     95.41MiB / 1.942GiB   4.80%     14.4kB / 11.3kB   44.5MB / 643kB    92
9313daf2fbf9   aiida-int-postgres   0.39%     79.17MiB / 1.942GiB   3.98%     74.1MB / 218MB    42.4MB / 99.8MB   9

Using pympler to analyze the memory, you can find that the Process (and its connected nodes) are still in memory.
They are being held there, it appears, by the kill_process method.

In [3]: gc.collect() 
   ...: all_objects = muppy.get_objects() 
   ...: calcs = [o for o in all_objects if hasattr(o, "__class__") and isinstance(o, SleepCalculation)] 
   ...: calcs                                                                                                                                        
Out[3]: [<aiida_sleep.sleep_job.SleepCalculation at 0x7fe0ab3169d0>]
In [4]: cb = refbrowser.ConsoleBrowser(calcs[0], maxdepth=3) 
   ...: cb.print_tree()                                                                                                                              
aiida_sleep.sleep_job.SleepCalculation-+-list--dict-+-list
                                       |            +-dict
                                       |            +-dict
                                       |            +-dict
                                       |            +-dict
                                       |            +-dict
                                       |            +-IPython.core.interactiveshell.DummyMod
                                       |            +-dict
                                       |            +-dict
                                       |
                                       +-list-+-dict-+-list
                                       |      |      +-dict
                                       |      |      +-dict
                                       |      |      +-dict
                                       |      |      +-dict
                                       |      |      +-dict
                                       |      |      +-IPython.core.interactiveshell.DummyMod
                                       |      |      +-dict
                                       |      |      +-dict
                                       |      |
                                       |      +-dict-+-list
                                       |      |      +-dict
                                       |      |
                                       |      +-dict-+-list
                                       |      |      +-IPython.terminal.prompts.RichPromptDisplayHook
                                       |      |
                                       |      +-dict-+-list
                                       |      |      +-IPython.core.interactiveshell.ExecutionResult
                                       |      |
                                       |      +-dict-+-list
                                       |             +-dict
                                       |             +-dict
                                       |             +-dict
                                       |
                                       |
                                       +-dict-+-list--dict
                                       |      +-dict-+-list
                                       |             +-aiida.plugins.utils.PluginVersionProvider
                                       |
                                       |
                                       +-list-+-list--dict
                                       |      +-hamt_bitmap_node-+-list
                                       |                         +-hamt
                                       |
                                       |
                                       +-cell-+-list--dict
                                       |      +-tuple-+-list
                                       |              +-function (kill_process)
                                       |
                                       |
                                       +-dict-+-list--dict
                                       |      +-plumpy.process_states.Finished-+-list
                                       |                                       +-dict
                                       |
                                       |
                                       +-cell-+-list--dict
                                              +-tuple-+-list
                                              |       +-function (<lambda>)
                                              |
                                              +-tuple-+-list
                                              |       +-function (<lambda>)
                                              |
                                              +-tuple-+-list
                                                      +-function (<lambda>)

Related and somewhat troubling is that, if you run an ipython cell, then press CTRL-C before it finishes, you get:

In [5]: import time; time.sleep(60)                                                                                                                  
^C01/26/2021 03:42:51 PM <243> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 5

So it appears that the process is not correctly finalising!?

(final note, if you exit the verdi shell, the memory is released)

This may or may not also be related to #3776 (comment)

@ltalirz
Copy link
Member

ltalirz commented Jan 26, 2021

@chrisjsewell Thanks a lot for checking!

The kill_process handler is indeed an inline-definition that may end up holding the process instance if it is never itself released

with utils.loop_scope(self.loop):
process = self.instantiate_process(process, *args, **inputs)
def kill_process(_num, _frame):
"""Send the kill signal to the process in the current scope."""
LOGGER.critical('runner received interrupt, killing process %s', process.pid)
process.kill(msg='Process was killed because the runner received an interrupt')
signal.signal(signal.SIGINT, kill_process)
signal.signal(signal.SIGTERM, kill_process)
process.execute()
return process.outputs, process.node

There are at least in principle two ways one could fix this:

  1. Add the process as a function argument and move the function out (not sure what the signal.signal is doing and whether one is allowed to change the signature of a handler?)
  2. Try to explicitly delete the reference to this handler after it is no longer needed (where would that be?)

I think @muhrin may know the answer to this; he pointed to a similar suspicious case (I believe it was a different one, but I'm not sure).

The memory reported to be held by functools in #4603 (comment) makes me suspect that there are other places in the engine code where the same happens with functools.partial, i.e. new functions being defined with fixed arguments of significant memory size, but the references to these functions never being released.

@chrisjsewell
Copy link
Member

All those callbacks should be cleared on termination

Opened an issue on plumpy aiidateam/plumpy#197

@muhrin
Copy link
Contributor

muhrin commented Jan 28, 2021

Yes, I think you're probably right. Ok, how 'bout just doing what you say and putting a check in the state property getter to raise an error with a messages pinpointing what has happened. Hopefully it never gets hit anyway.

@chrisjsewell
Copy link
Member

Yes, I think you're probably right. Ok, how 'bout just doing what you say and putting a check in the state property getter to raise an error with a messages pinpointing what has happened. Hopefully it never gets hit anyway.

👍 opened aiidateam/plumpy#198

@chrisjsewell
Copy link
Member

chrisjsewell commented Jan 28, 2021

Ok I've found the next memory leak, and this is a big one. For the slurm server code only (I assume this may extend to all ssh transport codes), but not for a local code:

aiida_sleep.sleep_job.SleepCalculation-+-list-+-hamt_bitmap_node-+-hamt-+-Context-+-asyncio.events.TimerHandle

This referencing object is open_callback_handle from:

<TimerHandle when=721531.841911332 TransportQueue.request_transport.<locals>.do_open() at /root/aiida-core/aiida/engine/transports.py:84>

So it appears that the transport is not being closed properly 😬

As an example, submitting with a local code:

$ aiida-sleep calc -o 1000000 -p 1000000 -c sleep@local --submit

and after it has ended:

CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
0d026cab8612   aiida-int-core       4.50%     577.6MiB / 1.942GiB   29.04%    2.46GB / 705MB    460MB / 13.3MB    22

Whereas submitting with a SLURM code:

$ aiida-sleep calc -o 1000000 -p 1000000 -c sleep@slurm --submit
CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
0d026cab8612   aiida-int-core       5.35%     1.139GiB / 1.942GiB   58.64%    2.17GB / 631MB    447MB / 13.3MB    25

@ltalirz
Copy link
Member

ltalirz commented Jan 28, 2021

A quick pympler check using aiida-integration-tests confirms that indeed the reference to the SleepCalculation object is correctly cleared when using the sleep-local computer with local transport on the latest develop (commit a490fe0 ; even without applying the additional fixes of cyclic references in plumpy).
I.e. the SSH transport seems to be the final holdout.

@chrisjsewell
Copy link
Member

chrisjsewell commented Jan 28, 2021

indeed 👍 As I have already noted to @ltalirz this does not look to yet release all memory; there still seems to be reference(s) to the input/output nodes, but it is getting there

Edit (Leo): We checked together and for running on sleep-local (localhost), all references Processes and to AiiDA nodes are freed. There are, however, remaining memory leaks, which I'm looking into.

@sphuber
Copy link
Contributor

sphuber commented Jan 28, 2021

So it appears that the transport is not being closed properly

Not exactly sure if the transport is holding on to an instance of the process, or if you are suggesting that the transport should be closed, but I just wanted to point out that we should be careful with the latter. The idea of the TransportQueue is to keep transport open and distribute it among processes when they need it. This means that if a process goes out of scope, it should not itself close the transport.

@chrisjsewell
Copy link
Member

thanks for the clarification @sphuber, yeh I haven't really looked in any detail at that part of the code yet

@chrisjsewell
Copy link
Member

chrisjsewell commented Jan 28, 2021

I guess then taht the TransportQueue may not be releasing terminated processes

@chrisjsewell
Copy link
Member

chrisjsewell commented Jan 28, 2021

Note for this transport queue issues, if you run multiple calculations it is only keeping a reference to the first one you run:

In [1]: import gc 
   ...:  
   ...: from pympler import muppy, refbrowser 
   ...:  
   ...: from aiida_sleep.cli import run_calc 
   ...: from aiida_sleep.sleep_job import SleepCalculation                                                         

In [2]: run_calc(code="sleep@slurm")                                                                               
Out[2]: <CalcJobNode: uuid: ad7338c7-3c7f-49db-9468-d3866a95bdda (pk: 192) (aiida.calculations:sleep)>

In [3]: run_calc(code="sleep@slurm")                                                                               
Out[3]: <CalcJobNode: uuid: 7485d8cb-ab0e-46c9-9312-5852d92291f1 (pk: 200) (aiida.calculations:sleep)>

In [4]: run_calc(code="sleep@slurm")                                                                               
Out[4]: <CalcJobNode: uuid: b2c11303-a607-48a7-8929-3376e05bd902 (pk: 208) (aiida.calculations:sleep)>

In [5]: gc.collect() 
   ...: all_objects = muppy.get_objects() 
   ...: calcs = [ 
   ...:     o 
   ...:     for o in all_objects 
   ...:     if hasattr(o, "__class__") and isinstance(o, SleepCalculation) 
   ...: ]                                                                                                          

In [6]: calcs                                                                                                      
Out[6]: [<aiida_sleep.sleep_job.SleepCalculation at 0x7fadfa6ad340>]

In [7]: calcs[0].node.pk                                                                                           
Out[7]: 192

I guess because for subsequent calculations, it checks if the transport already exists and does not have to re-open

@ltalirz
Copy link
Member

ltalirz commented Jan 28, 2021

Just to keep track of a curiosity we encountered: it seems like certain references go out of scope from one ipython cell execution to the next.

E.g. when using the local code sleep@local, we find that all references to the calculation are gone, if we use the following split into two ipython execution steps:

In [1]: from pympler import muppy, refbrowser, summary
   ...: import gc
   ...:
   ...: from aiida_sleep.cli import run_calc
   ...: from aiida_sleep.sleep_job import SleepCalculation
   ...:
   ...: run_calc(code="sleep@local", payload=int(1e5), output=int(1e5))
Out[1]: <CalcJobNode: uuid: 55ebc26f-0225-489c-a1a0-1de0710dfabb (pk: 194) (aiida.calculations:sleep)>

In [2]: gc.collect()
   ...: objects_1 = muppy.get_objects()
   ...: calcs = [o for o in objects_1 if hasattr(o, "__class__") and isinstance(o, SleepCalculation)]
   ...: calcs
Out[2]: []

However, if one runs the same code inside a single cell (or copies it into a python script and runs the script using verdi run), the result is different:

In [1]: from pympler import muppy, refbrowser, summary
   ...: import gc
   ...:
   ...: from aiida_sleep.cli import run_calc
   ...: from aiida_sleep.sleep_job import SleepCalculation
   ...:
   ...: run_calc(code="sleep@local", payload=int(1e5), output=int(1e5))
   ...:
   ...: gc.collect()
   ...: objects_1 = muppy.get_objects()
   ...:
   ...: calcs = [o for o in objects_1 if hasattr(o, "__class__") and isinstance(o, SleepCalculation)]
   ...: calcs
Out[1]: [<aiida_sleep.sleep_job.SleepCalculation at 0x7ff52a2edd90>]

In the second case, there are two extra references from cell objects :
(plus others, at least some of which are cyclic and would be removed by the garbage collector)

In [2]: cb = refbrowser.ConsoleBrowser(calcs[0], maxdepth=3)
   ...: cb.print_tree()
aiida_sleep.sleep_job.SleepCalculation-+-list--dict-+-list
                                       |            +-dict
                                       |            +-dict
                                       |            +-dict
                                       |            +-dict
                                       |            +-dict
                                       |            +-IPython.core.interactiveshell.DummyMod
                                       |            +-dict
                                       |            +-dict
                                       |
                                       +-list-+-dict-+-list
                                       |      |      +-dict
                                       |      |      +-dict
                                       |      |      +-dict
                                       |      |
                                       |      +-dict-+-list
                                       |      |      +-dict
                                       |      |      +-dict
                                       |      |      +-dict
                                       |      |      +-dict
                                       |      |      +-dict
                                       |      |      +-IPython.core.interactiveshell.DummyMod
                                       |      |      +-dict
                                       |      |      +-dict
                                       |      |
                                       |      +-dict-+-list
                                       |      |      +-dict
                                       |      |
                                       |      +-dict-+-list
                                       |      |      +-IPython.terminal.prompts.RichPromptDisplayHook
                                       |      |
                                       |      +-dict-+-list
                                       |             +-IPython.core.interactiveshell.ExecutionResult
                                       |
                                       |
                                       +-cell-+-list--dict
                                       |      +-tuple-+-list
                                       |              +-function (try_killing)
                                       |
                                       |
                                       +-method-+-list--dict
                                       |        +-dict-+-list
                                       |        |      +-function (broadcast_receive)
                                       |        |
                                       |        +-cell-+-list
                                       |               +-tuple
                                       |
                                       |
                                       +-dict-+-list--dict
                                       |      +-plumpy.process_states.Finished-+-list
                                       |                                       +-dict
                                       |
                                       |
                                       +-cell-+-list--dict
                                              +-tuple-+-list
                                              |       +-function (<lambda>)
                                              |
                                              +-tuple-+-list
                                              |       +-function (<lambda>)
                                              |
                                              +-tuple-+-list
                                                      +-function (<lambda>)

It looks somehow like certain cpython cell objects (two of which hold a reference to the Calculation) go out of scope in between two ipython cell executions.
Until we understand what's going on here, we may need to be a bit careful when using ipython to analyze those references to avoid that we're overlooking references that may exist inside the daemon runners.

P.S. We originally thought the reason for this result may be that after run_calc returned something was still going on under the hood, and so the muppy.get_objects() call would still catch a reference to the calculation lying around. We found that adding a time.sleep(20) right after run_calc did not change anything in the result.

P.P.S. Aha! The issue was the time delay after all, but time.sleep() blocks the asyncio event loop from making progress.
Adding the following snippet after run_calc restores the original behavior (no references remaining)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.sleep(2))

I.e. when working in ipython, we just need to make sure to let a bit of time pass after run_calc before proceeding with collecting objects etc.

@chrisjsewell
Copy link
Member

Note, I have created an alpha plumpy release (https://pypi.org/project/plumpy/0.18.5a0/) from aiidateam/plumpy#205, which should be the final change required for daemons to automatically garbage collect terminated processes.

@giovannipizzi do you want to give this a try (aiida-core develop + plumpy==0.18.5a0) to see if you see any difference in memory for your original workchain runs (#4603 (comment))?

@giovannipizzi
Copy link
Member Author

I've launched ~1000 work chains yesterday night, they ar running. I guess tomorrow we'll have an answer. Currently (400 work chains running, ~1200 active processes) the memory usage is still 2.5-2.8%/worker (with 8 workers), that totals to ~13GB RAM. We'll see if this goes back to zero at the end of all jobs! And then we can investigate if the memory consumption is OK (we've on average ~11 MB/active process I think, I think/hope this can be reduced? - but for a different issue)

@chrisjsewell
Copy link
Member

thanks @giovannipizzi, yes indeed it will not reduce the memory during running, that's the next step 😬

@giovannipizzi
Copy link
Member Author

Unfortunately I have bad news. After all workflows are completed and there are no more running (e.g. in verdi process list), the memory consumption is still high:

$ verdi daemon status
Profile: development
Daemon is running as PID 97545 since 2021-02-10 22:58:17
Active workers [8]:
  PID    MEM %    CPU %  started
-----  -------  -------  -------------------
97549    2.416        0  2021-02-10 22:58:18
97675    2.131        0  2021-02-10 22:58:47
97676    2.497        0  2021-02-10 22:58:47
97677    2.405        0  2021-02-10 22:58:47
97678    2.418        0  2021-02-10 22:58:47
97679    2.584        0  2021-02-10 22:58:47
97680    2.436        0  2021-02-10 22:58:47
97681    2.576        0  2021-02-10 22:58:47

Just to show the "correct" baseline for memory consumption I run verdi daemon decr 1 followed by verdi daemon incr 1 and I now get:

  PID    MEM %    CPU %  started
-----  -------  -------  -------------------
97675    2.131        0  2021-02-10 22:58:47
97676    2.497        0  2021-02-10 22:58:47
97677    2.405        0  2021-02-10 22:58:47
97678    2.418        0  2021-02-10 22:58:47
97679    2.584        0  2021-02-10 22:58:47
97680    2.436        0  2021-02-10 22:58:47
97681    2.576        0  2021-02-10 22:58:47
23192    0.143        0  2021-02-12 00:23:21

(and indeed at the beginning, or if I restart the daemon, all were using only ~0.143% memory).

@giovannipizzi
Copy link
Member Author

I suggest anyway to postpone this to after the release of 1.6.0 since this is not blocking and the issue was already there before. Instead I opened #4745 that occurred during the runs that is IMO more urgent to solve before the release, as it makes some jobs fail

@ltalirz
Copy link
Member

ltalirz commented Mar 31, 2021

Just to add another data point:

This is

  • with AiiDA 1.6.0 (python 3.8.8, pip list)
  • having run ~10k CalcJobs (mix between local and ssh transport)
  • on 2 daemon workers

followed by verdi daemon incr 1 shortly before the end [1]

(aiida-zeopp) ubuntu@qmobile:~$ verdi daemon status
Profile: csd_import
Daemon is running as PID 23850 since 2021-03-24 14:33:16
Active workers [3]:
  PID    MEM %    CPU %  started
-----  -------  -------  -------------------
28605    0.658    106.6  2021-03-30 16:31:41
28606    0.657     91    2021-03-30 16:31:41
21478    0.318      9.2  2021-03-31 16:22:10

100% = 32GB, i.e. the daemon workers are using ~210 MB each, which is perfectly acceptable to me.
Over the ~5k calculations each, they seem to have accumulated ~100 MB of extra memory (which would translate to 20 kB/calc).

If Giovanni's issues still persist with AiiDA 1.6.0, that suggests to me that his issue is either related to

  • some difference between WorkChains vs CalcJobs (i.e. a leak in workchains only), or
  • memory held somehow by the plugins/ORM (my jobs did not involve calculation inputs or outputs with large amounts of data stored in the ORM)

[1] I.e. the 3rd daemon worker still got the chance to pick up a handful of calculations. This makes for a fairer comparison in my view, since some memory increase is to be expected due to things like python imports.

Edit: Update after having run ~250-300k calculations with 2-3 workers (added one a bit later), i.e. roughly 100k per worker

$ verdi daemon status
Profile: csd_import
Daemon is running as PID 5938 since 2021-04-06 13:20:31
Active workers [3]:
  PID    MEM %    CPU %  started
-----  -------  -------  -------------------
 5942    0.869     20    2021-04-06 13:20:32
25838    0.837     20.2  2021-04-07 00:43:08
29560    0.747     20.3  2021-04-08 16:22:52

Again, 100%=32GB, i.e. 0.869%=278MB, so this worker accumulated ~170MB of extra memory over the course of running these 100k processes

@giovannipizzi
Copy link
Member Author

Hi all,
I just tested with the most recent develop (post 1.6.3). All seems working fine! I run again with 8 workers and the same work chains (Quantum ESPRESSO) as before (now using the most recent aiida-QE, git hash 33cb11e3c9a1f).

At the beginning the memory usage of each of the 8 workers was 0.146%.

After running 4 processes (some of which triggered the exponential-backoff mechanism): the memory usage went a bit up, to 0.22-0.25%. I then run 400 WorkChains concurrently (resulting in ~1200 processes): at the end the memory was only 0.33% (and I think it never went much higher).
Even after running 1200 more, it remains at 0.33%!
(The additional memory could just be loading the python modules).

So, for at least my original use case, I think that this issue is fixed and can be closed (thanks! :-) ) - @chrisjsewell feel free to close unless some other people have more issues.

@chrisjsewell
Copy link
Member

wahoo 🎉
yeh thanks, just to conclude me and @ltalirz fixed all the "hard" references to finished processes (i.e. non-circular ones that the Python garbage collector can in principle clean up), and added tests for this to guard against any code changes that would re-break this.
There was some "worry" that the garbage collector might not always be triggered on the daemon workers. But if you no longer see the high memory usage, then that seems to not be an issue 😄

So yes I will close this then, but feel free to re-open if anyone else still encounters the issue

@muhrin
Copy link
Contributor

muhrin commented Apr 30, 2021

That's great news, thanks for all the hard work!

@giovannipizzi
Copy link
Member Author

Also to further confirm, also usage while running is quite nice/low, now - I've 8 daemon workers, with slots increased from the default to 500 per worker, and with ~3600 running processes (from verdi process list); the memory usage of the workers is around 0.41-0.43% - which hints to me that it might be ok to suggest to increase the number of slots per worker without major issues (but of course, this requires more testing). This might also mean that it's going to be much harder to choke the machine (and so get to the point where the issue of #4598 is triggered

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants