Expose malloc statistics #1275

pitrou · 2018-05-08T17:55:55Z

I don't know if it's exactly in the scope for psutil, but just in case: it could be useful to expose per-platform malloc() statistics, for example using mallinfo() on GNU/Linux:
http://man7.org/linux/man-pages/man3/mallinfo.3.html

The text was updated successfully, but these errors were encountered:

giampaolo · 2018-05-08T18:00:27Z

Hey Antoine! I don't know either but it looks kinda too low-levelish. Do you have a use case?

pitrou · 2018-05-08T18:05:41Z

My use case was debugging memory use here: https://bugs.python.org/issue33444
I'm not sure there's a production use :-) Though I'm sure authors of sophisticated parallel computing frameworks (such as Dask -- @mrocklin, @jakirkham or @ogrisel) would like to be able to diagnose whether a memory consumption problem is a memory leak caused by Python objects.

jakirkham · 2018-05-08T19:06:28Z

Not having used mallinfo much, can't say much about it. That said, I can grok why it would be useful.

Just to understand the use case a bit more. Are you hoping to call mallinfo using ctypes (?) within a process that you are trying to debug? Namely by inserting mallinfo calls wherever it seems to matter to get some intuition about how memory usage is changing over time?

pitrou · 2018-05-08T19:09:27Z

What I mean is that a framework (like Dask) which already exposes memory usage statistics (such as RSS) could expose additional useful information thanks to mallinfo. The main info of interest IMHO is how much memory is kept in the allocator despite being nominally released (because of fragmentation).

mrocklin · 2018-05-08T19:11:31Z

I've been trying to track down memory leaks when using Pandas in parallel in situations that look similar to the bug report pointed to by @pitrou . I agree that increased visibility would be of value.

jakirkham · 2018-05-08T19:16:21Z

To rephrase my question in the Dask context, does this make sense to call from the nanny process or does it only make sense in the worker process?

pitrou · 2018-05-08T19:16:49Z

Only sense in the worker process, IMO.

giampaolo · 2018-05-09T07:21:18Z

If I'm understanding this right, mallinfo() returns info about the current (calling) process. It's not system-wide info nor it can be fetched on a per-process basis which makes it incompatible with psutil.Process class.

giampaolo · 2018-05-09T08:45:18Z

As for the usefulness of this, I'm skeptical. We are already able to determine memory leaks by using Process.memory_info().rss or Process.memory_full_info().uss before and after a function call. Knowing detailed stats about malloc() specifically looks like something which is more useful to debug a C program while developing it.
Also, it seems mallinfo() is basically deprecated: https://stackoverflow.com/questions/40878169/64-bit-capable-alternative-to-mallinfo

pitrou · 2018-05-09T10:09:19Z

We are already able to determine memory leaks by using Process.memory_info().rss or Process.memory_full_info().uss before and after a function call.

See https://bugs.python.org/issue33444. A higher rss tells you that there may be a memory leak, not that there is one. Python uses malloc() for all allocations > 512 bytes. When Python releases memory to the glibc (by calling free()), the glibc isn't always able to return memory to the system, so rss appears to remain high.

Also, it seems mallinfo() is basically deprecated

I see, perhaps malloc_info is more worth exposing then? Though it's return values aren't documented...

giampaolo · 2018-05-09T12:24:27Z

A higher rss tells you that there may be a memory leak, not that there is one.

You are right, RSS is an approximation. In fact I bumped into false positives for a long time before introducing USS which apparently either solved or mitigated the issue (not sure):

psutil/psutil/tests/test_memory_leaks.py

Lines 175 to 182 in 2ffb0ec

    
           @staticmethod 
        
           def _get_mem(): 
        
               # By using USS memory it seems it's less likely to bump 
        
               # into false positives. 
        
               if LINUX or WINDOWS or OSX: 
        
                   return thisproc.memory_full_info().uss 
        
               else: 
        
                   return thisproc.memory_info().rss

I say "not sure" because psutil's memory leak test script allows some tolerance:

psutil/psutil/tests/test_memory_leaks.py

Lines 135 to 136 in 2ffb0ec

    
           diff1 = mem2 - mem1 
        
           if diff1 > tolerance:

Would mallinfo() / malloc_info() metrics be more precise than USS at helping identify missing free()s? According to this blog post https://scaryreasoner.wordpress.com/2007/10/17/finding-memory-leaks-with-mallinfo/ uordblks is the value one would typically want to use.

With that said, I'm not against exposing mallinfo() / malloc_info() or whatever per se. My concern is that they are pretty low level, it would be a Linux only API and I'm not sure what values other than uordblks are really useful and worth exposing.

pitrou · 2018-05-09T12:35:19Z

In Python's case fordblks is the better choice (it gives you the number of bytes available in the allocator but not returned to the system). uordblks can be misleading because Python uses its own allocator for small-sized blocks (< 256 or 512 bytes).

USS is not better than RSS for finding out memory fragmentation.

giampaolo · 2018-05-10T10:04:42Z

If we figure out what malloc_info() values stand for and we are able to provide a concrete code sample in the doc which shows how to detect memory leaks for a function call then I suppose we could have a psutil.malloc_memory() or something. Basically I want there to be a concrete use case for it other than merely "reporting low level memory metrics" which very few people would use. If we can extend the same concept to other platforms (basically Windows and/or OSX) that's even better, especially because relying on RSS/USS is apparently not fully reliable.

giampaolo · 2018-05-10T10:23:57Z

To push this even further: test_memory_leaks.py tries hard to detect a function memory leaks by:
1 - picking up the right memory stats to use (RSS/USS)
2 - warming up first
3 - call function a certain number of times
4 - run garbage collector
5 - allow some failure tolerance

Since that's not straightforward maybe we can have a utility function like this:

>>> # signature
>>> test_leak(callable, times=1000, warmup_times=10, tolerance=4096)
>>>
>>> # success (returns None)
>>> test_leak(fun)
>>>
>>> # failure
>>> test_leak(fun)
AssertionError("46523 extra process memory after 1000 calls")

Depending on how reliable such a function turns out to be it can live either in psutil namespace, psutil.test namespace, psutil doc or a blog post.

giampaolo · 2020-05-04T21:26:40Z

The Windows counterpart appears to be called _heapwalk:
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/heapwalk?redirectedfrom=MSDN&view=vs-2019

giampaolo · 2020-05-05T01:13:58Z

List of useful links/info:
https://en.wikibooks.org/wiki/Linux_Applications_Debugging_Techniques/Leaks

giampaolo · 2020-05-12T20:01:31Z

See comment about mallinfo() in #1757. It still would not grant 100% stability in detecting memory leaks, despite being more precise.

Preamble ======= We have a [memory leak test suite](https://github.com/giampaolo/psutil/blob/e1ea2bccf8aea404dca0f79398f36f37217c45f6/psutil/tests/__init__.py#L897), which calls a function many times and fails if the process memory increased. We do this in order to detect missing `free()` or `Py_DECREF` calls in the C modules. When we do, then we have a memory leak. The problem ========== A problem we've been having for probably over 10 years, is the false positives. That's because the memory fluctuates. Sometimes it may increase (or even decrease!) due to how the OS handles memory, the Python's garbage collector, the fact that RSS is an approximation and who knows what else. So thus far we tried to compensate that by using the following logic: - warmup (call fun 10 times) - call the function many times (1000) - if memory increased before/after calling function 1000 times, then keep calling it for another 3 secs - if it still increased at all (> 0) then fail This logic didn't really solve the problem, as we still had occasional false positives, especially lately on FreeBSD. The solution ========= This PR changes the internal algorithm so that in case of failure (mem > 0 after calling fun() N times) we retry the test for up to 5 times, increasing N (repetitions) each time, so we consider it a failure only if the memory **keeps increasing** between runs. So for instance, here's a legitimate failure: ``` psutil.tests.test_memory_leaks.TestModuleFunctionsLeaks.test_disk_partitions ... Run #1: extra-mem=696.0K, per-call=3.5K, calls=200 Run #2: extra-mem=1.4M, per-call=3.5K, calls=400 Run #3: extra-mem=2.1M, per-call=3.5K, calls=600 Run #4: extra-mem=2.7M, per-call=3.5K, calls=800 Run #5: extra-mem=3.4M, per-call=3.5K, calls=1000 FAIL ``` If, on the other hand, the memory increased on one run (say 200 calls) but decreased on the next run (say 400 calls), then it clearly means it's a false positive, because memory consumption may be > 0 on second run, but if it's lower than the previous run with less repetitions, then it cannot possibly represent a leak (just a fluctuation): ``` psutil.tests.test_memory_leaks.TestModuleFunctionsLeaks.test_net_connections ... Run #1: extra-mem=568.0K, per-call=2.8K, calls=200 Run #2: extra-mem=24.0K, per-call=61.4B, calls=400 OK ``` Note about mallinfo() ================ Aka #1275. `mallinfo()` on Linux is supposed to provide memory metrics about how many bytes gets allocated on the heap by `malloc()`, so it's supposed to be way more precise than RSS and also [USS](http://grodola.blogspot.com/2016/02/psutil-4-real-process-memory-and-environ.html). In another branch were I exposed it, I verified that fluctuations still occur even when using `mallinfo()` though, despite less often. So that means even `mallinfo()` would not grant 100% stability.

crusaderky · 2021-05-27T17:22:26Z

I have a real life use case on dask.distributed. The package really struggles right now to tell apart genuine memory leaks and free'd memory that hasn't been returned to the OS yet. This meausure is used in heuristics for memory rebalancing and OOM safety net systems.

Using the demo workbook attached to dask/distributed#4774 I can reliably produce such a "leak" where I allocate a bunch of large-ish numpy arrays (160 kib each) and then free them after a few seconds. After that operation, on my GUI I read:

RSS: 1244 MiB
managed memory (Python objects tracked by dask and measured with sizeof): 749 MiB
unmanaged memory (RSS - managed: memory leaks, modules, global variables, and unreleased memory): 495 MiB

full_memory_info says nothing interesting; note how uss and rss are almost the same:

pfullmem(rss=1304870912, vms=2051538944, shared=31059968, text=2240512, lib=0, data=1326317568, dirty=0, uss=1274175488, pss=1275800576, swap=0)

however, if I run this on the process:

import ctypes
libc = ctypes.CDLL("libc.so.6")
libc.malloc_stats()

I read:

Total (incl. mmap):
system bytes     = 1238429696
in use bytes     =  797209696

which is exactly the information I need.
Thanks to the malloc_stats output I can get these numbers out:

RSS: 1244 MiB
managed: 749 MiB
not in use: (1181 - 760) = 421 MiB
unmanaged: (1244 - 749 - 421) = 74 MiB
managed+unmanaged: (749+74) = 823 MiB

When I run my rebalancing and anti-OMM algorithms, if I had this information I could consider 823 MiB instead of 1244 MiB, knowing that the rest will be reused at the next malloc.

MacOSX Big Sur has exactly the same problem. I don't know where to get the same information though.
I could not reproduce the issue on Windows.

CC @fjetter @gjoseph92

giampaolo · 2021-05-27T17:51:09Z

if I run this on the process:
import ctypes
libc = ctypes.CDLL("libc.so.6")
libc.malloc_stats()
I read:

Total (incl. mmap):
system bytes = 1238429696
in use bytes = 797209696

On my Linux system I get this:

Arena 0:
system bytes     =     790528
in use bytes     =     722832
Total (incl. mmap):
system bytes     =     942080
in use bytes     =     874384
max mmap regions =          1
max mmap bytes   =     151552
0

Which one of these values should we expose, in your opinion? Which one is useful to detect a memory leak?

giampaolo · 2021-05-27T17:54:23Z

For the record, there's an old experimental branch where I exposed mallinfo on Linux and _heapwalk on Windows:
https://github.com/giampaolo/psutil/compare/malloc-info?expand=1

giampaolo · 2021-05-27T18:30:27Z

Looking at malloc_stats: it's only able to stream the output to stderr, which is a major pita. It seems the evolution of malloc_stats is called malloc_info() (https://man7.org/linux/man-pages/man3/malloc_info.3.html). Quote:

malloc_stats function exports an XML string that describes the current state of the memory-allocation implementation in the caller. [...] The open_memstream(3) function can be used to send the output of malloc_info() directly into a buffer in memory, rather than to a file. The malloc_info() function is designed to address deficiencies in malloc_stats(3) and mallinfo(3).

The data returned, though, is undocumented, completely different than malloc_stats, and I'm not sure what to make of it, see:

In summary we have:

mallinfo: it apparently returns the most useful info in a nice struct, but it's deprecated and completely unsuitable for 64-bit machines
malloc_stats: it returns info (sort of) similar to mallinfo but it's streamed over stderr, making it unsuitable for a library, as we would have to mess with stderr without user consent
malloc_info: undocumented

These must be the worst designed APIs in Linux.

crusaderky · 2021-05-28T10:55:13Z

Which one of these values should we expose, in your opinion? Which one is useful to detect a memory leak?

The difference between these two is the amount of memory that will get released if you invoke malloc_trim(0), or it will likely be reused if you invoke malloc:

Total (incl. mmap):
system bytes     =     942080
in use bytes     =     874384

I'm unsure what "system bytes" means. Note that it's slightly lower than USS.

crusaderky · 2021-05-28T11:19:13Z

@giampaolo glibc 2.33 (released 2021-02-01) adds mallinfo2, which replaces mallinfo and supports >2GiB
https://man7.org/linux/man-pages/man3/mallinfo.3.html
Note that ubuntu 20.04 currently ships glibc 2.31.

crusaderky · 2021-05-28T12:38:47Z

Notes:

this comment in the man page of mallinfo (linked above) seems incorrect to me. If I read the source code, it's clearly doing the sum of all arenas e.g. it reports the same as the "Total (incl. mmap)" paragraph of malloc_stats.

       Information is returned for only the main memory allocation area.
       Allocations in other arenas are excluded.  See malloc_stats(3)
       and malloc_info(3) for alternatives that include information
       about other arenas.

also by reading the git tip of the glibc source code, I spotted that the total paragraph of malloc_stats will experience an integer overflow when you get beyond 4 GiB.

I'm opening bug reports for both issues.

[EDIT]
actually, all measures from malloc_stats will break when you go beyond 4 GiB, not just the total.
Bug reports:

giampaolo · 2021-05-28T12:56:59Z

@giampaolo glibc 2.33 (released 2021-02-01) adds mallinfo2, which replaces mallinfo and supports >2GiB
https://man7.org/linux/man-pages/man3/mallinfo.3.html

Sweet! I missed that. It's so recent we can't assume we can rely on it though. When mallinfo2 is not available we should use malloc_info in order to mimic the same results. I'm not sure if that's possible but malloc_info source code is here:
https://github.com/bminor/glibc/blob/master/malloc/malloc.c
Also, another interesting link:
https://bitbucket.org/einsteintoolkit/tickets/issues/2352/support-64-bit-numbers-for

crusaderky · 2021-05-28T13:36:34Z

Does anybody have a clue about where to get the same information on Mac? The problem there seems to be even more pronounced e.g. rss hardly ever deflates.

giampaolo · 2021-05-29T08:15:23Z

[EDIT]
actually, all measures from malloc_stats will break when you go beyond 4 GiB, not just the total.
Bug reports:
https://sourceware.org/bugzilla/show_bug.cgi?id=27928
https://sourceware.org/bugzilla/show_bug.cgi?id=21556

Ouch! I just saw your edit. It seems all APIs are bugged one way or another. :-\

crusaderky · 2021-05-29T08:29:36Z

To my knowledge malloc_info isn't buggy...

giampaolo added enhancement idea labels Feb 27, 2019

giampaolo mentioned this issue Apr 23, 2020

MemoryLeakTest class enhancements #1731

Merged

giampaolo mentioned this issue May 12, 2020

Memory leak test: take fluctuations into account #1757

Merged

giampaolo added new-api linux and removed idea labels Nov 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose malloc statistics #1275

Expose malloc statistics #1275

pitrou commented May 8, 2018 •

edited

giampaolo commented May 8, 2018

pitrou commented May 8, 2018

jakirkham commented May 8, 2018

pitrou commented May 8, 2018

mrocklin commented May 8, 2018

jakirkham commented May 8, 2018

pitrou commented May 8, 2018

giampaolo commented May 9, 2018

giampaolo commented May 9, 2018 •

edited

pitrou commented May 9, 2018

giampaolo commented May 9, 2018

pitrou commented May 9, 2018

giampaolo commented May 10, 2018

giampaolo commented May 10, 2018 •

edited

giampaolo commented May 4, 2020

giampaolo commented May 5, 2020

giampaolo commented May 12, 2020

crusaderky commented May 27, 2021

giampaolo commented May 27, 2021

giampaolo commented May 27, 2021

giampaolo commented May 27, 2021 •

edited

crusaderky commented May 28, 2021 •

edited

crusaderky commented May 28, 2021

crusaderky commented May 28, 2021 •

edited

giampaolo commented May 28, 2021

crusaderky commented May 28, 2021 •

edited

giampaolo commented May 29, 2021

crusaderky commented May 29, 2021

Expose malloc statistics #1275

Expose malloc statistics #1275

Comments

pitrou commented May 8, 2018 • edited

giampaolo commented May 8, 2018

pitrou commented May 8, 2018

jakirkham commented May 8, 2018

pitrou commented May 8, 2018

mrocklin commented May 8, 2018

jakirkham commented May 8, 2018

pitrou commented May 8, 2018

giampaolo commented May 9, 2018

giampaolo commented May 9, 2018 • edited

pitrou commented May 9, 2018

giampaolo commented May 9, 2018

pitrou commented May 9, 2018

giampaolo commented May 10, 2018

giampaolo commented May 10, 2018 • edited

giampaolo commented May 4, 2020

giampaolo commented May 5, 2020

giampaolo commented May 12, 2020

crusaderky commented May 27, 2021

giampaolo commented May 27, 2021

giampaolo commented May 27, 2021

giampaolo commented May 27, 2021 • edited

crusaderky commented May 28, 2021 • edited

crusaderky commented May 28, 2021

crusaderky commented May 28, 2021 • edited

giampaolo commented May 28, 2021

crusaderky commented May 28, 2021 • edited

giampaolo commented May 29, 2021

crusaderky commented May 29, 2021

pitrou commented May 8, 2018 •

edited

giampaolo commented May 9, 2018 •

edited

giampaolo commented May 10, 2018 •

edited

giampaolo commented May 27, 2021 •

edited

crusaderky commented May 28, 2021 •

edited

crusaderky commented May 28, 2021 •

edited

crusaderky commented May 28, 2021 •

edited