Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valgrind fixes #1076

Merged
merged 22 commits into from May 26, 2017

Conversation

Projects
None yet
4 participants
@garlick
Copy link
Member

garlick commented May 23, 2017

These are some fairly minor (mostly single call per broker invocation) memory leak fixes that I found when running valgrind.sh.

Not significant, but at least it reduces the noise coming out of that test.

The remaining leaks are all in modules, but I haven't quite yet figured out how to make valgrind provide useful data. The reports look like this

==30194==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==30194==    by 0xFA644A8: ???
==30194==    by 0xFA644DD: ???
==30194==    by 0xFA66719: ???
==30194==    by 0xFA670A0: ???
==30194==    by 0x1086DEFF: ???
==30194==    by 0xF62E2FC: ???
==30194==    by 0xF6279C6: ???
==30194==    by 0xF3E078A: ???
==30194==    by 0xF3E12FB: ???
==30194==    by 0x40AFDB: module_thread (module.c:158)
==30194==    by 0x59B16B9: start_thread (pthread_create.c:333)
==30194==    by 0x5ED182C: clone (clone.S:109)

so basically the module main is called (module.c:158) and something was leaked but we don't know which module it was.

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 23, 2017

Coverage Status

Coverage increased (+0.02%) to 78.089% when pulling 1a21796 on garlick:valgrind_fixes into 32268ce on flux-framework:master.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented May 23, 2017

Codecov Report

Merging #1076 into master will increase coverage by 0.03%.
The diff coverage is 65.85%.

@@            Coverage Diff             @@
##           master    #1076      +/-   ##
==========================================
+ Coverage   77.95%   77.98%   +0.03%     
==========================================
  Files         151      151              
  Lines       26109    26123      +14     
==========================================
+ Hits        20352    20371      +19     
+ Misses       5757     5752       -5
Impacted Files Coverage Δ
src/modules/aggregator/aggregator.c 78.24% <ø> (+1.25%) ⬆️
src/common/libflux/dispatch.c 83.55% <100%> (+0.31%) ⬆️
src/broker/heartbeat.c 88.4% <100%> (+0.34%) ⬆️
src/broker/sequence.c 93.75% <100%> (ø) ⬆️
src/common/liboptparse/optparse.c 89.1% <100%> (ø) ⬆️
src/broker/rusage.c 62.85% <100%> (+1.09%) ⬆️
src/modules/kvs/kvs.c 79.79% <25%> (+0.46%) ⬆️
src/modules/resource-hwloc/resource.c 71.33% <66.66%> (-0.3%) ⬇️
src/broker/broker.c 72.76% <82.35%> (+0.17%) ⬆️
src/common/libflux/handle.c 84.41% <0%> (-0.29%) ⬇️
... and 6 more
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented May 23, 2017

I was able to get valid stack trace from modules by temporarily commenting out the call to dlclose() in module_destroy

Apparently valgrind does stack trace analysis after the program exits, and for any leaks in dlopen'd code valgrind won't be able to find valid address if you've called dlclose on the plugin/module.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented May 23, 2017

oops, here was one source of noise:

diff --git a/src/modules/aggregator/aggregator.c b/src/modules/aggregator/aggregator.c
index fced1b3..076bb25 100644
--- a/src/modules/aggregator/aggregator.c
+++ b/src/modules/aggregator/aggregator.c
@@ -546,6 +546,7 @@ int mod_main (flux_t *h, int argc, char **argv)
 done_delvec:
     flux_msg_handler_delvec (htab);
 done:
+    aggregator_destroy (ctx);
     return rc;
 }
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented May 23, 2017

Nice! Thanks, that works. I'll try to run down a few more then.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented May 23, 2017

Cool, thanks!

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented May 23, 2017

I pushed a couple more minor incidental memory related fixes, but think I've reached the point of diminishing returns on what's left in the valgrind report.

==40137== 88 bytes in 1 blocks are definitely lost in loss record 35 of 92
==40137==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==40137==    by 0xFA644A8: lt__malloc (in /usr/lib/x86_64-linux-gnu/libltdl.so.7.3.1)
==40137==    by 0xFA644DD: lt__zalloc (in /usr/lib/x86_64-linux-gnu/libltdl.so.7.3.1)
==40137==    by 0xFA66719: ??? (in /usr/lib/x86_64-linux-gnu/libltdl.so.7.3.1)
==40137==    by 0xFA670A0: lt_dlopenadvise (in /usr/lib/x86_64-linux-gnu/libltdl.so.7.3.1)
==40137==    by 0x1086DEFF: ???
==40137==    by 0xF62E2FC: ??? (in /usr/lib/x86_64-linux-gnu/libhwloc.so.5.6.8)
==40137==    by 0xF6279C6: hwloc_topology_init (in /usr/lib/x86_64-linux-gnu/libhwloc.so.5.6.8)
==40137==    by 0xF3E079A: ctx_hwloc_init (resource.c:68)
==40137==    by 0xF3E130B: resource_hwloc_ctx_create (resource.c:140)
==40137==    by 0xF3E130B: mod_main (resource.c:547)
==40137==    by 0x40AFDB: module_thread (module.c:158)
==40137==    by 0x59B16B9: start_thread (pthread_create.c:333)
==40137==    by 0x5ED182C: clone (clone.S:109)
==40137==
==40137== 512 bytes in 1 blocks are possibly lost in loss record 70 of 92
==40137==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==40137==    by 0x1473D1AE: list_alloc_aux (list.c:786)
==40137==    by 0x1473D3F7: list_node_alloc (list.c:742)
==40137==    by 0x1473D3F7: list_node_create (list.c:669)
==40137==    by 0x1473DC20: list_append (list.c:306)
==40137==    by 0x1473AFB9: optparse_add_option (optparse.c:946)
==40137==    by 0x1473B106: optparse_create (optparse.c:707)
==40137==    by 0x144FC433: getctx (userdb.c:105)
==40137==    by 0x144FC433: mod_main (userdb.c:368)
==40137==    by 0x40AFDB: module_thread (module.c:158)
==40137==    by 0x59B16B9: start_thread (pthread_create.c:333)
==40137==    by 0x5ED182C: clone (clone.S:109)
==40137==
==40137== 7,239 (16 direct, 7,223 indirect) bytes in 1 blocks are definitely lost in loss record 90 of 92
==40137==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==40137==    by 0x4C2FDEF: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==40137==    by 0x4E66189: ev_realloc (ev.c:1741)
==40137==    by 0x4E6AC25: ev_idle_start (ev.c:4611)
==40137==    by 0x4E65FD2: ev_invoke_pending (ev.c:3314)
==40137==    by 0x4E69A8D: ev_run (ev.c:3606)
==40137==    by 0x4E485AC: flux_reactor_run (reactor.c:135)
==40137==    by 0xCEC5BA9: mod_main (kvs.c:1744)
==40137==    by 0x40AFDB: module_thread (module.c:158)
==40137==    by 0x59B16B9: start_thread (pthread_create.c:333)
==40137==    by 0x5ED182C: clone (clone.S:109)
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented May 23, 2017

Yeah, the list node pool and dlopen "leaks" are known issues. I wonder if we should figure out how to include a valgrind suppressions file for those. Then we'd be close to having a test we could add to automated testsuite?

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 23, 2017

Coverage Status

Coverage decreased (-0.005%) to 78.061% when pulling 700aa0a on garlick:valgrind_fixes into 32268ce on flux-framework:master.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented May 23, 2017

I ran with --gen-suppressions=all on valgrind and added the known leaks we want suppress to the existing valgrind.supp file:

diff --git a/src/test/valgrind.sh b/src/test/valgrind.sh
index 51e7e2f..93fd31b 100755
--- a/src/test/valgrind.sh
+++ b/src/test/valgrind.sh
@@ -15,6 +15,7 @@ ${top_srcdir}/src/cmd/flux /usr/bin/valgrind \
     --child-silent-after-fork=yes \
     --num-callers=30 \
     --leak-resolution=med \
+    --gen-suppressions=all \
     --suppressions=valgrind.supp \
     ${top_srcdir}/src/broker/.libs/lt-flux-broker \
     ./valgrind-workload.sh ${NJOBS}
diff --git a/src/test/valgrind.supp b/src/test/valgrind.supp
index 0c4d23d..be98911 100644
--- a/src/test/valgrind.supp
+++ b/src/test/valgrind.supp
@@ -12,4 +12,54 @@
    fun:send
    ...
 }
-
+{
+   <libev_io_start_realloc>
+   Memcheck:Leak
+   match-leak-kinds: indirect
+   fun:malloc
+   fun:realloc
+   fun:ev_realloc
+   fun:fd_change
+   fun:ev_io_start
+   fun:ev_invoke_pending
+   fun:ev_run
+   ...
+}
+{
+   <libltdl_known_leak>
+   Memcheck:Leak
+   match-leak-kinds: definite
+   fun:malloc
+   fun:lt__malloc
+   fun:lt__zalloc
+   obj:*/libltdl.so.7.3.1
+   fun:lt_dlopenadvise
+   ...
+}
+{
+   <list_node_alloc>
+   Memcheck:Leak
+   match-leak-kinds: possible
+   fun:malloc
+   fun:list_alloc_aux
+   fun:list_node_alloc
+   fun:list_node_create
+   fun:list_append
+   ...
+}
+{
+   <ipaddr_getprimary>
+   Memcheck:Leak
+   match-leak-kinds: definite
+   fun:malloc
+   obj:*
+   obj:*
+   obj:*
+   obj:*
+   fun:gaih_inet.constprop.5
+   fun:getaddrinfo
+   fun:ipaddr_getprimary
+   fun:calc_endpoint.isra.8
+   fun:boot_pmi
+   fun:main
+}

However, I still see this one leak(?) that I'm not sure about (on this branch I think)

==12501== 7,239 (16 direct, 7,223 indirect) bytes in 1 blocks are definitely lost in loss record 88 of 89
==12501==    at 0x4C2EB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12501==    by 0x4E473E6: flux_reactor_create (reactor.c:88)
==12501==    by 0x4E474E6: flux_get_reactor (reactor.c:119)
==12501==    by 0x1156AF: modservice_register (modservice.c:226)
==12501==    by 0x113998: module_thread (module.c:144)
==12501==    by 0x578E6C9: start_thread (pthread_create.c:333)
==12501==    by 0x5CB10AE: clone (clone.S:105)
==12501== 
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   match-leak-kinds: definite
   fun:calloc
   fun:flux_reactor_create
   fun:flux_get_reactor
   fun:modservice_register
   fun:module_thread
   fun:start_thread
   fun:clone
}

If we get this one analyzed and either fixed or suppressed, we could run valgrind with --error-exitcode=1 to get a pass/fail test for new leaks. (Oh, though we'd also have to figure out how to avoid dlclose at runtime only under valgrind)

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented May 23, 2017

That last one has the same number of bytes leaked as the last leak I posted above, which also seemed to be reactor-allocated storage. The one above entered through the kvs.

Maybe a watcher is not being destroyed when the module exits and is holding a refcount on the reactor so it's not freed?

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented May 23, 2017

That is what I assumed as well. However, there is not any good way to track down even the module to which it belongs.

@grondo grondo force-pushed the garlick:valgrind_fixes branch from 700aa0a to b6900f0 May 24, 2017

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 24, 2017

Coverage Status

Coverage increased (+0.01%) to 78.184% when pulling b6900f0 on garlick:valgrind_fixes into e6560ce on flux-framework:master.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented May 25, 2017

Pushed some experimental fixes here:

  1. Add an intercept function for dlclose when running under valgrind, when valgrind/valgrind.h is available. When compiled this way, dlclose is avoided (the wrapped function does nothing) only when running under valgrind, so that valgrind can report stack traces for broker modules.
  2. Add a few known memory leaks to valgrind suppression file
  3. A possible fix for dispatch and reactor leakage when fastpath is activated for a dispatcher. Honestly, i didn't spend enough time to understand the fastpath code in dispatch.c, but I did track down the final leak reported by valgrind.sh to something holding a reference to the dispatcher which in turn held a reference to the reactor. I still don't understand why fastpath branch in flux_msg_handler_destroy calls free_msg_handler while the other branch just sets w->destroyed, but it did seem like a call to dispatch_usecount_decr() was missing here. In any case, @garlick can you review c8db59a?
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented May 25, 2017

BTW, on the current branch valgrind.sh runs "clean" (at least on my desktop). So we could turn this into an automated test (with possibly limited scope, but perhaps it would be helpful)

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 25, 2017

Coverage Status

Coverage increased (+0.01%) to 78.181% when pulling 9827a6c on garlick:valgrind_fixes into e6560ce on flux-framework:master.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented May 25, 2017

That does indeed look right! Good catch.

To explain the other question: The fastpath handlers aren't in the dispatcher's zlist_t that it walks in its dispatch_message() loop, so there is not a need to defer their removal. Although I'm not entirely certain that this is even necessary. I had been operating under the assumption that zlist was not remove-safe during traversal, but as discussed in #1066, they actually are.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented May 25, 2017

To explain the other question: The fastpath handlers aren't in the dispatcher's zlist_t that it walks in its dispatch_message() loop, so there is not a need to defer their removal.

Ah, ok, maybe I'll add a comment to this effect, perhaps with a FIXME or something to clean up normal handler removal in the future?

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented May 25, 2017

Ah, ok, maybe I'll add a comment to this effect, perhaps with a FIXME or something to clean up normal handler removal in the future?

Sounds fine.

Wiring valgrind into make check will be really valuable going forward I think, and incidentally satisfies #161 which I closed as hopeless a while back :-)

@grondo grondo force-pushed the garlick:valgrind_fixes branch from 8a49908 to 36f6958 May 25, 2017

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 25, 2017

Coverage Status

Coverage decreased (-0.008%) to 78.162% when pulling 8a49908 on garlick:valgrind_fixes into e6560ce on flux-framework:master.

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 25, 2017

Coverage Status

Coverage decreased (-0.05%) to 78.176% when pulling b98c7e5 on garlick:valgrind_fixes into fefa109 on flux-framework:master.

@grondo grondo force-pushed the garlick:valgrind_fixes branch from b98c7e5 to 971d787 May 25, 2017

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 25, 2017

Coverage Status

Coverage decreased (-0.02%) to 78.206% when pulling 971d787 on garlick:valgrind_fixes into fefa109 on flux-framework:master.

@grondo grondo force-pushed the garlick:valgrind_fixes branch from fd7dbba to 933a7ea May 25, 2017

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 25, 2017

Coverage Status

Coverage decreased (-0.02%) to 78.202% when pulling fd7dbba on garlick:valgrind_fixes into fefa109 on flux-framework:master.

@grondo grondo force-pushed the garlick:valgrind_fixes branch from 933a7ea to abba9ad May 25, 2017

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 25, 2017

Coverage Status

Coverage decreased (-0.03%) to 78.195% when pulling 933a7ea on garlick:valgrind_fixes into fefa109 on flux-framework:master.

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 25, 2017

Coverage Status

Coverage decreased (-0.03%) to 78.191% when pulling abba9ad on garlick:valgrind_fixes into fefa109 on flux-framework:master.

garlick added some commits May 23, 2017

modules/aggregator: fix mem leak in main
Fix the following memory leak identified by valgrind.sh test,
with dlclose() commented out in module_destroy().  Credit goes
to grondo for both the dlclose() hack and this particular fix.

==32072== 2,168 (40 direct, 2,128 indirect) bytes in 1 blocks are
definitely lost in loss record 85 of 92
==32072==    at 0x4C2FB55: calloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==32072==    by 0xD90D625: aggregator_create (aggregator.c:411)
==32072==    by 0xD90D625: mod_main (aggregator.c:533)
==32072==    by 0x40AFDB: module_thread (module.c:158)
==32072==    by 0x59B16B9: start_thread (pthread_create.c:333)
==32072==    by 0x5ED182C: clone (clone.S:109)
modules/kvs: handle ENOMEM in initialization
Problem: getctx() handles ENOMEM in all cases except the
allocation of the context struct, where it calls xzmalloc()
which asserts on ENOMEM.

Replace xzmalloc() with calloc() and check for failure.
modules/kvs: avoid mem leaks on reactor failure
Problem: main() doesn't unregister message handlers if the
reactor exits with an error.

Restructure main()'s error path so this case is handled.
modules/kvs: adjust white space to fit in 80 cols
Just a cosmetic change - adjust white space so message
handler static initialization fits in 80 cols.
modules/resource-hwloc: check hwloc_topology_init
Problem: return value of hwloc_topology_init() in
topo_request_cb() is not checked.

Handle and log the error.

@grondo grondo force-pushed the garlick:valgrind_fixes branch from cdb2b18 to 043168b May 25, 2017

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 25, 2017

Coverage Status

Coverage increased (+0.03%) to 78.223% when pulling 043168b on garlick:valgrind_fixes into 2bafdad on flux-framework:master.

grondo added some commits May 25, 2017

broker: avoid calling dlclose() under valgrind
Use valgrind function wrapping to intercept calls to dlclose()
and do nothing, so that valgrind can properly determine stack trace
from any broker modules that are leaking memory.

Because we use the special valgrind function wrapping symbols, this
change should have no effect under normal broker execution.
libflux/dispatch: fix possible refcount bug in fastpath
In msghandler destruction of fastpath, dispatcher refcount
should be decremented to avoid leaking the dispatcher, and in turn,
the associated reactor.
test/valgrind.supp: Add to valgrind suppressions
Add valgrind suppressions for known uninteresting leaks:

 * libev internal appreantly harmless leak
 * lsd-tools list implementation internal list node pool
 * libltdl internal malloc
testsuite: add simple valgrind test for flux-broker
Remove src/test/valgrind.sh and replace with a sharness script that
runs valgrind and ensures no new errors are reported from a single
broker run under valgrind.

Move the suppressions file and valgrind-workload.sh script under t/.

To see valgrind output, check the t5000-valgrind.t logfile, or run
it with `./t5000-valgrind.t -d -v`.
libflux/dispatch: add comment in flux_msg_handler_destroy
Add a comment explaining the difference in destruction of fastpath
and non-fastpath msg handlers. Make a note that the premise for
this difference should be reevaluated in case the code can be cleaned up.
optparse: silence invalid read from optstring
optparse internal optstring starts with a malloc of 1 byte,
but the initial optimized strlen() may read more than one byte
at a time will trigger an error from valgrind.

Silence the error by allocating 4 bytes up front, which will
probably be more efficient anyway.
travis-ci: install valgrind
Install valgrind so that sharness test t5000-valgrind.t runs in
all travis tests.

@grondo grondo force-pushed the garlick:valgrind_fixes branch 2 times, most recently from fe76d5d to 079ecde May 25, 2017

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 25, 2017

Coverage Status

Coverage increased (+0.04%) to 78.235% when pulling fe76d5d on garlick:valgrind_fixes into 2bafdad on flux-framework:master.

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 25, 2017

Coverage Status

Coverage increased (+0.03%) to 78.223% when pulling 079ecde on garlick:valgrind_fixes into 2bafdad on flux-framework:master.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented May 25, 2017

We had a failure in the clang builder and I'm having a hard time finding what happened.

The cooked output is truncated at the bottom, so I don't get the nice little "after_failure" rolled up output anymore - is that just me? Having a bit of difficulty figuring out what went wrong even after downloading the raw text log and grepping for "FAIL: "... Will have another look later.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented May 26, 2017

Yeah, this was the same write error during distclean that I've been seeing a lot on this branch:

make[4]: Leaving directory `/home/travis/build/flux-framework/flux-core/flux-core-0.7.0-239-g086abf3/_build/sub/src/common/liboptparse'
make[4]: write error
make[3]: *** [distclean-recursive] Error 1
make[3]: Leaving directory `/home/travis/build/flux-framework/flux-core/flux-core-0.7.0-239-g086abf3/_build/sub/src/common'
make[2]: *** [distclean-recursive] Error 1

(In the raw log if you search for "exited with" this usually takes you to where the build errored)

I'm not sure why but perhaps make -j2 is causing this? I could revert that change and try again, though a restart of the build would also likely be successful. I don't want to destabilize future builds though.

travis-ci: emit config.log last on failure
Move config.log display to last position on error, since arguably
it is the least useful and most verbose output.

@grondo grondo force-pushed the garlick:valgrind_fixes branch from 079ecde to 0fdf588 May 26, 2017

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 26, 2017

Coverage Status

Coverage increased (+0.03%) to 78.227% when pulling 0fdf588 on garlick:valgrind_fixes into 2bafdad on flux-framework:master.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented May 26, 2017

Removal of -j2seems to have made the travis tests more reliable, so let's leave it at that.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented May 26, 2017

Thanks for all this! This should really keep us honest going forward.

@garlick garlick merged commit 655adad into flux-framework:master May 26, 2017

3 of 4 checks passed

codecov/patch 65.85% of diff hit (target 77.95%)
Details
codecov/project 77.98% (+0.03%) compared to 2bafdad
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.03%) to 78.227%
Details

@grondo grondo referenced this pull request Aug 23, 2017

Closed

0.8.0 Release #1160

@garlick garlick deleted the garlick:valgrind_fixes branch Sep 6, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.