Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experimental: add valgrind test to flux-sched, fix some leaks #328

Merged
merged 11 commits into from Apr 30, 2018

Conversation

Projects
None yet
4 participants
@grondo
Copy link
Contributor

grondo commented Apr 29, 2018

This somewhat experimental PR adds the t5000-valgrind.t to flux-sched, and fixes some of the leaks found. The valgrind test has the following differences from flux-core:

  • runs with test_expect_failure because there are still many leaks. Once all detected leaks are fixed the test can be updated.
  • A hacky attempt to determine if flux-broker was built with the dlclose valgrind wrapper is added, and the test is skipped if the symbol ZZ_Za_dlclose isn't found.
  • The path to flux-broker is determined via $(PATH=${FLUX_EXEC_PATH} /usr/bin/which flux-broker). This should be the flux-broker for the version of flux-core under which flux-sched was configured.
  • An extra test is added to ensure the found path to flux-broker is executable.

The valgrind-workload.sh is also just a copy from flux-core, this could be extended for flux-sched.

The leaks fixed here were all fairly obvious given valgrind output in #327 . One thing that happened was when shortjson.h was switched to shortjansson.h, the fact that Jtostr() now required its argument to be freed was missed. I would actually suggest that the functions in shortjansson.h should be avoided on future development in flux-sched.

This isn't ready for merge yet. I may need to iterate on the valgrind test implementation.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 29, 2018

Coverage Status

Coverage increased (+0.02%) to 75.228% when pulling 00aa18d on grondo:valgrind into cb78d4a on flux-framework:master.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Apr 29, 2018

Codecov Report

Merging #328 into master will increase coverage by 0.02%.
The diff coverage is 84.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #328      +/-   ##
==========================================
+ Coverage   73.65%   73.68%   +0.02%     
==========================================
  Files          56       56              
  Lines        9785     9811      +26     
==========================================
+ Hits         7207     7229      +22     
- Misses       2578     2582       +4
Impacted Files Coverage Δ
src/common/libutil/shortjansson.h 88.33% <ø> (ø) ⬆️
rdl/rdl.c 72.47% <0%> (-0.34%) ⬇️
resrc/resrc.c 83.39% <100%> (+0.08%) ⬆️
resrc/resrc_tree.c 86.83% <100%> (ø) ⬆️
sched/rs2rank.c 93.75% <100%> (+0.47%) ⬆️
sched/sched.c 74.03% <77.77%> (+0.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb78d4a...00aa18d. Read the comment docs.

@grondo grondo force-pushed the grondo:valgrind branch from 754acea to b4af0f0 Apr 29, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Apr 29, 2018

@dongahn: There's one last minor flux-sched leak detected by valgrind for which I think I'll need your help:

==2426630== 40 (24 direct, 16 indirect) bytes in 1 blocks are definitely lost in loss record 23 of 45
==2426630==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2426630==    by 0x8C6D72D: xzmalloc (xzmalloc.c:39)
==2426630==    by 0x8C66371: partition_new (rs2rank.c:114)
==2426630==    by 0x8C66371: rs2rank_tab_update (rs2rank.c:225)
==2426630==    by 0x8C66EBB: rsreader_hwloc_load (rsreader.c:276)
==2426630==    by 0x8C62BCA: build_hwloc_rs2rank (sched.c:671)
==2426630==    by 0x8C65AB6: load_resources (sched.c:773)
==2426630==    by 0x8C65AB6: mod_main (sched.c:2296)
==2426630==    by 0x40AB18: module_thread (module.c:158)
==2426630==    by 0x59D1183: start_thread (pthread_create.c:312)
==2426630==    by 0x5EE537C: clone (clone.S:111)
==2426630== 

Otherwise, once a couple libjsc leak fixes are landed in flux-core, this valgrind test at least will run clean for flux-sched.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Apr 30, 2018

@grondo: My patch for partition leak is at 376a694

I wasn't sure how to test this with your t5000-valgrind.t so I will appreciate if you can cherry pick and test this under your environment.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Apr 30, 2018

Thanks @dongahn, your fix appeared to work, and along with jsc fixes in core, the valgrind test runs clean:

PASS: t5000-valgrind.t 1 - found executable flux-broker
XPASS: t5000-valgrind.t 2 - valgrind reports no new errors on single broker run # TODO known breakage vanished

I'll update this PR with your fix, plus I will remove the test_expect_failure from the test so that future leaks which are detected and cause testsuite to fail.

grondo and others added some commits Apr 28, 2018

sched: fix leak of api object in req_tpexec_allocate
Fix leak detected by valgrind in req_tpexec_allocate().
resrc_api_map_t objects are allocated within this function but
never freed. Free them on exit from the function.
resrc: fix leak in resrc_tree_serialize_lite
In resrc_tree_serialize_lite() json_t object m_o is not freed. The
object is appended to the `gather` array with json_array_append --
use json_array_append_new() instead, which allows the array to steal
the reference to the object.
libutil: remove const qualifier on shortjansson Jtostr
Problem: The shortjansson.h version of Jtostr() uses json_dumps()
which returns memory that must be freed by the caller. The const
qualifier on this function therefore encourages memory leaks.
This was probably in place because the equivalent function in json-c
*did* return a const char *.

Remove the const qualifier from this function and update the comment.
Actually, it would probably be better to remove Jtostr() altogether,
but that is left for future work.
rdl: fix leak from Jtostr
Caller must free the result of Jtostr() to avoid leaking the resulting
memory. Store the result and free it after use.
sched: fix memory leaks in use of Jtostr()
When shortjson.h was replaced with shortjansson.h, the result of
Jtostr() went from a const char * which didn't need to be freed
by the caller, to a malloc'd string that *does* need to be freed
by the caller.

Clean up all uses of Jtostr() and be sure to free the result.
sched: fix leak in error path in job_status_cb
In job_status_cb, the jcb json object was leaked in a couple
of the error paths. Use goto instead to ensure jcb is always
freed.
resrc: fix leak in resrc_api_fini
The outer ctx container for the resrc_api_ctx_t object was
not freed in resrc_api_fini() (though all the inner objects were).
Add 'free (ctx)' to close the memleak.
resrc: fix leak in resrc_to_json_lite
Problem: xasprintf() was used as a parameter to Jadd_str() in
resrc_to_json_lite(), so the returned memory was leaked.

Capture the output of xasprintf() instead and free it after use.
t/t5000-valgrind.t: add valgrind test for flux-sched
Add the t5000-valgrind.t test from flux-core to flux-sched, with
some minor changes:

 - Run valgrind with the `flux` binary in the current PATH. This should
   work for `make check` with flux-sched.
 - Run the flux-broker found in the last component of FLUX_EXEC_PATH.
   This should point to the flux-broker for the version of flux-core
   under which flux-sched was configured.
 - Instead of testing for valgrind/valgrind.h, attempt to check that
   the flux-broker in FLUX_EXEC_PATH was built with valgrind wrap
   for dlclose() by searching for the ZZ_Za_dlclose symbol in the
   binary.
 - The test is run with `test_expect_failure` since leaks are currently
   extant in the flux-sched codebase.

The above are admittedly unimaginative hacks.
travis-ci: add valgrind
flux-sched now has a valgrind test, add the valgrind package to
Travis CI.

@grondo grondo force-pushed the grondo:valgrind branch from 8929949 to 00aa18d Apr 30, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Apr 30, 2018

Ok, added @dongahn's fix and updated valgrind test to expect success.

FYI this will fail travis (or should fail), until flux-framework/flux-core#1494 is merged.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Apr 30, 2018

LGTM across the board. One question. shortjansson.h is only used in sched? Not in core?

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Apr 30, 2018

flux-framework/flux-core#1494 has been merged. So, restarted the builds. When Travis reports, I will merge.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Apr 30, 2018

One question. shortjansson.h is only used in sched? Not in core?

Yes, I take the blame for that one. This replacement of shortjson.h with shortjansson.h was a bad shortcut taken in e282c89

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Apr 30, 2018

@grondo: Thanks. I just asked because I just wasn't sure -- when I searched the flux-core code basis, I couldn't find it. IMO, this was a good choice to minimize a churn in sched. We'll ultimately convert the whole thing to jansson in a near future.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Apr 30, 2018

We'll ultimately convert the whole thing to jansson in a near future.

Thanks, good plan. For new code shortjson/shortjansson.h should be avoided.

@dongahn dongahn merged commit a6c99a5 into flux-framework:master Apr 30, 2018

4 checks passed

codecov/patch 84.21% of diff hit (target 73.65%)
Details
codecov/project 73.68% (+0.02%) compared to cb78d4a
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.02%) to 75.228%
Details

@grondo grondo deleted the grondo:valgrind branch Apr 30, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Apr 30, 2018

Thanks!

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Apr 30, 2018

Thank YOU!

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Apr 30, 2018

@grondo: I merged your change into my dev branch and somehow t5000-valgrind.t is failing on quartz. Is there a way to see the valgrind output from this test?

ERROR: t5000-valgrind.t - missing test plan
ERROR: t5000-valgrind.t - exited with status 1
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Apr 30, 2018

Sorry @dongahn, I missed this comment until now. Try:

$ cd t
$ make check TESTS=t5000-valgrind.t FLUX_TESTS_LOGFILE=t debug=t

Then look in t5000-valgrind.output

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Apr 30, 2018

Sorry running the tests in flux-sched and saving the output is a bit convoluted, but this is because there is a lot of environment setup in t/Makefile.am for the tests. We should consider how to make it so that tests can by run by hand as in flux-core, e.g.

./t5000-valgrind.t -d -v
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.