Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wreck: enable optional cpu affinity for jobs based on cores assigned in R_lite #1533

Merged
merged 12 commits into from May 30, 2018

Conversation

Projects
None yet
6 participants
@grondo
Copy link
Contributor

grondo commented May 21, 2018

This PR, along with some related cleanup, enables wrexecd to optionally set its cpu affinity mask to the cores assigned in R_lite. The cpu affinity support is not enabled by default, but will only be active if the cpu-affinity option is enabled for the job (or globally)., e.g.

flux wreckrun -n4 -o cpu-affinity ...

The "related cleanup" here was to allow global wreck "job options" to be set in the KVS at the lwj.options level, and to make the options kvs entry a single JSON object instead of a directory (to avoid several back-to-back kvs queries when there are multiple job options set).

In order to allow per-job options to override globally set options, these options are now disabled if they are set to 0 or false or no, e.g.

flux wreckrun -n4 -o cpu-affinity=no ...

For cpu affinity, this means you can enable cpu affinity for all jobs by

$ flux kvs put --json lwj.options='{"cpu-affinity":1}'

And disable for a single job with -o cpu-affinity=no

Fortunately @trws was very thorough and the resource-hwloc module already supports restricting the topology to the current cpumask, so sub-instances will inherit these settings in their hwloc configuration:

$ src/cmd/flux start -s 4
flux-start: warning: setting --bootstrap=selfpmi due to --size option
$ flux hwloc info
4 Machines, 64 Cores, 128 PUs
$ flux wreckrun -n4 flux start flux hwloc info
4 Machines, 64 Cores, 128 PUs
$ flux wreckrun -n4 -o cpu-affinity flux start flux hwloc info
4 Machines, 4 Cores, 4 PUs

One thing I now notice is that the straight use of sched_setaffinity here is literally restricting jobs to the cores in the cores list from R_lite, but should we also be including thread siblings by default?
Before merging let me check if hwloc has an easy way to create a cpumask that includes thread siblings, unless someone notifies me that we don't want to include siblings by default in the cpu affinity mask.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 21, 2018

Ok, @trws came and set me straight. I'll switch cpu affinity to use the hwloc interface which should automatically pick up the physical cores including any PUs (hyperthread sibling).

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 21, 2018

Coverage Status

Coverage decreased (-0.03%) to 79.138% when pulling f9f4bae on grondo:wrexecd-affinity into b1d1b71 on flux-framework:master.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented May 21, 2018

Codecov Report

Merging #1533 into master will increase coverage by 0.09%.
The diff coverage is 64.7%.

@@            Coverage Diff            @@
##           master   #1533      +/-   ##
=========================================
+ Coverage    78.8%   78.9%   +0.09%     
=========================================
  Files         164     164              
  Lines       30333   30295      -38     
=========================================
- Hits        23905   23904       -1     
+ Misses       6428    6391      -37
Impacted Files Coverage Δ
src/modules/wreck/wrexecd.c 74.87% <62.96%> (+0.48%) ⬆️
src/modules/wreck/rcalc.c 89.3% <71.42%> (+16.43%) ⬆️
src/common/libflux/keepalive.c 86.66% <0%> (-6.67%) ⬇️
src/common/libflux/response.c 83.73% <0%> (-0.82%) ⬇️
src/common/libutil/base64.c 95.07% <0%> (-0.71%) ⬇️
src/broker/overlay.c 73.81% <0%> (-0.32%) ⬇️
src/common/libflux/message.c 81.05% <0%> (-0.24%) ⬇️
src/bindings/lua/flux-lua.c 82.19% <0%> (+0.08%) ⬆️
@garlick

This comment has been minimized.

Copy link
Member

garlick commented May 21, 2018

This looks good to me and I'm happy to merge if you think it's ready.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 21, 2018

This looks good to me and I'm happy to merge if you think it's ready.

Thanks, but use of sched_setaffinity is too simplistic for the use case. I'll have to rework to use some libhwloc magic that will translate a physical cores list to a cpumask that includes the cores and any threads shared with the same physical core.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented May 21, 2018

Fortunately @trws was very thorough and the resource-hwloc module already supports restricting the topology to the current cpumask, so sub-instances will inherit these settings in their hwloc configuration:

BTW, this is very nice!

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 22, 2018

Ok, @trws, I never found the magic to convert a cores bitmap to a PU bitmap in fewer than a dozen hwloc calls.

This approach works, but I'm wondering if I missed something simpler.

static int do_hwloc_core_affinity (struct prog_ctx *ctx)
{
    int rc = -1;
    int depth, i;
    hwloc_topology_t topology;
    hwloc_cpuset_t coreset = NULL;
    hwloc_cpuset_t resultset = NULL;
    flux_t *h = ctx->flux;

    if (hwloc_topology_init (&topology) < 0) {
        flux_log_error (h, "hwloc_topology_init");
        return (-1);
    }
    if (hwloc_topology_load (topology) < 0) {
        flux_log_error (h, "hwloc_topology_load");
        goto out;
    }
    if (!(coreset = hwloc_bitmap_alloc ())
       || !(resultset = hwloc_bitmap_alloc ())) {
        flux_log_error (h, "hwloc_bitmap_alloc");
        goto out;
    }
    if (hwloc_bitmap_list_sscanf (coreset, ctx->rankinfo.cores) < 0) {
        flux_log_error (h, "hwloc_sscanf(%s)", ctx->rankinfo.cores);
        goto out;
    }
    depth = hwloc_get_type_depth (topology, HWLOC_OBJ_CORE);
    if (depth == HWLOC_TYPE_DEPTH_UNKNOWN
       || depth == HWLOC_TYPE_DEPTH_MULTIPLE) {
        flux_log_error (h, "hwloc_get_type_depth (CORE)");
        goto out;
    }
    i = hwloc_bitmap_first (coreset);
    while (i >= 0) {
        hwloc_obj_t core = hwloc_get_obj_by_depth (topology, depth, i);
        if (core)
            hwloc_bitmap_or (resultset, resultset, core->cpuset);
        else
            flux_log_error (h, "hwloc_get_obj_by_depth");
        i = hwloc_bitmap_next (coreset, i);
    }
    if ((rc = hwloc_set_cpubind (topology, resultset, 0)) < 0)
        flux_log_error (h, "hwloc_set_cpubind: %s", strerror (errno));
out:
    hwloc_bitmap_free (resultset);
    hwloc_bitmap_free (coreset);
    hwloc_topology_destroy (topology);
    return (0);
}
$ flux hwloc info
4 Machines, 64 Cores, 128 PUs
$ flux wreckrun -n4 flux start flux hwloc info
4 Machines, 64 Cores, 128 PUs
$ flux wreckrun -o cpu-affinity -n4 flux start flux hwloc info
4 Machines, 4 Cores, 8 PUs
$ flux wreckrun -o cpu-affinity -n4 flux start flux exec grep Cpus_allowed_list /proc/self/status
Cpus_allowed_list:	0,16
Cpus_allowed_list:	0,16
Cpus_allowed_list:	0,16
Cpus_allowed_list:	0,16

@grondo grondo force-pushed the grondo:wrexecd-affinity branch from 45e6598 to 20b1f50 May 22, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 22, 2018

Ok, I pushed a fix for test failures when cpu thread_siblibgs_list used a different format than the cpus-allowed script (e.g. 0-1 vs 0,1), and also squashed incremental commits.

If this passes Travis I think it is read to merge.

@trws

This comment has been minimized.

Copy link
Member

trws commented May 22, 2018

Unfortunately at the C level that's about what it amounts to. Using the distribute call might shorten it slightly, but not a great deal since r lite doesn't actually have a cpuset equivalent but a core list.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented May 28, 2018

Sorry I missed that this was ready to go! It needs a rebase now.

@grondo grondo force-pushed the grondo:wrexecd-affinity branch 2 times, most recently from 41b908a to 04a89bd May 29, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 29, 2018

This is rebased, but I added flux wreck getopt/setopt commands to more easily manipulate the global lwj.options JSON object. The cpu-affinity tests were updated to use flux wreck setopt.

grondo added some commits May 17, 2018

wreck: remove unused affinity Lua plugin
The 02-affinity.lua plugin was not used, and would not work
anymore anyway since the move to R_lite. Remove it.
test: remove affinity checks from t2000-wreck.t
In preparation for removal of old style CPU affinity support in
wrexecd, remove the rank.cpumask and <taskid>.cpumask checks from
t2000-wreck.t.
wreck: remove unused code dealing with rank.N in kvs
The rank.N dirs in the KVS are no longer used for anything,
so remove the ctx->resources kvsdir and all its users in wrexecd.
wreck: allow cpu affinity to be set for jobs
Add a new job option 'cpu-affinity' which, when set, causes
wrexecd to set its cpu-affinity to the assigned cores for
the current rank from R_lite.
wreck: store options as single json object
Problem: The wreck job "options" are stored in the KVS as separate
keys, and their values are fetched back-to-back in a complicated
function in wrexecd.

Store the options more simply as a single JSON object so that it
can be fetched in a single KVS operation, and the gathering of
options simplified.
wreck: allow job options to be unset
Problem: wreck job options are "set only" meaning they can't
be unset or forced off by users.

Allow options to be unset within wrexecd if the options JSON
object has the key set to 0, false, or NULL.
wreck: init job options from global lwj.options
Allow global lwj.options kvs key to set initial job options
for all jobs, which are then overridden optionally by local
job options set on the submit or wreckrun cmdline.
wreck: update struct rcalc_rankinfo
Update the rcalc_rankinfo struct to include the cores list
and cpu_set_t representation for the targeted rank. The core
list and cpu_set_t are copies so that the rank info structure
can be used even after the rcalc_t object is destroyed.
t1999-wreck-rcalc.t: add cores output to tests
Add cores output to t/wreck/rcalc.c as sanity check for
wreck/rcalc tests. Update all expected output accordingly.
t2000-wreck.t: add test for cpu-affinity
Add checks for cpu-affinity set by wreck exec system, via
either global lwj.options or local job `-o cpu-affinity` option.
cmd/flux-wreck: add getopt/setopt command
Add `flux wreck getopt` and `flux wreck setopt` subcommands to more
easily manipulate the JSON `lwj.options` object for global WRECK
options.

@grondo grondo force-pushed the grondo:wrexecd-affinity branch from 04a89bd to f9f4bae May 29, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented May 30, 2018

We'll, Travis is useless today. Will try restarting builders later tonight (getting apt-get errors)

@garlick

This comment has been minimized.

Copy link
Member

garlick commented May 30, 2018

Thanks!

@garlick garlick merged commit ba1f064 into flux-framework:master May 30, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.03%) to 79.138%
Details

@grondo grondo deleted the grondo:wrexecd-affinity branch Jul 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.