Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wreck: initial per-task core affinity support #1603

Merged
merged 4 commits into from Jul 26, 2018

Conversation

Projects
None yet
5 participants
@grondo
Copy link
Contributor

grondo commented Jul 25, 2018

This PR adds support to wreck for per-task core affinity support through the use of hwloc_distrib(3) (which should be similar to hwloc-distrib(1) command AFAICT).

If -o cpu-affinity=per-task is used, then hwloc_distrib() is called to distribute the N local tasks across the topology (restricted by the assigned core list).

If needed we could add further options to hwloc_distrib() but it seemed like the basic functionality perhaps gets us 90% of the way there?

Anyway, throwing this up as a PR for initial testing to see if this approach will work. (I've only done basic testing thus far)

$ flux wreckrun -n2 -N1 sh -c 'taskset -cp $$'                                                                                            
pid 2490224's current affinity list: 0-7                                                       
pid 2490225's current affinity list: 0-7                                                       
$ flux wreckrun -o cpu-affinity -n2 -N1 sh -c 'taskset -cp $$'                                                                            
pid 2490322's current affinity list: 0-3                                                       
pid 2490323's current affinity list: 0-3                                                       
$ flux wreckrun -o cpu-affinity=per-task -n2 -N1 sh -c 'taskset -cp $$'                                                                   
pid 2490417's current affinity list: 0,1                                                       
pid 2490418's current affinity list: 2,3                                                       

@grondo grondo force-pushed the grondo:task-affinity branch from 48cdfee to 6f5f275 Jul 25, 2018

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jul 25, 2018

Coverage Status

Coverage increased (+0.1%) to 79.531% when pulling 48c990a on grondo:task-affinity into c91ac60 on flux-framework:master.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Jul 25, 2018

Codecov Report

Merging #1603 into master will increase coverage by 0.13%.
The diff coverage is 92.1%.

@@            Coverage Diff             @@
##           master    #1603      +/-   ##
==========================================
+ Coverage   79.23%   79.36%   +0.13%     
==========================================
  Files         171      171              
  Lines       31341    31378      +37     
==========================================
+ Hits        24832    24903      +71     
+ Misses       6509     6475      -34
Impacted Files Coverage Δ
src/modules/wreck/wrexecd.c 75.86% <92.1%> (+2.96%) ⬆️
src/common/libflux/handle.c 83.66% <0%> (-0.5%) ⬇️
src/bindings/lua/flux-lua.c 82.23% <0%> (+0.08%) ⬆️
src/common/libflux/message.c 80.9% <0%> (+0.11%) ⬆️
src/broker/module.c 84.07% <0%> (+0.27%) ⬆️
src/common/libflux/future.c 90.7% <0%> (+0.44%) ⬆️
src/common/libutil/base64.c 95.77% <0%> (+0.7%) ⬆️
src/common/libutil/dirwalk.c 94.28% <0%> (+0.71%) ⬆️
src/common/libflux/mrpc.c 87.3% <0%> (+1.19%) ⬆️
wreck: add support for '-o cpu-affinity=per-task'
If `-o cpu-affinity=per-task` is set for a job, then hwloc_distrib(3)
is used to distribute the N local tasks across a hwloc topology
which has been restricted to the cores set in R_lite.

Fixes #1600

@grondo grondo force-pushed the grondo:task-affinity branch from 6f5f275 to f23712a Jul 25, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 25, 2018

I have to run down a travis testing bug, but @SteVwonder if you could try this out and see if it solves the "spread" use case, that would be helpful (I think that is default hwloc_distrib() behavior)

Thanks!

grondo added some commits Jul 25, 2018

t2000-wreck: fix prereq test for MULTICORE
Problem: The test for MULTICORE prepreq in the wreck tests was
accidentally broken in f9f4bae, such that the cpu-affinity tests
were no longer running on most systems.

Fix the sense of the test for MULTICORE support so that these tests
are run, and fix a couple bugs in the tests detected now that the
tests actually are executed.
t2000-wreck.t: add test for cpu-affinity=per-task
Add a simple test verifying operation of `-o cpu-affinity=per-task`

@grondo grondo force-pushed the grondo:task-affinity branch from f23712a to e589d26 Jul 25, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 25, 2018

Ok, testing cpu-affinity=per-task in Travis was a bit of a challenge. I'm not sure what is going on in their VMs, but I had to settle for ensuring that with -N1 -n2 and cpu-affinity=per-task, each task runs on a different cpumask. It should at least give the code some amount of testing.

Along the way I found that the MULTICORE test was incorrect, and so we haven't been running any of the cpu-affinity checks (which had bugs in the test scripts). So that at least was fixed along the way.

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Jul 25, 2018

@grondo: I'll give it a whirl now. Thanks for putting this together!

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 25, 2018

Before merging, let me try on a couple different systems this time 😉

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 25, 2018

Hm, I'm getting unexpected results when running more than one flux-broker per node (under Slurm, so that cpu-affinity is set per-broker)

E.g. on ipa with 4 brokers across 2 nodes:

$ srun -p pall --pty -N2 -n8 /g/g0/grondo/flux/bin/flux start
(flux-SRLTmn) grondo@ipa4:/tmp$ flux hwloc info
8 Machines, 72 Cores, 72 PUs
(flux-SRLTmn) grondo@ipa4:/tmp$ flux hwloc lstopo
System (1024GB total)
  Machine L#0 + NUMANode L#0 (P#0 128GB) + Package L#0
    Core L#0 + PU L#0 (P#0)
    Core L#1 + PU L#1 (P#1)
    Core L#2 + PU L#2 (P#2)
    Core L#3 + PU L#3 (P#3)
    Core L#4 + PU L#4 (P#4)
    Core L#5 + PU L#5 (P#5)
    Core L#6 + PU L#6 (P#6)
    Core L#7 + PU L#7 (P#7)
    Core L#8 + PU L#8 (P#8)
  Machine L#1 + NUMANode L#1 (P#0 128GB) + Package L#1
    Core L#9 + PU L#9 (P#9)
    Core L#10 + PU L#10 (P#10)
    Core L#11 + PU L#11 (P#11)
    Core L#12 + PU L#12 (P#12)
    Core L#13 + PU L#13 (P#13)
    Core L#14 + PU L#14 (P#14)
    Core L#15 + PU L#15 (P#15)
    Core L#16 + PU L#16 (P#16)
    Core L#17 + PU L#17 (P#17)
  Machine L#2 + NUMANode L#2 (P#1 128GB) + Package L#2
    Core L#18 + PU L#18 (P#18)
    Core L#19 + PU L#19 (P#19)
    Core L#20 + PU L#20 (P#20)
    Core L#21 + PU L#21 (P#21)
    Core L#22 + PU L#22 (P#22)
    Core L#23 + PU L#23 (P#23)
    Core L#24 + PU L#24 (P#24)
    Core L#25 + PU L#25 (P#25)
    Core L#26 + PU L#26 (P#26)
  Machine L#3 + NUMANode L#3 (P#1 128GB) + Package L#3
    Core L#27 + PU L#27 (P#27)
    Core L#28 + PU L#28 (P#28)
    Core L#29 + PU L#29 (P#29)
    Core L#30 + PU L#30 (P#30)
    Core L#31 + PU L#31 (P#31)
    Core L#32 + PU L#32 (P#32)
    Core L#33 + PU L#33 (P#33)
    Core L#34 + PU L#34 (P#34)
    Core L#35 + PU L#35 (P#35)
  Machine L#4 + NUMANode L#4 (P#0 128GB) + Package L#4
    Core L#36 + PU L#36 (P#0)
    Core L#37 + PU L#37 (P#1)
    Core L#38 + PU L#38 (P#2)
    Core L#39 + PU L#39 (P#3)
    Core L#40 + PU L#40 (P#4)
    Core L#41 + PU L#41 (P#5)
    Core L#42 + PU L#42 (P#6)
    Core L#43 + PU L#43 (P#7)
    Core L#44 + PU L#44 (P#8)
  Machine L#5 + NUMANode L#5 (P#0 128GB) + Package L#5
    Core L#45 + PU L#45 (P#9)
    Core L#46 + PU L#46 (P#10)
    Core L#47 + PU L#47 (P#11)
    Core L#48 + PU L#48 (P#12)
    Core L#49 + PU L#49 (P#13)
    Core L#50 + PU L#50 (P#14)
    Core L#51 + PU L#51 (P#15)
    Core L#52 + PU L#52 (P#16)
    Core L#53 + PU L#53 (P#17)
  Machine L#6 + NUMANode L#6 (P#1 128GB) + Package L#6
    Core L#54 + PU L#54 (P#18)
    Core L#55 + PU L#55 (P#19)
    Core L#56 + PU L#56 (P#20)
    Core L#57 + PU L#57 (P#21)
    Core L#58 + PU L#58 (P#22)
    Core L#59 + PU L#59 (P#23)
    Core L#60 + PU L#60 (P#24)
    Core L#61 + PU L#61 (P#25)
    Core L#62 + PU L#62 (P#26)
  Machine L#7 + NUMANode L#7 (P#1 128GB) + Package L#7
    Core L#63 + PU L#63 (P#27)
    Core L#64 + PU L#64 (P#28)
    Core L#65 + PU L#65 (P#29)
    Core L#66 + PU L#66 (P#30)
    Core L#67 + PU L#67 (P#31)
    Core L#68 + PU L#68 (P#32)
    Core L#69 + PU L#69 (P#33)
    Core L#70 + PU L#70 (P#34)
    Core L#71 + PU L#71 (P#35)
(flux-SRLTmn) grondo@ipa4:/tmp$ flux wreckrun -N8 sh -c 'taskset -cp $$'
pid 6789's current affinity list: 0-8
pid 6790's current affinity list: 9-17
pid 6791's current affinity list: 18-26
pid 6792's current affinity list: 27-35
pid 4419's current affinity list: 0-8
pid 4420's current affinity list: 9-17
pid 4422's current affinity list: 18-26
pid 4421's current affinity list: 27-35
(flux-SRLTmn) grondo@ipa4:/tmp$ flux wreckrun -o cpu-affinity -N8 sh -c 'taskset -cp $$'
pid 6816's current affinity list: 0,36
pid 4446's current affinity list: 0,36
pid 4447's current affinity list: 0,36
pid 6817's current affinity list: 0,36
pid 6819's current affinity list: 0,36
pid 6818's current affinity list: 0,36
pid 4448's current affinity list: 0,36
pid 4449's current affinity list: 0,36

In the last run, I would have expected each task to run on the first core of broker's mask, e.g. 0,9,18,27,... However, it appears that each rank is assigned "Logical Core 0", even though the hwloc topology shows that the logical cores are numbered consecutively. This is borne out by the R_lite for this job:

(flux-SRLTmn) grondo@ipa4:~/git/flux-core.git$ flux kvs get lwj.0.0.9.R_lite
[{"node": "ipa4", "children": {"core": "0"}, "rank": 0}, {"node": "ipa4", "children": {"core": "0"}, "rank": 1}, {"node": "ipa4", "children": {"core": "0"}, "rank": 2}, {"node": "ipa4", "children": {"core": "0"}, "rank": 3}, {"node": "ipa5", "children": {"core": "0"}, "rank": 4}, {"node": "ipa5", "children": {"core": "0"}, "rank": 5}, {"node": "ipa5", "children": {"core": "0"}, "rank": 6}, {"node": "ipa5", "children": {"core": "0"}, "rank": 7}]

It appears sched is assigning logical core ids by rank, not logical hwloc core ids. Perhaps if I first use hwloc_topology_restrict() in wrexecd before setting the initial cpumask with hwloc_bind(), this will work as expected? However, it may be confusing looking at R_lite that the same core id is assigned on different ranks that map to the same host.

The other problem I think is due to the way Slurm is assigning the initial mask (I presume 0,36 are core siblings, but Slurm appears to be treating them as different cores). I'm ignoring that issue for now.

wreck: restrict topology before applying cpumask
Problem: the wreck cpu-affinity support doesn't restrict the topology
to the current process' cpuset before doing affinity binding, and
so expects the core ids assigned by the scheduler to be in "whole
system" logical order. Since the scheduler assigns cores indexed from
0 per-rank, this breaks the case where multiple brokers are run per
real physical node.

Fix this problem by first restricting the hwloc topology to the
current cpus allowed mask. The at least there is a change libhwloc
will use the same numbering scheme for cores as is being used by
the scheduler.
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 25, 2018

Now that I look at the hwloc output above, it occurs to me that I probably don't understand hwloc numbering. In the short term, using hwloc_topology_restrict() seems to fix the specific problem above. However, in the long term I worry about the ambiguity of the libhwloc logical numbering. What happens when a flux instance grows? libhwloc will renumber resources and logical core#0 may become logical core#8.

It seems like the scheduler should be using both physical and logical IDs for resources, and both should appear in R_lite, or at least eventually R.

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Jul 25, 2018

Ok, testing cpu-affinity=per-task in Travis was a bit of a challenge.

Yeah, in general, without reaching for hwloc in some capacity, I expect that extensive testing of this will be difficult.

It appears sched is assigning logical core ids by rank, not logical hwloc core ids. Perhaps if I first use hwloc_topology_restrict() in wrexecd before setting the initial cpumask with hwloc_bind(), this will work as expected?

Good catch! I agree on this path forward.

However, it may be confusing looking at R_lite that the same core id is assigned on different ranks that map to the same host.

Can we make a note of this for the new exec system?

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Jul 25, 2018

The other problem I think is due to the way Slurm is assigning the initial mask (I presume 0,36 are core siblings, but Slurm appears to be treating them as different cores). I'm ignoring that issue for now.

Yeah, I don't know for sure. I think the odd numbering due to core siblings is definitely the case on hype2:

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [15:51:47]
→ flux hwloc lstopo
System (128GB total)
  Machine L#0
    Package L#0
      NUMANode L#0 (P#0 32GB)
        Core L#0
          PU L#0 (P#0)
          PU L#1 (P#28)
        Core L#1
          PU L#2 (P#1)
          PU L#3 (P#29)
        Core L#2
          PU L#4 (P#2)
          PU L#5 (P#30)
        Core L#3
          PU L#6 (P#3)
          PU L#7 (P#31)
        Core L#4
          PU L#8 (P#4)
          PU L#9 (P#32)
        Core L#5
          PU L#10 (P#5)
          PU L#11 (P#33)
        Core L#6
          PU L#12 (P#6)
          PU L#13 (P#34)
      NUMANode L#1 (P#1 32GB)
...
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 25, 2018

Ok, I've pushed a temporary fix which restricts the topology to the current cpumask before doing the initial affinity binding.

This makes the current results more reasonable, and probably close to what we want for now?

 grondo@ipa15:~/git/flux-core.git$ srun -p pall --pty -N2 -n8 /g/g0/grondo/flux/bin/flux start
(flux-9t9PMP) grondo@ipa4:~/git/flux-core.git$ flux hwloc info
8 Machines, 72 Cores, 72 PUs
(flux-9t9PMP) grondo@ipa4:~/git/flux-core.git$ flux wreckrun -lN8 sh -c 'taskset -cp $$'
0: pid 14935's current affinity list: 0-8
1: pid 14936's current affinity list: 9-17
2: pid 14937's current affinity list: 18-26
3: pid 14938's current affinity list: 27-35
4: pid 8458's current affinity list: 0-8
5: pid 8459's current affinity list: 9-17
6: pid 8460's current affinity list: 18-26
7: pid 8461's current affinity list: 27-35
(flux-9t9PMP) grondo@ipa4:~/git/flux-core.git$ flux wreckrun -lN8 -o cpu-affinity sh -c 'taskset -cp $$'
0: pid 14961's current affinity list: 0
5: pid 8484's current affinity list: 9
6: pid 8485's current affinity list: 18
1: pid 14962's current affinity list: 9
2: pid 14963's current affinity list: 18
3: pid 14964's current affinity list: 27
4: pid 8483's current affinity list: 0
7: pid 8486's current affinity list: 27
(flux-9t9PMP) grondo@ipa4:~/git/flux-core.git$ flux wreckrun -lN8 -c2 -o cpu-affinity sh -c 'taskset -cp $$'
0: pid 14982's current affinity list: 0,1
1: pid 14983's current affinity list: 9,10
2: pid 14984's current affinity list: 18,19
3: pid 14985's current affinity list: 27,28
4: pid 8504's current affinity list: 0,1
5: pid 8503's current affinity list: 9,10
6: pid 8505's current affinity list: 18,19
7: pid 8506's current affinity list: 27,28

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Jul 25, 2018

@grondo: I'm confused as to why flux thinks there are 36 cores on IPA. I'm only seeing 32 cores:

[herbein1@ipa15:~]
[16:04:03] $ srun -ppall -N1 hwloc-ls
Machine (128GB total)
  NUMANode L#0 (P#0 64GB)
    Package L#0 + L3 L#0 (20MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#16)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#17)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#18)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#19)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#20)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#21)
      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#22)
      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#23)
    HostBridge L#0
      PCIBridge
        PCIBridge
          PCIBridge
            PCI 8086:1d6a
      PCIBridge
        PCI 8086:1521
          Net L#0 "eno1"
        PCI 8086:1521
          Net L#1 "eno2"
        PCI 8086:1521
          Net L#2 "eno3"
        PCI 8086:1521
          Net L#3 "eno4"
      PCIBridge
        PCI 1000:0087
      PCIBridge
        PCI 10de:1021
          GPU L#4 "card1"
          GPU L#5 "renderD128"
      PCIBridge
        PCI 102b:0522
          GPU L#6 "card0"
          GPU L#7 "controlD64"
      PCI 8086:1d02
  NUMANode L#1 (P#1 64GB)
    Package L#1 + L3 L#1 (20MB)
      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#24)
      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#25)
      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#26)
      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#27)
      L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#28)
      L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#29)
      L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#30)
      L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#31)
    HostBridge L#8
      PCIBridge
        PCI 10de:1021
          GPU L#8 "card2"
          GPU L#9 "renderD129"
      PCIBridge
        PCI 15b3:1003
          Net L#10 "hsi0"
          OpenFabrics L#11 "mlx4_0"

But this output does confirm your intuition about the core sibling and numbering (i.e., cores 0 & 16 are siblings).

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 25, 2018

Yeah, strange. We probably need a separate issue to run that down!

 grondo@ipa1:~$ grep -c processor /proc/cpuinfo 
32
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 25, 2018

I think it is something to do with combining the hwloc topology xml:

 grondo@ipa15:~$ srun -p pall --pty -N1 /g/g0/grondo/flux/bin/flux start flux hwloc info
1 Machine, 16 Cores, 16 PUs
 grondo@ipa15:~$ srun -p pall --pty -N2 /g/g0/grondo/flux/bin/flux start flux hwloc info
2 Machines, 72 Cores, 72 PUs

@grondo grondo referenced this pull request Jul 25, 2018

Closed

multi-node hwloc confusion #1605

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Jul 25, 2018

Hmmmm. So I attempted an LBANN-esque use-case (4 tasks per node, with each task getting 1/4 of the node), and I got an unexpected result:

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [16:17:07]
→ flux hwloc info                                                                                                                 
1 Machine, 28 Cores, 56 PUs

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [16:16:54]
→ flux wreckrun -n 4 -c 7 -N1 -o cpu-affinity=per-task bash -c 'taskset -cp $$'
pid 16949's current affinity list: 0,28
pid 16951's current affinity list: 2,30
pid 16952's current affinity list: 3,31
pid 16950's current affinity list: 1,29

I was expecting something like:

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [16:16:54]
→ flux wreckrun -n 4 -c 7 -N1 -o cpu-affinity=per-task bash -c 'taskset -cp $$'
pid 16949's current affinity list: 0-6,28-34
pid 16951's current affinity list: 7-13,35-41
pid 16952's current affinity list: 14-20,42-48
pid 16950's current affinity list: 21-27,49-55

Or some other configuration where each task has 7 cores (14 PUs) in its affinity list.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 26, 2018

@SteVwonder, could you run that test with just -o cpu-affinity (i.e. not per-task), and without -o cpu-affinity?

Can you also paste the R_lite from those jobs here?

The per-task code is basically using hwloc_distrib(3) over a topology restricted by the core list in R_lite. We could also experiment with running hwloc-distrib(1) to see if I've done something incorrect in this PR.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 26, 2018

Another thing that was confusing on ipa is that the mpibind plugin to Slurm is setting the default cpumask such that only one core sibling from each core is used:

grondo@ipa15:~$ srun -p pall -N2 sh -c 'taskset -cp $$'
pid 25122's current affinity list: 0-35
pid 32184's current affinity list: 0-35
 grondo@ipa15:~$ srun --mpibind=off -p pall -N2 sh -c 'taskset -cp $$'
pid 26000's current affinity list: 0-71
pid 33061's current affinity list: 0-71

Just something to look out for...

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 26, 2018

@SteVwonder, now that I figured out how dumb I was being on IPA, things are making more sense. In general, your use case above seems to work here:

(flux-dg185r) grondo@ipa4:~$ flux hwloc info
2 Machines, 72 Cores, 144 PUs
(flux-dg185r) grondo@ipa4:~$ flux wreckrun -N1 -ln4 -c9 -o cpu-affinity=per-task sh -c 'taskset -cp $$'
0: pid 38412's current affinity list: 0-8,36-44
2: pid 38414's current affinity list: 18-26,54-62
1: pid 38413's current affinity list: 9-17,45-53
3: pid 38415's current affinity list: 27-35,63-71
(flux-dg185r) grondo@ipa4:~$ flux wreckrun -N1 -ln4 -c9 -o cpu-affinity sh -c 'taskset -cp $$'
0: pid 38426's current affinity list: 0-71
3: pid 38429's current affinity list: 0-71
2: pid 38428's current affinity list: 0-71
1: pid 38427's current affinity list: 0-71
(flux-dg185r) grondo@ipa4:~$ flux wreckrun -N1 -ln4 -c2 -o cpu-affinity sh -c 'taskset -cp $$'
0: pid 38440's current affinity list: 0-7,36-43
3: pid 38443's current affinity list: 0-7,36-43
2: pid 38442's current affinity list: 0-7,36-43
1: pid 38441's current affinity list: 0-7,36-43
(flux-dg185r) grondo@ipa4:~$ flux wreckrun -N1 -ln4 -c2 -o cpu-affinity=per-task sh -c 'taskset -cp $$'
1: pid 38515's current affinity list: 2,3,38,39
3: pid 38517's current affinity list: 6,7,42,43
2: pid 38516's current affinity list: 4,5,40,41
0: pid 38514's current affinity list: 0,1,36,37

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Jul 26, 2018

@grondo: Huh. I wonder what I'm doing wrong/differently. Because the results in your last message look perfect. Just in case something funky was going on with versions, I did a full git clean -fdx and rebuilt your branch from scratch, installing into a clean directory:

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [20:17:55]
→ ./flux start

<snip new shell spawning>

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [20:20:53]
→ which flux
/usr/workspace/wsb/herbein1/packages/toss3/flux-core/task-affinity/bin/flux

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [20:19:28]
→ flux hwloc info
1 Machine, 28 Cores, 56 PUs

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [20:19:47]
→ flux wreckrun -n 4 -c 7 -N1 bash -c 'taskset -cp $$; flux kvs get lwj.0.0.${FLUX_JOB_ID}.R_lite'
pid 833's current affinity list: 0-55
pid 836's current affinity list: 0-55
pid 835's current affinity list: 0-55
pid 834's current affinity list: 0-55
[{"children":{"core":"0-3"},"rank":0}]
[{"children":{"core":"0-3"},"rank":0}]
[{"children":{"core":"0-3"},"rank":0}]
[{"children":{"core":"0-3"},"rank":0}]

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [20:19:54]
→ flux wreckrun -n 4 -c 7 -N1 -o cpu-affinity bash -c 'taskset -cp $$; flux kvs get lwj.0.0.${FLUX_JOB_ID}.R_lite'
pid 877's current affinity list: 0-3,28-31
pid 880's current affinity list: 0-3,28-31
pid 879's current affinity list: 0-3,28-31
pid 878's current affinity list: 0-3,28-31
[{"children":{"core":"0-3"},"rank":0}]
[{"children":{"core":"0-3"},"rank":0}]
[{"children":{"core":"0-3"},"rank":0}]
[{"children":{"core":"0-3"},"rank":0}]

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [20:20:02]
→ flux wreckrun -n 4 -c 7 -N1 -o cpu-affinity=per-task bash -c 'taskset -cp $$; flux kvs get lwj.0.0.${FLUX_JOB_ID}.R_lite'
pid 959's current affinity list: 3,31
pid 957's current affinity list: 1,29
pid 958's current affinity list: 2,30
pid 956's current affinity list: 0,28
[{"children":{"core":"0-3"},"rank":0}]
[{"children":{"core":"0-3"},"rank":0}]
[{"children":{"core":"0-3"},"rank":0}]
[{"children":{"core":"0-3"},"rank":0}]

It appears that your plugin is working just fine, but the R_lite generation isn't working quite right. Maybe because I don't have sched loaded?

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [20:25:58]
→ flux module list
Module               Size    Digest  Idle  S  Nodeset
content-sqlite       1117816 32A53AE    5  S  0
connector-local      1139152 1F0C6D3    0  R  0
cron                 1194568 5C4E93B    0  S  0
kvs                  1543200 AD00B0D    0  S  0
userdb               1109840 198DF7A    2  S  0
resource-hwloc       1135976 AC50E6E    8  S  0
job                  1178216 0774B6F    5  S  0
aggregator           1126968 A21A238    5  S  0
barrier              1107904 E1CB612    2  S  0
@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Jul 26, 2018

Ok. Nevermind, my bad. Once I make sure sched is loaded, everything works as expected:

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [20:28:14]
→ flux module list
Module               Size    Digest  Idle  S  Nodeset
content-sqlite       1117816 32A53AE    1  S  0
connector-local      1139152 1F0C6D3    0  R  0
cron                 1194568 5C4E93B    0  S  0
kvs                  1543200 AD00B0D    0  S  0
userdb               1109840 198DF7A    1  S  0
resource-hwloc       1135976 AC50E6E    1  S  0
sched                 467584 C960B1E    1  S  0
job                  1178216 0774B6F    1  S  0
aggregator           1126968 A21A238    1  S  0
barrier              1107904 E1CB612    1  S  0

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [20:28:15]
→ flux hwloc info
1 Machine, 28 Cores, 56 PUs

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [20:28:26]
→ flux wreckrun -n 4 -c 7 -N1 bash -c 'taskset -cp $$; flux kvs get lwj.0.0.${FLUX_JOB_ID}.R_lite'
pid 2230's current affinity list: 0-55
pid 2228's current affinity list: 0-55
pid 2229's current affinity list: 0-55
pid 2231's current affinity list: 0-55
[{"node": "hype2", "children": {"core": "0-27"}, "rank": 0}]
[{"node": "hype2", "children": {"core": "0-27"}, "rank": 0}]
[{"node": "hype2", "children": {"core": "0-27"}, "rank": 0}]
[{"node": "hype2", "children": {"core": "0-27"}, "rank": 0}]

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [20:28:33]
→ flux wreckrun -n 4 -c 7 -N1 -o cpu-affinity bash -c 'taskset -cp $$; flux kvs get lwj.0.0.${FLUX_JOB_ID}.R_lite'
pid 2276's current affinity list: 0-55
pid 2277's current affinity list: 0-55
pid 2274's current affinity list: 0-55
pid 2275's current affinity list: 0-55
[{"node": "hype2", "children": {"core": "0-27"}, "rank": 0}]
[{"node": "hype2", "children": {"core": "0-27"}, "rank": 0}]
[{"node": "hype2", "children": {"core": "0-27"}, "rank": 0}]
[{"node": "hype2", "children": {"core": "0-27"}, "rank": 0}]

# herbein1 at hype2 in ~/opt/packages/flux-core/task-affinity/bin [20:28:44]
→ flux wreckrun -n 4 -c 7 -N1 -o cpu-affinity=per-task bash -c 'taskset -cp $$; flux kvs get lwj.0.0.${FLUX_JOB_ID}.R_lite'
pid 2315's current affinity list: 0-6,28-34
pid 2317's current affinity list: 14-20,42-48
pid 2316's current affinity list: 7-13,35-41
pid 2318's current affinity list: 21-27,49-55
[{"node": "hype2", "children": {"core": "0-27"}, "rank": 0}]
[{"node": "hype2", "children": {"core": "0-27"}, "rank": 0}]
[{"node": "hype2", "children": {"core": "0-27"}, "rank": 0}]
[{"node": "hype2", "children": {"core": "0-27"}, "rank": 0}]
@SteVwonder
Copy link
Member

SteVwonder left a comment

LGTM! Thanks @grondo! Ready for a merge? Idk if @garlick wants to take a look first.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 26, 2018

Ah yes, sorry! Wreck without sched doesn't even look at cpus-per-task. (Maybe that should be fixed?)

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 26, 2018

I added some flags to hwloc_topology_restrict during development that may be unnecessary.

I guess if they aren't harming anything now they can stay in... So, yes this is probably ready for a merge.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Jul 26, 2018

hwloc_topology_restrict

Does this affect the visibility of the IO hierarchy at all? I need to traverse IO hierarchy for GPUs.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 26, 2018

@dongahn, the restrict calls are only used in wrexecd for use with hwloc_distrib, not in resource-hwloc, so it should be invisible to sched

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Jul 26, 2018

Ah. Sounds good to me then.

@SteVwonder SteVwonder merged commit ec07ecf into flux-framework:master Jul 26, 2018

4 checks passed

codecov/patch 92.1% of diff hit (target 79.23%)
Details
codecov/project 79.36% (+0.13%) compared to c91ac60
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.1%) to 79.531%
Details
@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Jul 26, 2018

Thanks again @grondo for putting this together!

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Jul 26, 2018

Forgot to mention before, but this PR brought up the coverage enough that it now rounds to 80%, meaning our coveralls badge is now yellow 🎉

@grondo grondo deleted the grondo:task-affinity branch Jul 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.