Integrate multi-node nccl testing into the tester package #447

weicongw · 2024-06-10T23:27:02Z

Issue #, if available:

Description of changes:
This PR integrates multi-node NCCL testing into the tester package. The tester now accepts the following flags to configure the test:

ncclTestImage: Specifies the base image to run the multi-node nccl test.
efaEnabled: Determines whether to use the EFA in the cluster.
nodeType: Specifies what type of nodes in the node groups will be used to run the multi-node NCCL test.

The tester can retrieve the hardware specifications from the nodes and render the nccl test manifest based on these specifications.

Testing

go test -v . -args -efaImage 665181186642.dkr.ecr.us-west-2.amazonaws.com/aws-k8s-tester/nccl-test:latest -skip-features single-node -efaEnabled=true
W0610 22:55:51.199584   28686 warnings.go:70] spec.template.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key: beta.kubernetes.io/instance-type is deprecated since v1.17; use "node.kubernetes.io/instance-type" instead
W0610 22:55:51.199630   28686 warnings.go:70] spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the "priorityClassName" field instead
2024/06/10 22:55:56 No node type specified. Using the node type p3dn.24xlarge in the node groups.
=== RUN   TestMPIJobPytorchTraining
=== RUN   TestMPIJobPytorchTraining/single-node
    env.go:438: Skipping feature: "single-node": name matched
=== RUN   TestMPIJobPytorchTraining/multi-node
=== RUN   TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds
--- PASS: TestMPIJobPytorchTraining (40.05s)
    --- SKIP: TestMPIJobPytorchTraining/single-node (0.00s)
    --- PASS: TestMPIJobPytorchTraining/multi-node (40.05s)
        --- PASS: TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds (40.01s)
PASS
ok      github.com/aws/aws-k8s-tester/e2e2/test/cases/nvidia    57.324s

Testing pod logs

...
[1,0]<stdout>:#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
[1,0]<stdout>:#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
[1,0]<stdout>:           8             2     float     sum      -1    52.23    0.00    0.00      0    51.88    0.00    0.00      0
[1,0]<stdout>:          16             4     float     sum      -1    51.30    0.00    0.00      0    51.15    0.00    0.00      0
[1,0]<stdout>:          32             8     float     sum      -1    51.54    0.00    0.00      0    51.09    0.00    0.00      0
[1,0]<stdout>:          64            16     float     sum      -1    51.29    0.00    0.00      0    50.73    0.00    0.00      0
[1,0]<stdout>:         128            32     float     sum      -1    51.35    0.00    0.00      0    50.61    0.00    0.00      0
[1,0]<stdout>:         256            64     float     sum      -1    51.20    0.01    0.01      0    50.47    0.01    0.01      0
[1,0]<stdout>:         512           128     float     sum      -1    50.65    0.01    0.02      0    49.16    0.01    0.02      0
[1,0]<stdout>:        1024           256     float     sum      -1    52.06    0.02    0.03      0    51.37    0.02    0.03      0
[1,0]<stdout>:        2048           512     float     sum      -1    55.56    0.04    0.06      0    54.80    0.04    0.07      0
[1,0]<stdout>:        4096          1024     float     sum      -1    60.02    0.07    0.12      0    59.12    0.07    0.12      0
[1,0]<stdout>:        8192          2048     float     sum      -1    61.74    0.13    0.23      0    60.72    0.13    0.24      0
[1,0]<stdout>:       16384          4096     float     sum      -1    64.68    0.25    0.44      0    63.69    0.26    0.45      0
[1,0]<stdout>:       32768          8192     float     sum      -1    71.38    0.46    0.80      0    70.84    0.46    0.81      0
[1,0]<stdout>:       65536         16384     float     sum      -1    74.78    0.88    1.53      0    74.54    0.88    1.54      0
[1,0]<stdout>:      131072         32768     float     sum      -1    80.97    1.62    2.83      0    79.26    1.65    2.89      0
[1,0]<stdout>:      262144         65536     float     sum      -1    80.99    3.24    5.66      0    78.00    3.36    5.88      0
[1,0]<stdout>:      524288        131072     float     sum      -1    84.01    6.24   10.92      0    83.20    6.30   11.03      0
[1,0]<stdout>:     1048576        262144     float     sum      -1    92.30   11.36   19.88      0    91.67   11.44   20.02      0
[1,0]<stdout>:     2097152        524288     float     sum      -1    114.6   18.29   32.01      0    112.5   18.64   32.62      0
[1,0]<stdout>:     4194304       1048576     float     sum      -1    147.6   28.42   49.74      0    145.3   28.86   50.51      0
[1,0]<stdout>:     8388608       2097152     float     sum      -1    196.9   42.59   74.54      0    197.1   42.55   74.47      0
[1,0]<stdout>:    16777216       4194304     float     sum      -1    288.9   58.07  101.62      0    288.0   58.25  101.95      0
[1,0]<stdout>:    33554432       8388608     float     sum      -1    508.5   65.98  115.47      0    508.6   65.97  115.45      0
[1,0]<stdout>:    67108864      16777216     float     sum      -1    953.6   70.38  123.16      0    954.6   70.30  123.02      0
[1,0]<stdout>:   134217728      33554432     float     sum      -1   1857.3   72.26  126.46      0   1860.8   72.13  126.22      0
[1,0]<stdout>:   268435456      67108864     float     sum      -1   3666.6   73.21  128.12      0   3673.1   73.08  127.89      0
[1,0]<stdout>:   536870912     134217728     float     sum      -1   7286.3   73.68  128.94      0   7299.7   73.55  128.71      0
[1,0]<stdout>:  1073741824     268435456     float     sum      -1    14494   74.08  129.65      0    14515   73.98  129.46      0
[1,0]<stdout>:  2147483648     536870912     float     sum      -1    28866   74.40  130.19      0    28892   74.33  130.08      0
[1,0]<stdout>:multi-node-nccl-test-worker-0:21:21 [0] NCCL INFO comm 0x55f94529a0b0 rank 0 nranks 8 cudaDev 0 busId 160 - Destroy COMPLETE
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth    : 40.7925 
[1,0]<stdout>:#
[1,1]<stdout>:multi-node-nccl-test-worker-0:22:22 [1] NCCL INFO comm 0x56151cbd4340 rank 1 nranks 8 cudaDev 1 busId 170 - Destroy COMPLETE
[1,2]<stdout>:multi-node-nccl-test-worker-0:23:23 [2] NCCL INFO comm 0x5556c80d7a40 rank 2 nranks 8 cudaDev 2 busId 180 - Destroy COMPLETE
[1,3]<stdout>:multi-node-nccl-test-worker-0:24:24 [3] NCCL INFO comm 0x55ddec7c3630 rank 3 nranks 8 cudaDev 3 busId 190 - Destroy COMPLETE
[1,7]<stdout>:multi-node-nccl-test-worker-0:31:31 [7] NCCL INFO comm 0x55df10ef2500 rank 7 nranks 8 cudaDev 7 busId 1d0 - Destroy COMPLETE
[1,5]<stdout>:multi-node-nccl-test-worker-0:26:26 [5] NCCL INFO comm 0x561b7c28f430 rank 5 nranks 8 cudaDev 5 busId 1b0 - Destroy COMPLETE
[1,6]<stdout>:multi-node-nccl-test-worker-0:29:29 [6] NCCL INFO comm 0x5648feb716d0 rank 6 nranks 8 cudaDev 6 busId 1c0 - Destroy COMPLETE
[1,4]<stdout>:multi-node-nccl-test-worker-0:25:25 [4] NCCL INFO comm 0x5569eef06e40 rank 4 nranks 8 cudaDev 4 busId 1a0 - Destroy COMPLETE

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

e2e2/test/cases/nvidia/manifests/mpi-job-nccl-test-multi-node.yaml

e2e2/test/cases/nvidia/main_test.go

e2e2/test/cases/nvidia/mpi_test.go

e2e2/test/cases/nvidia/manifests/mpi-job-nccl-test-multi-node.yaml

cartermckinnon · 2024-06-12T18:41:39Z

e2e2/test/cases/nvidia/main_test.go

+			if *nodeType != "" {
+				for _, v := range nodes.Items {
+					if v.Labels["node.kubernetes.io/instance-type"] == *nodeType {
+						nodeCount++
+						gpu := v.Status.Capacity["nvidia.com/gpu"]
+						gpuPerNode = int(gpu.Value())
+						efa := v.Status.Capacity["vpc.amazonaws.com/efa"]
+						efaPerNode = int(efa.Value())
+					}
+				}
+			} else {
+				log.Printf("No node type specified. Using the node type %s in the node groups.", nodes.Items[0].Labels["node.kubernetes.io/instance-type"])
+				nodeCount = len(nodes.Items)
+				gpu := nodes.Items[0].Status.Capacity["nvidia.com/gpu"]
+				gpuPerNode = int(gpu.Value())
+				efa := nodes.Items[0].Status.Capacity["vpc.amazonaws.com/efa"]
+				efaPerNode = int(efa.Value())
+			}


So the idea is that our deployer doesn't have to pass any info to the test binary about what instance type it deployed, the test case will just adapt to whatever it finds in the cluster? I think that's a fine idea, but I think you need to assert here that you have a single instance type in the cluster and throw an error if there's a mix.

Also why move this to main_test.go? It's only relevant to the multi-node tests, right?

Also why move this to main_test.go? It's only relevant to the multi-node tests, right?

I plan to modify the single node test as well, so it can make the most use of the single node.

cartermckinnon · 2024-06-12T18:42:08Z

e2e2/test/cases/nvidia/manifests/mpi-job-nccl-test-multi-node.yaml

-            - -t
-            - "1"
-            - -g
-            - "1"


What did these do?

-t,--nthreads number of threads per process. Default : 1.
-g,--ngpus number of gpus per thread. Default : 1.

The default values are 1.
https://github.com/NVIDIA/nccl-tests

cartermckinnon · 2024-06-12T18:43:34Z

e2e2/test/cases/nvidia/manifests/mpi-job-nccl-test-multi-node.yaml

I would consider just using a different manifest for the instance families we intend to target -- we probably should tune things like memory as well, and I think it's going to be clearer to hardcode most of this

If we use different manifests for the instance families, we would need to create and maintain many manifests in this repo, which might become a burden in the long run. We can start with the dynamic manifest and revert to a static one if needed. In my opinion, it is easier to switch from dynamic to static.

Fair enough, I don’t feel too strongly either way. Realistically we’ll only be testing 1 or 2 of the most popular families, which we would add/tune as they’re launched.

cartermckinnon

LGTM

Integrate multi-node nccl testing into the tester package

a84dea7

weicongw marked this pull request as ready for review June 10, 2024 23:27

ndbaker1 reviewed Jun 10, 2024

View reviewed changes

e2e2/test/cases/nvidia/manifests/mpi-job-nccl-test-multi-node.yaml Outdated Show resolved Hide resolved

e2e2/test/cases/nvidia/main_test.go Outdated Show resolved Hide resolved

e2e2/test/cases/nvidia/mpi_test.go Outdated Show resolved Hide resolved

Integrate multi-node nccl testing into the tester package

688ef2d

Issacwww reviewed Jun 11, 2024

View reviewed changes

e2e2/test/cases/nvidia/manifests/mpi-job-nccl-test-multi-node.yaml Show resolved Hide resolved

Issacwww approved these changes Jun 11, 2024

View reviewed changes

cartermckinnon reviewed Jun 12, 2024

View reviewed changes

Integrate multi-node nccl testing into the tester package

e1f5f98

weicongw requested a review from cartermckinnon June 13, 2024 20:22

cartermckinnon approved these changes Jun 13, 2024

View reviewed changes

Issacwww approved these changes Jun 14, 2024

View reviewed changes

ndbaker1 approved these changes Jun 14, 2024

View reviewed changes

cartermckinnon merged commit 56cbd21 into aws:main Jun 14, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate multi-node nccl testing into the tester package #447

Integrate multi-node nccl testing into the tester package #447

weicongw commented Jun 10, 2024 •

edited

Loading

cartermckinnon Jun 12, 2024

weicongw Jun 12, 2024 •

edited

Loading

cartermckinnon Jun 12, 2024

weicongw Jun 12, 2024 •

edited

Loading

cartermckinnon Jun 12, 2024

weicongw Jun 12, 2024

cartermckinnon Jun 13, 2024

cartermckinnon left a comment

Integrate multi-node nccl testing into the tester package #447

Integrate multi-node nccl testing into the tester package #447

Conversation

weicongw commented Jun 10, 2024 • edited Loading

cartermckinnon Jun 12, 2024

Choose a reason for hiding this comment

weicongw Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

cartermckinnon Jun 12, 2024

Choose a reason for hiding this comment

weicongw Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

cartermckinnon Jun 12, 2024

Choose a reason for hiding this comment

weicongw Jun 12, 2024

Choose a reason for hiding this comment

cartermckinnon Jun 13, 2024

Choose a reason for hiding this comment

cartermckinnon left a comment

Choose a reason for hiding this comment

weicongw commented Jun 10, 2024 •

edited

Loading

weicongw Jun 12, 2024 •

edited

Loading

weicongw Jun 12, 2024 •

edited

Loading