Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate multi-node nccl testing into the tester package #447

Merged
merged 3 commits into from
Jun 14, 2024

Conversation

weicongw
Copy link
Contributor

@weicongw weicongw commented Jun 10, 2024

Issue #, if available:

Description of changes:
This PR integrates multi-node NCCL testing into the tester package. The tester now accepts the following flags to configure the test:

  • ncclTestImage: Specifies the base image to run the multi-node nccl test.
  • efaEnabled: Determines whether to use the EFA in the cluster.
  • nodeType: Specifies what type of nodes in the node groups will be used to run the multi-node NCCL test.

The tester can retrieve the hardware specifications from the nodes and render the nccl test manifest based on these specifications.

Testing

go test -v . -args -efaImage 665181186642.dkr.ecr.us-west-2.amazonaws.com/aws-k8s-tester/nccl-test:latest -skip-features single-node -efaEnabled=true
W0610 22:55:51.199584   28686 warnings.go:70] spec.template.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key: beta.kubernetes.io/instance-type is deprecated since v1.17; use "node.kubernetes.io/instance-type" instead
W0610 22:55:51.199630   28686 warnings.go:70] spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the "priorityClassName" field instead
2024/06/10 22:55:56 No node type specified. Using the node type p3dn.24xlarge in the node groups.
=== RUN   TestMPIJobPytorchTraining
=== RUN   TestMPIJobPytorchTraining/single-node
    env.go:438: Skipping feature: "single-node": name matched
=== RUN   TestMPIJobPytorchTraining/multi-node
=== RUN   TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds
--- PASS: TestMPIJobPytorchTraining (40.05s)
    --- SKIP: TestMPIJobPytorchTraining/single-node (0.00s)
    --- PASS: TestMPIJobPytorchTraining/multi-node (40.05s)
        --- PASS: TestMPIJobPytorchTraining/multi-node/MPIJob_succeeds (40.01s)
PASS
ok      github.com/aws/aws-k8s-tester/e2e2/test/cases/nvidia    57.324s

Testing pod logs

...
[1,0]<stdout>:#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
[1,0]<stdout>:#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
[1,0]<stdout>:           8             2     float     sum      -1    52.23    0.00    0.00      0    51.88    0.00    0.00      0
[1,0]<stdout>:          16             4     float     sum      -1    51.30    0.00    0.00      0    51.15    0.00    0.00      0
[1,0]<stdout>:          32             8     float     sum      -1    51.54    0.00    0.00      0    51.09    0.00    0.00      0
[1,0]<stdout>:          64            16     float     sum      -1    51.29    0.00    0.00      0    50.73    0.00    0.00      0
[1,0]<stdout>:         128            32     float     sum      -1    51.35    0.00    0.00      0    50.61    0.00    0.00      0
[1,0]<stdout>:         256            64     float     sum      -1    51.20    0.01    0.01      0    50.47    0.01    0.01      0
[1,0]<stdout>:         512           128     float     sum      -1    50.65    0.01    0.02      0    49.16    0.01    0.02      0
[1,0]<stdout>:        1024           256     float     sum      -1    52.06    0.02    0.03      0    51.37    0.02    0.03      0
[1,0]<stdout>:        2048           512     float     sum      -1    55.56    0.04    0.06      0    54.80    0.04    0.07      0
[1,0]<stdout>:        4096          1024     float     sum      -1    60.02    0.07    0.12      0    59.12    0.07    0.12      0
[1,0]<stdout>:        8192          2048     float     sum      -1    61.74    0.13    0.23      0    60.72    0.13    0.24      0
[1,0]<stdout>:       16384          4096     float     sum      -1    64.68    0.25    0.44      0    63.69    0.26    0.45      0
[1,0]<stdout>:       32768          8192     float     sum      -1    71.38    0.46    0.80      0    70.84    0.46    0.81      0
[1,0]<stdout>:       65536         16384     float     sum      -1    74.78    0.88    1.53      0    74.54    0.88    1.54      0
[1,0]<stdout>:      131072         32768     float     sum      -1    80.97    1.62    2.83      0    79.26    1.65    2.89      0
[1,0]<stdout>:      262144         65536     float     sum      -1    80.99    3.24    5.66      0    78.00    3.36    5.88      0
[1,0]<stdout>:      524288        131072     float     sum      -1    84.01    6.24   10.92      0    83.20    6.30   11.03      0
[1,0]<stdout>:     1048576        262144     float     sum      -1    92.30   11.36   19.88      0    91.67   11.44   20.02      0
[1,0]<stdout>:     2097152        524288     float     sum      -1    114.6   18.29   32.01      0    112.5   18.64   32.62      0
[1,0]<stdout>:     4194304       1048576     float     sum      -1    147.6   28.42   49.74      0    145.3   28.86   50.51      0
[1,0]<stdout>:     8388608       2097152     float     sum      -1    196.9   42.59   74.54      0    197.1   42.55   74.47      0
[1,0]<stdout>:    16777216       4194304     float     sum      -1    288.9   58.07  101.62      0    288.0   58.25  101.95      0
[1,0]<stdout>:    33554432       8388608     float     sum      -1    508.5   65.98  115.47      0    508.6   65.97  115.45      0
[1,0]<stdout>:    67108864      16777216     float     sum      -1    953.6   70.38  123.16      0    954.6   70.30  123.02      0
[1,0]<stdout>:   134217728      33554432     float     sum      -1   1857.3   72.26  126.46      0   1860.8   72.13  126.22      0
[1,0]<stdout>:   268435456      67108864     float     sum      -1   3666.6   73.21  128.12      0   3673.1   73.08  127.89      0
[1,0]<stdout>:   536870912     134217728     float     sum      -1   7286.3   73.68  128.94      0   7299.7   73.55  128.71      0
[1,0]<stdout>:  1073741824     268435456     float     sum      -1    14494   74.08  129.65      0    14515   73.98  129.46      0
[1,0]<stdout>:  2147483648     536870912     float     sum      -1    28866   74.40  130.19      0    28892   74.33  130.08      0
[1,0]<stdout>:multi-node-nccl-test-worker-0:21:21 [0] NCCL INFO comm 0x55f94529a0b0 rank 0 nranks 8 cudaDev 0 busId 160 - Destroy COMPLETE
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth    : 40.7925 
[1,0]<stdout>:#
[1,1]<stdout>:multi-node-nccl-test-worker-0:22:22 [1] NCCL INFO comm 0x56151cbd4340 rank 1 nranks 8 cudaDev 1 busId 170 - Destroy COMPLETE
[1,2]<stdout>:multi-node-nccl-test-worker-0:23:23 [2] NCCL INFO comm 0x5556c80d7a40 rank 2 nranks 8 cudaDev 2 busId 180 - Destroy COMPLETE
[1,3]<stdout>:multi-node-nccl-test-worker-0:24:24 [3] NCCL INFO comm 0x55ddec7c3630 rank 3 nranks 8 cudaDev 3 busId 190 - Destroy COMPLETE
[1,7]<stdout>:multi-node-nccl-test-worker-0:31:31 [7] NCCL INFO comm 0x55df10ef2500 rank 7 nranks 8 cudaDev 7 busId 1d0 - Destroy COMPLETE
[1,5]<stdout>:multi-node-nccl-test-worker-0:26:26 [5] NCCL INFO comm 0x561b7c28f430 rank 5 nranks 8 cudaDev 5 busId 1b0 - Destroy COMPLETE
[1,6]<stdout>:multi-node-nccl-test-worker-0:29:29 [6] NCCL INFO comm 0x5648feb716d0 rank 6 nranks 8 cudaDev 6 busId 1c0 - Destroy COMPLETE
[1,4]<stdout>:multi-node-nccl-test-worker-0:25:25 [4] NCCL INFO comm 0x5569eef06e40 rank 4 nranks 8 cudaDev 4 busId 1a0 - Destroy COMPLETE

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@weicongw weicongw marked this pull request as ready for review June 10, 2024 23:27
e2e2/test/cases/nvidia/main_test.go Outdated Show resolved Hide resolved
e2e2/test/cases/nvidia/mpi_test.go Outdated Show resolved Hide resolved
Comment on lines +112 to +129
if *nodeType != "" {
for _, v := range nodes.Items {
if v.Labels["node.kubernetes.io/instance-type"] == *nodeType {
nodeCount++
gpu := v.Status.Capacity["nvidia.com/gpu"]
gpuPerNode = int(gpu.Value())
efa := v.Status.Capacity["vpc.amazonaws.com/efa"]
efaPerNode = int(efa.Value())
}
}
} else {
log.Printf("No node type specified. Using the node type %s in the node groups.", nodes.Items[0].Labels["node.kubernetes.io/instance-type"])
nodeCount = len(nodes.Items)
gpu := nodes.Items[0].Status.Capacity["nvidia.com/gpu"]
gpuPerNode = int(gpu.Value())
efa := nodes.Items[0].Status.Capacity["vpc.amazonaws.com/efa"]
efaPerNode = int(efa.Value())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is that our deployer doesn't have to pass any info to the test binary about what instance type it deployed, the test case will just adapt to whatever it finds in the cluster? I think that's a fine idea, but I think you need to assert here that you have a single instance type in the cluster and throw an error if there's a mix.

Also why move this to main_test.go? It's only relevant to the multi-node tests, right?

Copy link
Contributor Author

@weicongw weicongw Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why move this to main_test.go? It's only relevant to the multi-node tests, right?

I plan to modify the single node test as well, so it can make the most use of the single node.

Comment on lines -72 to -75
- -t
- "1"
- -g
- "1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did these do?

Copy link
Contributor Author

@weicongw weicongw Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-t,--nthreads number of threads per process. Default : 1.
-g,--ngpus number of gpus per thread. Default : 1.

The default values are 1.
https://github.com/NVIDIA/nccl-tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider just using a different manifest for the instance families we intend to target -- we probably should tune things like memory as well, and I think it's going to be clearer to hardcode most of this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use different manifests for the instance families, we would need to create and maintain many manifests in this repo, which might become a burden in the long run. We can start with the dynamic manifest and revert to a static one if needed. In my opinion, it is easier to switch from dynamic to static.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I don’t feel too strongly either way. Realistically we’ll only be testing 1 or 2 of the most popular families, which we would add/tune as they’re launched.

Copy link
Member

@cartermckinnon cartermckinnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cartermckinnon cartermckinnon merged commit 56cbd21 into aws:main Jun 14, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants