-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate multi-node nccl testing into the tester package #447
Conversation
e2e2/test/cases/nvidia/manifests/mpi-job-nccl-test-multi-node.yaml
Outdated
Show resolved
Hide resolved
if *nodeType != "" { | ||
for _, v := range nodes.Items { | ||
if v.Labels["node.kubernetes.io/instance-type"] == *nodeType { | ||
nodeCount++ | ||
gpu := v.Status.Capacity["nvidia.com/gpu"] | ||
gpuPerNode = int(gpu.Value()) | ||
efa := v.Status.Capacity["vpc.amazonaws.com/efa"] | ||
efaPerNode = int(efa.Value()) | ||
} | ||
} | ||
} else { | ||
log.Printf("No node type specified. Using the node type %s in the node groups.", nodes.Items[0].Labels["node.kubernetes.io/instance-type"]) | ||
nodeCount = len(nodes.Items) | ||
gpu := nodes.Items[0].Status.Capacity["nvidia.com/gpu"] | ||
gpuPerNode = int(gpu.Value()) | ||
efa := nodes.Items[0].Status.Capacity["vpc.amazonaws.com/efa"] | ||
efaPerNode = int(efa.Value()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea is that our deployer doesn't have to pass any info to the test binary about what instance type it deployed, the test case will just adapt to whatever it finds in the cluster? I think that's a fine idea, but I think you need to assert here that you have a single instance type in the cluster and throw an error if there's a mix.
Also why move this to main_test.go
? It's only relevant to the multi-node tests, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also why move this to main_test.go? It's only relevant to the multi-node tests, right?
I plan to modify the single node test as well, so it can make the most use of the single node.
- -t | ||
- "1" | ||
- -g | ||
- "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did these do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-t,--nthreads number of threads per process. Default : 1.
-g,--ngpus number of gpus per thread. Default : 1.
The default values are 1.
https://github.com/NVIDIA/nccl-tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider just using a different manifest for the instance families we intend to target -- we probably should tune things like memory as well, and I think it's going to be clearer to hardcode most of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use different manifests for the instance families, we would need to create and maintain many manifests in this repo, which might become a burden in the long run. We can start with the dynamic manifest and revert to a static one if needed. In my opinion, it is easier to switch from dynamic to static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I don’t feel too strongly either way. Realistically we’ll only be testing 1 or 2 of the most popular families, which we would add/tune as they’re launched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Issue #, if available:
Description of changes:
This PR integrates multi-node NCCL testing into the tester package. The tester now accepts the following flags to configure the test:
The tester can retrieve the hardware specifications from the nodes and render the nccl test manifest based on these specifications.
Testing
Testing pod logs
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.