Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assistance to broader Tag releases #395

Open
caio-davi opened this issue Apr 19, 2024 · 4 comments
Open

Assistance to broader Tag releases #395

caio-davi opened this issue Apr 19, 2024 · 4 comments

Comments

@caio-davi
Copy link

Hello team,

I noticed you have been updating most of the last tagged releases with the -aws suffix. I think I can assist if you need help testing it on other libfabric providers.
We have some nightly tests running on slingshot fabric, and I could add your test bed to it. This way the releases would have been tested in at least two providers.
Please, let me know if it is something you are interested!

@rajachan
Copy link
Member

Thanks for reaching out, we absolutely want to get back to general releases. As you rightly call out, test coverage across providers is one reason why we have not done that yet and any additional testing is welcome. Are you saying you have nightly tests running with the OFI NCCL plugin and the CXI provider already? If so, are you testing with the master branch?

FWIW, we currently track unit tests, functional tests using just the plugin interfaces, NCCL-level tests using the nccl-tests benchmark suite, and application tests to gain confidence in a release. This wiki page has some more information about what gets tested with each Pull Request: https://github.com/aws/aws-ofi-nccl/wiki/Testing-and-Benchmarking. In addition, we have a more exhaustive nightly test suite that tests changes with real applications.

@caio-davi
Copy link
Author

caio-davi commented Apr 20, 2024

Actually, these nightly tests I mentioned are tests for our internal software stack, and we are using nccl-tests as part of these tests. I apologize for the bad wording.
And no, we are not testing on the master branch. Since they are sanity tests, we use version 1.6.0, as it was the latest non-specific release (although I have been using a few -aws versions for other purposes with no major issues).
Please let me know how I can help with the testing.

@caio-davi
Copy link
Author

Hey @rajachan, just checking in...
Do you think we can move forward with this?

@rajachan
Copy link
Member

@caio-davi Yes! One thing we can do in the near term is extending the PR presubmit checks to include your tests with the cxi provider, to ensure commits that go into the master branch build and work as expected . We can setup any Github hooks needed for your automation to see new PRs, run the tests on your systems, and report back test results. Feel free to reach out to us at al-ofi-nccl-team@amazon.com and we can hash out specifics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants