-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PNNL CI w/ GitLab for GPU testing #64
Conversation
ca999be
to
3a17ca0
Compare
Only run in merge requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @cameronrutherford, for the clear instructions and comments everywhere. Very nice.
+1 to this 😄 thanks for all the docs along with the implementation |
119b349
to
6b2d4b9
Compare
https://code.pnnl.gov/e3sm/eagles/mam4xx/-/jobs/7350 - this is now generating an error during the building of mam4xx in CI. I was able to get HAERO to build with GPUs enabled using this pipeline, but for some reason I am unable to get the C/C++ compiler to correctly be configured. Relevant script is in #!/bin/bash
export BUILD_TYPE=Debug
export HAERO_INSTALL=$(pwd)/test-haero-install
export PRECISION=single
export SYSTEM_NAME=deception
# Assuming that this is the directory where you have mam4xx cloned
pushd mam4xx-git
./.github/pnnl-ci/ci.sh
popd And this is the script that I used to build HAERO in testing this myself: #!/bin/bash
export BUILD_TYPE=Debug
export HAERO_INSTALL=$(pwd)/test-haero-install
export PRECISION=single
export SYSTEM_NAME=deception
# Assuming mam4xx-git is where you have the repo cloned
./mam4xx-git/.github/pnnl-ci/rebuild-haero.sh |
While CI is failing, I am also getting test failures when running mam4xx tests myself on the compute node with P100 GPU:
Test 5 is the only test that fails. |
I'm having trouble logging into PNNL's system (an "access policy" problem?). Can you paste any relevant text from the CI failure here? Also, for your manual testing: are you using single or double precision? |
By the way, it might be good to rebase this branch against |
Manual testing was just with single precision - that is captured in the scripts that I added to re-produce. w.r.t. GitLab, I see you have an account and should have access to eagles gitlab under the username Single Precision Release mode PNNL CI Output
|
I rebase whenever I start new development during a development session so this should be good to go. |
You might have to add the |
.github/pnnl-ci/rebuild-haero.sh
Outdated
# Checkout custom branch where working changes are | ||
# TODO - REMOVE ONCE MERGED INTO HAERO | ||
git checkout origin/deception-testing-ci || exit | ||
|
||
# Need to modify .gitmodules file before cloning | ||
perl -i -p -e 's|git@(.*?):|https://\1/|g' .gitmodules || exit | ||
# Update just haero submodules | ||
git submodule update --init || exit | ||
|
||
# Go through and repeat the process for submodules with submodules | ||
declare -a arr=("ekat" "skywalker") | ||
for subm in "${arr[@]}" | ||
do | ||
pushd ./ext/$subm | ||
perl -i -p -e 's|git@(.*?):|https://\1/|g' .gitmodules || exit | ||
git submodule update --init || exit | ||
popd | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeff-cohere here is the block where I take care of HAERO submodules. There is no issue on this side of things, as this recursively initializes each submodule that matters in HAERO. I would use --recursive
when initializing submodules, but this was not feasible as we cannot use SSH based submodules.
Instead, I manually use Perl to in place modify .gitmodules
where it is necessary, and use that as a workaround.
perl -i -p -e 's|git@(.*?):|https://\1/|g' .gitmodules || exit | ||
git submodule update --init || exit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly for HAERO I have to modify mam4xx's .gitmodules
file.
I set |
It is true that mam4xx overrides the compilers using Haero's settings. This is another Kokkos thing. :-) It was fun trying to get this set up in the first place. |
Thinking about this more, it seems like the nvcc wrapper that mam4xx is pointing to is located in the haero source directory, and not in the installed configuration. This leads me to believe that when the HAERO CI install is completed, the old source files are deleted, and so the pointer to this file location is invalid, as nothing is there. Is there a way to patch this in HAERO? Do I have to now also keep a copy of the HAERO source directory that I used to install around? I'm also noticing that the error is with the kokkos submodule within HAERO - perhaps we need to just have an externally installed Kokkos that we can keep around that is outside of HAERO, and not using the submodule? |
That is a good idea. We already install Can you create an issue in the Haero repo for this?
The problem with this approach is that the version of Kokkos we use is a custom branch linked to EKAT. If we used an external version, we'd just be kicking the can down the road to address this problem when we integrate with EAMxx, which also relies on EKAT's version. Sorry about the complicated environment! Most of these decisions were made for us by the EAMxx project (which itself is no picnic to build and run). We're in a tough spot where we have to work with what we have and not try to be too ambitious, because our focus in the near term has to be on porting the parameterizations. Believe me, if we were doing this from scratch, a lot of things would look different if I had any say in it. |
7eb3276
to
e74054d
Compare
Single precision tests are failing in Debug, but passing in Release mode configuration. If we are happy with the infrastructure, then we can merge. Otherwise, we have to also debug the failing test before merging. |
I am still unable to log into the PNNL GitLab setup (I have an email to the support folks there). @pressel , do you want to take a look at it and verify that there's a test failure in the single-precision Debug build? Is anyone at Sandia able to connect to |
Okay, I got access to the PNNL CI pipeline. Here's the error (with CI kibble removed):
Evidently in this environment, |
@jeff-cohere I can test it out on Deception by hand and see how different psum and p0 are |
Thanks! |
This is what the atmosphere_utils test is putting out in single precision GPU Debug mode: |
Thanks, @jaelynlitz . @pbosler , this tolerance is evidently not sufficient for the compilers that Deception is using. I don't have access to my GPU machine today (Seattle is iced over), but if we can reproduce this issue on a single-precision GPU build, we may need to make this a little more permissive. |
I can't seem to reproduce this failure on my GPU-equipped machine (building Haero and mam4xx in single precision using CUDA). Maybe we should merge this PR and create an issue to track this particular error on Deception. Let's discuss in the new year. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cameronrutherford Thanks for your efforts on this. I'm fine with merging this PR and creating an issue for the single precision issues on deception as @jeff-cohere suggested above.
I've logged this in #93. |
This seems to be functional with empty testing scripts in PNNL CI.
Remaining TODOs are in README in pnnl-ci directory.