Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ppc 10.2 #55

Merged
merged 14 commits into from
Jun 4, 2021
Merged

Conversation

jaimergp
Copy link
Member

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

@conda-forge-linter
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@jaimergp jaimergp mentioned this pull request May 14, 2021
6 tasks
@jaimergp
Copy link
Member Author

Some fixes needed in the Docker image. Pending conda-forge/docker-images#181

@jaimergp
Copy link
Member Author

@peastman

PPC64LE fails with OpenCL (both pocl and oclgrind). It gives nan forces in the simtk.testInstallation routines.

More details.

@peastman
Copy link
Contributor

No idea what's causing it. We don't support either of those OpenCL implementations, and there's no information in the log beyond the fact that it produced nan. I notice the CUDA platform also failed. What kind of hardware is it running on? Is it using emulation?

@jaimergp
Copy link
Member Author

CUDA will not run on this CI. The OpenCL tests did pass on the other platforms, but I guess we'll have to disable for PPC and test with local artifacts.

@raimis
Copy link

raimis commented May 20, 2021

Soon I'll have access to a machine with ppc64le CPU and V100 GPU. Would it be possible to disable the OpenCL tests so the builds finish and the packages are available to test?

@jaimergp
Copy link
Member Author

Yes, definitely. Let's see if this commit works first!

@raimis
Copy link

raimis commented Jun 2, 2021

I tried to build the package on real hardware:

$ uname -a
Linux login02 4.18.0-147.13.2.el8_1.ppc64le #1 SMP Wed May 13 15:23:36 UTC 2020 ppc64le ppc64le ppc64le GNU/Linux
$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-285ad857-3f51-bece-1a26-111b71801e3c)
GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-4e6c7741-1786-866f-ec22-8036fbc79926)
GPU 2: Tesla V100-SXM2-16GB (UUID: GPU-724c10eb-1b65-92da-a810-d0ce94e16f6c)
GPU 3: Tesla V100-SXM2-16GB (UUID: GPU-9026bf81-cece-693a-af05-f54a1e412d36)

Except OpenCL, it works:

$ python -m simtk.testInstallation

OpenMM Version: 7.5.1
Git Revision: a9cfd7fb9343e21c3dbb76e377c721328830a3ee

There are 4 Platforms available:

1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 CUDA - Successfully computed forces
4 OpenCL - Error computing forces with OpenCL platform

OpenCL platform error: Error compiling kernel: 

Median difference in forces between platforms:

Reference vs. CPU: 6.29803e-06
Reference vs. CUDA: 6.72703e-06
CPU vs. CUDA: 6.59765e-07

All differences are within tolerance.

So, it seems the CI fails just due to a missing GPU.

@jaimergp
Copy link
Member Author

jaimergp commented Jun 2, 2021

Excellent! I am sorry I haven't got back to this yet. Between vacation and onboarding I still have some open issues to track down. This is on my list though. I'll get to it shortly!

@peastman
Copy link
Contributor

peastman commented Jun 2, 2021

I wonder why OpenCL failed to compile a kernel? Could you try running a few of the test cases? For example,

./TestOpenCLHarmonicBondForce
./TestOpenCLNonbondedForce

Does it give any more information?

@raimis
Copy link

raimis commented Jun 2, 2021

Just non-descriptive error messages:

$ ./TestOpenCLHarmonicBondForce
exception: Error compiling kernel:
$ ./TestOpenCLNonbondedForce
exception: Error compiling kernel: 

@peastman
Copy link
Contributor

peastman commented Jun 2, 2021

Very strange. And yet OpenCL did work on CI. I don't know whether it's worth spending a lot of time to investigate, since the number of people using NVIDIA OpenCL on PPC is likely to be close to zero.

@jaimergp
Copy link
Member Author

jaimergp commented Jun 2, 2021

Should we disable OpenCL on PPC then?

@peastman
Copy link
Contributor

peastman commented Jun 2, 2021

I would hesitate to do that without more information. We don't know why it's failing on that one particular computer. It might have nothing to do with the fact that it's PPC. We also have no reason to think the same error would happen with other OpenCL implementations.

@jchodera
Copy link
Contributor

jchodera commented Jun 2, 2021

@peastman: Didn't we get you access to a ppc64le machine at some point?

@raimis
Copy link

raimis commented Jun 3, 2021

Very strange. And yet OpenCL did work on CI. I don't know whether it's worth spending a lot of time to investigate, since the number of people using NVIDIA OpenCL on PPC is likely to be close to zero.

The CI machine doesn't have a GPU, so OpenCL tries to run on a CPU and succeeds. My machine has GPUs, so OpenCL tries to use them and fails, I guess. Anyway, OpenCL is irrelevant in this case.

So far, all the PPC machines I have seen exclusively have V100 GPUs. So, if we make a package with CUDA 10.2, it will cover most (if not all) use cases.

@peastman
Copy link
Contributor

peastman commented Jun 3, 2021

@peastman: Didn't we get you access to a ppc64le machine at some point?

See #55 (comment).

@jaimergp
Copy link
Member Author

jaimergp commented Jun 3, 2021

Ok, this is ready to go if everybody agrees.

@jaimergp jaimergp merged commit 502474e into conda-forge:master Jun 4, 2021
@raimis
Copy link

raimis commented Jun 4, 2021

Thanks @jaimergp!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants