Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluids: Add PyTorch external DD SGS evaluation #1581

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jrwrigh
Copy link
Collaborator

@jrwrigh jrwrigh commented May 9, 2024

Add PyTorch as a external DD SGS evaluation. This is a "follow up" to #1361 where we use PyTorch to run the data-driven model instead of a native implementation.

ToDos:

  • Add in PetscLogEvents (particularly for inference and the data transfer steps)
  • Add documentation
  • Cleanup the Makefile additions (remove the automatic USE_LIBTORCH testing)
  • Add weak symbols so that this can be compiled without pytorch
  • Add command line switch to enable pytorch vs internal sequential
  • Move createPyTorchModel into the testing directory (it's only useful for documenting the creation of the testing model (instead of it just being a binary blob).
  • Rename "sequential_internal" to "sequential_ceed"

@jrwrigh jrwrigh self-assigned this May 9, 2024
@jrwrigh jrwrigh force-pushed the jrwrigh/pytorch_external_sgs branch 2 times, most recently from 30f4204 to 50af404 Compare May 9, 2024 23:00
@jrwrigh jrwrigh force-pushed the jrwrigh/pytorch_external_sgs branch 4 times, most recently from ab6b800 to 984a964 Compare May 20, 2024 14:29
@jrwrigh jrwrigh marked this pull request as ready for review May 20, 2024 21:57
@jrwrigh jrwrigh added 1-In Review and removed 0-WIP labels May 20, 2024
Copy link
Collaborator Author

@jrwrigh jrwrigh May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To explain the reasoning for this file:

CMake is the only supported build system for PyTorch/libTorch. I initially tried to write a small CMake script to just get the necessary compiler flags, but this ended up not working because normal pip-installed versions of PyTorch have libraries built with CUDA and CMake requires that CUDA be available for those libraries even though it's completely unnecessary if you're not using the CUDA backends of PyTorch.

To circumvent this, the PyTorch python package gives some useful information to build against the underlying libTorch (include and library directories mostly).

However, it still doesn't give required compiler flags, namely GLIBCXX_USE_CXXX11_ABI. To get around that, I parse the TorchConfig.cmake file which has that information embedded in it. Rather than break out my grep-fu, I decided it'd be easiest to do it all in Python since I'm using Python for the include and library paths (and for where TorchConfig.cmake is located).

The easiest way I could think to transfer all this information to the Makefile while keeping the number of Python startups (which can take about 5 seconds per startup depending on the system in my experience) to a minimum is to write it to a pkgconfig file from Python that the make file then uses. Yeah, this relies on some POSIX compliant file system behavior, but I figured that's the case for most build environments.

Any who, thanks for attending my TED talk/rant.

@jrwrigh jrwrigh force-pushed the jrwrigh/pytorch_external_sgs branch 3 times, most recently from 26b57d3 to 31dc6d3 Compare May 26, 2024 19:42
@jrwrigh
Copy link
Collaborator Author

jrwrigh commented May 27, 2024

This is good for review now. Tests pass on Noether for CPU and CUDA. I'm having some difficulties getting PyTorch+ROCM to build with Spack, but that's not a huge priority since we don't have hardware to run that on right now.

- Rename sequential_internal -> *_ceed
To have the log_events accessbile to torch (in C++), I needed to
separate out the header file containing the extern PetscLogEvent
declarations. While I was at it, I figured it'd be more clear to have a
separate log_events.c file as well to have the actual "storage" of the
PetscLogEvents and the RegisterLogEvents function itself.
On Sunspot, on-device inference is not working reliably. I'm not sure
exactly why at the moment (whether it's a libCEED backend issue or
something else).
@jrwrigh jrwrigh force-pushed the jrwrigh/pytorch_external_sgs branch from 31dc6d3 to 8d82690 Compare May 28, 2024 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant