Panoptes - A Binary Translation Framework for CUDA (c) 2012 - Chris Kennelly (email@example.com)
Table of Contents
Panoptes intercepts library calls to the GPU in order to maintain bookkeeping information about the state of the GPU, including device code. This permits on-the-fly instrumentation of existing programs without recompilation.
This functionality is currently demonstrated by providing a memory checking functionality similar to Valgrind's memcheck tool for CUDA that instruments device code so that it can continue to run on the GPU, maintaining the parallelism that may have necessitated the use of GPUs in the first place.
Panoptes is open source software, licensed under the GPLv3. For more information see COPYING.
The Panoptes interposer depends on Boost, CUDA, Make, and Valgrind (for its hooks). The testsuite shares the same dependencies as well as Google's googletest framework (http://code.google.com/p/googletest/).
Once the appropriate include paths are specified in the Makefile, run 'make' from the source directory to build the interposer and run the test suite. (A working CUDA-compatible GPU is required for the tests to work.)
To run a CUDA program under Panoptes (for demonstration purposes, named "my_cuda_program"):
Panpotes is a research code base that has not achieved a complete implementation of CUDA. Notable limitatations (and the rationale for them) currently include:
- Mapped memory accesses from the GPU. Panoptes currently reports to callers of cudaGetDeviceProperties the flag canMapHostMemory to be zero. Supporting direct access requires we maintain two sets of validity bits, one for the device and one for the host, keeping the state of the two sets reasonably consistent. We could make the host "authoritative," exposing the validity bits for mapped regions stored by Valgrind directly to the device. Doing so would require tight coupling with Valgrind's internals as well as likely patching Valgrind to disable its compression technique for validity bits (as Panoptes uses 1:1 bit level shadowing). We could make the device authoritative. Upon a kernel launch, we would need to speculatively transfer any dirty, host-stored validity bits out of Valgrind and onto the device. Upon a host access, we would have to load the validity bits off of the device and place them into Valgrind.
- Peer-to-Peer Support. Panoptes currently does not implement the peer-to-peer functionality exposed in CUDA 4. Implementing this is currently limited by the fact that each device maintains its own master list of chunks in its address space. Devices that can communicate with peer-to-peer need to share some of those portions of their address space (relatedly, cudaGetDeviceProperties under Panoptes reports that unified addressing is not supported). Since this is a common use case for multi-GPU systems, it is expected to be implemented in the near future.
- Instruction support. Not all parts of the PTX instruction set are supported. Further, parts of the PTX instruction set that are supported have largely been tested by generating kernels written in C/C++ with nvcc. It is possible that there are untested edge cases that would only be exposed by use of inline PTX.
- Atomic operations: These currently are not checked for addressability nor are validity bits considered.
- Textures/Surfaces: Texture/surface support is currently being tested, but is not released.
- Extended Precision: Extended precision operations (addc, subc, madc) are not instrumented.
- Video instructions: The video instructions (vadd, vsub, vabsdiff, vmin,vmax, vshl, vshr, vmad, vset) are currently not instrumented.