-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Expand TornadoVM profiler with GPU power metrics from NVIDIA NVML API #377
Conversation
…ctions with the profiler via the PTXDeviceContext
...drivers/opencl/src/main/java/uk/ac/manchester/tornado/drivers/opencl/OCLKernelScheduler.java
Outdated
Show resolved
Hide resolved
Running the OpenCL backend:
I am using NVIDIA CUDA Driver: |
Is there anything else I should enable to get the profiling? |
Actually, the error is related to the JNI invocation call. Exception in thread "main" java.lang.UnsatisfiedLinkError: 'long uk.ac.manchester.tornado.drivers.opencl.OCLNvml.nvmlInit()'
at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLNvml.nvmlInit(Native Method) |
If I compile with
|
This does not seem to be a problem with CUDA version. It may be OS-specific. Can you let me know which OS do you have? I tried the PR with |
Kernel:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for me it works:
Result is correct. Total time: 612294 (ns)
{
"s0": {
"POWER_USAGE_mW": "48183",
"TOTAL_DISPATCH_KERNEL_TIME": "4096",
"COPY_IN_TIME": "42816",
"COPY_OUT_TIME": "19936",
"TOTAL_KERNEL_TIME": "7168",
"TOTAL_DISPATCH_DATA_TRANSFERS_TIME": "72160",
"TOTAL_TASK_GRAPH_TIME": "374417",
"TOTAL_COPY_IN_SIZE_BYTES": "800048",
"TOTAL_COPY_OUT_SIZE_BYTES": "400024",
"s0.t0": {
"BACKEND": "OPENCL",
"METHOD": "VectorAddInt.vectorAdd",
"DEVICE_ID": "0:0",
"DEVICE": "NVIDIA GeForce RTX 3070",
"TOTAL_COPY_IN_SIZE_BYTES": "24",
"POWER_USAGE_mW": "48183",
"TASK_KERNEL_TIME": "7168"
}
}
}
OS & Kernel :
Linux pop-os 6.8.0-76060800daily20240311-generic #202403110203~1711393930~22.04~331756a SMP PREEMPT_DYNAMIC Mon M x86_64 x86_64 x86_64 GNU/Linux
NVIDIA version:
| NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 |
For me it worked without doing any extra steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
OpenCL & PTX works for:
Kernel :
Linux pop-os 6.8.0-76060800daily20240311-generic #202403110203~1711393930~22.04~331756a SMP PREEMPT_DYNAMIC Mon M x86_64 x86_64 x86_64 GNU/Linux
NVIDIA version:
| NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 |
After a sync, it seems that the error occurs because in the failing system, CUDA is installed manually and therefore, the |
…n and extend it in the backends
…based on an abstract class (PowerMetric) that can be extended in each backend opencl and ptx
as this is a self-contained functionality. Does it make sense to port it with FFM as an exercise? |
My plan is to test it for Windows, and then we can proceed with merging the PR. I think it makes sense because it has just 4 files in the native part to be tested with FFI API. |
What is the purpose of FFI here? The TornadoVM native code to interact with the driver will use two different approaches then. That's the main reason we unified all native code in C++ using the same style across backends, so it will be easier to maintain a debug. IMO, if we transition to FFI, it should be for all backends:
|
yes, I think we agree on that. My understanding is that @mikepapadim would like to test it. Anyway, we can move that discussion in a proposal discussion. |
The PR is ready for review and running the tests. It should be working also for Windows (native installation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM , There is an issue with the level zero backend and profiler. But that's a separate issue not related to this PR. I will open a new one.
Improvements ~~~~~~~~~~~~~~~~~~ - [beehive-lab#369](beehive-lab#369): Introduction of Tensor types in TornadoVM API and interoperability with ONNX Runtime. - [beehive-lab#370](beehive-lab#370): Array concatenation operation for TornadoVM native arrays. - [beehive-lab#371](beehive-lab#371): TornadoVM installer script ported for Windows 10/11. - [beehive-lab#372](beehive-lab#372): Add support for ``HalfFloat`` (``Float16``) in vector types. - [beehive-lab#374](beehive-lab#374): Support for TornadoVM array concatenations from the constructor-level. - [beehive-lab#375](beehive-lab#375): Support for TornadoVM native arrays using slices from the Panama API. - [beehive-lab#376](beehive-lab#376): Support for lazy copy-outs in the batch processing mode. - [beehive-lab#377](beehive-lab#377): Expand the TornadoVM profiler with power metrics for NVIDIA GPUs (OpenCL and PTX backends). - [beehive-lab#384](beehive-lab#384): Auto-closable Execution Plans for automatic memory management. Compatibility ~~~~~~~~~~~~~~~~~~ - [beehive-lab#386](beehive-lab#386): OpenJDK 17 support removed. - [beehive-lab#390](beehive-lab#390): SapMachine OpenJDK 21 supported. - [beehive-lab#395](beehive-lab#395): OpenJDK 22 and GraalVM 22.0.1 supported. - TornadoVM tested with Apple M3 chips. Bug Fixes ~~~~~~~~~~~~~~~~~~ - [beehive-lab#367](beehive-lab#367): Fix for Graal/Truffle languages in which some Java modules were not visible. - [beehive-lab#373](beehive-lab#373): Fix for data copies of the ``HalfFloat`` types for all backends. - [beehive-lab#378](beehive-lab#378): Fix free memory markers when running multi-thread execution plans. - [beehive-lab#379](beehive-lab#379): Refactoring package of vector api unit-tests. - [beehive-lab#380](beehive-lab#380): Fix event list sizes to accommodate profiling of large applications. - [beehive-lab#385](beehive-lab#385): Fix code check style. - [beehive-lab#387](beehive-lab#387): Fix TornadoVM internal events in OpenCL, SPIR-V and PTX for running multi-threaded execution plans. - [beehive-lab#388](beehive-lab#388): Fix of expected and actual values of tests. - [beehive-lab#392](beehive-lab#392): Fix installer for using existing JDKs. - [beehive-lab#389](beehive-lab#389): Fix ``DataObjectState`` for multi-thread execution plans. - [beehive-lab#396](beehive-lab#396): Fix JNI code for the CUDA NVML library access with OpenCL.
Description
This PR implements a new feature regarding adding the power consumption as a metric in the TornadoVM profiler. To that end, this PR invokes the NVIDIA NVML API in the JNI part of both OpenCL and PTX drivers.
I updated the PR with an hierachical design:
-> drivers-common: new package (
power
) that contains an new interfacePowerMetric.java
.-> drivers-opencl: new package (
power
) that implements two instances of the new interface (OCLNvidiaPowerMetric.java
,OCLEmptyPowerMetric.java
). TheOCLNvidiaPowerMetric,java
contains the jni methods that point to the NVML functions.-> drivers-ptx: new package (
power
) that implements one instance of the new interface (PTXNvidiaPowerMetric.java
). ThePTXNvidiaPowerMetric,java
contains the jni methods that point to the NVML functions.See the
OCLNvidiaPowerMetric.cpp
andPTXNvidiaPowerMetric.cpp
files, which contain similar code for:I have modified the CMakeLists.txt file in OpenCL to build the
OCLNvidiaPowerMetric.cpp
file only ifnvml.h
and thenvidia-ml
are available in the system. For PTX, this is not necessary as the NVML API is available if CUDA is installed (i.e., NVML is part of the NVIDIA GPU Deployment Kit).The queried result is added in the TornadoVM profiler as a metric:
Note: We observed that the build process may fail if the NVIDIA CUDA Toolkit is manually installed in a directory that is not the default one. Two points in the CMakeLists.txt of
opencl-jni
have been updated to include a custom directory:nvml.h
, the script will search in the default location (e.g.,/usr/include
,/usr/local/include/
) and we added as an option/usr/local/cuda/targets/x86_64-linux/include
.libnvidia-ml.so
, the script will search in the default location (e.g.,/usr/lib/x86_64-linux-gnu
,/usr/local/lib/
) and we added as an option/usr/local/cuda/targets/x86_64-linux/lib/stubs
Backend/s tested
Mark the backends affected by this PR.
OS tested
Mark the OS where this PR is tested.
Did you check on FPGAs?
If it is applicable, check your changes on FPGAs.
How to test the new patch?
To test, you can run:
To test in Windows: