[feat] Expand TornadoVM profiler with GPU power metrics from NVIDIA NVML API #377

stratika · 2024-04-15T14:26:02Z

Description

This PR implements a new feature regarding adding the power consumption as a metric in the TornadoVM profiler. To that end, this PR invokes the NVIDIA NVML API in the JNI part of both OpenCL and PTX drivers.

I updated the PR with an hierachical design:
-> drivers-common: new package (power) that contains an new interface PowerMetric.java.
-> drivers-opencl: new package (power) that implements two instances of the new interface (OCLNvidiaPowerMetric.java, OCLEmptyPowerMetric.java). The OCLNvidiaPowerMetric,java contains the jni methods that point to the NVML functions.
-> drivers-ptx: new package (power) that implements one instance of the new interface (PTXNvidiaPowerMetric.java). The PTXNvidiaPowerMetric,java contains the jni methods that point to the NVML functions.

See the OCLNvidiaPowerMetric.cpp and PTXNvidiaPowerMetric.cpp files, which contain similar code for:

i) initialising NVML for all GPU devices in the system,
ii) getting the device handle for a particular index,
iii) querying the power usage of a device handle.

I have modified the CMakeLists.txt file in OpenCL to build the OCLNvidiaPowerMetric.cpp file only if nvml.h and the nvidia-ml are available in the system. For PTX, this is not necessary as the NVML API is available if CUDA is installed (i.e., NVML is part of the NVIDIA GPU Deployment Kit).

The queried result is added in the TornadoVM profiler as a metric:

{
    "s0": {
        "TOTAL_KERNEL_TIME": "15328",
        "COPY_OUT_TIME": "65792",
        "TOTAL_TASK_GRAPH_TIME": "649996",
        "COPY_IN_TIME": "3680",
        "TOTAL_DISPATCH_KERNEL_TIME": "0",
        "TOTAL_DISPATCH_DATA_TRANSFERS_TIME": "0",
        "TOTAL_COPY_OUT_SIZE_BYTES": "400024",
        "s0.t0": {
            "BACKEND": "PTX",
            "METHOD": "VectorAddInt.vectorAdd",
            "DEVICE_ID": "0:0",
            "DEVICE": "NVIDIA RTX A2000 8GB Laptop GPU",
            "TOTAL_COPY_IN_SIZE_BYTES": "24",
            "POWER_USAGE_mW": "9229",              # <--- This line is new, if the device is not NVIDIA "n/a" is returned.
            "TASK_KERNEL_TIME": "15328"
        }
    }
}

Note: We observed that the build process may fail if the NVIDIA CUDA Toolkit is manually installed in a directory that is not the default one. Two points in the CMakeLists.txt of opencl-jni have been updated to include a custom directory:

For nvml.h, the script will search in the default location (e.g., /usr/include, /usr/local/include/) and we added as an option /usr/local/cuda/targets/x86_64-linux/include.
For libnvidia-ml.so, the script will search in the default location (e.g., /usr/lib/x86_64-linux-gnu, /usr/local/lib/) and we added as an option /usr/local/cuda/targets/x86_64-linux/lib/stubs

Backend/s tested

Mark the backends affected by this PR.

OpenCL
PTX
SPIRV

OS tested

Mark the OS where this PR is tested.

Linux
OSx
Windows

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

Yes
No

How to test the new patch?

To test, you can run:

make BACKEND=opencl
tornado --enableProfiler console -m tornado.examples/uk.ac.manchester.tornado.examples.VectorAddInt --params="100000"

make BACKEND=ptx
tornado --enableProfiler console -m tornado.examples/uk.ac.manchester.tornado.examples.VectorAddInt --params="100000"

To test in Windows:

.\bin\tornadovm-installer.cmd --backend=opencl
python %TORNADO_SDK%\bin\tornado -ea  --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True "  -m  tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  --params "uk.ac.manchester.tornado.unittests.profiler.TestProfiler"

.\bin\tornadovm-installer.cmd --backend=ptx
python %TORNADO_SDK%\bin\tornado -ea  --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True "  -m  tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  --params "uk.ac.manchester.tornado.unittests.profiler.TestProfiler"

…ctions with the profiler via the PTXDeviceContext

…e in the system

...drivers/opencl/src/main/java/uk/ac/manchester/tornado/drivers/opencl/OCLKernelScheduler.java

jjfumero · 2024-04-16T08:36:09Z

Running the OpenCL backend:

tornado --enableProfiler console -m tornado.examples/uk.ac.manchester.tornado.examples.VectorAddInt 100000
WARNING: Using incubator modules: jdk.incubator.vector
Exception in thread "main" java.lang.UnsatisfiedLinkError: 'long uk.ac.manchester.tornado.drivers.opencl.OCLNvml.nvmlInit()'
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLNvml.nvmlInit(Native Method)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLNvml.clNvmlInit(OCLNvml.java:52)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLNvml.<init>(OCLNvml.java:36)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLDeviceContext.<init>(OCLDeviceContext.java:71)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLContext.createDeviceContext(OCLContext.java:209)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLContext.createDeviceContext(OCLContext.java:42)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.graal.OCLHotSpotBackendFactory.createJITCompiler(OCLHotSpotBackendFactory.java:95)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.createOCLJITCompiler(OCLBackendImpl.java:204)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.installDevices(OCLBackendImpl.java:218)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.lambda$discoverDevices$4(OCLBackendImpl.java:225)
	at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
	at java.base/java.util.stream.IntPipeline$Head.forEach(IntPipeline.java:617)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.discoverDevices(OCLBackendImpl.java:223)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.<init>(OCLBackendImpl.java:76)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLTornadoDriverProvider.createBackend(OCLTornadoDriverProvider.java:48)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoCoreRuntime.loadBackends(TornadoCoreRuntime.java:167)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoCoreRuntime.<init>(TornadoCoreRuntime.java:105)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoCoreRuntime.<clinit>(TornadoCoreRuntime.java:79)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.meta.MetaDataUtils.resolveDevice(MetaDataUtils.java:41)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.meta.AbstractMetaData.getLogicDevice(AbstractMetaData.java:192)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.reuseDeviceBufferObject(TornadoTaskGraph.java:1030)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.lockObjectsInMemory(TornadoTaskGraph.java:1041)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.transferToDevice(TornadoTaskGraph.java:944)
	at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TaskGraph.transferToDevice(TaskGraph.java:704)
	at tornado.examples@1.0.4-dev/uk.ac.manchester.tornado.examples.VectorAddInt.main(VectorAddInt.java:56)

I am using NVIDIA CUDA Driver: 550.67 and CUDA 12.3

jjfumero · 2024-04-16T08:36:27Z

Is there anything else I should enable to get the profiling?

jjfumero · 2024-04-16T08:37:07Z

Actually, the error is related to the JNI invocation call.

Exception in thread "main" java.lang.UnsatisfiedLinkError: 'long uk.ac.manchester.tornado.drivers.opencl.OCLNvml.nvmlInit()'
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLNvml.nvmlInit(Native Method)

jjfumero · 2024-04-16T08:40:58Z

If I compile with PTX only, it works.

{
    "s0": {
        "TOTAL_TASK_GRAPH_TIME": "380085",
        "TOTAL_DISPATCH_KERNEL_TIME": "0",
        "TOTAL_DISPATCH_DATA_TRANSFERS_TIME": "0",
        "COPY_OUT_TIME": "53952",
        "COPY_IN_TIME": "3584",
        "TOTAL_KERNEL_TIME": "6943",
        "TOTAL_COPY_OUT_SIZE_BYTES": "400024",
        "s0.t0": {
            "BACKEND": "PTX",
            "METHOD": "VectorAddInt.vectorAdd",
            "DEVICE_ID": "0:0",
            "DEVICE": "NVIDIA GeForce RTX 3070",
            "TOTAL_COPY_IN_SIZE_BYTES": "24",
            "TASK_KERNEL_TIME": "6943",
            "POWER_USAGE_mW": "48444"      << 
        }
    }
}

stratika · 2024-04-16T09:04:35Z

Actually, the error is related to the JNI invocation call.

Exception in thread "main" java.lang.UnsatisfiedLinkError: 'long uk.ac.manchester.tornado.drivers.opencl.OCLNvml.nvmlInit()'
at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLNvml.nvmlInit(Native Method)

This does not seem to be a problem with CUDA version. It may be OS-specific. Can you let me know which OS do you have? I tried the PR with Ubuntu 23.10 and macOS Sonoma 14.4.

jjfumero · 2024-04-16T09:06:01Z

LSB Version:	n/a
Distributor ID:	Fedora
Description:	Fedora Linux 39 (Workstation Edition)
Release:	39
Codename:	n/a

Kernel:

Linux xps8950 6.7.11-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 27 16:50:39 UTC 2024 x86_64 GNU/Linux

mikepapadim

for me it works:

Result is correct. Total time: 612294 (ns)
{
    "s0": {
        "POWER_USAGE_mW": "48183",
        "TOTAL_DISPATCH_KERNEL_TIME": "4096",
        "COPY_IN_TIME": "42816",
        "COPY_OUT_TIME": "19936",
        "TOTAL_KERNEL_TIME": "7168",
        "TOTAL_DISPATCH_DATA_TRANSFERS_TIME": "72160",
        "TOTAL_TASK_GRAPH_TIME": "374417",
        "TOTAL_COPY_IN_SIZE_BYTES": "800048",
        "TOTAL_COPY_OUT_SIZE_BYTES": "400024",
        "s0.t0": {
            "BACKEND": "OPENCL",
            "METHOD": "VectorAddInt.vectorAdd",
            "DEVICE_ID": "0:0",
            "DEVICE": "NVIDIA GeForce RTX 3070",
            "TOTAL_COPY_IN_SIZE_BYTES": "24",
            "POWER_USAGE_mW": "48183",
            "TASK_KERNEL_TIME": "7168"
        }
    }
}

OS & Kernel :

Linux pop-os 6.8.0-76060800daily20240311-generic #202403110203~1711393930~22.04~331756a SMP PREEMPT_DYNAMIC Mon M x86_64 x86_64 x86_64 GNU/Linux

NVIDIA version:

| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |

mikepapadim · 2024-04-16T09:25:05Z

Is there anything else I should enable to get the profiling?

For me it worked without doing any extra steps

mikepapadim

LGTM, thanks

OpenCL & PTX works for:

Kernel :

Linux pop-os 6.8.0-76060800daily20240311-generic #202403110203~1711393930~22.04~331756a SMP PREEMPT_DYNAMIC Mon M x86_64 x86_64 x86_64 GNU/Linux

NVIDIA version:

| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |

stratika · 2024-04-16T10:10:50Z

LSB Version:	n/a
Distributor ID:	Fedora
Description:	Fedora Linux 39 (Workstation Edition)
Release:	39
Codename:	n/a

Kernel:

Linux xps8950 6.7.11-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 27 16:50:39 UTC 2024 x86_64 GNU/Linux

After a sync, it seems that the error occurs because in the failing system, CUDA is installed manually and therefore, the libnvml.so is not detected at runtime. We need to think a patch for CMakeLists.txt. I will post any updates to be tested in this PR. I switch it to draft mode.

…n and extend it in the backends

…based on an abstract class (PowerMetric) that can be extended in each backend opencl and ptx

mikepapadim · 2024-04-17T09:46:06Z

as this is a self-contained functionality. Does it make sense to port it with FFM as an exercise?

stratika · 2024-04-17T10:19:45Z

My plan is to test it for Windows, and then we can proceed with merging the PR. I think it makes sense because it has just 4 files in the native part to be tested with FFI API.

jjfumero · 2024-04-17T10:26:33Z

What is the purpose of FFI here? The TornadoVM native code to interact with the driver will use two different approaches then. That's the main reason we unified all native code in C++ using the same style across backends, so it will be easier to maintain a debug. IMO, if we transition to FFI, it should be for all backends:

OpenCL
CUDA
Level Zero

stratika · 2024-04-17T10:31:18Z

What is the purpose of FFI here? The TornadoVM native code to interact with the driver will use two different approaches then. That's the main reason we unified all native code in C++ using the same style across backends, so it will be easier to maintain a debug. IMO, if we transition to FFI, it should be for all backends:

OpenCL

CUDA

Level Zero

yes, I think we agree on that. My understanding is that @mikepapadim would like to test it. Anyway, we can move that discussion in a proposal discussion.

stratika · 2024-04-19T07:27:12Z

The PR is ready for review and running the tests. It should be working also for Windows (native installation).

jjfumero

LGTM , There is an issue with the level zero backend and profiler. But that's a separate issue not related to this PR. I will open a new one.

Improvements ~~~~~~~~~~~~~~~~~~ - [beehive-lab#369](beehive-lab#369): Introduction of Tensor types in TornadoVM API and interoperability with ONNX Runtime. - [beehive-lab#370](beehive-lab#370): Array concatenation operation for TornadoVM native arrays. - [beehive-lab#371](beehive-lab#371): TornadoVM installer script ported for Windows 10/11. - [beehive-lab#372](beehive-lab#372): Add support for ``HalfFloat`` (``Float16``) in vector types. - [beehive-lab#374](beehive-lab#374): Support for TornadoVM array concatenations from the constructor-level. - [beehive-lab#375](beehive-lab#375): Support for TornadoVM native arrays using slices from the Panama API. - [beehive-lab#376](beehive-lab#376): Support for lazy copy-outs in the batch processing mode. - [beehive-lab#377](beehive-lab#377): Expand the TornadoVM profiler with power metrics for NVIDIA GPUs (OpenCL and PTX backends). - [beehive-lab#384](beehive-lab#384): Auto-closable Execution Plans for automatic memory management. Compatibility ~~~~~~~~~~~~~~~~~~ - [beehive-lab#386](beehive-lab#386): OpenJDK 17 support removed. - [beehive-lab#390](beehive-lab#390): SapMachine OpenJDK 21 supported. - [beehive-lab#395](beehive-lab#395): OpenJDK 22 and GraalVM 22.0.1 supported. - TornadoVM tested with Apple M3 chips. Bug Fixes ~~~~~~~~~~~~~~~~~~ - [beehive-lab#367](beehive-lab#367): Fix for Graal/Truffle languages in which some Java modules were not visible. - [beehive-lab#373](beehive-lab#373): Fix for data copies of the ``HalfFloat`` types for all backends. - [beehive-lab#378](beehive-lab#378): Fix free memory markers when running multi-thread execution plans. - [beehive-lab#379](beehive-lab#379): Refactoring package of vector api unit-tests. - [beehive-lab#380](beehive-lab#380): Fix event list sizes to accommodate profiling of large applications. - [beehive-lab#385](beehive-lab#385): Fix code check style. - [beehive-lab#387](beehive-lab#387): Fix TornadoVM internal events in OpenCL, SPIR-V and PTX for running multi-threaded execution plans. - [beehive-lab#388](beehive-lab#388): Fix of expected and actual values of tests. - [beehive-lab#392](beehive-lab#392): Fix installer for using existing JDKs. - [beehive-lab#389](beehive-lab#389): Fix ``DataObjectState`` for multi-thread execution plans. - [beehive-lab#396](beehive-lab#396): Fix JNI code for the CUDA NVML library access with OpenCL.

stratika added 8 commits February 8, 2024 12:27

[wip] Add power usage metric in profiler

488ba70

[wip] Add power usage function in OpenCL device context

29147f2

[feat] Invoke power usage jni function in OpenCL

7e0587d

Merge branch 'develop' into feat/nvidia-power

d792ca6

[feat] Add native functions for nvml in OpenCL driver

a2dbc61

[feat] Add native functions for nvml in PTX driver and link those fun…

4012da7

…ctions with the profiler via the PTXDeviceContext

Resolve conflicts

14ffa00

[fix] Added condition to build nvml for OpenCL only if it is availabl…

d9886ea

…e in the system

stratika commented Apr 15, 2024

View reviewed changes

...drivers/opencl/src/main/java/uk/ac/manchester/tornado/drivers/opencl/OCLKernelScheduler.java Outdated Show resolved Hide resolved

stratika requested review from jjfumero and mikepapadim April 15, 2024 14:27

stratika self-assigned this Apr 15, 2024

stratika added enhancement New feature or request feature New feature proposal labels Apr 15, 2024

[fix] Add condition to apply NVML if device supports it

dec3480

mikepapadim reviewed Apr 16, 2024

View reviewed changes

mikepapadim approved these changes Apr 16, 2024

View reviewed changes

stratika marked this pull request as draft April 16, 2024 10:11

stratika added 4 commits April 16, 2024 13:27

[revert] Withdraw condition for applying NVML if device supports it

208bc17

[feat] Display power usage if NVML is supported otherwise n/a

d1f04d7

[style] Applied code formatter

09331f8

[wip] Try resolving issue with manual cuda installation

bc5e20d

stratika added 2 commits April 16, 2024 17:58

[feat] Redesign classes to move abstract power class in drivers commo…

1dd6f7c

…n and extend it in the backends

[refactor] Refactored jni implementations of nvml to be hierarchical …

a86aa55

…based on an abstract class (PowerMetric) that can be extended in each backend opencl and ptx

stratika added 3 commits April 18, 2024 21:36

[fix] Update cmake script to enable nvml link for windows

16335be

[fix] Update cmake script to enable nvml link for windows in PTX

d864600

Merge branch 'develop' into feat/nvidia-power

5c1d627

stratika marked this pull request as ready for review April 19, 2024 07:23

stratika added 2 commits April 22, 2024 13:31

[refactor] Changed abstract class of PowerMetric to an interface

7e64175

Merge branch 'develop' into feat/nvidia-power

45f25af

jjfumero approved these changes Apr 22, 2024

View reviewed changes

jjfumero merged commit 4e183ab into beehive-lab:develop Apr 22, 2024
2 checks passed

stratika deleted the feat/nvidia-power branch April 22, 2024 12:40

jjfumero mentioned this pull request Apr 30, 2024

[release] TornadoVM v1.0.4 #398

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Expand TornadoVM profiler with GPU power metrics from NVIDIA NVML API #377

[feat] Expand TornadoVM profiler with GPU power metrics from NVIDIA NVML API #377

stratika commented Apr 15, 2024 •

edited

jjfumero commented Apr 16, 2024

jjfumero commented Apr 16, 2024

jjfumero commented Apr 16, 2024

jjfumero commented Apr 16, 2024

stratika commented Apr 16, 2024 •

edited

jjfumero commented Apr 16, 2024

mikepapadim left a comment •

edited

mikepapadim commented Apr 16, 2024

mikepapadim left a comment

stratika commented Apr 16, 2024

mikepapadim commented Apr 17, 2024

stratika commented Apr 17, 2024

jjfumero commented Apr 17, 2024

stratika commented Apr 17, 2024

stratika commented Apr 19, 2024

jjfumero left a comment

[feat] Expand TornadoVM profiler with GPU power metrics from NVIDIA NVML API #377

[feat] Expand TornadoVM profiler with GPU power metrics from NVIDIA NVML API #377

Conversation

stratika commented Apr 15, 2024 • edited

Description

Backend/s tested

OS tested

Did you check on FPGAs?

How to test the new patch?

jjfumero commented Apr 16, 2024

jjfumero commented Apr 16, 2024

jjfumero commented Apr 16, 2024

jjfumero commented Apr 16, 2024

stratika commented Apr 16, 2024 • edited

jjfumero commented Apr 16, 2024

mikepapadim left a comment • edited

Choose a reason for hiding this comment

mikepapadim commented Apr 16, 2024

mikepapadim left a comment

Choose a reason for hiding this comment

stratika commented Apr 16, 2024

mikepapadim commented Apr 17, 2024

stratika commented Apr 17, 2024

jjfumero commented Apr 17, 2024

stratika commented Apr 17, 2024

stratika commented Apr 19, 2024

jjfumero left a comment

Choose a reason for hiding this comment

stratika commented Apr 15, 2024 •

edited

stratika commented Apr 16, 2024 •

edited

mikepapadim left a comment •

edited