Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Expand TornadoVM profiler with GPU power metrics from NVIDIA NVML API #377

Merged
merged 20 commits into from
Apr 22, 2024

Conversation

stratika
Copy link
Collaborator

@stratika stratika commented Apr 15, 2024

Description

This PR implements a new feature regarding adding the power consumption as a metric in the TornadoVM profiler. To that end, this PR invokes the NVIDIA NVML API in the JNI part of both OpenCL and PTX drivers.

I updated the PR with an hierachical design:
-> drivers-common: new package (power) that contains an new interface PowerMetric.java.
-> drivers-opencl: new package (power) that implements two instances of the new interface (OCLNvidiaPowerMetric.java, OCLEmptyPowerMetric.java). The OCLNvidiaPowerMetric,java contains the jni methods that point to the NVML functions.
-> drivers-ptx: new package (power) that implements one instance of the new interface (PTXNvidiaPowerMetric.java). The PTXNvidiaPowerMetric,java contains the jni methods that point to the NVML functions.

See the OCLNvidiaPowerMetric.cpp and PTXNvidiaPowerMetric.cpp files, which contain similar code for:

  • i) initialising NVML for all GPU devices in the system,
  • ii) getting the device handle for a particular index,
  • iii) querying the power usage of a device handle.

I have modified the CMakeLists.txt file in OpenCL to build the OCLNvidiaPowerMetric.cpp file only if nvml.h and the nvidia-ml are available in the system. For PTX, this is not necessary as the NVML API is available if CUDA is installed (i.e., NVML is part of the NVIDIA GPU Deployment Kit).

The queried result is added in the TornadoVM profiler as a metric:

{
    "s0": {
        "TOTAL_KERNEL_TIME": "15328",
        "COPY_OUT_TIME": "65792",
        "TOTAL_TASK_GRAPH_TIME": "649996",
        "COPY_IN_TIME": "3680",
        "TOTAL_DISPATCH_KERNEL_TIME": "0",
        "TOTAL_DISPATCH_DATA_TRANSFERS_TIME": "0",
        "TOTAL_COPY_OUT_SIZE_BYTES": "400024",
        "s0.t0": {
            "BACKEND": "PTX",
            "METHOD": "VectorAddInt.vectorAdd",
            "DEVICE_ID": "0:0",
            "DEVICE": "NVIDIA RTX A2000 8GB Laptop GPU",
            "TOTAL_COPY_IN_SIZE_BYTES": "24",
            "POWER_USAGE_mW": "9229",              # <--- This line is new, if the device is not NVIDIA "n/a" is returned.
            "TASK_KERNEL_TIME": "15328"
        }
    }
}

Note: We observed that the build process may fail if the NVIDIA CUDA Toolkit is manually installed in a directory that is not the default one. Two points in the CMakeLists.txt of opencl-jni have been updated to include a custom directory:

  • For nvml.h, the script will search in the default location (e.g., /usr/include, /usr/local/include/) and we added as an option /usr/local/cuda/targets/x86_64-linux/include.
  • For libnvidia-ml.so, the script will search in the default location (e.g., /usr/lib/x86_64-linux-gnu, /usr/local/lib/) and we added as an option /usr/local/cuda/targets/x86_64-linux/lib/stubs

Backend/s tested

Mark the backends affected by this PR.

  • OpenCL
  • PTX
  • SPIRV

OS tested

Mark the OS where this PR is tested.

  • Linux
  • OSx
  • Windows

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

  • Yes
  • No

How to test the new patch?

To test, you can run:

make BACKEND=opencl
tornado --enableProfiler console -m tornado.examples/uk.ac.manchester.tornado.examples.VectorAddInt --params="100000"

make BACKEND=ptx
tornado --enableProfiler console -m tornado.examples/uk.ac.manchester.tornado.examples.VectorAddInt --params="100000"

To test in Windows:

.\bin\tornadovm-installer.cmd --backend=opencl
python %TORNADO_SDK%\bin\tornado -ea  --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True "  -m  tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  --params "uk.ac.manchester.tornado.unittests.profiler.TestProfiler"

.\bin\tornadovm-installer.cmd --backend=ptx
python %TORNADO_SDK%\bin\tornado -ea  --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=True "  -m  tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  --params "uk.ac.manchester.tornado.unittests.profiler.TestProfiler"

@stratika stratika self-assigned this Apr 15, 2024
@stratika stratika added enhancement New feature or request feature New feature proposal labels Apr 15, 2024
@jjfumero
Copy link
Member

Running the OpenCL backend:

tornado --enableProfiler console -m tornado.examples/uk.ac.manchester.tornado.examples.VectorAddInt 100000
WARNING: Using incubator modules: jdk.incubator.vector
Exception in thread "main" java.lang.UnsatisfiedLinkError: 'long uk.ac.manchester.tornado.drivers.opencl.OCLNvml.nvmlInit()'
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLNvml.nvmlInit(Native Method)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLNvml.clNvmlInit(OCLNvml.java:52)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLNvml.<init>(OCLNvml.java:36)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLDeviceContext.<init>(OCLDeviceContext.java:71)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLContext.createDeviceContext(OCLContext.java:209)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLContext.createDeviceContext(OCLContext.java:42)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.graal.OCLHotSpotBackendFactory.createJITCompiler(OCLHotSpotBackendFactory.java:95)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.createOCLJITCompiler(OCLBackendImpl.java:204)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.installDevices(OCLBackendImpl.java:218)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.lambda$discoverDevices$4(OCLBackendImpl.java:225)
	at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
	at java.base/java.util.stream.IntPipeline$Head.forEach(IntPipeline.java:617)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.discoverDevices(OCLBackendImpl.java:223)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLBackendImpl.<init>(OCLBackendImpl.java:76)
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLTornadoDriverProvider.createBackend(OCLTornadoDriverProvider.java:48)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoCoreRuntime.loadBackends(TornadoCoreRuntime.java:167)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoCoreRuntime.<init>(TornadoCoreRuntime.java:105)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.TornadoCoreRuntime.<clinit>(TornadoCoreRuntime.java:79)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.meta.MetaDataUtils.resolveDevice(MetaDataUtils.java:41)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.meta.AbstractMetaData.getLogicDevice(AbstractMetaData.java:192)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.reuseDeviceBufferObject(TornadoTaskGraph.java:1030)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.lockObjectsInMemory(TornadoTaskGraph.java:1041)
	at tornado.runtime@1.0.4-dev/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph.transferToDevice(TornadoTaskGraph.java:944)
	at tornado.api@1.0.4-dev/uk.ac.manchester.tornado.api.TaskGraph.transferToDevice(TaskGraph.java:704)
	at tornado.examples@1.0.4-dev/uk.ac.manchester.tornado.examples.VectorAddInt.main(VectorAddInt.java:56)

I am using NVIDIA CUDA Driver: 550.67 and CUDA 12.3

@jjfumero
Copy link
Member

Is there anything else I should enable to get the profiling?

@jjfumero
Copy link
Member

Actually, the error is related to the JNI invocation call.

Exception in thread "main" java.lang.UnsatisfiedLinkError: 'long uk.ac.manchester.tornado.drivers.opencl.OCLNvml.nvmlInit()'
	at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLNvml.nvmlInit(Native Method)

@jjfumero
Copy link
Member

If I compile with PTX only, it works.

{
    "s0": {
        "TOTAL_TASK_GRAPH_TIME": "380085",
        "TOTAL_DISPATCH_KERNEL_TIME": "0",
        "TOTAL_DISPATCH_DATA_TRANSFERS_TIME": "0",
        "COPY_OUT_TIME": "53952",
        "COPY_IN_TIME": "3584",
        "TOTAL_KERNEL_TIME": "6943",
        "TOTAL_COPY_OUT_SIZE_BYTES": "400024",
        "s0.t0": {
            "BACKEND": "PTX",
            "METHOD": "VectorAddInt.vectorAdd",
            "DEVICE_ID": "0:0",
            "DEVICE": "NVIDIA GeForce RTX 3070",
            "TOTAL_COPY_IN_SIZE_BYTES": "24",
            "TASK_KERNEL_TIME": "6943",
            "POWER_USAGE_mW": "48444"      << 
        }
    }
}

@stratika
Copy link
Collaborator Author

stratika commented Apr 16, 2024

Actually, the error is related to the JNI invocation call.

Exception in thread "main" java.lang.UnsatisfiedLinkError: 'long uk.ac.manchester.tornado.drivers.opencl.OCLNvml.nvmlInit()'
at tornado.drivers.opencl@1.0.4-dev/uk.ac.manchester.tornado.drivers.opencl.OCLNvml.nvmlInit(Native Method)

This does not seem to be a problem with CUDA version. It may be OS-specific. Can you let me know which OS do you have? I tried the PR with Ubuntu 23.10 and macOS Sonoma 14.4.

@jjfumero
Copy link
Member

LSB Version:	n/a
Distributor ID:	Fedora
Description:	Fedora Linux 39 (Workstation Edition)
Release:	39
Codename:	n/a

Kernel:

Linux xps8950 6.7.11-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 27 16:50:39 UTC 2024 x86_64 GNU/Linux

Copy link
Member

@mikepapadim mikepapadim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me it works:

Result is correct. Total time: 612294 (ns)
{
    "s0": {
        "POWER_USAGE_mW": "48183",
        "TOTAL_DISPATCH_KERNEL_TIME": "4096",
        "COPY_IN_TIME": "42816",
        "COPY_OUT_TIME": "19936",
        "TOTAL_KERNEL_TIME": "7168",
        "TOTAL_DISPATCH_DATA_TRANSFERS_TIME": "72160",
        "TOTAL_TASK_GRAPH_TIME": "374417",
        "TOTAL_COPY_IN_SIZE_BYTES": "800048",
        "TOTAL_COPY_OUT_SIZE_BYTES": "400024",
        "s0.t0": {
            "BACKEND": "OPENCL",
            "METHOD": "VectorAddInt.vectorAdd",
            "DEVICE_ID": "0:0",
            "DEVICE": "NVIDIA GeForce RTX 3070",
            "TOTAL_COPY_IN_SIZE_BYTES": "24",
            "POWER_USAGE_mW": "48183",
            "TASK_KERNEL_TIME": "7168"
        }
    }
}

OS & Kernel :

Linux pop-os 6.8.0-76060800daily20240311-generic #202403110203~1711393930~22.04~331756a SMP PREEMPT_DYNAMIC Mon M x86_64 x86_64 x86_64 GNU/Linux

NVIDIA version:

| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |

@mikepapadim
Copy link
Member

Is there anything else I should enable to get the profiling?

For me it worked without doing any extra steps

Copy link
Member

@mikepapadim mikepapadim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

OpenCL & PTX works for:

Kernel :

Linux pop-os 6.8.0-76060800daily20240311-generic #202403110203~1711393930~22.04~331756a SMP PREEMPT_DYNAMIC Mon M x86_64 x86_64 x86_64 GNU/Linux

NVIDIA version:

| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |

@stratika
Copy link
Collaborator Author

LSB Version:	n/a
Distributor ID:	Fedora
Description:	Fedora Linux 39 (Workstation Edition)
Release:	39
Codename:	n/a

Kernel:

Linux xps8950 6.7.11-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 27 16:50:39 UTC 2024 x86_64 GNU/Linux

After a sync, it seems that the error occurs because in the failing system, CUDA is installed manually and therefore, the libnvml.so is not detected at runtime. We need to think a patch for CMakeLists.txt. I will post any updates to be tested in this PR. I switch it to draft mode.

@stratika stratika marked this pull request as draft April 16, 2024 10:11
…based on an abstract class (PowerMetric) that can be extended in each backend opencl and ptx
@mikepapadim
Copy link
Member

as this is a self-contained functionality. Does it make sense to port it with FFM as an exercise?

@stratika
Copy link
Collaborator Author

My plan is to test it for Windows, and then we can proceed with merging the PR. I think it makes sense because it has just 4 files in the native part to be tested with FFI API.

@jjfumero
Copy link
Member

What is the purpose of FFI here? The TornadoVM native code to interact with the driver will use two different approaches then. That's the main reason we unified all native code in C++ using the same style across backends, so it will be easier to maintain a debug. IMO, if we transition to FFI, it should be for all backends:

  • OpenCL
  • CUDA
  • Level Zero

@stratika
Copy link
Collaborator Author

What is the purpose of FFI here? The TornadoVM native code to interact with the driver will use two different approaches then. That's the main reason we unified all native code in C++ using the same style across backends, so it will be easier to maintain a debug. IMO, if we transition to FFI, it should be for all backends:

  • OpenCL
  • CUDA
  • Level Zero

yes, I think we agree on that. My understanding is that @mikepapadim would like to test it. Anyway, we can move that discussion in a proposal discussion.

@stratika stratika marked this pull request as ready for review April 19, 2024 07:23
@stratika
Copy link
Collaborator Author

The PR is ready for review and running the tests. It should be working also for Windows (native installation).

Copy link
Member

@jjfumero jjfumero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM , There is an issue with the level zero backend and profiler. But that's a separate issue not related to this PR. I will open a new one.

@jjfumero jjfumero merged commit 4e183ab into beehive-lab:develop Apr 22, 2024
2 checks passed
@stratika stratika deleted the feat/nvidia-power branch April 22, 2024 12:40
jjfumero added a commit to jjfumero/TornadoVM that referenced this pull request Apr 30, 2024
Improvements
~~~~~~~~~~~~~~~~~~

- [beehive-lab#369](beehive-lab#369): Introduction of Tensor types in TornadoVM API and interoperability with ONNX Runtime.
- [beehive-lab#370](beehive-lab#370): Array concatenation operation for TornadoVM native arrays.
- [beehive-lab#371](beehive-lab#371): TornadoVM installer script ported for Windows 10/11.
- [beehive-lab#372](beehive-lab#372): Add support for ``HalfFloat`` (``Float16``) in vector types.
- [beehive-lab#374](beehive-lab#374): Support for TornadoVM array concatenations from the constructor-level.
- [beehive-lab#375](beehive-lab#375): Support for TornadoVM native arrays using slices from the Panama API.
- [beehive-lab#376](beehive-lab#376): Support for lazy copy-outs in the batch processing mode.
- [beehive-lab#377](beehive-lab#377): Expand the TornadoVM profiler with power metrics for NVIDIA GPUs (OpenCL and PTX backends).
- [beehive-lab#384](beehive-lab#384): Auto-closable Execution Plans for automatic memory management.

Compatibility
~~~~~~~~~~~~~~~~~~

- [beehive-lab#386](beehive-lab#386): OpenJDK 17 support removed.
- [beehive-lab#390](beehive-lab#390): SapMachine OpenJDK 21 supported.
- [beehive-lab#395](beehive-lab#395): OpenJDK 22 and GraalVM 22.0.1 supported.
- TornadoVM tested with Apple M3 chips.

Bug Fixes
~~~~~~~~~~~~~~~~~~

- [beehive-lab#367](beehive-lab#367): Fix for Graal/Truffle languages in which some Java modules were not visible.
- [beehive-lab#373](beehive-lab#373): Fix for data copies of the ``HalfFloat`` types for all backends.
- [beehive-lab#378](beehive-lab#378): Fix free memory markers when running multi-thread execution plans.
- [beehive-lab#379](beehive-lab#379): Refactoring package of vector api unit-tests.
- [beehive-lab#380](beehive-lab#380): Fix event list sizes to accommodate profiling of large applications.
- [beehive-lab#385](beehive-lab#385): Fix code check style.
- [beehive-lab#387](beehive-lab#387): Fix TornadoVM internal events in OpenCL, SPIR-V and PTX for running multi-threaded execution plans.
- [beehive-lab#388](beehive-lab#388): Fix of expected and actual values of tests.
- [beehive-lab#392](beehive-lab#392): Fix installer for using existing JDKs.
- [beehive-lab#389](beehive-lab#389): Fix ``DataObjectState`` for multi-thread execution plans.
- [beehive-lab#396](beehive-lab#396): Fix JNI code for the CUDA NVML library access with OpenCL.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature New feature proposal
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants