compiler: Make nbytes available mapper aware of visible devices environment variables by EdCaunt · Pull Request #2746 · devitocodes/devito

EdCaunt · 2025-09-26T13:18:59Z

Still needs tests.

FabioLuporini

minor comments, but looks fine

FabioLuporini · 2025-09-26T13:21:42Z

devito/operator/operator.py

+    def visible_devices(self):
+        device_vars = (
+            'CUDA_VISIBLE_DEVICES',
+            'ROCR_VISIBLE_DEVICES',


ROCR or ROCM

?

ROCR inexplicably https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html#rocr-visible-devices

jeez 😂 ...

OK, look, can you add ROCM_VISIBLE_DEVICES too? They might add it in the future...

It's not currently an option, so I don't think that's a good idea. If the user accidentally set it then you may get unexpected, hard-to-debug behaviours since Devito will act as if devices have been set, but the ROCM runtime will not

btw this could be a util inside arch/archinfo rather than a private method since AFAICT it's totally unrelated to self itself

devito/operator/operator.py

codecov · 2025-09-26T13:37:46Z

Codecov Report

❌ Patch coverage is 72.34043% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.06%. Comparing base (c207ded) to head (35304b9).
⚠️ Report is 12 commits behind head on main.

Files with missing lines	Patch %	Lines
tests/test_gpu_common.py	59.18%	20 Missing ⚠️
devito/operator/operator.py	78.57%	2 Missing and 1 partial ⚠️
devito/parameters.py	90.00%	1 Missing and 1 partial ⚠️
devito/arch/archinfo.py	90.90%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2746      +/-   ##
==========================================
- Coverage   83.06%   83.06%   -0.01%     
==========================================
  Files         248      248              
  Lines       50356    50438      +82     
  Branches     4432     4437       +5     
==========================================
+ Hits        41830    41897      +67     
- Misses       7768     7782      +14     
- Partials      758      759       +1

Flag	Coverage Δ
pytest-gpu-aomp-amdgpuX	`68.74% <72.34%> (-0.01%)`	⬇️
pytest-gpu-nvc-nvidiaX	`69.29% <72.34%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mloubout · 2025-09-26T17:25:14Z

tests/test_gpu_common.py

+            op = Operator(eq)
+
+            argmap = op.arguments()
+            # deviceid should see the world from within CUDA_VISIBLE_DEVICES


I don't think that is the wanted behavior. DEVITO_DEVICEID is the id of the device not the index withing the devices so this isn't compatible with what the configuration does and might lead to problems for users currently using deviceid

Just checked current main and this is the existent behaviour of DEVITO_DEVICEID in combination with CUDA_VISIBLE_DEVICES. If you set CUDA_VISIBLE_DEVICES="1,2" and DEVITO_DEVICEID=1, the kernel will run on device 2, due to how devices appear to CUDA programs inside an environment with CUDA_VISIBLE_DEVICES set. Essentially for anything other than nvidia-smi, the visible devices appear renumbered from zero. See here for why this is the case.

As to whether this is the wanted behaviour, I would argue it is. Consider a scheduler which runs a job with two available GPUs out of a total four. This is presumably achieved under the hood with CUDA_VISIBLE_DEVICES. In that job a Devito script setting deviceid=0 is used. The intuitive behaviour of that script would be to use the first device available to the job. This would be device 0 so far as the job is concerned, albeit not necessarily device 0 on the whole node.

yes, I agree

mloubout · 2025-09-26T17:25:55Z

tests/test_gpu_common.py

+            assert argmap1._physical_deviceid == 1
+
+        # Make sure the switchenv doesn't somehow persist
+        for i in ("CUDA", "ROCR", "HIP"):


These are likely to break on most systems. You need to fetch it before switchenv then check it's reverted to the orginal one

Good point. I think I should actually split this out into a specific test for the switchenv class. It was mainly in anticipation of a possible silent failure route

mloubout · 2025-09-26T17:26:49Z

devito/operator/operator.py

+            rank = self.comm.Get_rank() if self.comm != MPI.COMM_NULL else 0
+
+            logical_deviceid = max(self.get('deviceid', 0), 0) + rank
+            if self._visible_devices is not None:


would be simpler with self._visible_devices.get(logical_deviceid, logical_deviceid) and just have _visible_devices return {}

mloubout · 2025-09-26T17:27:35Z

devito/operator/operator.py

+        for v in device_vars:
+            try:
+                return tuple(int(i) for i in os.environ[v].split(','))
+            except (ValueError, KeyError):


Should these be split?

ValueError -> there is an id so set it to os.environ[v].split(',').index(i)

KeyError -> no env var, no device

ValueError would be expected to coincide with device UUIDs set in CUDA_VISIBLE_DEVICES which aren't (and were not previously) parsed into integer IDs. This one should probably raise at least a warning to mention that the UUID is being ignored.

Alternatively, the UUID -> integer ID mapping could potentially be reverse-engineered from nvidia-smi somewhere in the device sniffing, but that may be overkill. Another approach would be to widen support for device UUIDs, but this, again, might be overkill.

mloubout · 2025-10-03T11:42:36Z

devito/operator/operator.py

+            # Get the physical device ID (as CUDA_VISIBLE_DEVICES may be set)
+            rank = self.comm.Get_rank() if self.comm != MPI.COMM_NULL else 0
+
+            logical_deviceid = max(self.get('deviceid', 0), 0) + rank


Wait, currently lots of user pass a deviceid per rank already this is gonna make it into an id that doesn't exist

Ah, hmm, should it be:

logical_deviceid = max(self.get('deviceid', rank), 0)

then?

Currently if you just leave it as the default value, it checks available memory on the first device for every rank.

I think should be self.get('deviceid', max(rank, 0)) and keep user input untouched

I think desired behaviour would be:

If user sets deviceid=-1, use the MPI rank

If user sets an nonnegative integer deviceid, use it with no modification

If default value obtained, or no deviceid found, use the MPI rank

So:

logical_deviceid = self.get('deviceid', -1) if logical_deviceid < 0: logical_deviceid = rank

mloubout · 2025-10-03T11:43:28Z

devito/operator/operator.py

+    def _physical_deviceid(self):
+        if isinstance(self.platform, Device):
+            # Get the physical device ID (as CUDA_VISIBLE_DEVICES may be set)
+            rank = self.comm.Get_rank() if self.comm != MPI.COMM_NULL else 0


Doesn't Get_rank return zero for COMM_NULL?

It appears not:

>>> MPI.COMM_NULL.Get_rank() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "src/mpi4py/MPI.src/Comm.pyx", line 110, in mpi4py.MPI.Comm.Get_rank mpi4py.MPI.Exception: MPI_ERR_COMM: invalid communicator

…vices specified

…ted tests

…ecified by user

FabioLuporini · 2025-10-06T07:55:11Z

tests/test_gpu_common.py

+            op = Operator(eq)
+
+            argmap = op.arguments()
+            # deviceid should see the world from within CUDA_VISIBLE_DEVICES


yes, I agree

EdCaunt requested a review from FabioLuporini September 26, 2025 13:19

EdCaunt self-assigned this Sep 26, 2025

EdCaunt added bug-py-minor GPU labels Sep 26, 2025

FabioLuporini reviewed Sep 26, 2025

View reviewed changes

EdCaunt requested review from FabioLuporini and mloubout September 26, 2025 17:06

mloubout reviewed Sep 26, 2025

View reviewed changes

EdCaunt force-pushed the deviceid branch from 098b0a6 to 1ac17c8 Compare October 3, 2025 10:45

mloubout reviewed Oct 3, 2025

View reviewed changes

EdCaunt force-pushed the deviceid branch from 1ac17c8 to 2fdffc8 Compare October 3, 2025 13:18

EdCaunt requested a review from mloubout October 3, 2025 13:18

EdCaunt added 11 commits October 4, 2025 11:09

compiler: Initial investigations to deviceid handling when visible de…

4d270cf

…vices specified

compiler: Make nbytes_avail_mapper aware of visible devices

b480cde

misc: Tidy up

212a045

tests: Add tests for using CUDA_VISIBLE_DEVICES and similar

82704f3

compiler: Tweak nbytes_avail_mapper behaviour

61eb537

compiler: Refine handling of device environment variables and associa…

64d1e73

…ted tests

misc: Flake8

ef0ef47

misc: Fix oversight

d9c2744

misc: Flake8

77d226b

misc: Move get_visible_devices to archinfo

dc4d345

compiler: Tweak behaviour of deviceid to respect per-rank DeviceID sp…

35304b9

…ecified by user

mloubout force-pushed the deviceid branch from 2fdffc8 to 35304b9 Compare October 4, 2025 15:09

mloubout approved these changes Oct 4, 2025

View reviewed changes

FabioLuporini approved these changes Oct 6, 2025

View reviewed changes

FabioLuporini merged commit a04b3af into main Oct 6, 2025
36 checks passed

FabioLuporini deleted the deviceid branch October 6, 2025 07:55

Conversation

EdCaunt commented Sep 26, 2025

Uh oh!

FabioLuporini left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Sep 26, 2025 •

edited

Loading