Skip to content

[Runtime][Metal] add TVM_METAL_STORAGE_MODE env opt-in for Shared/Managed buffers#19504

Open
apstenku123 wants to merge 1 commit intoapache:mainfrom
apstenku123:cppmega/metal-shared-storage-opt-in
Open

[Runtime][Metal] add TVM_METAL_STORAGE_MODE env opt-in for Shared/Managed buffers#19504
apstenku123 wants to merge 1 commit intoapache:mainfrom
apstenku123:cppmega/metal-shared-storage-opt-in

Conversation

@apstenku123
Copy link
Copy Markdown

Summary

Adds an opt-in environment variable TVM_METAL_STORAGE_MODE that lets users allocate device data buffers as MTLResourceStorageModeShared (or Managed) instead of the default MTLResourceStorageModePrivate. Default behaviour is unchanged.

value mode semantics
unset / private MTLResourceStorageModePrivate default, GPU-only, preserves historical behaviour
shared MTLResourceStorageModeShared CPU+GPU mapped — required for zero-copy DLPack to MLX
managed MTLResourceStorageModeManaged macOS-only intermediate (driver tracks dirty pages)
anything else MTLResourceStorageModePrivate + warn safe fall-back

The env var is read once on first MetalWorkspace::AllocDataSpace and cached for the lifetime of the process; no per-allocation overhead. A new FFI helper metal.GetStorageMode is registered alongside the existing metal.GetProfileCounters / metal.ResetProfileCounters helpers so tests can verify the resolved mode without an ObjC bridge.

The staging-buffer pool (metal_common.h:383) and temp-buffer pool (metal_device_api.mm:374) already use MTLStorageModeShared and are intentionally untouched — they're host-staging by design and don't fall under the data-space allocator.

Why

TVM's Metal device API has always allocated MTLBuffer with MTLResourceStorageModePrivate. This is the right choice for pure-GPU workloads (no CPU page mapping), but it blocks zero-copy DLPack interop with other Metal-using frameworks that allocate Shared/Managed buffers — notably ml-explore/mlx, which uses MTLResourceStorageModeShared everywhere. Two allocators on the same MTLDevice produce buffers with different page-mapping semantics; DLPack capsules from TVM cannot be consumed by mx.array (live-tested: std::bad_cast on mx.array(tvm_metal_capsule)).

This change unblocks the bridge from TVM-NDArray to mlx.array (both wrap MTLBuffer; require matching storage mode for the same foreign capsule to be consumable). It is the producer half of a pair; the consumer half is a parallel ml-explore/mlx PR that adds mx.from_dlpack(obj).

Test plan

  • xcrun --sdk macosx clang++ -std=c++17 -framework Metal syntax_check.mm -o syntax_check && ./syntax_check — exercises env-var parsing for all 6 cases (unset, shared, mixed-case Shared, invalid, managed, private).
  • Build runtime: mkdir build && cd build && cmake -DUSE_METAL=ON -DUSE_LLVM=ON -DCMAKE_BUILD_TYPE=Release .. && make -j tvm_runtime
  • ./runtime_check (TVM-linked probe) — validates that the env var flows to a real MTLBuffer.storageMode. Live captured 2026-05-03 on Apple M4 Max for unset/shared/managed/private.
  • TVM_METAL_STORAGE_MODE=shared python -c "import tvm; arr = tvm.nd.empty((4,), dtype='float32', device=tvm.metal()); print(arr.shape)"
  • CI: macos-arm64 runner in apache/tvm should exercise the existing Metal tests; default behaviour (env unset) is unchanged.

Caveats / non-goals

  • This is a copy-elision interop patch, not a kernel-speed patch. Default Private mode remains the right choice for TVM-only workloads.
  • The patch artifact only changes src/runtime/metal/metal_device_api.mm; it does not yet add an upstream tests/python/runtime/... file. A subprocess-isolated Python test for the env-cache behaviour can be folded in if maintainers want it in tree.
  • Local Metal microbenchmarks on Apple M4 Max show Shared buffers remove the staging-buffer + blit/wait cost at CPU↔Metal transfer boundaries (e.g., 1 MiB CPU→Metal median 138.375 µs Private vs 12.750 µs Shared in a downstream probe). These numbers are local-health checks, not in-tree benchmarks.

Pairing

Paired upstream patch: ml-explore/mlx adds mx.from_dlpack(obj) Metal-aware consumer (filed in parallel). Both patches must land for the zero-copy MLX↔TVM use case to work end-to-end.

Attribution

Co-developed with cppmega.mlx for Apple-Silicon Metal interop with MLX.

…aged buffers

Replaces the historical hard-coded MTLResourceStorageModePrivate at
MetalWorkspace::AllocDataSpace with a function-local static cache
seeded from the env var. Default behavior unchanged (env unset =>
Private). Accepted values (case-insensitive): private | shared | managed.
Unknown values fall back to Private with a warning.

Also registers a metal.GetStorageMode FFI helper for parity tests.

Motivation: zero-copy DLPack interop with MLX (which uses
MTLResourceStorageModeShared everywhere). Two allocators on the same
MTLDevice cannot share an MTLBuffer with different page-mapping
semantics, so DLPack capsules cross-imported between TVM-NDArray and
mx.array currently fail with std::bad_cast. With env=shared, both
allocators agree.

Live-verified on Apple Silicon (Xcode 21, MacOSX26.4 SDK):
runtime_check.mm probes confirm MTLBuffer.storageMode matches the env
var setting in all 4 scenarios (default, shared, managed, private).
Build: cmake -DUSE_METAL=ON .. && make tvm_runtime — 3 min on M-series.

Untouched on purpose:
- StagingBufferPool::GetOrCreate (host-staging, must remain Shared)
- MetalThreadEntry::GetTempBuffer (GPU->CPU readback bounce, must remain Shared)
- metal_module.mm kernel dispatch (Shared buffers are valid kernel args)

Polished version of the patch by parallel test agent (cppmega-mlx
Path C work). Replaces the earlier draft that had a smaller diff but
missed the FFI registration helper.
Copilot AI review requested due to automatic review settings May 4, 2026 08:49
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ability to configure the Metal storage mode via the TVM_METAL_STORAGE_MODE environment variable, facilitating zero-copy DLPack interop with frameworks like MLX. Key changes include the implementation of GetMetalStorageOptions to parse and cache the storage mode, updating buffer allocations to respect this setting, and adding a new FFI function metal.GetStorageMode to query the current configuration. I have no feedback to provide.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in TVM_METAL_STORAGE_MODE environment variable to control Metal device data buffer storage mode (Private/Shared/Managed) to enable zero-copy interop (e.g., via DLPack) while preserving existing default behavior.

Changes:

  • Introduces GetMetalStorageOptions() to parse/cache TVM_METAL_STORAGE_MODE (case-insensitive) and select the corresponding MTLResourceStorageMode*.
  • Switches MetalWorkspace::AllocDataSpace to allocate buffers using the resolved storage mode.
  • Registers a new FFI helper metal.GetStorageMode for tests/introspection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +510 to +511
if (opts == MTLResourceStorageModeShared) return ffi::String("shared");
if (opts == MTLResourceStorageModeManaged) return ffi::String("managed");
return MTLResourceStorageModePrivate;
}
// Lowercase a small bounded copy for case-insensitive comparison.
std::string v(raw);
apstenku123 added a commit to DatasunriseOU/cppmega_mlx that referenced this pull request May 4, 2026
Files documenting the actual PRs we just opened upstream:

- PR #1: ml-explore/mlx#3476 — from_dlpack Metal-aware consumer (against main, clean)
- PR #2: apache/tvm#19504 — TVM_METAL_STORAGE_MODE env opt-in (against main, clean)
- PR #3: tile-ai/tilelang#2139 — mixed-dtype T.gemm via scalar fallback (stacks on PR #2130)
- PR #4: tile-ai/tilelang#2140 — FP8-input T.gemm scalar fallback routing (stacks on PR #2130)
- PR #5: tile-ai/tilelang#2141 — T.Pipelined num_stages>1 3D buffer fix (stacks on PR #2130)
- PR #6: tile-ai/tilelang#2142 — T.fp8_scaled_matmul DSL intrinsic (stacks on PR #2130)

Deferred (split into companion PRs needed): tilelang_metal_fp8 and
tilelang_metal_fp8_vector each touch both tilelang supermodule and the
TileLang/tvm vendored submodule. These need 2 PRs each — one to
tile-ai/tilelang, one to TileLang/tvm — separate filing round.

PRs #3-#6 are independent of each other; each branches directly from
jorgecurious/tilelang:metal-gemm-upstream-rebase HEAD 971c17b, so they
can be reviewed in any order. They DO depend on the upstream 4-PR Apple
Metal landing chain (#1869, #2118, #2121, #2130) merging first; if any
of those land separately, ours can be retargeted at main.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants