fix: handle non-existent path in is_nfs_path for Triton autotune cache#7921
Merged
tohtana merged 2 commits intodeepspeedai:masterfrom Mar 25, 2026
Conversation
deepspeedai#7642) When the Triton autotune cache directory does not exist yet, `is_nfs_path()` passes a non-existent path to `df -T`, which fails with "No such file or directory" on stderr. Fix by walking up to the nearest existing ancestor directory before calling `df`. Also suppress stderr output and catch `FileNotFoundError` for environments where `df` is unavailable. Fixes deepspeedai#7642 Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com>
tohtana
approved these changes
Mar 25, 2026
Collaborator
tohtana
left a comment
There was a problem hiding this comment.
Looks good to me. Thank you for your contribution, @Krishnachaitanyakc!
nathon-lee
pushed a commit
to nathon-lee/DeepSpeed_woo
that referenced
this pull request
Mar 27, 2026
deepspeedai#7921) ### Summary - `is_nfs_path()` in `matmul_ext.py` passes the cache directory path to `df -T` before the directory is created, causing `df: /root/.triton/autotune: No such file or directory` errors on stderr - Fix by walking up to the nearest existing ancestor directory before invoking `df`, which correctly resolves the filesystem type without requiring the target path to exist - Also suppress stderr via `subprocess.DEVNULL` and catch `FileNotFoundError` for environments where `df` is unavailable (e.g., minimal containers) ### Root Cause In `AutotuneCacheManager.__init__`, `TritonCacheDir.warn_if_nfs(self.cache_dir)` is called before `os.makedirs(self.cache_dir, exist_ok=True)`. The `is_nfs_path()` function then runs `df -T` on a path that does not yet exist, which causes `df` to print an error to stderr. While the `CalledProcessError` exception was caught, the stderr output still leaked to the user's terminal. ### Changes - `deepspeed/ops/transformer/inference/triton/matmul_ext.py`: Walk up to nearest existing ancestor before calling `df -T`; suppress stderr; catch `FileNotFoundError` ### Testing - Python syntax validation: PASS - yapf formatting check: PASS (no diff) - flake8: PASS (no warnings) Fixes deepspeedai#7642 Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
nathon-lee
pushed a commit
to nathon-lee/DeepSpeed_woo
that referenced
this pull request
Mar 28, 2026
deepspeedai#7921) ### Summary - `is_nfs_path()` in `matmul_ext.py` passes the cache directory path to `df -T` before the directory is created, causing `df: /root/.triton/autotune: No such file or directory` errors on stderr - Fix by walking up to the nearest existing ancestor directory before invoking `df`, which correctly resolves the filesystem type without requiring the target path to exist - Also suppress stderr via `subprocess.DEVNULL` and catch `FileNotFoundError` for environments where `df` is unavailable (e.g., minimal containers) ### Root Cause In `AutotuneCacheManager.__init__`, `TritonCacheDir.warn_if_nfs(self.cache_dir)` is called before `os.makedirs(self.cache_dir, exist_ok=True)`. The `is_nfs_path()` function then runs `df -T` on a path that does not yet exist, which causes `df` to print an error to stderr. While the `CalledProcessError` exception was caught, the stderr output still leaked to the user's terminal. ### Changes - `deepspeed/ops/transformer/inference/triton/matmul_ext.py`: Walk up to nearest existing ancestor before calling `df -T`; suppress stderr; catch `FileNotFoundError` ### Testing - Python syntax validation: PASS - yapf formatting check: PASS (no diff) - flake8: PASS (no warnings) Fixes deepspeedai#7642 Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Signed-off-by: nathon-lee <leejianwoo@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
is_nfs_path()inmatmul_ext.pypasses the cache directory path todf -Tbefore the directory is created, causingdf: /root/.triton/autotune: No such file or directoryerrors on stderrdf, which correctly resolves the filesystem type without requiring the target path to existsubprocess.DEVNULLand catchFileNotFoundErrorfor environments wheredfis unavailable (e.g., minimal containers)Root Cause
In
AutotuneCacheManager.__init__,TritonCacheDir.warn_if_nfs(self.cache_dir)is called beforeos.makedirs(self.cache_dir, exist_ok=True). Theis_nfs_path()function then runsdf -Ton a path that does not yet exist, which causesdfto print an error to stderr. While theCalledProcessErrorexception was caught, the stderr output still leaked to the user's terminal.Changes
deepspeed/ops/transformer/inference/triton/matmul_ext.py: Walk up to nearest existing ancestor before callingdf -T; suppress stderr; catchFileNotFoundErrorTesting
Fixes #7642