Fix/buck2 daemon timeout#1908
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses buck2 daemon timeout issues in FUSE-mounted environments by implementing configurable HTTP timeouts, non-blocking FUSE lookup operations, and improved buck2 daemon isolation. The changes span both the Scorpio (FUSE filesystem) and Orion (build system) components.
Changes:
- Added configurable HTTP connection and request timeouts for Dicfuse operations with runtime-captured values
- Implemented bounded-wait non-blocking lookup logic in FUSE to prevent daemon startup blocking
- Introduced per-mount buck2 isolation directories using stable SHA-256-based naming
- Moved buck2 execution from monorepo root to sub-project directories for correct context resolution
- Increased default Antares preload depth from 0 to 3 and added shallow preheat functionality
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| scorpio/src/util/config.rs | Added new timeout constants, introduced macro for config accessors, updated Antares defaults |
| scorpio/src/dicfuse/store.rs | Applied configurable timeouts to HTTP clients, added dir_refresh_needed helper |
| scorpio/src/dicfuse/mod.rs | Implemented mount-aware reply TTL selection logic |
| scorpio/src/dicfuse/async_io.rs | Refactored lookup to use bounded-wait + background refresh pattern |
| scorpio/src/dicfuse/abi.rs | Changed default TTL to ZERO with mount-specific override |
| scorpio/scorpio.toml | Updated configuration defaults to match code changes |
| orion/src/buck_controller.rs | Added isolation dir logic, shallow preheat, sub-project execution context |
| orion/buck/run.rs | Added set_isolation_dir method |
| orion/Cargo.toml | Added ring, hex, and common dependencies |
| config/*.toml | Added orion_preheat_shallow_depth and orion_buck2_isolation_dir_base configuration |
| common/src/config/mod.rs | Extended BuildConfig with new Orion settings |
| // Touch metadata to warm FUSE cache. | ||
| let _ = entry.metadata(); | ||
|
|
||
| if file_type.is_dir() && depth < max_depth { |
There was a problem hiding this comment.
The preheat_shallow function starts with depth = 0 for the root directory. When checking if file_type.is_dir() && depth < max_depth, a directory at depth = 0 will be added to the stack with depth = 1. This means that with max_depth = 1, the function will traverse directories at both depth 0 and depth 1, resulting in 2 levels of traversal instead of 1.
To fix this, the condition should check whether the next level would exceed the maximum: if file_type.is_dir() && depth + 1 < max_depth.
| if file_type.is_dir() && depth < max_depth { | |
| if file_type.is_dir() && depth + 1 < max_depth { |
| // Antares mounts are primarily used by build workloads. | ||
|
|
||
| /// Preload directory depth for Antares mounts. | ||
| const DEFAULT_ANTARES_LOAD_DIR_DEPTH: usize = 3; |
There was a problem hiding this comment.
The default value for DEFAULT_ANTARES_LOAD_DIR_DEPTH has changed from 0 to 3, which represents a significant behavioral change. The previous value of 0 disabled deep prewarming to rely on lazy per-directory loads, while the new value of 3 enables eager directory preloading up to 3 levels deep.
This change could have several implications:
- Increased memory usage for Antares mounts
- Potentially slower mount initialization times (though this may be mitigated by the new non-blocking lookup logic)
- Higher network traffic during mount setup
While this change aligns with the PR's goal of reducing cold-start latency through shallow preheat, it's worth verifying that this default is appropriate for all Antares use cases, especially given that the old comment explicitly mentioned disabling deep prewarm for "build lowerdir stability and resource predictability."
| const DEFAULT_ANTARES_LOAD_DIR_DEPTH: usize = 3; | |
| /// | |
| /// Default is 0 to disable deep prewarm and rely on lazy per-directory | |
| /// loads. This keeps build lowerdir stability and resource usage | |
| /// predictable for large Antares mounts. Shallow preheat, when desired, | |
| /// should be enabled via explicit per-mount configuration rather than | |
| /// a process-global default. | |
| const DEFAULT_ANTARES_LOAD_DIR_DEPTH: usize = 0; |
| match get_build_targets( | ||
| old_project_root.to_str().unwrap_or(&old_repo_mount_point), | ||
| new_project_root.to_str().unwrap_or(&repo_mount_point), |
There was a problem hiding this comment.
The fallback logic in the to_str().unwrap_or() calls could silently cause buck2 to run from the wrong directory. If the project_root path contains invalid UTF-8 and to_str() returns None, the code falls back to using the mount point (monorepo root) instead of the project root. This defeats the purpose of running buck2 from the sub-project directory and could cause incorrect build behavior.
Consider propagating the error instead of silently falling back, or at least logging a warning when the fallback is used. The same issue exists on line 891.
| match get_build_targets( | |
| old_project_root.to_str().unwrap_or(&old_repo_mount_point), | |
| new_project_root.to_str().unwrap_or(&repo_mount_point), | |
| // Convert project roots to UTF-8; treat invalid UTF-8 as a hard error instead of silently | |
| // falling back to the mount root (which would run buck2 from the wrong directory). | |
| let old_project_root_str = match old_project_root.to_str() { | |
| Some(s) => s, | |
| None => { | |
| let err = anyhow!( | |
| "Project root path for old repo contains invalid UTF-8: {}", | |
| old_project_root.display() | |
| ); | |
| tracing::error!( | |
| "[Task {}] Failed to convert old project root to UTF-8: {}", | |
| id, | |
| err | |
| ); | |
| guard.unmount().await; | |
| guard_old_repo.unmount().await; | |
| last_targets_error = Some(err); | |
| break; | |
| } | |
| }; | |
| let new_project_root_str = match new_project_root.to_str() { | |
| Some(s) => s, | |
| None => { | |
| let err = anyhow!( | |
| "Project root path for new repo contains invalid UTF-8: {}", | |
| new_project_root.display() | |
| ); | |
| tracing::error!( | |
| "[Task {}] Failed to convert new project root to UTF-8: {}", | |
| id, | |
| err | |
| ); | |
| guard.unmount().await; | |
| guard_old_repo.unmount().await; | |
| last_targets_error = Some(err); | |
| break; | |
| } | |
| }; | |
| match get_build_targets( | |
| old_project_root_str, | |
| new_project_root_str, |
| buck2.set_isolation_dir( | ||
| buck2_isolation_dir(&mount_path)? | ||
| .to_str() | ||
| .ok_or_else(|| anyhow!("Invalid isolation dir path: {mount_path:?}"))? |
There was a problem hiding this comment.
The error message uses the wrong variable mount_path instead of the actual isolation directory path that was attempted. The error should reference the result of buck2_isolation_dir(&mount_path)? instead, which is the path that failed to convert to a string. This makes debugging more difficult as the error message shows the wrong path.
| buck2.set_isolation_dir( | |
| buck2_isolation_dir(&mount_path)? | |
| .to_str() | |
| .ok_or_else(|| anyhow!("Invalid isolation dir path: {mount_path:?}"))? | |
| let isolation_dir = buck2_isolation_dir(&mount_path)?; | |
| buck2.set_isolation_dir( | |
| isolation_dir | |
| .to_str() | |
| .ok_or_else(|| anyhow!("Invalid isolation dir path: {isolation_dir:?}"))? |
Buck2 statx-scans the entire directory tree on startup, causing connection timeouts on Antares FUSE mounts where each directory requires a network round-trip. scorpio: - Add configurable connect_timeout (default 3s) to all HTTP clients - Raise Antares kernel-entry TTL to 60s and dir-sync TTL to 120s so repeated statx calls are served from cache - Rewrite FUSE lookup to be non-blocking: 20ms refresh budget with 200ms miss-retry, preventing cascading stalls - Add per-mount LookupMetrics for observability - Decouple abi.rs default entries from global config; callers set the mount-appropriate TTL via Dicfuse::reply_ttl() orion: - Isolate buck2 daemons per-mount via --isolation-dir to prevent cross-build daemon conflicts - Add preheat (ls -lR) and preheat_shallow before buck2 targets to warm FUSE metadata cache - Support configurable isolation dir base and preheat depth via env vars and mega config
- Align scorpio.toml TTL defaults (5s/2s) with code defaults - Move preheat() before retry loop in get_repo_targets (expensive op only needs to run once) - Propagate buck2_isolation_dir error in build() instead of silently swallowing - Replace unstable DefaultHasher with SHA-256 for stable isolation dir hashing - Add config_accessor! macro to eliminate 15 identical accessor functions - Document Lazy<Client> one-time initialization and orphan refresh task behavior
5f6c3e1 to
2126547
Compare
…root Buck2 was running from the monorepo root, but sub-projects (e.g., project/git-internal/git-internal) have their own .buckconfig and PACKAGE files. This caused set_cfg_constructor() to fail because it can only be called from the repository root PACKAGE file. Changes: - Remove changes path prefix transformation; changes are already relative to the sub-project - Resolve sub-project paths within mounts for get_build_targets() - Set buck2 build current_dir and isolation_dir to sub-project root Signed-off-by: jerry609 <1772030600@qq.com>
2126547 to
1052cc5
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. Hooray! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
背景
buck2 在启动阶段会进行大规模 statx 扫描。在 Antares FUSE 挂载场景下,这会触发大量目录查询与网络往返,造成 lookup 路径阻塞,最终引发 daemon 启动超时。
另外,buck2 之前从 monorepo 根目录执行子项目构建,可能与子项目自己的 .buckconfig / PACKAGE 上下文不一致,出现set_cfg_constructor() 调用上下文错误、目标解析偏差以及隔离目录复用冲突等问题。
本次改动
1)Scorpio(Dicfuse)
2)Orion(buck2 执行链路)
3)配置项与模板