Skip to content

Fix/buck2 daemon timeout#1908

Merged
benjamin-747 merged 3 commits into
gitmono-dev:mainfrom
jerry609:fix/buck2-daemon-timeout
Feb 9, 2026
Merged

Fix/buck2 daemon timeout#1908
benjamin-747 merged 3 commits into
gitmono-dev:mainfrom
jerry609:fix/buck2-daemon-timeout

Conversation

@jerry609

@jerry609 jerry609 commented Feb 6, 2026

Copy link
Copy Markdown
Collaborator

背景

buck2 在启动阶段会进行大规模 statx 扫描。在 Antares FUSE 挂载场景下,这会触发大量目录查询与网络往返,造成 lookup 路径阻塞,最终引发 daemon 启动超时。

另外,buck2 之前从 monorepo 根目录执行子项目构建,可能与子项目自己的 .buckconfig / PACKAGE 上下文不一致,出现set_cfg_constructor() 调用上下文错误、目标解析偏差以及隔离目录复用冲突等问题。

本次改动

1)Scorpio(Dicfuse)

  • 增加可配置的 HTTP connect_timeout,并应用到 Dicfuse 相关客户端。
  • 将 FUSE lookup 改为“有界等待 + 后台刷新”的非阻塞模式:
    • 首次目录刷新等待预算:20ms
    • 超时后转后台继续刷新
    • 仅在 miss 时做一次有界重试等待:200ms
  • 使用挂载感知的 reply TTL 选择逻辑(Dicfuse::reply_ttl()),补齐 Antares reply TTL 的生效链路。
  • 优化配置维护性:
    • 对齐 scorpio.toml 默认值与代码预期
    • 使用宏减少 config accessor 重复样板代码
    • 精简注释并补充后续优化 TODO

2)Orion(buck2 执行链路)

  • 通过稳定的 --isolation-dir 处理实现按挂载隔离 buck2 daemon,降低跨构建相互影响。
  • 将高开销 preheat 从重试循环中前移,只做一次,减少重复成本。
  • 增加 shallow preheat 路径,降低冷启动目录遍历成本。
  • buck2_isolation_dir 创建失败时不再静默吞错,改为显式传播错误。
  • 使用稳定 SHA-256 后缀生成 isolation 目录,替换不稳定哈希行为。
  • buck2 改为从子项目目录执行(而非 monorepo 根目录),并按子项目上下文解析路径。

3)配置项与模板

  • 在运行时配置与模板中补充/贯通 connect_timeout 相关参数。
  • 支持 Orion 侧 shallow preheat depth 与 buck2 isolation base dir 的配置化控制。

Copilot AI review requested due to automatic review settings February 6, 2026 12:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses buck2 daemon timeout issues in FUSE-mounted environments by implementing configurable HTTP timeouts, non-blocking FUSE lookup operations, and improved buck2 daemon isolation. The changes span both the Scorpio (FUSE filesystem) and Orion (build system) components.

Changes:

  • Added configurable HTTP connection and request timeouts for Dicfuse operations with runtime-captured values
  • Implemented bounded-wait non-blocking lookup logic in FUSE to prevent daemon startup blocking
  • Introduced per-mount buck2 isolation directories using stable SHA-256-based naming
  • Moved buck2 execution from monorepo root to sub-project directories for correct context resolution
  • Increased default Antares preload depth from 0 to 3 and added shallow preheat functionality

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scorpio/src/util/config.rs Added new timeout constants, introduced macro for config accessors, updated Antares defaults
scorpio/src/dicfuse/store.rs Applied configurable timeouts to HTTP clients, added dir_refresh_needed helper
scorpio/src/dicfuse/mod.rs Implemented mount-aware reply TTL selection logic
scorpio/src/dicfuse/async_io.rs Refactored lookup to use bounded-wait + background refresh pattern
scorpio/src/dicfuse/abi.rs Changed default TTL to ZERO with mount-specific override
scorpio/scorpio.toml Updated configuration defaults to match code changes
orion/src/buck_controller.rs Added isolation dir logic, shallow preheat, sub-project execution context
orion/buck/run.rs Added set_isolation_dir method
orion/Cargo.toml Added ring, hex, and common dependencies
config/*.toml Added orion_preheat_shallow_depth and orion_buck2_isolation_dir_base configuration
common/src/config/mod.rs Extended BuildConfig with new Orion settings

// Touch metadata to warm FUSE cache.
let _ = entry.metadata();

if file_type.is_dir() && depth < max_depth {

Copilot AI Feb 6, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preheat_shallow function starts with depth = 0 for the root directory. When checking if file_type.is_dir() && depth < max_depth, a directory at depth = 0 will be added to the stack with depth = 1. This means that with max_depth = 1, the function will traverse directories at both depth 0 and depth 1, resulting in 2 levels of traversal instead of 1.

To fix this, the condition should check whether the next level would exceed the maximum: if file_type.is_dir() && depth + 1 < max_depth.

Suggested change
if file_type.is_dir() && depth < max_depth {
if file_type.is_dir() && depth + 1 < max_depth {

Copilot uses AI. Check for mistakes.
// Antares mounts are primarily used by build workloads.

/// Preload directory depth for Antares mounts.
const DEFAULT_ANTARES_LOAD_DIR_DEPTH: usize = 3;

Copilot AI Feb 6, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value for DEFAULT_ANTARES_LOAD_DIR_DEPTH has changed from 0 to 3, which represents a significant behavioral change. The previous value of 0 disabled deep prewarming to rely on lazy per-directory loads, while the new value of 3 enables eager directory preloading up to 3 levels deep.

This change could have several implications:

  1. Increased memory usage for Antares mounts
  2. Potentially slower mount initialization times (though this may be mitigated by the new non-blocking lookup logic)
  3. Higher network traffic during mount setup

While this change aligns with the PR's goal of reducing cold-start latency through shallow preheat, it's worth verifying that this default is appropriate for all Antares use cases, especially given that the old comment explicitly mentioned disabling deep prewarm for "build lowerdir stability and resource predictability."

Suggested change
const DEFAULT_ANTARES_LOAD_DIR_DEPTH: usize = 3;
///
/// Default is 0 to disable deep prewarm and rely on lazy per-directory
/// loads. This keeps build lowerdir stability and resource usage
/// predictable for large Antares mounts. Shallow preheat, when desired,
/// should be enabled via explicit per-mount configuration rather than
/// a process-global default.
const DEFAULT_ANTARES_LOAD_DIR_DEPTH: usize = 0;

Copilot uses AI. Check for mistakes.
Comment on lines +826 to +828
match get_build_targets(
old_project_root.to_str().unwrap_or(&old_repo_mount_point),
new_project_root.to_str().unwrap_or(&repo_mount_point),

Copilot AI Feb 6, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback logic in the to_str().unwrap_or() calls could silently cause buck2 to run from the wrong directory. If the project_root path contains invalid UTF-8 and to_str() returns None, the code falls back to using the mount point (monorepo root) instead of the project root. This defeats the purpose of running buck2 from the sub-project directory and could cause incorrect build behavior.

Consider propagating the error instead of silently falling back, or at least logging a warning when the fallback is used. The same issue exists on line 891.

Suggested change
match get_build_targets(
old_project_root.to_str().unwrap_or(&old_repo_mount_point),
new_project_root.to_str().unwrap_or(&repo_mount_point),
// Convert project roots to UTF-8; treat invalid UTF-8 as a hard error instead of silently
// falling back to the mount root (which would run buck2 from the wrong directory).
let old_project_root_str = match old_project_root.to_str() {
Some(s) => s,
None => {
let err = anyhow!(
"Project root path for old repo contains invalid UTF-8: {}",
old_project_root.display()
);
tracing::error!(
"[Task {}] Failed to convert old project root to UTF-8: {}",
id,
err
);
guard.unmount().await;
guard_old_repo.unmount().await;
last_targets_error = Some(err);
break;
}
};
let new_project_root_str = match new_project_root.to_str() {
Some(s) => s,
None => {
let err = anyhow!(
"Project root path for new repo contains invalid UTF-8: {}",
new_project_root.display()
);
tracing::error!(
"[Task {}] Failed to convert new project root to UTF-8: {}",
id,
err
);
guard.unmount().await;
guard_old_repo.unmount().await;
last_targets_error = Some(err);
break;
}
};
match get_build_targets(
old_project_root_str,
new_project_root_str,

Copilot uses AI. Check for mistakes.
Comment on lines +664 to +667
buck2.set_isolation_dir(
buck2_isolation_dir(&mount_path)?
.to_str()
.ok_or_else(|| anyhow!("Invalid isolation dir path: {mount_path:?}"))?

Copilot AI Feb 6, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message uses the wrong variable mount_path instead of the actual isolation directory path that was attempted. The error should reference the result of buck2_isolation_dir(&mount_path)? instead, which is the path that failed to convert to a string. This makes debugging more difficult as the error message shows the wrong path.

Suggested change
buck2.set_isolation_dir(
buck2_isolation_dir(&mount_path)?
.to_str()
.ok_or_else(|| anyhow!("Invalid isolation dir path: {mount_path:?}"))?
let isolation_dir = buck2_isolation_dir(&mount_path)?;
buck2.set_isolation_dir(
isolation_dir
.to_str()
.ok_or_else(|| anyhow!("Invalid isolation dir path: {isolation_dir:?}"))?

Copilot uses AI. Check for mistakes.
Buck2 statx-scans the entire directory tree on startup, causing
connection timeouts on Antares FUSE mounts where each directory
requires a network round-trip.

scorpio:
- Add configurable connect_timeout (default 3s) to all HTTP clients
- Raise Antares kernel-entry TTL to 60s and dir-sync TTL to 120s
  so repeated statx calls are served from cache
- Rewrite FUSE lookup to be non-blocking: 20ms refresh budget with
  200ms miss-retry, preventing cascading stalls
- Add per-mount LookupMetrics for observability
- Decouple abi.rs default entries from global config; callers set
  the mount-appropriate TTL via Dicfuse::reply_ttl()

orion:
- Isolate buck2 daemons per-mount via --isolation-dir to prevent
  cross-build daemon conflicts
- Add preheat (ls -lR) and preheat_shallow before buck2 targets
  to warm FUSE metadata cache
- Support configurable isolation dir base and preheat depth via
  env vars and mega config
- Align scorpio.toml TTL defaults (5s/2s) with code defaults
- Move preheat() before retry loop in get_repo_targets (expensive op only needs to run once)
- Propagate buck2_isolation_dir error in build() instead of silently swallowing
- Replace unstable DefaultHasher with SHA-256 for stable isolation dir hashing
- Add config_accessor! macro to eliminate 15 identical accessor functions
- Document Lazy<Client> one-time initialization and orphan refresh task behavior
@jerry609 jerry609 force-pushed the fix/buck2-daemon-timeout branch 2 times, most recently from 5f6c3e1 to 2126547 Compare February 6, 2026 13:03
…root

Buck2 was running from the monorepo root, but sub-projects (e.g.,
project/git-internal/git-internal) have their own .buckconfig and
PACKAGE files. This caused set_cfg_constructor() to fail because
it can only be called from the repository root PACKAGE file.

Changes:
- Remove changes path prefix transformation; changes are already
  relative to the sub-project
- Resolve sub-project paths within mounts for get_build_targets()
- Set buck2 build current_dir and isolation_dir to sub-project root

Signed-off-by: jerry609 <1772030600@qq.com>
@jerry609 jerry609 force-pushed the fix/buck2-daemon-timeout branch from 2126547 to 1052cc5 Compare February 6, 2026 13:11
@genedna

genedna commented Feb 6, 2026

Copy link
Copy Markdown
Collaborator

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@benjamin-747 benjamin-747 added this pull request to the merge queue Feb 9, 2026
Merged via the queue into gitmono-dev:main with commit d3f4189 Feb 9, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants