Skip to content

fix: apply cpu affinity to existing agent threads#11556

Merged
sharang merged 4 commits intomainfrom
support-cpu-affinity
Apr 9, 2026
Merged

fix: apply cpu affinity to existing agent threads#11556
sharang merged 4 commits intomainfrom
support-cpu-affinity

Conversation

@kylewanginchina
Copy link
Copy Markdown
Contributor

@kylewanginchina kylewanginchina commented Apr 1, 2026

This PR is for:

  • Agent

Fixes CPU affinity updates not being applied to already running agent threads, and replaces manual procfs thread scanning with the procfs library

Steps to reproduce the bug

  • Start a managed deepflow-agent and let it create its normal runtime worker threads.
  • Update global.tunning.cpu_affinity from deepflow-server for the agent group after those threads are already running.
  • Observe that the new CPU affinity setting is expected to apply to all existing agent threads, not only to threads created later.
  • In the original implementation, the rebinding logic depended on manual reads from /proc/self/task and /proc/self/task/<tid>/comm.
  • This made the thread scanning code harder to maintain and duplicated functionality already provided by the procfs crate.

Changes to fix the bug

  • Apply CPU affinity updates to already existing agent threads when global.tunning.cpu_affinity changes, instead of limiting the effect to threads
    created after startup.
  • Keep scanning multiple times so newly discovered threads created during the rebinding window can also be covered.
  • Continue skipping self-managed kick-kern.* eBPF threads during CPU affinity rebinding.
  • Replace manual /proc/self/task directory traversal with procfs::process::Process::myself()?.tasks()?.
  • Replace direct reads of /proc/self/task/<tid>/comm with task.status()?.name.
  • Add a small procfs::ProcError to io::Error conversion helper so the existing retry and error-handling flow remains unchanged.
  • Preserve the original tolerance for disappearing threads by continuing on NotFound errors during task enumeration and thread name resolution.

Affected branches

  • main

Checklist

  • Added unit test to verify the fix.
  • Verified eBPF program runs successfully on linux 4.14.x.
  • Verified eBPF program runs successfully on linux 4.19.x.
  • Verified eBPF program runs successfully on linux 5.2.x.
  • Verified eBPF program runs successfully on linux 5.4.x.

@kylewanginchina
Copy link
Copy Markdown
Contributor Author

针对这个工单需求

@kylewanginchina
Copy link
Copy Markdown
Contributor Author

测试结果证明

  • 本地单元测试
    • 命令:

      docker exec deepflow-ext-ide /bin/bash -c "cd /work/deepflow/agent && cargo test --lib config::handler::tests -- --test-threads=1"

    • 结果:6 passed; 0 failed

    • 关键输出:

      running 6 tests
      test config::handler::tests::test_set_cpu_affinity_updates_existing_worker_threads ... ok
      test result: ok. 6 passed; 0 failed

  • 格式检查
    • 命令:

      docker exec deepflow-ext-ide /bin/bash -c "cd /work/deepflow/agent && cargo fmt --check"

    • 结果:退出码 0

  • 10.50.120.200 managed 模式实机验证
    • 我在 10.50.120.201 上临时给目标 agent group profiling 下发了:

      global:
      tunning:
      cpu_affinity:
      - 0

    • 然后在 10.50.120.200 用 managed 模式启动:

      /home/deepflow-agent -f /home/deepflow-agent.yaml

    • 日志证据:

      • 10.50.120.200:/var/log/deepflow-agent/deepflow-agent.log:4367

      • 关键行:

        Update global.tunning.cpu_affinity from [] to [0].

    • 线程亲和性实测输出里,受控模式进程 PID=3997103 的多个工作线程都已收敛到 CPU 0

      • 示例:

        3997103: pid 3997103's current affinity list: 0
        3997148: pid 3997148's current affinity list: 0
        3997150: pid 3997150's current affinity list: 0
        3997152: pid 3997152's current affinity list: 0
        3997157: pid 3997157's current affinity list: 0

      • 覆盖到的线程包含 uniform-sender、stats-collector、main-loop、tokio-runtime-w、guard 等,不是只有主线程

@kylewanginchina
Copy link
Copy Markdown
Contributor Author

测试结果:
image

@sharang sharang enabled auto-merge (squash) April 9, 2026 12:36
@sharang sharang merged commit 8f516d0 into main Apr 9, 2026
10 checks passed
@sharang sharang deleted the support-cpu-affinity branch April 9, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants