Skip to content

Conversation

@mmichel11
Copy link
Contributor

Setting GGML_SYCL_DISABLE_GRAPHS=0 to enable SYCL graphs currently crashes with most use cases because of the following unsupported operations in graph recording regions:

  • Host waits
  • SYCL malloc / free calls

The following changes are made to fix SYCL graph functionality:

  • When graphs are enabled, use the SYCL async memory extension for temp buffers which is supported with SYCL graphs.
  • For SYCL compiler versions that do not support this extension, skip graphs with the affected op.
  • Switch from USM shared to device memory as the async extension currently just supports device allocations.

I have verified functionality with commit a3132c1 of intel/llvm. For earlier compilers that do not support this extension, graphs are disabled for most cases to prevent crashes.

GGML_SYCL_DISABLE_GRAPHS=0 causes crashes because:
  - Host waits are currently unsupported in graph recording mode.
  - SYCL malloc / free calls are unsupported in graph recording mode.

The following changes are made to fix SYCL graph functionality:
  - When graphs are enabled, use the SYCL async memory extension for temp
    buffers which is supported with SYCL graphs.
  - For compiler versions that do not support this extension, skip
    graphs with the affected op.
  - Switch from USM shared to device memory as the async extension
    currently just supports device allocations.
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Oct 17, 2025
@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Oct 20, 2025

@mmichel11
By this PR, will the SYCL graph feature work well?
Is there any performance benefit?

I support to enable SYCL graph feature in SYCL backend.
But I appreciate a PR to enable SYCL graph to provide whole solution of SYCL graph, instead of a piece of code to fix a small issue before it's ready.
When a developer want to refer to SYCL graph solution in llama.cpp, there are many PRs. It'd hard to learn.

I understand that it's not easy to make the SYCL graph work well in llama.cpp.
I suggest merging such PRs to a feature branch, instead of main branch:

  • We hope main branch work with stable features and bug fix.

  • For this PR, user need to install a special compiler version to support the PR code.
    It doesn't make sense for main branch.

Thank you!

@lslusarczyk
Copy link
Contributor

If only current Intel llvm public compiler has all needed support, then it should be used for SYCL in CI. Let's (1) add CI executing this code, (2) check if performance on non-graph is not deteriorated by this PR.

If both above tasks are done let's merge and have SYCL graphs working on llama head again. This will be far better situation on head that now, when we have crashing graphs.

@NeoZhangJianyu
Copy link
Collaborator

If only current Intel llvm public compiler has all needed support, then it should be used for SYCL in CI. Let's (1) add CI executing this code, (2) check if performance on non-graph is not deteriorated by this PR.

If both above tasks are done let's merge and have SYCL graphs working on llama head again. This will be far better situation on head that now, when we have crashing graphs.

Have you tested the performance of SYCL graph locally?
How much is the performance improvement?

@mmichel11
Copy link
Contributor Author

mmichel11 commented Oct 20, 2025

If only current Intel llvm public compiler has all needed support, then it should be used for SYCL in CI. Let's (1) add CI executing this code, (2) check if performance on non-graph is not deteriorated by this PR.
If both above tasks are done let's merge and have SYCL graphs working on llama head again. This will be far better situation on head that now, when we have crashing graphs.

Have you tested the performance of SYCL graph locally? How much is the performance improvement?

I've tested performance locally, and SYCL graphs does not yet deliver performance benefit over the standard path. In particular, usage of graphs needs to be improved to reduce the number of finalize calls bottlenecking performance and graph update needs to be fixed by adding alternative paths for unsupported nodes in graph update (memcpy). This patch just reenables functionality for newer compilers with the async memory extension.

If the feature branch is the way we need to go now, the biggest problem I foresee is future SYCL changes breaking compatibility with graphs as @lslusarczyk mentioned creating difficulties for enablement in the long-term. Having some CI job to ensure SYCL graph functionality with new changes on master seems ideal to prevent this if possible.

@NeoZhangJianyu
Copy link
Collaborator

If only current Intel llvm public compiler has all needed support, then it should be used for SYCL in CI. Let's (1) add CI executing this code, (2) check if performance on non-graph is not deteriorated by this PR.
If both above tasks are done let's merge and have SYCL graphs working on llama head again. This will be far better situation on head that now, when we have crashing graphs.

Have you tested the performance of SYCL graph locally? How much is the performance improvement?

I've tested performance locally, and SYCL graphs does not yet deliver performance benefit over the standard path. In particular, usage of graphs needs to be improved to reduce the number of finalize calls bottlenecking performance and graph update needs to be fixed by adding alternative paths for unsupported nodes in graph update (memcpy). This patch just reenables functionality for newer compilers with the async memory extension.

If the feature branch is the way we need to go now, the biggest problem I foresee is future SYCL changes breaking compatibility with graphs as @lslusarczyk mentioned creating difficulties for enablement in the long-term. Having some CI job to ensure SYCL graph functionality with new changes on master seems ideal to prevent this if possible.

The SYCL graph code is separated by macro in code.
It won’t impact legacy code.

Though we can’t run CI to check and protect SYCL graph code not to be broken by other PRs, we could pay attention to during review PRs.
At least I will do it.
So, don’t worry it.

Thank you for your sharing! Hope this feature be implemented as soon.

@NeoZhangJianyu NeoZhangJianyu self-requested a review October 22, 2025 07:38
@NeoZhangJianyu NeoZhangJianyu merged commit 9de9672 into ggml-org:master Oct 23, 2025
70 checks passed
FMayran pushed a commit to FMayran/llama.cpp that referenced this pull request Oct 23, 2025
…ng (ggml-org#16644)

* sycl: use async memory allocation to fix graph recording failures

GGML_SYCL_DISABLE_GRAPHS=0 causes crashes because:
  - Host waits are currently unsupported in graph recording mode.
  - SYCL malloc / free calls are unsupported in graph recording mode.

The following changes are made to fix SYCL graph functionality:
  - When graphs are enabled, use the SYCL async memory extension for temp
    buffers which is supported with SYCL graphs.
  - For compiler versions that do not support this extension, skip
    graphs with the affected op.
  - Switch from USM shared to device memory as the async extension
    currently just supports device allocations.

* Address reviewer feedback

* Use global async variable to decide path in sycl_ext_[malloc_device|free]
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025
…ng (ggml-org#16644)

* sycl: use async memory allocation to fix graph recording failures

GGML_SYCL_DISABLE_GRAPHS=0 causes crashes because:
  - Host waits are currently unsupported in graph recording mode.
  - SYCL malloc / free calls are unsupported in graph recording mode.

The following changes are made to fix SYCL graph functionality:
  - When graphs are enabled, use the SYCL async memory extension for temp
    buffers which is supported with SYCL graphs.
  - For compiler versions that do not support this extension, skip
    graphs with the affected op.
  - Switch from USM shared to device memory as the async extension
    currently just supports device allocations.

* Address reviewer feedback

* Use global async variable to decide path in sycl_ext_[malloc_device|free]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants