Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU support to ggml #914

Closed
ggerganov opened this issue Apr 12, 2023 · 1 comment
Closed

Add GPU support to ggml #914

ggerganov opened this issue Apr 12, 2023 · 1 comment
Labels
enhancement New feature or request hardware Hardware related help wanted Extra attention is needed research 🔬

Comments

@ggerganov
Copy link
Owner

ggerganov commented Apr 12, 2023

Intro

This issue is more suitable for the https://github.com/ggerganov/ggml repo, but adding it here for more visibility.

First, I don't see adding a GPU framework that is tightly integrated with ggml anytime soon because it usually comes with a lot of maintenance drawbacks, architecture changes and issues. However, there is an alternative approach that might be relatively easy to implement and I think would be a very cool way for new developers to join in and help.

Description

ggml produces computation graphs which are basically directed acyclic graphs (DAGs) that can be easily exported, iterated, etc. A graph contains the information about all necessary tensor operations and buffers needed to evaluate the model. The idea is to first add basic ggml functionality for exporting the graphs in some trivial text format that can be parsed as a second step by a separate ggml tool. Having the exported graphs, one can process them and construct hardware-specific code for evaluating them.

For example, a ggml-cuda tool can parse the exported graph and construct the necessary CUDA kernels and GPU buffers to evaluate it on a NVIDIA GPU. Another tool, for example ggml-mps, can do similar stuff but for Metal Performance Shaders. Etc.

This approach preserves the cross-platform nature of ggml and allows custom hardware support, via compiler-like translation of the exported computation graphs.

Still, the most difficult part of implementing the respective kernels remains the biggest obstacle.

I think this decoupled approach of the implementation would make the development process much easier and can potentially allow for some interesting optimizations. My biggest fear of adding a tightly integrated GPU backend to ggml is that I don't know the important details for supporting the respective backend, which could lead to bad software design decisions that in turn can potentially affect negatively even the cure CPU implementation.
However, with the proposed approach in this issue, we eliminate this risk and allow multiple independent implementations to be provided without any negative side effects on the core ggml implementation.

Another cool thing about this idea is that there could be separate leading developers for each backend.
So if you have a good knowledge and understanding about a certain hardware architecture, you are one step away from initiating the kernel "translation" process and making a very significant contribution to the project.

Guiding principles

I don't know all the specifics of a GPU code, but I believe one could try to adopt the fundamental principles of ggml.
For example, there could be a single memory buffer allocated and all the tensors can be distributed within that memory buffer at certain offsets. Each graph operation will correspond to a kernel with source tensors as input and a destination tensor for output which will be all part of that single memory buffer allocated at the start of the execution.

Additionally, I think we don't need to explicitly add 3rd party dependencies (e.g. CUDA SDK, OpenCL, etc.) to ggml to achieve that. The new ggml tools will simply generate code, which will be up to the user to compile and run.

I've heard the concept of "super-shaders" / "super-kernels" - probably this is something we should try to achieve.

Taking shortcuts and making custom hacks in favor of better performance is very welcome.

Why?

Currently, ggml is one of the few ML frameworks that provides efficient 4-bit quantization and demonstrates effective application for transformer evaluation. The code is compact, easily comprehensible with very little bloat. I think ggml has a slight leading edge in this regard compared to other general purpose frameworks and if we utilize it now, it has the potential of becoming a very respectable machine learning framework in the future with a focus for on-device inference.

Links

@ggerganov ggerganov added enhancement New feature or request help wanted Extra attention is needed hardware Hardware related research 🔬 labels Apr 12, 2023
@ggerganov ggerganov pinned this issue Apr 12, 2023
@clxyder
Copy link

clxyder commented Apr 12, 2023

Would it be possible to use https://github.com/openai/triton/ to generate the specific backend GPU code? From what I can tell it generates CUDA code for you.

Repository owner locked and limited conversation to collaborators Apr 12, 2023
@ggerganov ggerganov converted this issue into discussion #915 Apr 12, 2023
@gjmulder gjmulder unpinned this issue May 2, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
enhancement New feature or request hardware Hardware related help wanted Extra attention is needed research 🔬
Projects
None yet
Development

No branches or pull requests

2 participants