Comparison with torch jit #14

aichr · 2022-10-03T08:55:46Z

Great work! I find especially the implementation of ggml interesting. It looks like you implement all the basic neural network building blocks with ggml. How do you compare it with the torch jit approach of using a pytorch model in c++?

ggerganov · 2022-10-03T10:50:08Z

Hi, thanks!

I'm not sure how it compares - I've never used torch jit. In general, I have very little experience with the python frameworks out there. This is one of the reasons to implement ggml - to not have to use python and the various extra dependencies that usually come with the framework, and to avoid their overhead. Also, it is a good learning experience - it helps to understand how these NN approaches work on a lower level.

skaiware · 2022-10-03T16:08:05Z

Hi @ggerganov
torch.jit does nt require any python after exporting the model to a jit file. It s then quite easy to run inference just using the torch Cpp API and linking with torch_cpu.so/dll. But quite invasive (up to 2GB of libraries), specially compared to onnxrt (20MB). And usually slower than onnxrt so not that a priority to try.

aichr · 2022-10-04T09:03:07Z

@ggerganov thanks for the reply. I am astonished by your c implementation of these neural network building blocks. Like @skaiware said, the downside of jit could be that it comes with a big linking burden. I think your solution has a great potential to make big models run faster on any device!

aichr · 2022-10-04T09:07:14Z

how did you make sure that the c implementation renders the same results as pytorch functions? Any tests to guard this?

ggerganov · 2022-10-04T12:22:15Z

how did you make sure that the c implementation renders the same results as pytorch functions? Any tests to guard this?

The results are note the same - it's hard to make them exactly the same due to round-off errors when using floating point numbers. Instead, I just verified manually that the numbers that I get after each layer are similar to the one from the original python implementation.

It would be great to add some tests in the future to make sure the results match the reference implementation within some tolerance.

ejkitchen · 2022-10-18T02:25:07Z

@skaiware, how would you convert this C++ code to use CUDA? any ideas?

ggerganov · 2022-10-29T16:58:24Z

IMHO it does not make sense to port this code to CUDA. If you want to use CUDA or other GPU framework, you are better off using some of the well-established python frameworks (PyTorch, Tensforflow, etc.).

With whisper.cpp, the main idea is to avoid the overhead that comes from using a high-level programming language (python) and also avoid moving data back and forth across the PCI. We do lose a lot of GPU performance, but we try to compensate by optimizing the CPU operations and memory bandwidth as much as possible.

And here also comes the Apple Silicon hardware which gives you the AMX matrix coprocessor offering great performance boost (at least according to my experiments). It is so easy to integrate it in your project (simply add -framework Accelerate to your compile flags) that there is no reason not to use it.

My understanding is that python frameworks at the moment do not fully support the Accelerate framework (I might be wrong), but soon they will do and probably you won't see much of a performance improvement when using whisper.cpp. But for the moment, I think this project offers better performance - at least this is what I see on my MacBook when comparing to PyTorch. And again, take this with a grain of salt, because I have almost no experience with python, so I could be doing something wrong.

ejkitchen · 2022-10-29T18:14:28Z

@ggerganov Sorry, I should have been clearer. I meant C++ with Pytorch CUDA-enabled drivers. I thought a lot of the performance issues were with Python, but it wasn't as simple as that after looking into it with a profiler called scalene which looks at Python code, C/C++ etc library code, system calls and various GPU metrics.

We're now converting Whisper to use TorchScript with TensorRT and some other CUDA optimizations like pinning memory etc. But again this is all for CUDA-enabled devices. We think we're going to get about a 3x-5x, but until completed this is just a theory with quick back-of-the-napkin calculations.

In any case, thank you so much for this C++ version; it was quite an inspiration and great work. I might come back to it if I feel we can squeeze more performance gains by remaining entirely native but with Pytorch libs and the other tricks mentioned above.

make LICENSE a link instead of code-formatted text

ggerganov added the question Further information is requested label Oct 3, 2022

ggerganov closed this as completed Oct 29, 2022

cjia4 mentioned this issue Mar 28, 2023

ggml_new_tensor_impl: not enough space in the scratch memory #671

Closed

mattsta pushed a commit to mattsta/whisper.cpp that referenced this issue Apr 1, 2023

Merge pull request ggerganov#14 from bquast/patch-1

e90b8fa

make LICENSE a link instead of code-formatted text

warkcod mentioned this issue Jun 8, 2023

OpenCL clCreateCommandQueue error -30 on MacOS 13.4 intel #996

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison with torch jit #14

Comparison with torch jit #14

aichr commented Oct 3, 2022

ggerganov commented Oct 3, 2022

skaiware commented Oct 3, 2022

aichr commented Oct 4, 2022

aichr commented Oct 4, 2022

ggerganov commented Oct 4, 2022

ejkitchen commented Oct 18, 2022

ggerganov commented Oct 29, 2022

ejkitchen commented Oct 29, 2022

Comparison with torch jit #14

Comparison with torch jit #14

Comments

aichr commented Oct 3, 2022

ggerganov commented Oct 3, 2022

skaiware commented Oct 3, 2022

aichr commented Oct 4, 2022

aichr commented Oct 4, 2022

ggerganov commented Oct 4, 2022

ejkitchen commented Oct 18, 2022

ggerganov commented Oct 29, 2022

ejkitchen commented Oct 29, 2022