Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison with torch jit #14

Closed
aichr opened this issue Oct 3, 2022 · 8 comments
Closed

Comparison with torch jit #14

aichr opened this issue Oct 3, 2022 · 8 comments
Labels
question Further information is requested

Comments

@aichr
Copy link

aichr commented Oct 3, 2022

Great work! I find especially the implementation of ggml interesting. It looks like you implement all the basic neural network building blocks with ggml. How do you compare it with the torch jit approach of using a pytorch model in c++?

@ggerganov ggerganov added the question Further information is requested label Oct 3, 2022
@ggerganov
Copy link
Owner

Hi, thanks!

I'm not sure how it compares - I've never used torch jit. In general, I have very little experience with the python frameworks out there. This is one of the reasons to implement ggml - to not have to use python and the various extra dependencies that usually come with the framework, and to avoid their overhead. Also, it is a good learning experience - it helps to understand how these NN approaches work on a lower level.

@skaiware
Copy link

skaiware commented Oct 3, 2022

Hi @ggerganov
torch.jit does nt require any python after exporting the model to a jit file. It s then quite easy to run inference just using the torch Cpp API and linking with torch_cpu.so/dll. But quite invasive (up to 2GB of libraries), specially compared to onnxrt (20MB). And usually slower than onnxrt so not that a priority to try.

@aichr
Copy link
Author

aichr commented Oct 4, 2022

@ggerganov thanks for the reply. I am astonished by your c implementation of these neural network building blocks. Like @skaiware said, the downside of jit could be that it comes with a big linking burden. I think your solution has a great potential to make big models run faster on any device!

@aichr
Copy link
Author

aichr commented Oct 4, 2022

how did you make sure that the c implementation renders the same results as pytorch functions? Any tests to guard this?

@ggerganov
Copy link
Owner

how did you make sure that the c implementation renders the same results as pytorch functions? Any tests to guard this?

The results are note the same - it's hard to make them exactly the same due to round-off errors when using floating point numbers. Instead, I just verified manually that the numbers that I get after each layer are similar to the one from the original python implementation.

It would be great to add some tests in the future to make sure the results match the reference implementation within some tolerance.

@ejkitchen
Copy link

@skaiware, how would you convert this C++ code to use CUDA? any ideas?

@ggerganov
Copy link
Owner

IMHO it does not make sense to port this code to CUDA. If you want to use CUDA or other GPU framework, you are better off using some of the well-established python frameworks (PyTorch, Tensforflow, etc.).

With whisper.cpp, the main idea is to avoid the overhead that comes from using a high-level programming language (python) and also avoid moving data back and forth across the PCI. We do lose a lot of GPU performance, but we try to compensate by optimizing the CPU operations and memory bandwidth as much as possible.

And here also comes the Apple Silicon hardware which gives you the AMX matrix coprocessor offering great performance boost (at least according to my experiments). It is so easy to integrate it in your project (simply add -framework Accelerate to your compile flags) that there is no reason not to use it.

My understanding is that python frameworks at the moment do not fully support the Accelerate framework (I might be wrong), but soon they will do and probably you won't see much of a performance improvement when using whisper.cpp. But for the moment, I think this project offers better performance - at least this is what I see on my MacBook when comparing to PyTorch. And again, take this with a grain of salt, because I have almost no experience with python, so I could be doing something wrong.

@ejkitchen
Copy link

@ggerganov Sorry, I should have been clearer. I meant C++ with Pytorch CUDA-enabled drivers. I thought a lot of the performance issues were with Python, but it wasn't as simple as that after looking into it with a profiler called scalene which looks at Python code, C/C++ etc library code, system calls and various GPU metrics.

We're now converting Whisper to use TorchScript with TensorRT and some other CUDA optimizations like pinning memory etc. But again this is all for CUDA-enabled devices. We think we're going to get about a 3x-5x, but until completed this is just a theory with quick back-of-the-napkin calculations.

In any case, thank you so much for this C++ version; it was quite an inspiration and great work. I might come back to it if I feel we can squeeze more performance gains by remaining entirely native but with Pytorch libs and the other tricks mentioned above.

mattsta pushed a commit to mattsta/whisper.cpp that referenced this issue Apr 1, 2023
make LICENSE a link instead of code-formatted text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants