-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State of multi core / parallel simulation #60
Comments
Hello, no nothing has changed. I made some tests a long time ago, and the results weren't impressive. The conclusion was: before using multicores, first optimize for a single core. Thanks, |
Not sure if this ticket should be re-opened, but seeing that 8-16 core CPU:s are becoming mainstream this year, and 32-64 core CPU:s will likely be common in high end workstations a couple of years from now, there is a huge potential for performance improvements if that CPU power could be utilized (or put another way: it would be a huge waste not to use all that power). Due to the nature of the functionality that GHDL simulates, it should be possible to extract a healthy amount of parallelism for most VHDL designs. |
Hi @mbitsnbites! Although it is not a built-in feature, it is currently possible to co-execute multiple binaries generated with GHDL. Assuming that the design can be split into some submodules which are connected to each other through some standard interface, VHPIDIRECT can be used to transfer data between them. In the following VUnit script, you can find an example of how to load and execute a GHDL binary in a Python thread: https://github.com/dbhi/vunit/blob/feat-external-arrays/examples/vhdl/external_buffer/run.py#L104-L118. Note that it is possible to achieve the same result with C and pthreads. |
My use case is to run a program in a custom VHDL CPU core. Splitting the CPU core up into 10-100 binaries and manually hooking them up and running the pieces in parallel is not something that I'm keen on at this point. One of the programs that I run takes about 6-8 million simulated clock cycles, which translates to about 30-40 minutes of wall clock time. If (as I suspect) there is enough parallelism in the design to scale over all the available cores on a modern workstation, that simulation time could be brought down to a couple of minutes, which would be a huge deal. |
The issue with automatic parallelization is that the communication overhead between the cores can potentially cancel the performance gain. It is quite difficult to do a proper partitioning of the design without any prior knowledge. When the system is split to 10-100 binaries, the important point is that the designer is doing the partitioning. Related to this, you can have a look at the following conversation at gitter: October 8, 2018 7:59 AM
Nonetheless, there is room for improvement in single-thread:
Please, don't take me wrong. I am very concerned about simulation time using GHDL; and I would love to have multi-core support which would provide a significant performance improvement. However, we need to be realistic with regard to both the number of people involved in this project and the expertise. Last, but not least, @tgingold, @Paebbels, what do you think about reopening this issue (or opening a new one) just to keep track of this (non) feature? It's is a legit doubt that I expect to come out frequently. |
Thanks for the feedback. I'm already using the LLVM backend, so I'll be sure to check out the reference you pointed to about optimization passes. Since I don't have any knowledge about the GHDL internals, I am not going to suggest any technical solutions. Anything based on frequent locking is likely going to kill performance, as @Paebbels suggests, so you may need some other construct. If the architecture isn't built with massive parallelism as a key design goal from the start, it can be hard to retrofit an efficient model later. |
Related to this, multi-thread support was recently added to Verilator. The following video might be inspiring for anyone willing to think about how could this be done: https://www.youtube.com/watch?v=en8JMz7v3LU (min 7:30 and forward). |
This is a matter of efforts and priority. I am convinced that before going parallel, a program must already be optimized for sequential execution.
|
It depends. If your problem is a just the calculations, then optimizing for a single core. But especially for big designs with a lot of memory, multi threading might help as loading stuff from memory gets paralelized. But I can totally understand that there is and might not be multi thread support due to the amount of work. |
I respect that line of reasoning (again, especially since I'm not familiar with the GHDL internals). However it is also easy to argue that since multi-threaded execution can potentially give a many-fold performance increase (as demonstrated by Verilator), you'd have to have a pretty aggressive plan for matching that performance win with single-threaded execution improvements. In other words: which path will give you the biggest win per invested development hour? I may be biased, but for most things that I do (tune build systems for CI, implement raytracers, and optimizing SW in general) I tend to spen a lot of time in a CPU load monitor (e.g. htop), and whenever I see unused cores I get an itch to fix it. |
This is a question/enhancement: I wanted to know what the state of ghdl is with respect to simulations on multi threading/core machines.
I found a discussion that about the topic that is about 8 years old, and just wanted to know whether something has changed.
Thank you for all your work Tristan!
The text was updated successfully, but these errors were encountered: