State of multi core / parallel simulation #60

StefanD986 · 2016-04-21T23:45:59Z

This is a question/enhancement: I wanted to know what the state of ghdl is with respect to simulations on multi threading/core machines.
I found a discussion that about the topic that is about 8 years old, and just wanted to know whether something has changed.

Thank you for all your work Tristan!

tgingold · 2016-04-29T06:16:11Z

Hello,

no nothing has changed. I made some tests a long time ago, and the results weren't impressive. The conclusion was: before using multicores, first optimize for a single core.

Thanks,
Tristan.

mbitsnbites · 2019-05-07T18:15:59Z

Not sure if this ticket should be re-opened, but seeing that 8-16 core CPU:s are becoming mainstream this year, and 32-64 core CPU:s will likely be common in high end workstations a couple of years from now, there is a huge potential for performance improvements if that CPU power could be utilized (or put another way: it would be a huge waste not to use all that power).

Due to the nature of the functionality that GHDL simulates, it should be possible to extract a healthy amount of parallelism for most VHDL designs.

umarcor · 2019-05-07T19:04:26Z

Hi @mbitsnbites! Although it is not a built-in feature, it is currently possible to co-execute multiple binaries generated with GHDL. Assuming that the design can be split into some submodules which are connected to each other through some standard interface, VHPIDIRECT can be used to transfer data between them.

In the following VUnit script, you can find an example of how to load and execute a GHDL binary in a Python thread: https://github.com/dbhi/vunit/blob/feat-external-arrays/examples/vhdl/external_buffer/run.py#L104-L118. Note that it is possible to achieve the same result with C and pthreads.

mbitsnbites · 2019-05-08T11:47:02Z

My use case is to run a program in a custom VHDL CPU core. Splitting the CPU core up into 10-100 binaries and manually hooking them up and running the pieces in parallel is not something that I'm keen on at this point.

One of the programs that I run takes about 6-8 million simulated clock cycles, which translates to about 30-40 minutes of wall clock time. If (as I suspect) there is enough parallelism in the design to scale over all the available cores on a modern workstation, that simulation time could be brought down to a couple of minutes, which would be a huge deal.

umarcor · 2019-05-08T13:36:02Z

The issue with automatic parallelization is that the communication overhead between the cores can potentially cancel the performance gain. It is quite difficult to do a proper partitioning of the design without any prior knowledge. When the system is split to 10-100 binaries, the important point is that the designer is doing the partitioning.

Related to this, you can have a look at the following conversation at gitter: October 8, 2018 7:59 AM

@1138-4eb:
While analyzing them, it seems that GHDL is CPU bounded and it is single-threaded. So, the faster the clock frequency, the better performance. Can either @tgingold or any other confirm it?

@tgingold:
Yes, simulation is CPU bound, unless you are dumping waveforms. Even if speed is improved, I think it will always be CPU bound.

@1138-4eb:
Thanks for the quick response. Is there any option to somehow improve it by extending GHDL to be multi-threaded? Or would it be too complicated (would requrie rewriting GHDL)?

@tgingold:
In the past there was an option, but the speed-up was not significant.
Before being multithread, you'd better to be fast in single thread!

@Paebbels:
I think to do multi-threaded simulation, you have to cluster the simulation into processes with high interaction (closely coupled) and processes with less coupling. If the coupling is to high, you spend more time in thread synchronozation than in simulation, so processes with lots of signal interaction should run on one thread and processes with less signals connections should run on another thread.
I believe creating a graph of that system and partitioning processes is not so easy. AND, using signals as a measurement of coupling is not accurate, you can have lots of signals, which will never update ...

Nonetheless, there is room for improvement in single-thread:

Which backend are you using? 'mcode' is known to analyze faster and simulate slower than GCC/LLVM. Hence, you could try GCC or LLVM.
Moreover, LLVM allows to enable multiple optimization passes and only a few of them are being used at the moment. If you know anything about compilers, you could try to enable a custom (better) set of passes. See #667.
A different strategy that might help you accelerate the design is using variables instead of signals, when possible. See Large array crashes ghdl #812.

Please, don't take me wrong. I am very concerned about simulation time using GHDL; and I would love to have multi-core support which would provide a significant performance improvement. However, we need to be realistic with regard to both the number of people involved in this project and the expertise.

Last, but not least, @tgingold, @Paebbels, what do you think about reopening this issue (or opening a new one) just to keep track of this (non) feature? It's is a legit doubt that I expect to come out frequently.

mbitsnbites · 2019-05-08T14:20:32Z

Thanks for the feedback. I'm already using the LLVM backend, so I'll be sure to check out the reference you pointed to about optimization passes.

Since I don't have any knowledge about the GHDL internals, I am not going to suggest any technical solutions. Anything based on frequent locking is likely going to kill performance, as @Paebbels suggests, so you may need some other construct. If the architecture isn't built with massive parallelism as a key design goal from the start, it can be hard to retrofit an efficient model later.

umarcor · 2019-05-08T15:04:01Z

If the architecture isn't built with massive parallelism as a key design goal from the start, it can be hard to retrofit an efficient model later.

Related to this, multi-thread support was recently added to Verilator. The following video might be inspiring for anyone willing to think about how could this be done: https://www.youtube.com/watch?v=en8JMz7v3LU (min 7:30 and forward).

tgingold · 2019-05-10T04:16:28Z

This is a matter of efforts and priority. I am convinced that before going parallel, a program must already be optimized for sequential execution.

go2sh · 2019-05-10T06:12:14Z

It depends. If your problem is a just the calculations, then optimizing for a single core. But especially for big designs with a lot of memory, multi threading might help as loading stuff from memory gets paralelized.

But I can totally understand that there is and might not be multi thread support due to the amount of work.

mbitsnbites · 2019-05-10T13:10:00Z

I respect that line of reasoning (again, especially since I'm not familiar with the GHDL internals).

However it is also easy to argue that since multi-threaded execution can potentially give a many-fold performance increase (as demonstrated by Verilator), you'd have to have a pretty aggressive plan for matching that performance win with single-threaded execution improvements. In other words: which path will give you the biggest win per invested development hour?

I may be biased, but for most things that I do (tune build systems for CI, implement raytracers, and optimizing SW in general) I tend to spen a lot of time in a CPU load monitor (e.g. htop), and whenever I see unused cores I get an itch to fix it.

tgingold closed this as completed Apr 29, 2016

tgingold added the Question label Apr 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State of multi core / parallel simulation #60

State of multi core / parallel simulation #60

StefanD986 commented Apr 21, 2016

tgingold commented Apr 29, 2016

mbitsnbites commented May 7, 2019

umarcor commented May 7, 2019 •

edited

mbitsnbites commented May 8, 2019

umarcor commented May 8, 2019 •

edited

mbitsnbites commented May 8, 2019

umarcor commented May 8, 2019

tgingold commented May 10, 2019 via email

go2sh commented May 10, 2019 •

edited

mbitsnbites commented May 10, 2019

State of multi core / parallel simulation #60

State of multi core / parallel simulation #60

Comments

StefanD986 commented Apr 21, 2016

tgingold commented Apr 29, 2016

mbitsnbites commented May 7, 2019

umarcor commented May 7, 2019 • edited

mbitsnbites commented May 8, 2019

umarcor commented May 8, 2019 • edited

mbitsnbites commented May 8, 2019

umarcor commented May 8, 2019

tgingold commented May 10, 2019 via email

go2sh commented May 10, 2019 • edited

mbitsnbites commented May 10, 2019

umarcor commented May 7, 2019 •

edited

umarcor commented May 8, 2019 •

edited

go2sh commented May 10, 2019 •

edited