Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State of multi core / parallel simulation #60

Closed
StefanD986 opened this issue Apr 21, 2016 · 10 comments
Closed

State of multi core / parallel simulation #60

StefanD986 opened this issue Apr 21, 2016 · 10 comments
Labels

Comments

@StefanD986
Copy link

This is a question/enhancement: I wanted to know what the state of ghdl is with respect to simulations on multi threading/core machines.
I found a discussion that about the topic that is about 8 years old, and just wanted to know whether something has changed.

Thank you for all your work Tristan!

@tgingold
Copy link
Member

Hello,

no nothing has changed. I made some tests a long time ago, and the results weren't impressive. The conclusion was: before using multicores, first optimize for a single core.

Thanks,
Tristan.

@mbitsnbites
Copy link

Not sure if this ticket should be re-opened, but seeing that 8-16 core CPU:s are becoming mainstream this year, and 32-64 core CPU:s will likely be common in high end workstations a couple of years from now, there is a huge potential for performance improvements if that CPU power could be utilized (or put another way: it would be a huge waste not to use all that power).

Due to the nature of the functionality that GHDL simulates, it should be possible to extract a healthy amount of parallelism for most VHDL designs.

@umarcor
Copy link
Member

umarcor commented May 7, 2019

Hi @mbitsnbites! Although it is not a built-in feature, it is currently possible to co-execute multiple binaries generated with GHDL. Assuming that the design can be split into some submodules which are connected to each other through some standard interface, VHPIDIRECT can be used to transfer data between them.

In the following VUnit script, you can find an example of how to load and execute a GHDL binary in a Python thread: https://github.com/dbhi/vunit/blob/feat-external-arrays/examples/vhdl/external_buffer/run.py#L104-L118. Note that it is possible to achieve the same result with C and pthreads.

@mbitsnbites
Copy link

My use case is to run a program in a custom VHDL CPU core. Splitting the CPU core up into 10-100 binaries and manually hooking them up and running the pieces in parallel is not something that I'm keen on at this point.

One of the programs that I run takes about 6-8 million simulated clock cycles, which translates to about 30-40 minutes of wall clock time. If (as I suspect) there is enough parallelism in the design to scale over all the available cores on a modern workstation, that simulation time could be brought down to a couple of minutes, which would be a huge deal.

@umarcor
Copy link
Member

umarcor commented May 8, 2019

The issue with automatic parallelization is that the communication overhead between the cores can potentially cancel the performance gain. It is quite difficult to do a proper partitioning of the design without any prior knowledge. When the system is split to 10-100 binaries, the important point is that the designer is doing the partitioning.

Related to this, you can have a look at the following conversation at gitter: October 8, 2018 7:59 AM

@1138-4eb:
While analyzing them, it seems that GHDL is CPU bounded and it is single-threaded. So, the faster the clock frequency, the better performance. Can either @tgingold or any other confirm it?

@tgingold:
Yes, simulation is CPU bound, unless you are dumping waveforms. Even if speed is improved, I think it will always be CPU bound.

@1138-4eb:
Thanks for the quick response. Is there any option to somehow improve it by extending GHDL to be multi-threaded? Or would it be too complicated (would requrie rewriting GHDL)?

@tgingold:
In the past there was an option, but the speed-up was not significant.
Before being multithread, you'd better to be fast in single thread!

@Paebbels:
I think to do multi-threaded simulation, you have to cluster the simulation into processes with high interaction (closely coupled) and processes with less coupling. If the coupling is to high, you spend more time in thread synchronozation than in simulation, so processes with lots of signal interaction should run on one thread and processes with less signals connections should run on another thread.
I believe creating a graph of that system and partitioning processes is not so easy. AND, using signals as a measurement of coupling is not accurate, you can have lots of signals, which will never update ...


Nonetheless, there is room for improvement in single-thread:

  • Which backend are you using? 'mcode' is known to analyze faster and simulate slower than GCC/LLVM. Hence, you could try GCC or LLVM.
  • Moreover, LLVM allows to enable multiple optimization passes and only a few of them are being used at the moment. If you know anything about compilers, you could try to enable a custom (better) set of passes. See #667.
  • A different strategy that might help you accelerate the design is using variables instead of signals, when possible. See Large array crashes ghdl #812.

Please, don't take me wrong. I am very concerned about simulation time using GHDL; and I would love to have multi-core support which would provide a significant performance improvement. However, we need to be realistic with regard to both the number of people involved in this project and the expertise.


Last, but not least, @tgingold, @Paebbels, what do you think about reopening this issue (or opening a new one) just to keep track of this (non) feature? It's is a legit doubt that I expect to come out frequently.

@mbitsnbites
Copy link

Thanks for the feedback. I'm already using the LLVM backend, so I'll be sure to check out the reference you pointed to about optimization passes.

Since I don't have any knowledge about the GHDL internals, I am not going to suggest any technical solutions. Anything based on frequent locking is likely going to kill performance, as @Paebbels suggests, so you may need some other construct. If the architecture isn't built with massive parallelism as a key design goal from the start, it can be hard to retrofit an efficient model later.

@umarcor
Copy link
Member

umarcor commented May 8, 2019

If the architecture isn't built with massive parallelism as a key design goal from the start, it can be hard to retrofit an efficient model later.

Related to this, multi-thread support was recently added to Verilator. The following video might be inspiring for anyone willing to think about how could this be done: https://www.youtube.com/watch?v=en8JMz7v3LU (min 7:30 and forward).

@tgingold
Copy link
Member

tgingold commented May 10, 2019 via email

@go2sh
Copy link

go2sh commented May 10, 2019

It depends. If your problem is a just the calculations, then optimizing for a single core. But especially for big designs with a lot of memory, multi threading might help as loading stuff from memory gets paralelized.

But I can totally understand that there is and might not be multi thread support due to the amount of work.

@mbitsnbites
Copy link

I respect that line of reasoning (again, especially since I'm not familiar with the GHDL internals).

However it is also easy to argue that since multi-threaded execution can potentially give a many-fold performance increase (as demonstrated by Verilator), you'd have to have a pretty aggressive plan for matching that performance win with single-threaded execution improvements. In other words: which path will give you the biggest win per invested development hour?

I may be biased, but for most things that I do (tune build systems for CI, implement raytracers, and optimizing SW in general) I tend to spen a lot of time in a CPU load monitor (e.g. htop), and whenever I see unused cores I get an itch to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants