Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

back.pysim performance improvements #348

Closed
wants to merge 2 commits into from

Conversation

sjolsen
Copy link
Contributor

@sjolsen sjolsen commented Apr 7, 2020

I'm not sure how high a priority the Python simulation backend is or whether it's worth making the structural changes to hdl.ast to optimize hashing, but I figured I would share what I've figured out so far in case it's useful. Please let me know what you think.

The changes here fall roughly into the following categories:

  • Fixing what I believe to be the intent behind curr/next comparisons, and avoiding doing work where (I think) it's known there is no work to be done
  • Optimizing the hashing of Value/Signal/Statement key objects, since this is done in a few very frequently traversed code paths
  • Caching compilation results (from hdl.ast to what I believe is Python bytecode)
  • Micro-optimizations. I think the only one that made it through was the VCD hash table lookup, and it's pretty marginal at this point

I used the following simulation to make performance measurements: sim_perf_test.py. I have perf data for each commit in the chain, both with and without VCD output enabled. I can see about uploading them somewhere if there's interest, but in the meantime here's a summary:

With no VCD output:

Commit Description Time elapsed (s) Reduction (incr.) Reduction (cum.)
bb1bbcc hdl.mem: fix source location of ReadPort.en. 116.8 -- --
52e30bc back.pysim: Fix curr=next optimizations 90.77 22.29% 22.29%
6902c99 hdl.ast: Intern value/statement keys off the hot path 80.74 11.05% 30.87%
642e477 back.pysim: Cache compilation results 32.17 60.16% 72.46%
d1fe3c6 back.pysim: Reuse clock simulation commands 25.29 21.39% 78.35%
6540568 back.pysim: Eliminate duplicate dict lookup in VCD update 24.88 1.62% 78.70%

With VCD output enabled:

Commit Description Time elapsed (s) Reduction (incr.) Reduction (cum.)
bb1bbcc hdl.mem: fix source location of ReadPort.en. 152.6 -- --
52e30bc back.pysim: Fix curr=next optimizations 127.4 16.51% 16.51%
6902c99 hdl.ast: Intern value/statement keys off the hot path 118 7.38% 22.67%
642e477 back.pysim: Cache compilation results 66.94 43.27% 56.13%
d1fe3c6 back.pysim: Reuse clock simulation commands 57.16 14.61% 62.54%
6540568 back.pysim: Eliminate duplicate dict lookup in VCD update 53.25 6.84% 65.10%

I did run the nMigen unit tests, but I haven't figured out yet how to set up sby on Windows so I don't know whether those tests are broken by my changes.

@whitequark
Copy link
Member

whitequark commented Apr 7, 2020

  • Fixing what I believe to be the intent behind curr/next comparisons

Curr/next comparisons implement the two-phase simulation algorithm that is deterministic in presence of asynchronous code. I looked at your fix and it will lead to incorrect simulation results. I can explain the algorithm if necessary.

  • Optimizing the hashing of Value/Signal/Statement key objects, since this is done in a few very frequently traversed code paths

I looked at your changes and while the intent seems reasonable, the patch seems fairly invasive at first glance. I'll take another look later.

  • Caching compilation results (from hdl.ast to what I believe is Python bytecode)

Extremely good change, one I was going to make for some time but never got around to.

  • Micro-optimizations

Also looks all reasonable.


To summarize, 642e477, d1fe3c6 and 6540568 are mergeable as-is, 6902c99 needs more discussion, and 52e30bc is unsound.

@codecov
Copy link

codecov bot commented Apr 7, 2020

Codecov Report

Merging #348 into master will increase coverage by 0.47%.
The diff coverage is 84.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #348      +/-   ##
==========================================
+ Coverage   82.69%   83.17%   +0.47%     
==========================================
  Files          35       35              
  Lines        5919     5991      +72     
  Branches     1201     1179      -22     
==========================================
+ Hits         4895     4983      +88     
+ Misses        863      859       -4     
+ Partials      161      149      -12     
Impacted Files Coverage Δ
nmigen/hdl/ast.py 90.97% <82.87%> (+2.44%) ⬆️
nmigen/back/pysim.py 91.16% <100.00%> (+0.10%) ⬆️
nmigen/hdl/rec.py 98.73% <100.00%> (+0.03%) ⬆️
nmigen/build/run.py 31.25% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e40dc0...f38654f. Read the comment docs.

@@ -124,6 +124,14 @@ def cast(obj):
return Const(obj.value, Shape.cast(type(obj)))
raise TypeError("Object {!r} cannot be converted to an nMigen value".format(obj))

@staticmethod
def castable(obj):
Copy link
Member

@whitequark whitequark Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A patch dedicated to optimization should not introduce a new public API.

@sjolsen
Copy link
Contributor Author

sjolsen commented Apr 7, 2020

Thanks!

Curr/next comparisons implement the two-phase simulation algorithm that is deterministic in presence of asynchronous code. I looked at your fix and it will lead to incorrect simulation results. I can explain the algorithm if necessary.

I'm not sure I understand, but it's easy enough to revert. Would you prefer I modify the history to keep it linear or layer revert commits etc. on top of what's already here?

I looked at your changes and while the intent seems reasonable, the patch seems fairly invasive at first glance. I'll take another look later.

Yeah, I wasn't entirely sure how to make this less invasive. A big part of the performance gain was from eliminating glue logic from what was previously ValueKey/SignalKey.

To summarize, 642e477, d1fe3c6 and 6540568 are mergeable as-is, 6902c99 needs more discussion, and 52e30bc is unsound.

As it is, 642e477 depends on 6902c99 -- that was the motivation for the changes to ValueKey. I actually started with repr(value) as the key, but there was still a fair bit of overhead there.

A patch dedicated to optimization should not introduce a new public API.

Should that specific method be named _castable? Or should it go somewhere else?

@whitequark
Copy link
Member

whitequark commented Apr 7, 2020

I'm not sure I understand, but it's easy enough to revert.

The basic idea is that:

  1. eval only looks at curr and only modifies next;
  2. commit is the only one who can assign from next to curr;
  3. the eval/commit loop continues while there are any changes.

The reason it's arranged like that is because rules (1) and (2) ensure that no matter in which order eval and commit are run (i.e. no matter how any particular netlist is translated, how it's arranged into modules, etc), each full eval/commit cycle will have the exact same result. Rule (3), by iterating to fixed point, ensures that all updates in a netlist fully propagate before advancing time--for example, consider a case where you have two domains with asynchronous reset chained, and you reset the first one. This will require two delta cycles (eval/commit calls).

No part of this scheme is an "optimization"; it's solely used for correctness.

Would you prefer I modify the history to keep it linear or layer revert commits etc. on top of what's already here?

I strongly prefer keeping the history linear.

Yeah, I wasn't entirely sure how to make this less invasive. A big part of the performance gain was from eliminating glue logic from what was previously ValueKey/SignalKey.

Should that specific method be named _castable? Or should it go somewhere else?

I'm not sure yet. I'll need to understand your proposed changes in detail to see if there is a less invasive way to do the same thing, or if there's really no other choice.

@sjolsen
Copy link
Contributor Author

sjolsen commented Apr 7, 2020

I reverted the unsound curr/next change and rearranged the interning API to conserve the existing ValueKey and SignalKey classes. I think all the new abstractions should now be hidden.

@whitequark
Copy link
Member

whitequark commented Apr 7, 2020

Thanks. I've cherry-picked the second two commits. I expect it will take a while to convince myself that the proposed interning changes are the way forward. I appreciate that you rearranged them to have no impact to the public API, but the bulk of my concern has to do with the increased internal complexity as well as soundness in face of mutability of AST nodes: I am hesitant to introduce any sort of persistent hashing while properties of most AST nodes remain writable.

Have you considered adding fast paths to the simulator, for the case where bench code operates on simple signals alone? This removes the need for caching bytecode, is an easy to understand change with minimal impact, and should yield most of the performance benefit.

@sjolsen
Copy link
Contributor Author

sjolsen commented Apr 7, 2020

I expect it will take a while to convince myself that the proposed interning changes are the way forward

Understood. I'm not necessarily expecting to get this specific design upstreamed, it's just where I ended up after probing for performance improvements.

the bulk of my concern has to do with the increased internal complexity as well as soundness in face of mutability of AST nodes: I am hesitant to introduce any sort of persistent hashing while properties of most AST nodes remain writable.

That makes sense. Do you anticipate the AST becoming immutable?

Have you considered adding fast paths to the simulator, for the case where bench code operates on simple signals alone? This removes the need for caching bytecode, is an easy to understand change with minimal impact, and should yield most of the performance benefit.

I'm not sure I follow. Are you thinking along the lines of falling back to a simpler scheduler if only the default sync and comb domains are used? Or did you have something else in mind?

@sjolsen
Copy link
Contributor Author

sjolsen commented Apr 7, 2020

The basic idea is that [...]

Thank you for explaining this, BTW. The eval/commit/loop approach makes perfect sense to me. I think I see at least one way the unsoundness of 52e30bc could manifest: if one invocation sets next != curr, and a second invocation sets it back to curr, it will be left with the wrong value for next. That's important because a single fragment could set next repeatedly.

I retraced my steps and realized that the call to pending.clear was the only part of that commit that actually mattered for performance. The rest of the changes were a result of trying to micro-optimize _SignalState.commit. If I'm not (still) very much mistaken, calling pending.clear should be correct since at that point in _SimulatorState.commit, every pending update from the current eval/commit loop has been effected, and the next eval step will set pending if necessary. What do you think?

@whitequark
Copy link
Member

whitequark commented Apr 7, 2020

Do you anticipate the AST becoming immutable?

This is a bit of a complex issue. At the very beginning, I designed nMigen so that it never mutates ASTs (and it still doesn't). When it needs to change an AST, it emits a new one with the necessary elements changed (the FragmentTransformer stuff). I knew this could be a performance issue when writing it, but I didn't realize how just how slow it would be.

This slowness could be fixed in two ways: a simpler but more flawed one--which is to copy the entire AST once, eliminating any sharing, and then mutating it; and a more complex but principled one--which is to not change ASTs in first place (with or without mutation), and instead treat ClockSignal, ResetSignal, domain names, etc similar to variable bindings. That is, instead of having DomainRenamer alter the actual AST, it would alter a binding table carried along with it, with this table specifying which domain becomes which after every transformation is applied.

The principled approach potentially has very significant benefits. It would dramatically improve performance on all designs but especially large generic ones. It would eliminate multiple complicated corner cases from the internal machinery. It would bring the internals of nMigen closer to well understood and well behaved compiler development principles, namely using variable bindings for late bound names (there's nothing wrong in principle with term rewriting it's currently doing, but this term rewriting is very ad-hoc and hard to analyze). It's also a lot of work.

There is a separate issue, which is the user mutability of ASTs, whether interior or exterior. For most expression nodes (everything except Signal, ArrayProxy, and UserValue), this is clearly undesirable. For ArrayProxy and UserValue, it's just interior mutability, and there's already code to cope with it, which should be largely sufficient. For Signal, the story is a bit complex--I believe it's never been "officially" supported, but you can mutate Signals and the results are currently correct, provided you never use them in expressions before you stop mutating them. This is used internally in nMigen in the FSM generator code, so it seems like a valuable capability and we might want to enforce that invariant in a future immutable AST.

To summarize, I think the AST should be immutable, but we're not quite there yet, and there is at least one corner case that currently presents an API with a soundness issue that would likely need to be handled explicitly.

@whitequark
Copy link
Member

whitequark commented Apr 7, 2020

That's important because a single fragment could set next repeatedly.

Yup, and there are many cases where this happens--for example each time you assert reset.

If I'm not (still) very much mistaken, calling pending.clear should be correct since at that point in _SimulatorState.commit, every pending update from the current eval/commit loop has been effected, and the next eval step will set pending if necessary. What do you think?

Could you show a patch that does what you want? I'm not quite sure exactly what change you're referring to, and it seems like preparing a patch isn't a lot of effort here.

@whitequark
Copy link
Member

whitequark commented Apr 7, 2020

back.pysim: Clear pending updates after they are effected

Oh... did I just completely forget to remove elements from self.pending anywhere at all? Well that would make things a lot slower than they should be!!

@whitequark
Copy link
Member

whitequark commented Apr 7, 2020

I'm curious, could you measure the perf impact? I'm really glad you noticed this, it's a really silly mistake with severe consequences.

@sjolsen
Copy link
Contributor Author

sjolsen commented Apr 8, 2020

I also cherry-picked it onto master:

* 555c23c8 back.pysim: Clear pending updates after they are effected
| * d891a794 back.pysim: Clear pending updates after they are effected
| * daa616f0 back.pysim: Eliminate duplicate dict lookup in VCD update
| * 1b60dbda ...
| * d698e6f4 ...
| * 3f466e89 ...
|/
* bb1bbcc5 hdl.mem: fix source location of ReadPort.en.

Here's what I got with VCD output disabled:

Commit Description Time elapsed (s) Reduction (incr.) Reduction (cum.)
bb1bbcc hdl.mem: fix source location of ReadPort.en. 152.6 -- --
... ... ... ... ...
daa616f back.pysim: Eliminate duplicate dict lookup in VCD update 47.24 ... 69.04%
d891a79 back.pysim: Clear pending updates after they are effected 25.05 46.97% 83.58%
Commit Description Time elapsed (s) Reduction (incr.) Reduction (cum.)
bb1bbcc hdl.mem: fix source location of ReadPort.en. 152.6 -- --
555c23c8 back.pysim: Clear pending updates after they are effected 87.41 42.72% 42.72%

@whitequark
Copy link
Member

whitequark commented Apr 8, 2020

I've cherry-picked commit d891a79, thank you!

I'm not sure I follow. Are you thinking along the lines of falling back to a simpler scheduler if only the default sync and comb domains are used? Or did you have something else in mind?

Something else. You can notice that most simulator commands are along the lines of yield sig or yield sig.eq(123). Which means that by special-casing just these two patterns we can dramatically speed up simulation without any of the interning complexity.

I think @jordens proposed (in #228) an improvement to the simulator interface that would make all commands except for these two impossible to express. I'm not yet fully convinced that this is the way to go, and we'll have to keep the existing code for Migen compatibility, but you can see that this special case is less narrow than it looks at first.

@sjolsen
Copy link
Contributor Author

sjolsen commented Apr 9, 2020

I integrated the cherry-picked changes to clean up the PR in case you decide to do something with it. One thing that occurred to me that I haven't bothered to change is that the compilation cache should probably be scoped to the simulator rather than the coroutine process.

Something else. You can notice that most simulator commands are along the lines of yield sig or yield sig.eq(123). Which means that by special-casing just these two patterns we can dramatically speed up simulation without any of the interning complexity.

That makes a lot of sense. Speaking of the simulator interface, something I've thought of requesting is a way to wait asynchronously for a signal change from a coroutine process. It sounds from #228 like you already have some idea of how you'd like to do that? FWIW I found Value hashing useful for hacking together (emphasis on "hack") an implementation using _FragmentCompiler.

@whitequark
Copy link
Member

whitequark commented Apr 9, 2020

It sounds from #228 like you already have some idea of how you'd like to do that?

Internally, this is easy to do. The external API needs to be designed.

@whitequark
Copy link
Member

whitequark commented Apr 15, 2020

As per discussion above, I am closing this PR; I would encourage you (or anyone else) to submit another PR with a less invasive approach I suggested above. If it turns out that (contrary to my expectations) the simpler approach does not bring similar performance benefits, then I'm going to reevaluate the approach in this PR. Thanks for doing this work!

@whitequark whitequark closed this Apr 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants