Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evmone speedups #11

Closed
wants to merge 25 commits into from
Closed

evmone speedups #11

wants to merge 25 commits into from

Conversation

zac-williamson
Copy link

Hi @chfast!

I've been tinkering around with your excellent evm interpreter, and have made a few tweaks and additions that speed up the benchmarks by 1.25x - 3x. Perhaps they would be of interest? This branch has the following changes/additions:

  • I changed the main loop to a direct threaded model that uses computed gotos. Partially to remove the overheads of function calls, partially to remove some conditional branches (e.g. the 'is this a basic block' check can be removed, more on that below)
  • I added a global memory manager that provides pointers to pre-allocated, zeroed out blocks of memory. The amount of allocated memory is large enough that any out-of-bounds access would trigger an out of gas error (current gas limit caps memory to ~8MB, and when a transaction creates/calls a contract, unused memory is freed to prevent excessive memory consumption). The upside of this is that all of the overheads from updating memory pages during interpretation are removed.
  • Changed the stack to a fixed-size array that is indexed by a pointer. Changes to the stack size will change the position of the pointer, instead of the size of the array. Some marginal speedups from not having to implicitly track the size of the stack (stack over/underflows are checked against the memory addresses that define the start and end of the stack)
  • Jump destinations are found by indexing a sparse array, instead of a search. I did a bit of profiling with valgrind and this doesn't seem to noticeably increase the # of cache misses
  • The main loop doesn't perform a conditional branch to check whether the instruction is at the start of a basic block. Instead, these instructions are indexed with a different jump label to normal opcodes
  • For conditional branches that perform error checking (e.g. gas accounting), I've made liberal use of __builtin_expect to define the happy path to be the branch that does not lead to an error state

The results have been quite interesting. On my machine (8th gen i7), the sha1_divs benchmark is the weakest performer, with only a ~25% speed increase. The blake2b benchmark, on the other hand, is ~3x faster at the top end, chewing through 3.56 billion gas per second for the blake2b_shifts/65536 benchmark.

If you would like to integrate these changes into evmone, I'd be happy to fix up anything that you think needs attention.

@chfast
Copy link
Member

chfast commented Apr 26, 2019

Hey, this looks great! I didn't expect this happen so quickly. I will definitely integrate your changes, but it will probably take me some time.

To clarify one thing: there is some code duplication in the implementation (especially around calls). It was done on purpose because what I want to focus on now is add more unit tests to have full code coverage.

@gcolvin
Copy link

gcolvin commented Apr 26, 2019

Nice, @zac-williamson. The first four were on my to-do-if-I-ever-had-time list. I think Geth handles the fourth one with a small (<=4K) bitmap, if I understand you.

@gcolvin
Copy link

gcolvin commented Apr 26, 2019

I think the one optimization I still saw to do (and maybe it's in there but I missed it) is to pre-swap the PUSH constants.

@zac-williamson
Copy link
Author

I think the one optimization I still saw to do (and maybe it's in there but I missed it) is to pre-swap the PUSH constants.

Heya, thanks for the feedback! I did squeeze that optimization in too, should have documented it. It's in analysis.cpp, but is a wee bit of a cludge because I wanted to weasel out of declaring the program stack to be of type uint256 to avoid the default constructor being called on evey stack element.

Copy link

@gcolvin gcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a possible DoS vulnerabilities, and fixing it might be a performance improvement.

///
/// The deque container is used because pointers to its elements are not
/// invalidated when the container grows.
std::deque<bytes32> args_storage;
Copy link

@gcolvin gcolvin May 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big can this container get? Just the amount of push data in the program? Could a flat array be used to avoid the overhead (and possible DoS vulnerability) of intermittently growing the deque? Ot is setting aside that much memory, potentially almost the whole program, not worth it?

@chfast
Copy link
Member

chfast commented Jun 23, 2019

The 0.1 version of evmone has been tagged to be used as the base line for future optimizations.
I also described a first one in #72.

@chfast
Copy link
Member

chfast commented Jul 24, 2019

@zac-williamson I'd like discuss some changes you have here.

@chfast
Copy link
Member

chfast commented Aug 20, 2022

After all these years, I'm finally getting threaded code support: #495.

@chfast chfast closed this Aug 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants