Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared memory between calls #673

Merged
merged 23 commits into from Oct 11, 2023
Merged

Shared memory between calls #673

merged 23 commits into from Oct 11, 2023

Conversation

thedevbirb
Copy link
Contributor

@thedevbirb thedevbirb commented Aug 31, 2023

I tried to give a shot to #445 just for fun and as a challenge with this PR.
This introduces a new struct called SharedMemory which is indeed shared between calls.

About the implementation

  • its API matches almost completely the 'old' interpreter Memory implementation
  • memory space is allocated using this estimate https://2π.com/22/eth-max-mem/,
  • it has a pointer current_slice which is internally used to refer to the portion of data reserved for the current context. This requires two unsafe methods get_current_slice and get_current_slice_mut which deferences the raw pointer
  • current_slice pointer is updated when entering a new context (which is when a new Interpreter instance is created) using the new_context_memory method and when exiting it, which happens when the return_value method of Interpreter is called

Performance

I'd like a feedback on this because:

  • probably I'm doing something not super optimized: some results are very good, other are awful
  • maybe benches code is not ideal for a shared memory: for example, if a tx has a high gas limit (I had to lower it in the benches) but no memory-related operations, performance is slower since I'm allocating for no reason. If I don't go wrong, we always use the same bytecode for benches, which may result is not entirely accurate performance indications on real usage of this shared memory

Anyway, this is the result on running cargo bench --all on main and then on my branch:

analysis/transact/raw   time:   [8.5066 µs 8.6215 µs 8.7831 µs]
                        change: [+14.721% +16.955% +19.127%] (p = 0.00 < 0.05)
                        Performance has regressed.
analysis/transact/checked
                        time:   [8.2738 µs 8.3773 µs 8.5572 µs]
                        change: [+11.677% +14.695% +18.139%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
analysis/transact/analysed
                        time:   [6.2713 µs 6.3171 µs 6.3475 µs]
                        change: [+8.6451% +11.185% +13.377%] (p = 0.00 < 0.05)
                        Performance has regressed.

snailtracer/transact/analysed
                        time:   [5.4007 µs 5.4421 µs 5.5240 µs]
                        change: [-99.992% -99.992% -99.992%] (p = 0.00 < 0.05)
                        Performance has improved.
snailtracer/eval        time:   [2.9665 ms 2.9938 ms 3.0241 ms]
                        change: [-94.980% -94.892% -94.818%] (p = 0.00 < 0.05)
                        Performance has improved.

transfer/transact/analysed
                        time:   [5.2016 µs 5.2170 µs 5.2350 µs]
                        change: [+548.06% +556.75% +566.60%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high severe

I also tried to run cachegrind as explained here #582 and it is indeed slower, of about a 10% or less. Gas limit required is between 2^22 and 2^23. Maybe also here I'm allocating memory which is not used very much. It depends on the bytecode of the snailtracer.

Thanks in advance for any feedback!

@thedevbirb thedevbirb changed the title Shared memory Shared memory between calls Aug 31, 2023
@gakonst
Copy link
Collaborator

gakonst commented Sep 1, 2023

cc @DaniPopes who may also have thoughts

@rakita
Copy link
Member

rakita commented Sep 4, 2023

In general very supportive of this, tbh I would have expected the same or better performance. Will see to play with it a little bit to check it out.

Would like to see if we can remove Rc<RefCell<>>, maybe the overhead that we see is related to the dynamic borrow checks inside refcell. Snailtracer should be a good benchmark to check as init of evm is small in comparison of the work that interpreter is doing.

@thedevbirb
Copy link
Contributor Author

thedevbirb commented Sep 5, 2023

In general very supportive of this, tbh I would have expected the same or better performance. Will see to play with it a little bit to check it out.

I'm very happy you like it! After this PR I can try to do a shared_stack as well in the same fashion, by allocating 32MB.

Would like to see if we can remove Rc<RefCell<>>, maybe the overhead that we see is related to the dynamic borrow checks inside refcell. Snailtracer should be a good benchmark to check as init of evm is small in comparison of the work that interpreter is doing.

I've found a way to remove the Rc<Refcell<>> and keep lifetimes to the minimum. I had to change a bit the signature of the function run_interpreter in order to achieve it (it no longer returns the interpreter but only the needed properties). The performance gains are very very minor though. Let me know what you think!

@DaniPopes
Copy link
Collaborator

cc @DaniPopes who may also have thoughts

Same as #660 (comment)

@DaniPopes
Copy link
Collaborator

#582 was merged, can you rebase this PR? @lorenzofero

@thedevbirb
Copy link
Contributor Author

thedevbirb commented Sep 21, 2023

#582 was merged, can you rebase this PR? @lorenzofero

I can't really perform a rebase because I have conflicts in multiple commits and it's really cumbersome to get out. I'm working on a merge with conflicts resolution 👍 . I hope it's not a problem! It will take some time though, I would like to keep as close as possible how you managed some memory functions

@thedevbirb
Copy link
Contributor Author

thedevbirb commented Sep 26, 2023

Hey @rakita @DaniPopes @gakonst, now tests are passing if you want to check it out again. I managed to keep all Dani's updates of memory related functions inside shared_memory.rs. Let me know what you think!

Lastly, is there a new way to run benches now? I've seen in the readme that something has changed but it seems not up to date.

Thanks!

@DaniPopes
Copy link
Collaborator

You can run cargo criterion or the Cachegrind test as explained in #582's description

Copy link
Collaborator

@DaniPopes DaniPopes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crates/interpreter/src/instructions/macros.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
crates/revm/src/evm_impl.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
crates/revm/benches/bench.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
@DaniPopes
Copy link
Collaborator

DaniPopes commented Sep 26, 2023

This is great btw, I think we can get perf regression down to neutral or positive

@thedevbirb
Copy link
Contributor Author

This is great btw, I think we can get perf regression down to neutral or positive

Thanks and thank you for the very detailed feedback you provided; I'll try to resolve all the comments as soon as possible

crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@DaniPopes DaniPopes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what's causing tests to fail

crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved
@thedevbirb
Copy link
Contributor Author

thedevbirb commented Sep 30, 2023

Today if I have time I'll take a look at tests failing with ethtests profile
edit: I need to handle the edge case gas_limit == 0. A cheap fix is do to nothing on set_ptr if self.data.len() == 0:

fn set_ptr(&mut self, checkpoint: usize) {
    if self.data.len() > 0 {
        assume!(checkpoint < self.data.len());
        self.current_ptr = unsafe { self.data.as_mut_ptr().add(checkpoint) };
    }
}

@rakita
Copy link
Member

rakita commented Oct 3, 2023

Hi @lorenzofero, how is the performance on this?
What I am thinking rn is this becoming a more complex solution than the current one, additionally the peak memory usage is lower but average memory usage is a few times more than the previous where your memory per call was dynamically incremented while this approach takes its maximum right away.

In essence, I expected this to be more impactful but if this is not the case unfortunately there is no good reason to include it.

@thedevbirb
Copy link
Contributor Author

thedevbirb commented Oct 3, 2023

Hey @rakita I tried to run again Cachegrin after Dani's first measurements

Which is around -3.5% in performance. After current changes, this is what I get:

which is around -2.7%.

@DaniPopes maybe I can try with this now to see if we can squeeze a little bit more

I think we can go a step further here and use uninit memory for extra perf. This may not be that big so maybe wait until the end to try this.

Yeah I get that the more complex the transaction, the better the performance. For simple stuff on average I expect it to be in some way worse, as benches suggest.
Maybe we can shift to the model you originally suggested ethereum/evmone#481 (comment) with manual expansion, or a similar setup where we keep allocating using the shared setup.
However if you don't plan on include at all given that it can add some complexity that's fine too

@thedevbirb
Copy link
Contributor Author

thedevbirb commented Oct 4, 2023

Hey, I hope you don't find all of this comments pedantic. I tried to use a different model for the shared memory similar to ethereum/evmone#481 (comment) that you can see here on my fork: thedevbirb#2. This model has some benefits imo:

  • very simple -- a little more than a wrapper of original memory for the shared setup
  • no big pre-allocation based on gas limit -- it allocates 4KiB as the original memory, and only when you need more, it allocates other slots pf 4KiB, which will be kept for the next calls (see resize method)
  • overall better performance in most situation -- and if you don't use the memory, no performance penalties

Here is the result of running cargo bench --all against main:

analysis/transact/raw   time:   [4.0090 µs 4.1340 µs 4.2474 µs]
                        change: [-57.683% -52.266% -45.567%] (p = 0.00 < 0.05)
                        Performance has improved.
analysis/transact/checked
                        time:   [3.9221 µs 3.9936 µs 4.0423 µs]
                        change: [-45.316% -44.017% -42.611%] (p = 0.00 < 0.05)
                        Performance has improved.
analysis/transact/analysed
                        time:   [2.3999 µs 2.4642 µs 2.5303 µs]
                        change: [-54.329% -53.175% -52.059%] (p = 0.00 < 0.05)
                        Performance has improved.

snailtracer/transact/analysed
                        time:   [60.762 ms 61.650 ms 62.692 ms]
                        change: [-3.0766% +0.7508% +4.7091%] (p = 0.73 > 0.05)
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild
snailtracer/eval        time:   [58.104 ms 59.000 ms 59.465 ms]
                        change: [-10.689% -7.1282% -3.1639%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

transfer/transact/analysed
                        time:   [960.41 ns 967.11 ns 975.02 ns]
                        change: [+13.760% +16.191% +18.220%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

If you like it, I can bring the changes to this branch or in a new pr

@rakita
Copy link
Member

rakita commented Oct 5, 2023

Hey, I hope you don't find all of this comments pedantic.

It is fine, this is a tough decision as you invested a lot of time into this, and the benefits are unfortunately small, I have a few examples of this where the idea didn't pan out as expected (gas block i am looking at you, we had performance boost of 5-7% and the code was merged but it was very hard to use, so after few months it was reverted) so in the end it was more like a research effort to get data.

I tried to use a different model for the shared memory similar to ethereum/evmone#481 (comment) that you can see here on my fork: lorenzofero#2. This model has some benefits imo:

  • very simple -- a little more than a wrapper of original memory for the shared setup
  • no big pre-allocation based on gas limit -- it allocates 4KiB as the original memory, and only when you need more, it allocates other slots pf 4KiB, which will be kept for the next calls (see resize method)
  • overall better performance in most situation -- and if you don't use the memory, no performance penalties

This seems better, not all transactions are going to be 1024 calls deep or use 30M gas, so this is more reasonable. But with this,

  • you would always copy all data to the newly allocated vec when resizing happens. That is why I that comment i showed memory in chunks to mitigate this somehow.
  • but it allows more compact usage of allocated memory.
  • memory peak stays allocated.

And spilling context (in this case memory) from one Interpreter to another would be fine if we would gain something significant on the other hand having just one place for memory opens things for new ideas. This is probably going to look better if we switch from recursive calls to loop calls (if the loop approach turns out okay).

I am on the fence here but let us include it, will review it in detail in the next few days (@DaniPopes already did an amazing job there).

@thedevbirb
Copy link
Contributor Author

thedevbirb commented Oct 6, 2023

Ok I brought the changes here for reviewing. However, I was wondering if it was worth it to open a new PR which supersedes this one with this new model, since both commit history and github conversation here are becoming a little messy imo. Let me know what you think about it!

@rakita
Copy link
Member

rakita commented Oct 8, 2023

Ok I brought the changes here for reviewing. However, I was wondering if it was worth it to open a new PR which supersedes this one with this new model, since both commit history and github conversation here are becoming a little messy imo. Let me know what you think about it!

It is messy but it I fine for it to be in one place, so people can follow what is happening.

Copy link
Member

@rakita rakita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reintroduce the memory limit.

Other parts look good!

($interp:expr, $offset:expr, $len:expr) => {
if let Some(new_size) =
crate::interpreter::next_multiple_of_32($offset.saturating_add($len))
{
#[cfg(feature = "memory_limit")]
if new_size > ($interp.memory_limit as usize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we reintroduce the memory limit?

/// Memory checkpoints for each depth
checkpoints: Vec<usize>,
/// How much memory has been used in the current context
current_len: usize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably put memory limit here, it feels like a better place

/// Get the last memory checkpoint
#[inline(always)]
fn last_checkpoint(&self) -> usize {
*self.checkpoints.last().unwrap_or(&0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*self.checkpoints.last().unwrap_or(&0)
self.checkpoints.last().cloned().unwrap_or_default()

cloned would remove the reference from the option.

@thedevbirb
Copy link
Contributor Author

We should reintroduce the memory limit.

It should be good now!

Copy link
Member

@rakita rakita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Amazing work @lorenzofero !

@rakita rakita merged commit b5aa4c9 into bluealloy:main Oct 11, 2023
8 checks passed
@thedevbirb
Copy link
Contributor Author

I'm very happy we've found the right approach for this and got some good performance improvements. Thanks a lot @rakita and @DaniPopes for all the support in the last month; it has been a pleasure working with you.

@thedevbirb thedevbirb deleted the shared_memory branch October 11, 2023 16:51
@github-actions github-actions bot mentioned this pull request Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants