# IAVL V2 vs Memiavl Benchmarking

In [17]:
from analysis.read_logs import load_benchmark_dir_dict
import polars as pl
from analysis.analysis import summary, plot_ops_per_sec, plot_mem, plot_disk_usage
import plotly.io as pio
pio.renderers.default = "notebook_connected"

In [2]:
mixed_small = load_benchmark_dir_dict("../bench/run-mixed-small")

The initial benchmark data set that we'll look at was run over a relatively small data set of 10,000 versions generated with the following store parameters:

In [3]:
pl.DataFrame(mixed_small['iavl-v1'].init_data['changeset_info']['store_params'])

store_key,key_mean,key_std_dev,value_mean,value_std_dev,initial_size,final_size,versions,change_per_version,delete_fraction
str,i64,i64,i64,i64,i64,i64,i64,i64,f64
"""bank""",56,3,100,1200,35000,220020,10000,1840,0.25
"""staking""",24,2,12263,22967,35000,160069,10000,305,0.25
"""lockup""",56,3,1936,29261,35000,260020,10000,363,0.29


With the default settings, both memiavl and iavl/v2 show significant improvements over iavl/v1 in terms of raw operations per second performance. iavl/v2 is roughly 2x as fast as iavl/v1 and memiavl is roughly 4x as fast. We see, however, that memiavl consume significantly more memory.

In [4]:
summary(mixed_small, ['iavl-v1', 'iavl-v2-alpha6', 'memiavl'])

name,ops_per_sec,max_mem_gb,max_disk_gb
str,f64,f64,f64
"""iavl-v1""",4742.310918,1.559486,153.0
"""iavl-v2-alpha6""",10644.822227,0.996339,50.0
"""memiavl""",20543.653772,14.627673,26.0


## Performance over time

The following plot shows performance over time (averaged every 100 blocks so that the chart isn't super noisy).
Generally, for this small data set performance was relatively consistent over time, but memiavl and iavl/v2 are significantly more spikey.

In [5]:
plot_ops_per_sec(mixed_small, ['iavl-v1', 'iavl-v2-alpha6', 'memiavl'])

To better understand what we are looking at, let's look into some differences in the behavior of these systems.

### Memiavl Snapshot Behavior

memiavl only serializes the iavl tree to disk at snapshot intervals, which are by default, set to 1000 versions.
We see predictable dips in performance slightly after every 1000 blocks.
memiavl uses the following strategy to do snapshotting and then reclaim memory by switching to a snapshot backed tree after every snapshot interval:
1. at snapshot height A, begin snapshotting asynchronously in the background
2. when the snapshot for height A completes at height B, begin replaying the WAL in the background up to height B (this is called best-effort WAL replay)
3. when best-effort WAL replay completes, at height C, stop the world and replay the WAL from height B to C and then swap out the current memory tree, for the hybrid snapshot/memory tree from height A

When the tree is snapshotted, memiavl always traverses and serializes the entire iavl tree.
This behavior is significant and we'll look into it in more detail later.

### iavl/v2 Checkpoint Behavior

iavl/v2, on the hand, does checkpointing, also by default every 1000 blocks, so we also see dips are this regular interval as well.
iavl/v2's checkpoints are not full snapshots, but are instead diffs of the internal node changes since the last checkpoint height.
In iavl/v2, however, leaf nodes are stored synchronously every block as both the WAL and the leaf node part of the tree.
So, iavl/v2 checkpoints are much smaller operations comparatively, because it only is a difference between the previous and current
internal node structure. At checkpoint intervals, all leaf nodes have already been flushed to disk synchronously at each version as the WAL.
By default, when leaf nodes are serialized, they are evicted from memory which helps iavl/v2 keep a small memory footprint.
Also, at checkpoint intervals, iavl/v2 evicts nodes from memory as an additional strategy for managing memory.
Both the eviction of leaf nodes at every version and eviction of branch nodes at checkpoint intervals are configurable parameters
which affect performance and memory usage.
This more aggressive flushing to disk and memory reclamation (which currently happens synchronously) explain a significant amount
of the performance difference between iavl/v2 and memiavl.
As we'll see later, by tweaking these parameters, we can improve the performance of iavl/v2 in exchange for higher memory consumption.

### iavl/v1 Behavior

It is worth noting that iavl/v1 does not do any periodic snapshotting or checkpointing but instead flushes both branch and leaf nodes to disk when saving every version,
and it does these synchronously.
This is maybe an over-simplification, but we could maybe summarize the key types of optimizations that memiavl and iavl/v2 are trying to make as:
* saving less stuff to disk less often (snapshotting or checkpointing every 1000 blocks instead of every block)
* pushing some operations to the background (mostly implemented in memiavl)
* keeping more stuff in memory (mostly memiavl)

## Memory Usage Over Time

Looking at memory usage over time shows us some pretty significant differences.
iavl/v1 and iavl/v2 both have quite low memory consumption that maintains consistent over time
with iavl/v2 managing to consumption less memory while delivering higher performance.
memiavl, on the other hand consumes, a lot of memory with significant drops after every snapshot interval.
Its memory reclamation does appear to be fairly consistent over time, dropping to roughly 2.5gb, which is manageable,
albeit higher than iavl/v1 or iavl/v2.
However, we see its peak memory consumption trending higher over time. We'll look into this in more detail with a
more aggressive benchmark later.

In [6]:
plot_mem(mixed_small, ['iavl-v1', 'iavl-v2-alpha6', 'memiavl'])

## iavl/v2 Configuration

As we mentioned before, iavl/v2's node eviction behavior can be configured to improve performance at the expense of memory consumption.
We can configure whether leaf nodes get evicted and up to what depth we will retain branch nodes in memory after checkpointing.
These summary numbers show us how iavl/v2 behaves when we 1) don't evict leaf nodes and 2) retain a tree of up to height 20 after snapshotting:

In [7]:
summary(mixed_small, ['iavl-v2-alpha6-evict20', 'iavl-v2-alpha6', 'memiavl'])

name,ops_per_sec,max_mem_gb,max_disk_gb
str,f64,f64,f64
"""iavl-v2-alpha6-evict20""",15925.232166,15.010651,50.0
"""iavl-v2-alpha6""",10644.822227,0.996339,50.0
"""memiavl""",20543.653772,14.627673,26.0


With these settings iavl/v2's performance increases by 50%, achieving 75% of the speed of memiavl while consuming roughly the same amount of memory.
The graph below shows this configuration's memory consumption in comparison to memiavl.
The periodic drops in both graphs might be due to GC memory reclamation, but this isn't definitive.

In [8]:
plot_mem(mixed_small, ['iavl-v2-alpha6-evict20', 'memiavl'])

## Larger Dataset Behavior

Let's look at how these systems behave with a significantly larger dataset which was generated with the parameters below.
The growth of this dataset is maybe unrealistically aggressive, however, the final tree sizes are likely within the realm
of realistic real world scenarios based on what I'm able to ascertain from previous benchmark parameters.
So let's consider this benchmark run a stress test

In [9]:
mixed_large = load_benchmark_dir_dict("../bench/run-mixed-large")
pl.DataFrame(mixed_large['iavl-v1'].init_data['changeset_info']['store_params'])

store_key,key_mean,key_std_dev,value_mean,value_std_dev,initial_size,final_size,versions,change_per_version,delete_fraction
str,i64,i64,i64,i64,i64,i64,i64,i64,f64
"""bank""",56,3,100,1200,35000,22002000,20000,1840,0.25
"""staking""",24,2,12263,22967,35000,16006960,20000,305,0.25
"""lockup""",56,3,1936,29261,35000,26002000,20000,363,0.29


No system was actually able to complete the benchmarking run.

iavl/v1 actually got the farthest and crashes with a mutex exception at version 19,737,
just barely before the finish line at 20,000 versions.

iavl/v2 completed 6344 versions before it consumed all of the remaining disk space on the VM. It appears that its pruning behavior started to break
down as the tree grew larger. We'll look at a graph of this later.

memiavl simply ran out of memory and crashed after completing 2970 blocks, consuming all 128gb of RAM on the VM.

Below is a graph of the operations per second performance for these three systems with default settings.
As you can see, memiavl starts out strong, but its performance plummets quickly as the tree grows until it effectively crashes and burns
iavl/v2 and iavl/v2 both have pretty significant dips in performance but then tend to level out at a consistent rate.

In [10]:
plot_ops_per_sec(mixed_large, ["memiavl", "iavl-v1", "iavl-v2-alpha6"])

Looking at memory consumption, we can see how memiavl's memory consumption gets out of control,
basically consuming all available RAM before the memory reclamation from the snapshot at height 2000 could complete.

In [11]:
plot_mem(mixed_large, ["memiavl", "iavl-v1", "iavl-v2-alpha6"])

One obvious thing to try here, is reducing the snapshot interval so that memory reclamation can happen more often.
Below we can see how memiavl did with a snapshot interval of 100 versions.
It definitely held out for longer, but in the end,
it still ended up consuming all memory before the snapshot memory reclamation switch could happen for height 4700.

In [12]:
plot_mem(mixed_large, ["memiavl", "memiavl-100-2"])

### Memiavl Snapshot Timing

Looking at the snapshot logs, we can see the snapshotting was pretty quick at height 1000, but at height 2000,
the snapshot itself took 6m 35s, then it took an additional 15m 48s to replay the WAL before the third step
of synchronously trying to finish WAL replay could start.
Now, keep in mind that these WAL replay times may be unrealistic because this tree was growing _very_ quickly.
Real chain data would likely grow much slower so these WAL replay times will likely be much smaller.
However, the snapshot durations we're seeing may be realistic because they depend on the absolute size of the tree, not
the rate of growth.

In [13]:
mixed_large["memiavl"].memiavl_snapshots.select("version", "snapshot_duration", "best_effort_wal_duration", "wal_sync_duration")


version,snapshot_duration,best_effort_wal_duration,wal_sync_duration
i64,duration[μs],duration[μs],duration[μs]
1000,16s 687869µs,7s 778346µs,5s 777170µs
2000,6m 35s 684748µs,15m 48s 902105µs,


Looking at the run with snapshotting happening every 100 blocks,
we do see significantly quicker snapshot times at height 2000 and only see things really getting out of hand at height 2900.
I can't explain the differences in behavior, and this is likely something we'd want to look into if we are going to pursue further work on memiavl.
The only difference that I can think of is the fact that more of the tree is being read from disk with more frequent snapshotting.
However, in either case the cost of snapshotting _should_ be the cost of tree traversal and writing nodes to disk.

In [14]:
mixed_large["memiavl-100-2"].memiavl_snapshots.select("version", "snapshot_duration", "best_effort_wal_duration", "wal_sync_duration")


version,snapshot_duration,best_effort_wal_duration,wal_sync_duration
i64,duration[μs],duration[μs],duration[μs]
100,2s 950354µs,1s 534665µs,1s 531132µs
200,4s 906384µs,2s 692384µs,2s 716207µs
300,7s 141700µs,3s 463955µs,3s 279345µs
400,8s 623121µs,4s 60007µs,3s 629395µs
500,10s 169634µs,5s 419941µs,4s 342241µs
…,…,…,…
2600,58s 376784µs,2s 742597µs,39s 528687µs
2700,57s 212177µs,1s 923872µs,13s 903165µs
2800,1m 2s 43393µs,2s 26948µs,53s 976633µs
2900,9m 44s 401487µs,9m 33s 432278µs,6m 5s 535087µs



In both cases, we do see that as the tree grows, snapshotting does take longer which is what we'd expect and
is consistent with what we've heard about memiavl.
The additional complication is that once snapshotting has completed, the WAL needs to be replayed in 2 steps.
First, the WAL is replayed in the background up to the version at which snapshotting completed,
and then it is played back synchronously up to the current height blocking further block production.
So, if snapshotting itself takes a long time, more blocks will have been committed in the meantime which need to be replayed.
If that background replaying takes too long, then even more blocks will get committed and then block processing will need to halt to catch up.

### Disk Usage

I haven't looked into disk usage thoroughly yet, but it's worth looking at a few charts.
As far as I know, iavl/v1 isn't doing any pruning of old versions with its default settings, however,
it maintains manageable linear growth of disk usage.

iavl/v2 also, as far as I know, isn't doing pruning by default and since its storing only checkpoints every 1000
blocks, should be storing less data than iavl/v1.
However, it appears to consume more disk space on average and at some points this does appear to get a bit out of
control which ended in failure once it consumed all available disk space on the VM.

memiavl, however, appears to do pruning at every snapshot interval.
The memiavl-100-2 run should only be saving the 2 most recent snapshots,
but its disk usage and rate of disk usage appears to be significantly higher.

Whether we adopt memiavl or iavl/v2, this data suggests we'll want to explore disk usage
more deeply because this behavior does appear to be a regression compared to iavl/v1.

In [15]:
plot_disk_usage(mixed_large, ["iavl-v1", "iavl-v2-alpha6", "memiavl", "memiavl-100-2"])