-
Notifications
You must be signed in to change notification settings - Fork 838
Open
Labels
monitoringThis primarily focuses on logs, metrics, and/or tracingThis primarily focuses on logs, metrics, and/or tracing
Description
Context
As part of the Height-Indexed Database C-Chain Integration work, we now have an EVM database wrapper that sends block headers/bodies/receipts to a height-indexed db while keeping non-block data on the existing leveldb-backed KV store.
This issue is about measuring what changes when we turn that on: block insertion time, disk usage, and node resource usage, compared to when all block data is stored in leveldb.
Questions we want to answer
- Block insertion time and throughput
- Does enabling the height-indexed db change C-Chain block insertion time (and therefore block verify + accept time)?
- Does it change the variance of block insertion time (do block insertions become more or less “spiky”)?
- Does enabling the height-indexed db let us process more blocks overall (for example, does mgas/s change when running the reexecution tests)?
- On-disk storage usage
- What is the difference in total on-disk storage between:
- State sync off, storing all C-Chain blocks (+72M blocks).
- State sync on, only storing blocks after the state sync height.
- For each of the above, how do leveldb-only and height-indexed-db-enabled setups compare, and how quickly do they grow over time?
- What is the difference in total on-disk storage between:
- Node resource usage and compactions
- How does moving block data out of leveldb change compaction behavior (frequency, duration, bytes written, and write stalls)?
- Do we see a meaningful change in CPU, memory, disk IO, or db-level latency that we can reasonably attribute to those compaction changes?
- Migration cost
- For a node with state sync off (storing all C-Chain blocks), how long does it take to migrate into the height-indexed db, and what is the resource profile?
- For a node with state sync on (only storing blocks after the state sync height), how long does migration take, and how does its resource profile compare?
- In both cases, how disruptive is migration to the node (for example, does it noticeably affect responsiveness)?
Metrics to collect
- Block insertion time and throughput
- For C-Chain nodes with the height-indexed db enabled vs disabled:
- Average block insertion time over the last X days.
- Mean block insertion time and standard deviation over the last X days (to capture variance).
- Reexecution tests:
mgas/sreported by the reexecution tests with and without the height-indexed db.
- For C-Chain nodes with the height-indexed db enabled vs disabled:
- On-disk storage usage
- For archive nodes (state sync off, storing all C-Chain blocks):
- Total db size for leveldb-only vs leveldb + height-indexed db.
- For nodes with state sync on (only storing blocks after the state sync height):
- Total db size for leveldb-only vs leveldb + height-indexed db.
- For both node types:
- Approximate growth rate over a defined period (for example, size change after X additional blocks or Y days).
- For archive nodes (state sync off, storing all C-Chain blocks):
- Node resource usage and compactions
- Leveldb compaction + stall metrics:
writes_delayed,writes_delayed_duration, andwrite_delayedto see how often and how long writes are stalled by compaction.io_writeandio_readto track total compaction IO.- Per-level stats:
duration,reads,writesalong withmem_comps,level_0_comps,non_level_0_comps,seek_compsto understand where compaction work is happening.
- Leveldb compaction + stall metrics:
- CPU and memory:
- CPU usage over time (average and peaks), memory usage (RSS and Go heap) for both configurations.
- Disk IO and db latency:
- Read/write throughput and IO latency for both configurations.
- If available, average and tail latency for db Get/Put-style operations, to see whether heavy compaction periods line up with slower db operations.
- Migration cost
- Total migration time for:
- A node with state sync off.
- A node with state sync on.
- Migration throughput:
- Blocks migrated per second and/or bytes migrated per second.
- Resource usage during migration:
- CPU, memory, and disk IO while migration runs.
- Any noticeable impact on node responsiveness (for example, RPC latency) during migration if run on a live node.
- Total migration time for:
Metadata
Metadata
Assignees
Labels
monitoringThis primarily focuses on logs, metrics, and/or tracingThis primarily focuses on logs, metrics, and/or tracing
Type
Projects
Status
In Progress 🏗️