# Measurements Reindexing Mainnet

These measurements were made on a full node on GCP (Google Compute) reindexing mainnet.

The host is type `n1-standard-1` which means:
 * 1 virtual CPU (pinned to the host CPU)
 * 4096 MB of RAM, of which about 3.6 GB is available to userspace
 * Very slow disks (connected to some kind of SAN)
 
Bitcoin itself is configured with `dbcache=2048`.

## Motivation

The motivation here is to make IBD work better on machines with slow disks. For instance, on most cloud instances IBD will be extremely I/O bound, to the point that it's basically impossible to actually finish syncing a node. This is true even on hosts that have a relatively large amount of memory, and therefore ought to be able ot keep most of the dataset in memory.

## A Priori Memory Predictions

How much memory would we expect the UTXO set to use if it was fully loaded into memory? We can calculate this with the size of keys/values and the number of entries in the UTXO set.

Even on hosts that don't have this much memory for their dbcache, hosts that have a reasonable proportion of that memory (e.g. 50% of it) ought to see a big speedup.

In [None]:
coutpoint_size = 36  # sizeof(COutPoint)
cache_entry_size = 56  # sizeof(CCoinsCacheEntry)
current_utxo_size = 57.3e6  # approximate number of entries in the current uxto set
estimated_map_size = (coutpoint_size + cache_entry_size) * current_utxo_size
print('Estimated memory: {:1.2f} GB'.format(estimated_map_size / (1<<30)))

Note that this estimate assumes no overhead for the space taken by the hash table itself. Therefore we should consider it a lower bound on the amount of memory that would be needed. Given a properly tuned hash table though the real world results shouldn't take that much more memory.

## Measurements

In [None]:
%matplotlib inline

import matplotlib
import numpy as np
import pandas as pd

matplotlib.rcParams['figure.dpi']= 300
matplotlib.rcParams['figure.figsize'] = [12, 8]

In [None]:
# this loads the local file unpack.py, force reloading it if necessary
import importlib
import sys
try:
    importlib.reload(sys.modules['unpack'])
    importlib.reload(sys.modules['plotutils'])
except KeyError:
    import unpack
    import plotutils

# set these as aliases this weird way to handle reloading
plot = plotutils.plot
select = plotutils.select

# load the data and print some information about it
data = unpack.unpack_data()

print()
print('Host Info')
print('---------')
for k, v in data['hostinfo'].items():
    print('{:16s}{}'.format(k, v))

frames = data['frames']
print()
print('Data Frames')
print('-----------')
print(' '.join(sorted(frames.keys())))

# naughty
globals().update(frames)

In [None]:
# List of all flush events that happened
plotutils.set_flushes(flushes)
flushes.head()

### Indexing Rate (Height)

In this section we'll look at how fast the chain is syncing, measured by the chain height.

In [None]:
#df = updatetip[['height', 'progress']]
#plot(df, title='Progress', secondary_y=['progress'])
plot(updatetip['progress'] * 100, title='IBD Syncing', ylabel='Progress %')

### Cache Hits

In [None]:
plot(cache, title='Cache Hits/Misses')

In [None]:
df = updatetip.copy()
df['utxo:size'] /= 1e6
plot(df[['utxo:count', 'utxo:size']], secondary_y=['utxo:size'])

In [None]:
hits = cache['hits']
total = hits + cache['misses']
plot(hits / total, title='Cache Hit Ratio')

## Disk / VFS Stats

In [None]:
plot(select(bio, 'count'), title='I/O Requests')

In [None]:
plot(select(bio, 'pages'), title='Pages read from disk')

In [None]:
vfstime = select(vfs, 'time')
#vfstime[vfstime.columns] /= 1e9
plot(vfstime, title="VFS Time")

In [None]:
plot(leveldb['get:time']/1e8, title='Time Waiting For leveldb::DBImpl::Get()', ylabel='% of Time')

In [None]:
plot((leveldb['get:time']/1e4)/leveldb['get:count'], title='Time Per LevelDB::Get')