Work proposal: EOS History Indexer

Introduction

About the author: I am a senior engineer, with over 20 years of experience in software and network engineering. I'm quite active in open-source software development under my real name, and all public activity related to EOS is done under a nickname cc32d9, which is the beginning of sha256 from my real name. Investors of this project will know me by real name. I am owning a consultancy company in Europe, and all payments will be done with proper accounting and taxation.

My previous work proposal was aiming to deliver an end-to-end solution for a new EOS history solution, but because of decline in cryptocurrency economy it was decided to terminate the project with the first release of Chronicle software.

As of today, Chronicle is a fast and reliable reader and decoder of state history archive that is generated by state_history_plugin in nodeos.

What needs to be built next, is an end-to-end solution for hosting the history archive and serving the queries.

The purpose of this work proposal is to deliver a solution that scales from a low-cost server to a large cluster serving throusands of requests per second. This will be achieved by means of efficient indexing, caching, and data replication.

Scope of the work proposal

This work proposal aims to deliver free and open-source software tools and scripts, accompanied with detailed installation instructions.

Building a production service infrastructure is not covered in this proposal.

The primary goal of this work proposal is to build a scalable back-end for the history database, aiming for easy sacalability and low maintenance costs. It will deliver basic API as well, but more complex API's would be built on top of it by other teams.

Functional requirements

Provide a back-end for queries compatible with history_plugin requests, such as get_actions.
Provide an API interface for custom indexers: many projects need data that is specific to their needs. Typically they require indexing on specific actions within their smart contracts.
Provide the means for live streaming of blockchain events, such as actions within a specific account or changes in contract tables.

Proposed approach

state_history_plugin generates an archive of all traces and table deltas and stores them in compressed form in the filesystem. As of today, it's about 500GB of data. The data has an elementary index that allows quick retrieval by block number. The traces are stored in binary form, and ABI relevant to that moment in history is required to decode the traces and table deltas.

Chronicle Receiver is reading the state history data sequentially and storing all revisions of contract ABI in its state memory. Also it's exporting the traces and table detlas in JSON, either as a sequential stream, or as individual blocks on demand.

There will be two different types of processes:

Indexer is reading the stream of traces and deltas from Chronicle and building an index database. This database is not storing the traces, but referring to block numbers in history. The simplest kind of index is per-account action receipts. The indexer will have a modular structure, so that the user will be able to add indexing functions as required. Normally the indexer would go through the whole history from Genesis, and then process real-time data as it arrives from the blockchain. It's also possible to re-start the indexer from a specific block number if application-specific indexer modules are added. Typically there would be one indexer process retrieving data from Chronicle, and the underlying database would replicate itself to multiple back-end instances. Once the indexer is N days behind the head block, it will start storing the JSON traces in the backend database. Also once it reaches the head block, it will start deleting the traces older than N days.
Query processing. The API requests would search for relevant block numbers in the indexer database. If the required block is within N days from head, and relevant transaction trace is available in the database, the data will be quickly delivered to the consumer. Otherwise, corresponding blocks would be retrieved by Chronicle from the state history archive. This process can be easily distributed across multiple servers and multiple processes to occupy CPU cores efficiently. Once the blocks are retrieved and decoded into JSON data, this data will be cached in RAM. Least Recently Used (LRU) cache will allow efficient re-use of data that is requested more often.

ScyllaDB is the primary candidate for storing the index data. It's optimized for big volumes, and also automatic replication is built in.

In addition, if the budget allows it, there will be an infrastructure for real-time streaming of blockchaiin data.

Quite likely, the indexer will be implemented in JavaScript, for the following reasons:

It has to process JSON quickly.
It has to allow adding third-party modules easily. So, C++ and Golang would be problematic. But probably a combined approach would work, allowing Javascript plugins to be executed within a processing engine that is written in some other language.
I'm fast in Perl programming, but this language is too unpopular.
I'm bad at Python programming.

Ongoing development and POC

Development of Chronicle continues in my free time, and this development effort is budgeted in the cost estimation below.

As of today, Chronicle reader is storing all ABI revisions for every contract in its state memory, and also interactiive mode is implemeted: upon request, a block is retrieved from state history archive and decoded according to ABI revisions relevant for that block.

The interactive mode is quite fast. At the moment my development server keeps all state history on two HDD drives in RAID0 array, and about half of the interactive response time is spent on retrieving the block data from the disks. Sibsequent requests for the same block numbers are much faster because the data is taken from filesystem cache.

For a random set of 420 block numbers (action history of cc32dninexxx from block 1070845 to 40743062), interactive Chronicle request takes 33s on the first run, and next run the same request takes 17s. The amount of produced traces is 191MB in pretty-formatted JSON.

A simple proof-of-concept indexer is built on top of MySQL, with two tables: one for transaction IDs, and the other one for action receipts, storinng contract name, action name, recipient name, and a reference to the transaction. The transaction table keeps references to block numbers.

The POC indexer database occupies approximately the same space as the state history archive (approximately 500GB), and the primary bottleneck is MySQL performance in insering millions of rows. The Chronicle reader is idling about 30% of time, waiting for MySQL to process the data. The full indexing takes about a week.

First I started with MySQL data on HDD, and it was working fine for the first 20M blocks. But then disk I/O started to be a heavy bottleneck, so I moved the data to NVMe and continued the indexing.

Also I started designing the ScyllaDB database scheme and running tests with it.

Cost estimation

The cost estimation is based on the rate of US$150 per hour. This is slightly below my usual rate for engineering and development work. If it's possible to outsource parts of the work, the total bill may probably be reduced.

Chronicle development: 50 hours

This work is mostly done. Chronicle stores now all ABI revisions and allows interactive requests. Additionally there will be on-demand streaming mode that allows requesting a stream of blocks starting from any block ini the past. Also ABI caching for interactove mode needs to be implemented.

POC with ScyllaDB: 20 hours

ScyllaDB needs to be thoroughly tested to check if it's suitable for the project. At least one full indexing needs to be done with a minimal set of tables, analogous to MySQL POC. Also it is important to see if ScyllaDB data can be stored on HDD while performing the indexing at full speed. This work is partially done.

Indexer implementation: 80 hours

The biggest challenge is to buld an API and runtime environment that allows third parties adding their processing modules, while keeping common storage and source of data.

Querying infrastructure: 40 hours

The querying infrastructure will access the indexes, so it needs to allow third-party modules to query their specific data. Then after looking up in index database, specific blocks would be retrieved from Chronicle and cached in LRU cache.

Reference API: 20 hours

This will be a reference implementation of history API. Its primary focus is to provide output for testing and to demonstrate the access patterms to other API implementations.

Budget leftovers

If parts of the work are possible to impolement quicker or cheaper, the remaing budget would be spent on the following development:

History API fully compatible with history_plugin;
Application-specific indexers for some well-known smart contracts, such as dGoods;
Assisting BP teams with setting up the infrastructure.

Hosting costs

Hosting of physical and virtual servers needed for development and testing is estimated as US$1000.

Total

Total work estimates as 210 hours, or US$32,500 including the hosting costs.

Delivery and payments

At least 15 sponsor organizations need to confirm their participation. Target deadline for finalizig the list: May 1st 2019.

Once the list of sponsors is finalized, the deposit account cc32dninewp1 will be set to multisignature, requiring at least 3 sponsor signatures for active and owner privileges.

The whole amount would be fixed in EOS at an average daily rate of EOS/USDT on Binance, and transferred to the deposit account in equal parts by each sponsor.

50% of the total amount would be transferred immediately to cc32dninexxx, and the rest would be released after project acceptance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

002_01_proposal.md

002_01_proposal.md

Work proposal: EOS History Indexer

Introduction

Scope of the work proposal

Functional requirements

Proposed approach

Ongoing development and POC

Cost estimation

Chronicle development: 50 hours

POC with ScyllaDB: 20 hours

Indexer implementation: 80 hours

Querying infrastructure: 40 hours

Reference API: 20 hours

Budget leftovers

Hosting costs

Total

Delivery and payments

Files

002_01_proposal.md

Latest commit

History

002_01_proposal.md

File metadata and controls

Work proposal: EOS History Indexer

Introduction

Scope of the work proposal

Functional requirements

Proposed approach

Ongoing development and POC

Cost estimation

Chronicle development: 50 hours

POC with ScyllaDB: 20 hours

Indexer implementation: 80 hours

Querying infrastructure: 40 hours

Reference API: 20 hours

Budget leftovers

Hosting costs

Total

Delivery and payments