Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add address-based index (attempt 4?) #14053

Open
wants to merge 6 commits into
base: master
from

Conversation

@marcinja
Copy link

marcinja commented Aug 24, 2018

Adds index to transactions by scriptPubKey. Based off of #2802. Stores hashes of scripts (hashed using Murmurhash3, with a hash seed that is stored in the index database), along with the COutPoint for the output which was spent/created, and the CDiskTxPos for the transaction in which this happened.

@marcinja marcinja mentioned this pull request Aug 24, 2018
@DrahtBot

This comment has been minimized.

Copy link
Contributor

DrahtBot commented Aug 24, 2018

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #17897 (init: Stop indexes on shutdown after ChainStateFlushed callback. by jimpo)
  • #17383 (Refactor: Move consts to their correct translation units by jnewbery)
  • #16365 (Log RPC parameters (arguments) if -debug=rpcparams by LarryRuane)
  • #16224 (gui: Bilingual GUI error messages by hebasto)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Copy link
Contributor

ryanofsky left a comment

Reviewed most of the code, but just skimmed tests. It looks to me like this PR could be merged basically in its current form, so I'm curious if you're intending to make the improvements cited in the PR description above here or in a separate PRs.

I left a few minor suggestions about the code, which you should feel free to ignore. The only changes I would definitely like to see here are:

  • adding some python code in test/functional/ to call the new rpc method
  • adding a blurb in doc/release notes.md to describe the feature and maybe mention some use-cases
src/index/base.h Outdated Show resolved Hide resolved
std::unique_ptr<AddrIndex> g_addrindex;

/**
* Access to the addrindex database (indexes/addrindex/)

This comment has been minimized.

Copy link
@ryanofsky

ryanofsky Aug 29, 2018

Contributor

In commit "Introduce address index" (3c7cc3c)

Note: new index/addrindex.cpp, index/addrindex.h, and test/addrindex_tests.cpp files in this commit mirror existing index/txindex.cpp and index/txindex.h, test/txindex_tests.cpp files and have some code and comments in common. It can help to diff the addr files against the tx files when reviewing this PR.

src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/rpc/rawtransaction.cpp Show resolved Hide resolved
src/init.cpp Outdated Show resolved Hide resolved
"\nArguments:\n"
"1. \"address\" (string, required) The address to search for\n"
"2. \"verbose\" (bool, optional, default = false) If set to false, only returns data for hex-encoded `txid`s. \n"
"3. \"skip\" (numeric, optional, default = 0) If set, the result skips this number of initial values. \n"

This comment has been minimized.

Copy link
@luke-jr

luke-jr Aug 30, 2018

Member

skip and count probably make sense on an options object instead.

src/rpc/rawtransaction.cpp Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

jimpo left a comment

Nice work! I'm glad someone's working on this. Concept ACK.

  1. The AddrIndex should return information about the outpoints and differentiate between outputs and spends, not just return the raw transactions. In fact, the AddrIndex could just return outpoints, then the client code could use the TxIndex could to fetch the tx bodies. It'd involve a separate lookup though.
  2. I don't think it's necessary to delete keys from the database when a block is disconnected. There's no harm in leaving it. The higher level methods to search the index can then filter for results that are on the main chain if that's what the client wants. It'd have to do this anyway to avoid races with reorgs and such.
  3. I'm worried about collisions on address IDs because they are only 64 bits. I can think of three options, 1) use a 32 byte cryptographic hash, 2) use a 20 byte cryptographic hash of the script plus some randomly generated, database-global salt, 3) use a 32- or 64-bit non-cryptographic hash (might as well use Murmur3 or SipHash, not truncated SHA256), then store the full script as the database value to double check against. Option 3 feels best to me.
  4. What's the purpose of having the first byte of the block hash as the value? It doesn't seem robust nor particularly useful.
src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/index/addrindex.h Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/index/base.cpp Outdated Show resolved Hide resolved
src/index/base.h Outdated Show resolved Hide resolved
src/index/base.h Outdated Show resolved Hide resolved
src/index/addrindex.h Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
@marcinja

This comment has been minimized.

Copy link
Author

marcinja commented Aug 31, 2018

Thanks for all the reviews.

To answer some of @jimpo's questions:
2 & 4. I included part of the block hash so that in BlockDisconnected we ae sure to remove the entries in the index from this block only (that's where filter_by_value is used). The reason I chose to remove entries from the database is to prevent reading into a block file using an old CDiskTxPos that may no longer be a valid position. Otherwise in FindTxsByScript you could run into errors. You're right that this problem would be better handled by higher level methods.

I think that returning just the outpoint is a better idea than the current choice so I'll switch to that and try to incorporate all the other feedback here.

src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/test/test_bitcoin.h Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
src/index/addrindex.cpp Outdated Show resolved Hide resolved
@marcinja marcinja force-pushed the marcinja:add-address-index branch from cf3da73 to a6685c6 Dec 3, 2019
@DrahtBot DrahtBot removed the Needs rebase label Dec 3, 2019
@marcinja marcinja force-pushed the marcinja:add-address-index branch 2 times, most recently from f36990f to 51103c2 Dec 3, 2019
@marcinja marcinja force-pushed the marcinja:add-address-index branch 3 times, most recently from c70aaf5 to ec5fbd0 Jan 2, 2020
marcinja added 3 commits Jan 17, 2019
Adds index that relates scriptPubKeys to location of transactions in the
filesystem, along with the hash of the block they are found in, and the
outpoint information of the txout with the related script.
@jnewbery jnewbery added the Review club label Jan 3, 2020
marcinja added 3 commits Aug 16, 2019
Adds searchrawtransactions RPC that uses the address index to lookup
transactions and outpoints by script and address. Adds basic functional
tests for searchrawtransactions.
Setup address index in initialization process.
Add initialization warning and wallet feature request warning as
suggested by ryanofsky.
@marcinja marcinja force-pushed the marcinja:add-address-index branch from ec5fbd0 to c217671 Jan 3, 2020
@ryanofsky

This comment has been minimized.

Copy link
Contributor

ryanofsky commented Jan 6, 2020

Nice set of notes on this PR at https://bitcoincore.reviews/14053.html

Copy link
Member

jnewbery left a comment

I've done a preliminary review and left a few comments.

I think this PR could use:

  • some design documentation - what's being stored as keys and values in the addr index database and why, and what values are being returned to the RPC user. This could be a code comment at the top of addrindex.cpp or addrindex.h.
  • more test cases, particularly of the spent outputs values (there aren't currently any tests for that) and persistence across stop-start. The functional test can be moved to its own file, rather than being added to rpc_rawtransaction.py.
@@ -135,6 +135,8 @@ BITCOIN_CORE_H = \
httpserver.h \
index/base.h \
index/blockfilterindex.h \
index/addrindex.h \
index/disktxpos.h \

This comment has been minimized.

Copy link
@jnewbery

jnewbery Jan 7, 2020

Member

nit: this is added in the wrong commit (should be in Move only: Move CDiskTxPos to own file, not Add address index)

const std::unique_ptr<DB> m_db;

// m_hash_seed is used by GetAddrID in its calls to MurmurHash3.
// It is stored in the index, and restored from their on construction

This comment has been minimized.

Copy link
@jnewbery

jnewbery Jan 7, 2020

Member

spelling: there


std::unique_ptr<AddrIndex> g_addr_index;

static constexpr char DB_ADDR_INDEX = 'a';

This comment has been minimized.

Copy link
@jnewbery

jnewbery Jan 7, 2020

Member

I'm not sure if this is actually needed. The other indexes have different prefixes for the different object types that they store, but all objects in this index are given the same prefix, so is it necessary?

if (!iter->GetKey(key) || key.m_addr_id != addr_id || !iter->GetValue(value) ) break;

// Check that the stored script matches the one we're searching for, in case of hash collisions.
if (value.second != script) continue;

This comment has been minimized.

Copy link
@jnewbery

jnewbery Jan 7, 2020

Member

Does this branch need to advance the iterator? If not, I think value will be the same the next time round and we'll never break out of this loop.

I think the following is what we want:

    ...
    if (value.second == script) {
        result.emplace_back(std::make_pair(key, value));
    }

    iter->Next();
}
CDiskTxPos pos(pindex->GetBlockPos(), GetSizeOfCompactSize(block.vtx.size()));
std::vector<std::pair<DBKey, DBValue>> entries;

const bool not_genesis_block = (pindex->nHeight > 0);

This comment has been minimized.

Copy link
@jnewbery

jnewbery Jan 7, 2020

Member

style nit: having a variable defined as not_thing seems unintuitive to me. I think it'd be better to define genesis and then test on !genesis.

const CTxUndo& tx_undo = block_undo.vtxundo[i-1];
for (size_t k = 0; k < tx.vin.size(); ++k) {
CScript spent_outputs_scriptpubkey = tx_undo.vprevout[k].out.scriptPubKey;
DBKey key(DBKeyType::SPENT, GetAddrId(spent_outputs_scriptpubkey), tx.vin[k].prevout);

This comment has been minimized.

Copy link
@jnewbery

jnewbery Jan 7, 2020

Member

I don't think this isn't the information that we want to save in the address index. For a given scriptPubKey, a user wants to know:

  • which txouts (txid, output index) spent to that scriptPubKey (the CREATED DBKeyType above)
  • which txins (txid, input index) consume UTXOs for that scriptPubKey (the SPENT DBKeyType here)

So here, I think you want to save the txid and input index spending the coin. You're actually saving the txid and output index that creates the coin, because you're using the prevout.

(I'm not entirely sure about this. Perhaps you are trying to return the outpoint that created the coin in the spent outputs array, but that's not clear to me from the RPC documentation).

This comment has been minimized.

Copy link
@narula

narula Jan 8, 2020

Contributor

I think this is correct because he is using the prevout in the key. When looking up by this script with this outpoint, one would want to find this transaction because it spends this outpoint.

AFAICT the value just contains the position of the relevant transaction on disk and the scriptPubKey (to detect collisions). The value does not contain any indexes into inputs or outputs.

This comment has been minimized.

Copy link
@narula

narula Jan 8, 2020

Contributor

Can you add a comment with your key/value format somewhere and the justification?

{
{"address", RPCArg::Type::STR_HEX, RPCArg::Optional::NO, /* default_val */ "", "address or scriptPubKey"},
{"verbose", RPCArg::Type::BOOL, RPCArg::Optional::OMITTED, /* default_val */ "false", "If false, return hex-encoded tx, otherwise return a json object"},
{"count", RPCArg::Type::NUM, RPCArg::Optional::OMITTED, /* default_val */ "", "The block in which to look for the transaction"},

This comment has been minimized.

Copy link
@jnewbery

jnewbery Jan 7, 2020

Member

"The block in which to look for the transaction" is wrong

{
RPCResult{"if verbose is not set or set to false",
"\nResult:\n"
" [ (array of json objects)\n"

This comment has been minimized.

Copy link
@jnewbery

jnewbery Jan 7, 2020

Member

I suggest you make this an object with two keys: creates and spends

@@ -213,6 +214,142 @@ static UniValue getrawtransaction(const JSONRPCRequest& request)
return result;
}

static UniValue searchrawtransactions(const JSONRPCRequest& request) {
if (request.fHelp || request.params.size() < 1 || request.params.size() > 4)

This comment has been minimized.

Copy link
@andrewtoth

andrewtoth Jan 8, 2020

Contributor
Suggested change
if (request.fHelp || request.params.size() < 1 || request.params.size() > 4)
if (request.fHelp || request.params.size() < 1 || request.params.size() > 3)

This comment has been minimized.

Copy link
@MarcoFalke

MarcoFalke Jan 10, 2020

Member

Should use the new .Check() syntax instead

{
{"address", RPCArg::Type::STR_HEX, RPCArg::Optional::NO, /* default_val */ "", "address or scriptPubKey"},
{"verbose", RPCArg::Type::BOOL, RPCArg::Optional::OMITTED, /* default_val */ "false", "If false, return hex-encoded tx, otherwise return a json object"},
{"count", RPCArg::Type::NUM, RPCArg::Optional::OMITTED, /* default_val */ "", "The block in which to look for the transaction"},

This comment has been minimized.

Copy link
@andrewtoth

andrewtoth Jan 8, 2020

Contributor

The default value is implicitly 100. It should be explicitly defined here.

// to the output spent with the given script, and the transaction it was spent
// in. creations_result is filled with outpoints for outputs created with this
// script as their script pubkey, and the transactions they were created in.
bool AddrIndex::FindTxsByScript(const CScript& script,

This comment has been minimized.

Copy link
@andrewtoth

andrewtoth Jan 8, 2020

Contributor

Shouldn't this method take the count as a parameter and return early when enough entries are found?

}
spends.push_back(tx_val);
++spends_it;
}

This comment has been minimized.

Copy link
@andrewtoth

andrewtoth Jan 8, 2020

Contributor

If there are more spends than count then no creates will be returned. Ideally the spends and creates will be returned in order, rather than all spends first.

@@ -213,6 +214,142 @@ static UniValue getrawtransaction(const JSONRPCRequest& request)
return result;
}

static UniValue searchrawtransactions(const JSONRPCRequest& request) {

This comment has been minimized.

Copy link
@andrewtoth

andrewtoth Jan 8, 2020

Contributor

There doesn't seem to be a way to skip transactions anymore. Was this removed intentionally?

while (iter->Valid()) {
DBKey key;
DBValue value;
if (!iter->GetKey(key) || key.m_addr_id != addr_id || !iter->GetValue(value) ) break;

This comment has been minimized.

Copy link
@narula

narula Jan 8, 2020

Contributor

Do you want to log or throw something here? If I understand the LevelDB docs correctly, you have what LevelDB says is a valid iterator which should be positioned at the key. I think if any of these return false you'd want to know as the DB could be corrupted.

This comment has been minimized.

Copy link
@narula

narula Jan 8, 2020

Contributor

I think in general it would be nice if this class did some of the error logging that I see in blockfilterindex.cpp.

@@ -487,6 +486,44 @@ def run_test(self):
assert_equal(testres['allowed'], True)
self.nodes[2].sendrawtransaction(hexstring=rawTxSigned['hex'], maxfeerate='0.20000000')

self._test_searchrawtransactions()

def _test_searchrawtransactions(self):

This comment has been minimized.

Copy link
@narula

narula Jan 8, 2020

Contributor

Please also test the spends_results path (returning transactions which spend from this address).

Copy link
Member

jonatack left a comment

Still gathering context here in light of the Concept ACKs others have given. Nevertheless, on first look I'm finding it difficult to ack the concept of investing resources on reviewing, adding, and maintaining a feature that will need to be removed down the road in favor of a dedicated external index by interested parties.

@fjahr

This comment has been minimized.

Copy link
Contributor

fjahr commented Jan 8, 2020

For me what this PR is missing some analysis on the current size of the index in mainnet and the growth rate of the data. Has anyone run the numbers recently and can report?

UniValue tx_val(UniValue::VOBJ);

UniValue outpoint_val(UniValue::VOBJ);
outpoint_val.pushKV("txid", spend.first.hash.GetHex());

This comment has been minimized.

Copy link
@marcinja

marcinja Jan 8, 2020

Author
Suggested change
outpoint_val.pushKV("txid", spend.first.hash.GetHex());
outpoint_val.pushKV("txid", spend.second.first.hash.GetHex());

I'll write a functional test that catches this too :)

// Place entry into correct vector depending on its type.
switch (key.m_key_type) {
case DBKeyType::SPENT:
spends_result.emplace_back(std::make_pair(key.m_outpoint, result));

This comment has been minimized.

Copy link
@narula

narula Jan 8, 2020

Contributor

As pointed out by jnewbery in review club, I think this is wrong.

The DB's key always represents the outpoint that created the scriptPubKey. For spends_result, you want the txid of the transaction that spends the scriptPubKey. So instead of the DB key's outpoint as the transaction id to be returned to the user, you want the transaction id of the transaction in the DB value. Or, in the RPC code lines 314 and 315, use the txid from spend.second.

Edit: Or maybe this field of the returned result is always the created_outpoint, even though it's called the spent_outpoint sometimes? Would be good to clarify this.

@fjahr

This comment has been minimized.

Copy link
Contributor

fjahr commented Jan 8, 2020

I think @jnewbery was hinting at this in the PR Review Club as well: it seems like this index does not handle reorgs. BaseIndex handles this through Rewind which is not overridden.

@andrewtoth

This comment has been minimized.

Copy link
Contributor

andrewtoth commented Jan 10, 2020

Ran on i7-8750H, took about 48 hours on an already synced node. Took 200 GB disk space.

2020-01-08T02:22:01Z addr_index thread start
...
2020-01-10T02:37:07Z addr_index is enabled at height 611001
2020-01-10T02:37:07Z addr_index thread exit
$ du -h addr_index
200G	addr_index
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.