Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new bip: compact client side filtering #609

Closed
wants to merge 14 commits into from

Conversation

Roasbeef
Copy link
Contributor

@Roasbeef Roasbeef commented Nov 9, 2017

This BIP describes a new light client node type for Bitcoin as well as the
modifications to current full-nodes required to support this new type of light
client. The light client mode described in this BIP is meant to supersede BIP
37 as it provides a greater degree of privacy, utility, and also reduces the
resources required for full-nodes to service this new light client mode
compared to BIP 37. The light client mode described in this BIP can be seen as
a "reversal" of BIP 37: rather than the light clients sending filters to
full-nodes, full-nodes send filters to light clients. Unlike BIP 37, we don't
utilize bloom filters. Instead, we utilize a compact filter (more efficient
than bloom filters) which leverages Golomb-Rice coding for compression.
Additionally, blocks are downloaded as a whole (from any source), rather than
directly from peers as fragments with merkle-branches proving their
authenticity.

We've tested the implementation of both a light client implementing this
BIP
, as well as the serving full-node on Bitcoin's testnet over the past several months. Additionally, this new operating mode also serves as the basis for our light weight Lightning node.

Copy link
Member

@luke-jr luke-jr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to address backward compatibility, at least briefly (eg, just say there isn't any).

We probably shouldn't copy the test vectors here. The external repo is good for that.

Comments-URI: ???
Type: Standards Track
Created: 05-24-2017
License: PD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please choose one of the recommended, or at least acceptable licenses (and also add a Copyright section below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunzo, we chose CCO-1.0.

37 as it provides a greater degree of privacy, utility, and also reduces the
resources required for full-nodes to service this new light client mode
compared to BIP 37. The light client mode described in this BIP can be seen as
a "reversal"[1] of BIP 37: rather than the light clients sending filters to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1] should be <ref>https://github.com/bitcoin/bips/blob/master/bip-0037.mediawiki</ref> (and use <references/> below) to make it auto-linkable. (suggestion, not mandatory)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* Transmission of runs of 1's is omitted.
* Two 1's in a row are denoted by a zero-length run of zero.

As an example, consider the following sequence of bits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should mention that k=4 in the example.

Also, I'm confused by the RLE scheme. How are runs of more than 2 1's represented? What would be the encoding of the below example with s/0{20}/0{15} 1 0{5}/?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is underspecified and also probably not entirely relevant here - the point of the Specification section is to describe what you do, not some alternative ways to encode the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opted to move this to a "Background" appendix at the end of the document.

Before we specify the details of our proposal, we'll first go over a few
preliminaries which will aid in the understanding our proposal.

By <code>[]byte</code> we refer to a slice (or array) of bytes. This value is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like "vector" is more appropriate than "array" (not just in a C++ sense).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified wording to say "vector" instead of array.

</pre>

RLE allows one to efficiently encode a data stream in a lossless manner. Due
the the encoding of runs, RLE works best when encoding a set with a high degree
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Due to the encoding

</pre>

[https://en.wikipedia.org/wiki/Golomb_coding Golomb Coding] encodes the two
values (<code>q</code> and <code>m</code> for a given integer <code>n</code> as a two-tuple. The first value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: (q and r) for a given integer n

Also missing closing parenthesis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

c*m + r
</pre>

To aide in understanding we provide the following examples of using Golomb-Rice
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: aid. aide is a noun.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


In addition, to help optimize the algorithm, we use a fast range algorithm[7],
multiplying the hashed value by F and taking only the top 64 bits. This fairly
distributes the values over F without division and can be done with fewer cycles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't mention what strategy this algorithm uses fewer cycles than.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elaborated a bit more.


// Compute the difference between this value and the last one, divided
// by P. This is our quotient.
let quotient = (value - last_value - remainder) >> fp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to subtract the remainder? Just the shift is sufficient, no?


The 6th service bit will now be dedicated to signaling support for the
features described within this BIP:
* <code>CFNodeCF = 1 << 6</code>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SFNodeCF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

message that allows light clients to ascertain which filters a node supports.

The <code>getcftypes</code> message is an ''empty message'' whose command string is:
<code>getcftypes</code>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioning the command string seems redundant.

Copy link
Contributor

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this spec very confusing...I could probably stumble my way through implementing it purely because of the test vectors, but I would certainly miss parts of it the first few times around. Also, there is way too much background information in the "Specification" section - it doesn't read as a specification, it reads as a general goal of what you want it to do.

* Transmission of runs of 1's is omitted.
* Two 1's in a row are denoted by a zero-length run of zero.

As an example, consider the following sequence of bits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is underspecified and also probably not entirely relevant here - the point of the Specification section is to describe what you do, not some alternative ways to encode the data.

fact to leverage a more efficient encoding scheme based on the distribution of
the length of a run. The [https://en.wikipedia.org/wiki/Geometric_distribution
Geometric Distribution] represents the probabilities of a number of failures
before the first success in a series of Bernoulli trials (yes/no experiments).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph probably doesn't belong in the "Specification" section, motivation section seems much more appropriate, or maybe a "Background Math" section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, move the bulk of the initial section to a "Background Math" section at the end of the document.

needs to sync a basic Bitcoin wallet. In order to facilitate this use-case, for
each transaction, normal filters contain:
* The outpoints of each input within a transaction.
* The data-pushes contained within the public key script of each output within the transaction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think "public key script" is the way we normally refer to it....I'd say stick with either "scriptPubKey" or "the output script".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to use "output script".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No?

<code>new_bit_stream</code> The <code>bit_stream</code> has two functions that
operate on it, <code>unary_encode(stream, n)</code> and
<code>write_bits_big_endian(stream, n, k)</code> where <code>unary_encode(steam,
n)</code> emits n (an integer) to the stream in unary, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically underspecified - "unary" is not a normative encoding, and can be 0-indexed, 1-indexed and can encode as 0s followed by a 1 or 1s followed by a 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to make more explicit.


The following routine describes the compression process:
<pre>
gcs_compress(sorted_set, fp) -> []byte:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, sorted_set? No where else does anything reference that the set must be sorted? Sorted in what order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hashed_set_construct function above this returns a sorted set. Update to point this out, and make it explicit in the section above gcs_compress.

This filter header chain can be utilized by light clients to gain a greater
degree of security against bamboozling full-nodes during their initial chain
sync. In addition to fetching all the bitcoin headers, light clients
implementing this BIP should also fetch all the ''filter headers'' from ''each'' of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section may be clearer if you use RFC words (which you appear to already be doing) and be clear that you are. Also, what is the advantage of fetching the filter headerS from each peer? It would be less bandwidth and just as much cross-checking to only fetch the top block's filter header after you've synced the headers chain (which is probably a much better approach in any case as you can first make sure you're on the best chain, and then download a filters chain, protecting you from some types of DoS issues). You can then do a backwards sync of the filters. If there is a mismatch you do a binary search or otherwise to figure out which peer is lying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to make the usage of RFC works explicit. The section below assumes the client has fully synced, as is fetching from tip. As a result, they already have the entire header chain. The naive way would be just to fetch the entire filter from each peer. This algo avoids that and only fetches the filter in the case a bamboozlement attempt by a peer.

bandwidth), a light client instead does the following:

<pre>
verify_from_tip(tip_block_hash: [32]byte):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know why this needs to be in the normative "Specification" section - clients can use this how they feel, and I think this is probably not the best approach anyway. You may wish to say something like clients SHOULD avoid syncing the full filter chain from all peers, but instead should only download from multiple peers when an inconsistency is detected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to the "Implementation Notes" section.

// matches any relevant items.
</pre>

Light clients should persistently commit all filter headers to disk, as when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this should be normative, but if it is, I'd suggest using RFC words and use MAY here instead of SHOULD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to use MAY.

positives, and the bandwidth used to download the filters themselves. The code along with a demo used for the paramter tuning can be found [here]


=== Protocol Version Bump ===
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe this is neccessary? There is a new service bit, which should be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, all new extensions to the p2p protocol have been accompanied by a protocol version bump? (or at least ones that add new messages)

We define the concept of an abstract bit stream instantiated by the function:
<code>new_bit_stream</code> The <code>bit_stream</code> has two functions that
operate on it, <code>unary_encode(stream, n)</code> and
<code>write_bits_big_endian(stream, n, k)</code> where <code>unary_encode(steam,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to much more seriously consider whether we want to, for the first time, introduce big endian at the purely-p2p layer - it is nice at a bit level, but it kinda sucks to have yet more endianness confusion in Bitcoin's P2P protocol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big endian just seemed natural initially. It's not as if we're using big-endian to encode a field that was previously encoded using little-endian. This is a new data-structure all together.

achieve our desired false positive rate.

In addition, to help optimize the algorithm, we use a fast range algorithm[7],
multiplying the hashed value by F and taking only the top 64 bits. This fairly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64 bits appears as a magic number here. I would note that the 64 bits comes from the SipHash output size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specified.

@Roasbeef
Copy link
Contributor Author

@luke-jr AFAICT other BIPs do contain test vectors in folders named the same as the BIP. For example BIP-00039 contains the word list used. Added a section for backwards compatability.

@TheBlueMatt thanks for the thorough review! Admittedly the initial section reads more like a blog post. Following your suggestion I've adopted RFC words where applicable throughout, and created a new section ("Mathematical Background") that houses what use to be the opening portion of the specification.

@evoskuil
Copy link
Contributor

evoskuil commented Nov 30, 2017

The light client mode described in this BIP is meant to supersede BIP 37 as it provides a greater degree of privacy...

As far as I can tell the privacy improvement is never quantified. It would seem that the independent downloading of less than all blocks (which is ultimately the objective) is the only source of privacy loss. However, given that privacy is a primary objective, it would be helpful if this analysis was explicit. I am specifically interested in whether there is any privacy loss in the filtering aspect (as compared to the operation of a full node).

@luke-jr
Copy link
Member

luke-jr commented Nov 30, 2017

@Roasbeef I see you're doing a lot of revising of stuff. Let me know when you'd like this merged. (others commenting should note that this is just a draft, and merging does not preclude acting on comments in future revisions)

There's no hard rule against test vectors in the repo. My opinion on where they belong is an opinion.

Copy link
Contributor

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reviewed the P2P part again.

==== Peer to Peer Service Bit ====

To start, we reserve a currently unutilized service bit. This is required as
light clients SHOULD ''preferentially'' peer to full-nodes that support the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SHOULD is maybe wrong here - some light clients will almost certainly have to only connect to nodes that have the bit set. You could replace with MAY preferentially, or only connect to, as required, or just drop it.

light clients SHOULD ''preferentially'' peer to full-nodes that support the
features outlined in this BIP.

The 6th service bit will now be dedicated to signaling support for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess terminology...its kinda the 7th bit.

needs to sync a basic Bitcoin wallet. In order to facilitate this use-case, for
each transaction, normal filters contain:
* The outpoints of each input within a transaction.
* The data-pushes contained within the public key script of each output within the transaction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No?

features outlined in this BIP.

The 6th service bit will now be dedicated to signaling support for the
features described within this BIP:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hugely under-defined - can you set the BIP if you're not a full node (ie is NODE_NETWORK implied or is that separate and you can look for eg NODE_NETWORK_LIMITED & NODE_CF)? Do you have to have implemented both currently-defined filter types (No is better for maintainability, but that has huge useability implications)? What does this bit mean without NODE_NETWORK? Maybe you always have to return the filter chain hashes but dont need to return the filters themselves?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It would be ideal if pruned nodes could set this and still be usable by light wallets (I might even suggest that clients SHOULD support using pruned nodes... many users want pruned nodes, and their own node should be the expected server)


The 6th service bit will now be dedicated to signaling support for the
features described within this BIP:
* <code>SFNodeCF = 1 << 6</code>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the SF prefix here? I think generally BIPs have used Bitcoin Core's NODE_* constant naming (and generally use all-caps constant naming).


==== Compact Filter Header Chain ====

As the filters described in this BIP ''are not'' consensus critical, meaning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might suggest you strike this portion - you later say that clients should detect and ban nodes that dont have a normative filter for each block, so its a kind of consensus, even if not "consensus critical" per se.

header hashed with zero bytes). Clients can take this fact into account in
order to save a round trip for <code>null</code> headers.

Light clients implementing this proposal SHOULD utilize the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very strong. Why SHOULD? Hell, why even MAY? It seems very unrelated.


However, there exists a special case of a <code>null</code> filter. This this
case an empty byte slice is transmitted rather than consuming
<code>4-bytes</code> to encode the size of zero.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the special case here? You're already sending many bytes anyway, dropping the extra 4 just seems...over optimization?


== Implementation Notes ==

This filter header chain SHOULD be utilized by light clients to gain a greater
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe s/by light clients/in place of BIP 37/

degree of security against bamboozling full-nodes during their initial chain
sync. In addition to fetching all the bitcoin headers, light clients
implementing this BIP SHOULD also fetch all the ''filter headers'' from
''each'' of their connected peers. With these headers, light clients SHOULD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say they SHOULD only fetch the top header after doing a headers sync from all of their peers except one, from which they build the whole header tree.

require mapping set items to prime representatives in the associated group
which can be preemptively expensive.

==== Matrix Based Probabilistic Set Datastructures ====
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Datastructures" should probably be two words.

@Roasbeef
Copy link
Contributor Author

Closing this in favor of #636.

Thanks to everyone who gave valuable review comments, especially @TheBlueMatt for his comments on the P2P aspect which led to notable improvements as proposed by @jimpo.

@Roasbeef Roasbeef closed this Jan 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants