Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bloom filter gossip #266

Closed
wants to merge 81 commits into from
Closed

Bloom filter gossip #266

wants to merge 81 commits into from

Conversation

joshua-kim
Copy link
Contributor

@joshua-kim joshua-kim commented Jun 17, 2023

Why this should be merged

Performance optimization to lower end-to-end issuance to acceptance time for a transaction.

How this works

Uses a pull-based approach to gossip instead of a push-based approach. See testing for more details.

How this was tested

Ran some simulations, results are as follows:

With the existing push-based gossip, we see that as more and more nodes learn the new transaction, the rate at which nodes learn the transaction slows down exponentially.

At first, few nodes know of the transactions, but only one node knows about it so there's not many people gossiping it. Once it hits the 50% mark, the transaction is being gossiped at maximum velocity, since we're at a middle-point where both half the network is willing to gossip the transaction, and the other half is willing to continue forwarding it.

As a majority of the network learns of the transaction, it starts slowing down again. This is because nodes sample peers to gossip to, but the peers will only forward the gossiped transaction if it hasn't seen it already. Over time, peers effectively "absorb" the gossip and don't forward it, so the last few nodes are very unlikely to ever hear of the transaction until a re-gossip cycle kicks in.
image

With the pull-based approach, we see that the transaction is gossiped exponentially quickly as more and more nodes learn the transaction. This is because we now poll peers, and as more peers learn the transaction, it's more likely that you're going to poll a set of peers that has it.
image

gossip/bloom.go Outdated Show resolved Hide resolved
gossip/mempool.go Outdated Show resolved Hide resolved
gossip/mempool.go Outdated Show resolved Hide resolved
gossip/gossip.go Outdated Show resolved Hide resolved
gossip/gossip.go Outdated Show resolved Hide resolved
gossip/gossip.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@darioush darioush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level comments:

  • Improve the gossip to not block on each peer's response.
  • Remove the wrapper from the bloom filter
  • Wrap the mempool type instead of modifying it if possible.

Also left some inline comments.

plugin/evm/message/codec.go Outdated Show resolved Hide resolved
plugin/evm/tx_gossip.go Outdated Show resolved Hide resolved
plugin/evm/tx_gossip.go Outdated Show resolved Hide resolved
plugin/evm/vm.go Outdated Show resolved Hide resolved
plugin/evm/vm.go Outdated Show resolved Hide resolved
plugin/evm/mempool.go Outdated Show resolved Hide resolved
core/txpool/txpool_test.go Outdated Show resolved Hide resolved
gossip/gossip_test.go Outdated Show resolved Hide resolved
plugin/evm/mempool.go Show resolved Hide resolved
gossip/mempool.go Outdated Show resolved Hide resolved
@joshua-kim joshua-kim mentioned this pull request Jul 5, 2023
@joshua-kim joshua-kim linked an issue Jul 5, 2023 that may be closed by this pull request
peer/mock_client.go Outdated Show resolved Hide resolved
plugin/evm/message/message.go Outdated Show resolved Hide resolved
gossip/bloom.go Show resolved Hide resolved
plugin/evm/message/handler.go Outdated Show resolved Hide resolved
plugin/evm/vm.go Outdated Show resolved Hide resolved
gossip/gossip_test.go Outdated Show resolved Hide resolved
plugin/evm/gossip_mempool.go Outdated Show resolved Hide resolved
plugin/evm/gossiper_atomic_gossiping_test.go Outdated Show resolved Hide resolved
@@ -56,6 +62,8 @@ func TestMempoolAtmTxsIssueTxAndGossiping(t *testing.T) {
return nil
}

assert.NoError(vm.SetState(ctx, snow.NormalOp))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer using require over assert for new code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just using assert here because the rest of this test uses assert, I think we can probably pass over these tests to use require in a future pr

Comment on lines 186 to 193
if _, exists := m.issuedTxs[txID]; exists {
return nil
return fmt.Errorf("%w: %s", errTxAlreadyIssued, txID)
}
if _, exists := m.currentTxs[txID]; exists {
return nil
return fmt.Errorf("%w: %s", errTxAlreadyIssued, txID)
}
if _, exists := m.txHeap.Get(txID); exists {
return nil
return fmt.Errorf("%w: %s", errTxAlreadyIssued, txID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we changing this behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we want something to signal that tx was already issued and we don't need to refresh the bloom filter.
I'd add a boolean to to return boolean + error, instead of changing the error semantics

Copy link
Contributor

@abi87 abi87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshua-kim I am still reviewing this but I have two general questions:

  1. Shouldn't the bloom filter mechanism be activated only following an hard fork? What would happen with current code if it was deployed in a network of nodes not fuly sopporting bloom gossiping?
  2. Is this bloom gossip mechanism designed to be portable to avalanchego mempool? If so, are there PRs on avalanchego side? Happy to help there if I can

Copy link
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly notes on first pass, really cool work!

gossip/bloom_test.go Outdated Show resolved Hide resolved
gossip/bloom_test.go Outdated Show resolved Hide resolved
gossip/gossip_test.go Show resolved Hide resolved
peer/network.go Outdated Show resolved Hide resolved
plugin/evm/gossip_mempool.go Outdated Show resolved Hide resolved
plugin/evm/vm.go Outdated Show resolved Hide resolved
plugin/evm/vm.go Outdated Show resolved Hide resolved
plugin/evm/vm.go Outdated Show resolved Hide resolved
plugin/evm/gossip_mempool.go Outdated Show resolved Hide resolved
plugin/evm/gossiper.go Outdated Show resolved Hide resolved
Copy link

@richardpringle richardpringle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take my comment with a grain of salt, it's honestly more for me than it is for you!
@joshua-kim, feel free to resolve the comment if you don't agree, but I would love it if you could explain why.

And if you do agree... you're welcome 😉 .

gossip/bloom.go Outdated Show resolved Hide resolved
@joshua-kim
Copy link
Contributor Author

Shouldn't the bloom filter mechanism be activated only following an hard fork? What would happen with current code if it was deployed in a network of nodes not fuly sopporting bloom gossiping?

We can deploy this without a hard fork, since this introduces all the new gossip over new message types. Nodes without the upgrade will just drop the new messages.

Eventually we will want to deprecate the existing push-based gossip, which will need to be done on a hard-fork.

Is this bloom gossip mechanism designed to be portable to avalanchego mempool? If so, are there PRs on avalanchego side? Happy to help there if I can

Yeah this should be portable to X/P chains (which is why the code is in its own independent gossip package). Once this PR is merged the follow-up will be to refactor the gossip package into avalanchego, and from there introduce it into the P/X chains.

Copy link
Collaborator

@ceyonur ceyonur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes overall lgtm, just few comments. I feel like currently our gossip/network/handler related files are little too spread in the code base. It's a bit hard to follow and add new handlers. I wonder if we should collect them together in a package and use subpackage to differentiate from each other.

gossip/bloom.go Outdated Show resolved Hide resolved
gossip/gossip.go Outdated Show resolved Hide resolved
plugin/evm/gossip_mempool.go Outdated Show resolved Hide resolved
gossip/gossip.go Show resolved Hide resolved
plugin/evm/gossiper.go Outdated Show resolved Hide resolved
plugin/evm/gossiper.go Outdated Show resolved Hide resolved
gossip/bloom.go Outdated
// ResetBloomFilterIfNeeded resets a bloom filter if it breaches a ratio of
// filled elements. Returns true if the bloom filter was reset.
func ResetBloomFilterIfNeeded(
bloomFilter **bloomfilter.Filter,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a pointer to a pointer here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just have a NeedReset(...) bool function and recreate the bloom filter anew in the called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so this is a funny situation, we either pass a double pointer here, or we pass a single pointer here but this causes an issue since when we overwrite this pointer's value like so:

	fresh, _ := bloomfilter.New((*bloomFilter).M(), (*bloomFilter).K())
	*bloomFilter = *fresh // copying a mutex here

it flags the linter as us copying a mutex (I think this is technically safe since it's gonna be unlocked and no one else can modify it). I felt like the double pointer was less evil than this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just have a NeedReset(...) bool function and recreate the bloom filter anew in the called?

Yeah an alternative would be just to remove this function entirely, it's only used in two places and it's a small snippet of code so the duplication cost is really not bad.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I'd prefer to either remove this entirely or update this to a function that does not deal with a double pointer. Even though this seems correct, it still seems overly complicated. I think another simpler alternative would be to have a function that returns nil if no reset is needed and returns a non-nil pointer to a new bloom filter if it is needed (or a boolean if that seems cleaner)

gossip/mempool.go Outdated Show resolved Hide resolved
Copy link
Contributor

@StephenButtolph StephenButtolph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a pass of the gossip package

gossip/bloom.go Outdated Show resolved Hide resolved
gossip/bloom.go Outdated Show resolved Hide resolved
gossip/bloom.go Outdated Show resolved Hide resolved
gossip/bloom.go Outdated Show resolved Hide resolved
gossip/bloom.go Outdated Show resolved Hide resolved
gossip/gossip.go Outdated Show resolved Hide resolved
gossip/gossip.go Show resolved Hide resolved
gossip/handler.go Outdated Show resolved Hide resolved
gossip/handler.go Outdated Show resolved Hide resolved
gossip/types.go Outdated Show resolved Hide resolved
gossip/gossip.go Outdated Show resolved Hide resolved
gossip/gossip.go Outdated
Comment on lines 19 to 23
// GossipableAny exists to help create non-nil pointers to a concrete Gossipable
type GossipableAny[T any] interface {
*T
Gossipable
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a wild hack... afaict it does work tho... Is there a ref we can link to for this? Essentially it works because:

  1. The type must be a pointer to something
  2. The type must implement the interface.
  3. A pointer to an interface does not implement the interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you summed up how it works. I can add a ref here.

It's basically a decision of whether we prefer this generics black magic (add complexity to the package) vs if we want to add another interface the caller has to return a pointer to their type (add complexity to the caller). I prefer the former personally but can understand if we feel like this is too evil.

Alternative:

type FooFactory struct{}{}

func (FooFactory) MakePointer() *Foo {
    return &Foo{}
}

peer/network.go Show resolved Hide resolved
Comment on lines 25 to 41
type GossipAtomicTx struct {
Tx *Tx `serialize:"true"`
}

func (tx *GossipAtomicTx) GetHash() gossip.Hash {
id := tx.Tx.ID()
return gossip.HashFromBytes(id[:])
}

func (tx *GossipAtomicTx) Marshal() ([]byte, error) {
return Codec.Marshal(message.Version, tx)
}

func (tx *GossipAtomicTx) Unmarshal(bytes []byte) error {
_, err := Codec.Unmarshal(bytes, tx)
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to implement these functions directly on the *Tx type? I suppose it's pretty nice that we define both messages here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts are that this is all gossip-specific code and we should never be using GetHash, Marshal, or Unmarshal outside of the context of our gossip package/implementations so I don't want people to ever depend on it.

plugin/evm/gossip_mempool.go Outdated Show resolved Hide resolved
g.lock.RLock()
defer g.lock.RUnlock()

return g.bloom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems racy - this bloom can be reset in the goroutine running Subscribe - so the caller of this function must either have access to the lock that is local to this struct - or be otherwise guaranteed that Subscribe isn't running.

Or I'm missing something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is racy as you correctly mentioned since we return a reference which can be modified by the Subscribe goroutine once the lock is released. I think this is fixed if we deep-copy here.

m.lock.RLock()
defer m.lock.RUnlock()

return m.bloom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the other comment - this seems racy.

Comment on lines +301 to +307
for _, item := range m.txHeap.maxHeap.items {
gossipTx := &GossipAtomicTx{Tx: item.tx}
if !filter(gossipTx) {
continue
}
gossipTxs = append(gossipTxs, gossipTx)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also feel like this should have some form of maximum size

plugin/evm/vm.go Outdated Show resolved Hide resolved
plugin/evm/vm.go Outdated Show resolved Hide resolved
joshua-kim and others added 4 commits August 30, 2023 23:49
Co-authored-by: Stephen Buttolph <stephen@avalabs.org>
Signed-off-by: Joshua Kim <20001595+joshua-kim@users.noreply.github.com>
Co-authored-by: Stephen Buttolph <stephen@avalabs.org>
Signed-off-by: Joshua Kim <20001595+joshua-kim@users.noreply.github.com>
Copy link
Contributor

@StephenButtolph StephenButtolph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit

peer/network.go Outdated Show resolved Hide resolved
joshua-kim and others added 6 commits August 31, 2023 12:05
Co-authored-by: Stephen Buttolph <stephen@avalabs.org>
Signed-off-by: Joshua Kim <20001595+joshua-kim@users.noreply.github.com>
@joshua-kim
Copy link
Contributor Author

Diff is getting huge, we broke up this PR's diff to make it more reviewable to get this over the finish line:

Coreth PR to add router handling + UT: #316
Avalanchego PR to add gossip package: ava-labs/avalanchego#1958
(Draft) Coreth PR to use gossip package: #318, blocked on (2)

@t-anyu
Copy link

t-anyu commented Sep 25, 2023

@joshua-kim as there are nodes running older versions, will they get throttled due to droppedRequests counter or any other mechanisms?

@joshua-kim
Copy link
Contributor Author

@t-anyu Nodes running older versions when receiving an incoming new gossip request will drop them because they'll fail to unmarshal against the legacy codec. They will still get transactions because all nodes will support the legacy pushGossiper implementation until the next network upgrade.

@a1k0n
Copy link

a1k0n commented Sep 26, 2023

Nodes running older versions when receiving an incoming new gossip request will drop them because they'll fail to unmarshal against the legacy codec. They will still get transactions because all nodes will support the legacy pushGossiper implementation until the next network upgrade.

@joshua-kim @StephenButtolph I hope the team isn't planning to remove push gossip entirely.

@StephenButtolph
Copy link
Contributor

StephenButtolph commented Sep 26, 2023

@a1k0n

I hope the team isn't planning to remove push gossip entirely.

We will be moving push gossip to the new p2p SDK. We will not remove push gossip entirely - as that would significantly increase the tx propagation times.

@a1k0n
Copy link

a1k0n commented Sep 26, 2023

We will be moving push gossip to the new p2p SDK. We will not remove push gossip entirely - as that would significantly increase the tx propagation times.

Ah, I see. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Slow Tx Gossip