p2p/sentry: don't duplicate SendMessageById across shared-store sentries#21597
Merged
Conversation
Since #21335 all per-eth-version GrpcServers share one p2p.Server and one PeerStore, so every sentry resolves the same PeerInfo and an unfiltered SendMessageById fan-out (e.g. the txpool's new-peer pool sync via PropagatePooledTxnsToPeersList) wrote the same frame once per sentry - three identical NewPooledTransactionHashes messages per new peer on the default eth/69+70+71 build. Apply the negotiated-version guard that SendMessageToAll and SendMessageToRandomPeers already use. This is what flaked hive devp2p BlobViolations ("expected disconnect on blob violation, got msg code: 24"): the test tolerates exactly one hash announcement before the violation disconnect, and the triplicated pool sync could land two announcements inside the txpool fetcher's 250ms batching window before the kick.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes an outbound message duplication bug introduced by shared PeerStore usage across multiple per-eth-version sentry GrpcServers. The key change ensures SendMessageById only writes eth-protocol frames via the sentry instance matching the peer’s negotiated eth protocol version, aligning it with the existing routing behavior of SendMessageToAll / SendMessageToRandomPeers.
Changes:
- Add an eth-version routing guard to
GrpcServer.SendMessageByIdto avoid duplicate writes when multiple sentries share aPeerStore. - Add a regression test that constructs multiple eth-version
GrpcServers sharing onePeerStoreand verifies a by-id fan-out results in exactly one write.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| p2p/sentry/sentry_grpc_server.go | Adds negotiated-eth-version filtering to SendMessageById for eth-protocol messages to prevent duplicate writes in shared-store mode. |
| p2p/sentry/sentry_grpc_server_test.go | Adds a regression test ensuring shared-store SendMessageById fan-out does not duplicate outbound writes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
taratorio
approved these changes
Jun 3, 2026
Sahil-4555
pushed a commit
to Sahil-4555/erigon
that referenced
this pull request
Jun 3, 2026
## Summary
Hive workspace logs (simulator output + erigon client logs) are uploaded
with the matrix values interpolated into the artifact name:
```
name: hive-workspace-log-${{ matrix.sim }}-${{ matrix.sim-limit }}-${{ matrix.exec_mode }}
```
`actions/upload-artifact` forbids `/`, `|` and `*` in artifact names, so
the upload fails for every leg whose name contains them — i.e. all
`ethereum/engine`, `ethereum/rpc-compat` (`.*` limit) and
`devp2p`-`eth|discv5` legs. The failure is masked by the step's
`continue-on-error: true`, so the logs just silently vanish:
```
##[error]The artifact name is not valid: hive-workspace-log-devp2p-eth|discv5-parallel. Contains the following character: Vertical bar |
```
Only `hive-workspace-log-devp2p-eth-serial` ever uploaded. This bit
during the investigation of the `BlobViolations` flake behind erigontech#21597:
the failing parallel-leg run
(https://github.com/erigontech/erigon/actions/runs/26867180780) had no
workspace-log artifact, and the root-cause analysis had to be
reconstructed from the job stdout.
## Fix
Compute the artifact name in a small step that replaces any character
outside `[A-Za-z0-9._-]` with `_`. Resulting names per leg:
| before (invalid ones rejected) | after |
|---|---|
|
`hive-workspace-log-ethereum/engine-exchange-capabilities\|auth-serial`
| `hive-workspace-log-ethereum_engine-exchange-capabilities_auth-serial`
|
| `hive-workspace-log-ethereum/engine-cancun-parallel` |
`hive-workspace-log-ethereum_engine-cancun-parallel` |
| `hive-workspace-log-ethereum/rpc-compat-.*-serial` |
`hive-workspace-log-ethereum_rpc-compat-._-serial` |
| `hive-workspace-log-devp2p-eth\|discv5-parallel` |
`hive-workspace-log-devp2p-eth_discv5-parallel` |
| `hive-workspace-log-devp2p-eth-serial` (worked) | unchanged |
`test-hive-eest.yml` (`matrix.shard` values like `paris+shanghai`; `+`
is allowed) and `release.yml` artifact names contain no forbidden
characters, so they are left alone.
Validated with `actionlint` and by evaluating the substitution against
every matrix combination.
chris-mercer
pushed a commit
to white-b0x/erigon
that referenced
this pull request
Jun 3, 2026
## Problem Two merge-queue evictions in the last 3 weeks were caused by the SonarCloud scan failing to download the scanner CLI from `binaries.sonarsource.com` — not by anything in the queued code: - [Jun 3 run](https://github.com/erigontech/erigon/actions/runs/26882242018): instant `Unexpected HTTP response: 403` (the action does not retry 4xx) — evicted erigontech#21597 from the queue with all sibling jobs green or still running - [Jun 1 run](https://github.com/erigontech/erigon/actions/runs/26751990487): download kept failing through the action's three internal attempts (~40s), then gave up In the merge queue the sonar job fast-cancels the whole CI Gate run on failure, so a CDN blip cancels ~40 min of green sibling jobs and `github-merge-queue` removes the PR with `failed_checks`. Per CI-GUIDELINES.md, merge-queue checks must have no false positives; CDN weather is one. ## Fix Give the scan one spaced retry: - the first attempt runs with `continue-on-error: true` - if it failed, wait 90s and run the action again - if the retry also fails, the job fails as before — a persistent outage still blocks correctly `cache-warming-only` runs are unaffected (scan skipped → outcome is `skipped`, so the retry steps skip too). A `continue-on-error` step reports `conclusion: success` to the jobs API, so ci-gate's root-cause detection won't flag a run recovered by the retry; a double failure is attributed to `SonarCloud scan (retry)`. ## Alternatives considered - **Pre-seeding the runner tool cache**: impossible — the action's `tc.find()` lookup can never match SonarSource's 4-segment version string (`semver.clean("8.1.0.6389")` is `null`), so the action's internal tool-cache path is dead code on any runner. - **Mirroring the scanner zip via `scannerBinariesUrl`**: removes the CDN dependency entirely but adds hosting and per-upgrade maintenance, and the GPG keyserver dependency remains. Can revisit if 403s persist despite the retry. No tests: CI workflow YAML change (TDD not applicable); validated with `actionlint` and `make lint`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the hive devp2p
BlobViolationsflake (expected disconnect on blob violation, got msg code: 24) seen on the CI Gate run for #21524 (parallel leg) and on a runner-experiment branch the day before (serial leg) — and the gossip-duplication regression behind it.Root cause
Since #21335 all per-eth-version sentry
GrpcServers share onep2p.Serverand onePeerStore.SendMessageByIdresolves the target peer in the shared store, so callers that fan a by-id send out across every sentry client — e.g. the txpool's new-peer pool sync (PropagatePooledTxnsToPeersList) — now deliver the same frame once per sentry: three identicalNewPooledTransactionHashesmessages per new peer on the default eth/69+70+71 build. Before #21335 each sentry had its own peer set, so the off-sentry sends were silent no-ops.SendMessageToAllandSendMessageToRandomPeersalready guard against this withprotocolVersions.Contains(peerInfo.EthProtocol());SendMessageByIdwas the only outbound path without the negotiated-version check.The BlobViolations failure sequence, reproduced locally with wire-level logging (hive
--sim devp2p --sim.limit ethagainst an instrumented image; failed on the second loop iteration with the exact CI error):PooledTransactionsreply sits in the txpool fetcher's 250 ms inbound batch before the announcement check kicks the peer.syncToNewPeersEverytick lands in that window, the all-pool announcement (~70 KB — about 2000 hashes accumulated by earlier suite tests) is written to the test peer three times:In the failed CI run the timing lines up exactly: the txpool started at 06:33:04.64 and the test failed at 06:33:14.644 — the tick+10s sync.
With the duplication fixed, at most one announcement can precede the disconnect in that window, which the test tolerates by design. Beyond the test, every new peer was receiving the full pool sync in triplicate.
Fix
Apply the same negotiated-eth-version guard in
SendMessageByIdthat the other outbound paths use. Eth-protocol messages only; wit is unaffected (it is deduplicated to a singleGrpcServerat shared-server construction). This also restores the routing contract documented atFetchBlockAccessLists("sentryIndex MUST be the sentry where peerID is actually connected"): sends via a non-matching sentry are silent no-ops again, as they were with per-sentry peer sets.Not present on
release/3.4(#21335 is main-only), so no backport is needed.Testing
TestGrpcServer_SendMessageById_SharedStore_NoDuplicateWrites(TDD): threeGrpcServers (eth/69/70/71) sharing aPeerStore, peer negotiated eth/70, by-id fan-out across all three must produce exactly one write. Red before the fix (expected: 1, actual: 3), green after.go test ./p2p/sentry/... ./txnprovider/txpool/...clean.not ok 18 BlobViolationsreproduced on iteration 2; after the fix 12/12 iterations green, with instrumented logs showing exactly oneNEW_POOLED_TRANSACTION_HASHES_68write per new peer where there were three.make lint(repeatedly) andmake erigon integrationclean.