p2p/sentry: don't duplicate SendMessageById across shared-store sentries by yperbasis · Pull Request #21597 · erigontech/erigon

yperbasis · 2026-06-03T09:42:13Z

Summary

Fixes the hive devp2p BlobViolations flake (expected disconnect on blob violation, got msg code: 24) seen on the CI Gate run for #21524 (parallel leg) and on a runner-experiment branch the day before (serial leg) — and the gossip-duplication regression behind it.

Root cause

Since #21335 all per-eth-version sentry GrpcServers share one p2p.Server and one PeerStore. SendMessageById resolves the target peer in the shared store, so callers that fan a by-id send out across every sentry client — e.g. the txpool's new-peer pool sync (PropagatePooledTxnsToPeersList) — now deliver the same frame once per sentry: three identical NewPooledTransactionHashes messages per new peer on the default eth/69+70+71 build. Before #21335 each sentry had its own peer set, so the off-sentry sends were silent no-ops.

SendMessageToAll and SendMessageToRandomPeers already guard against this with protocolVersions.Contains(peerInfo.EthProtocol()); SendMessageById was the only outbound path without the negotiated-version check.

The BlobViolations failure sequence, reproduced locally with wire-level logging (hive --sim devp2p --sim.limit eth against an instrumented image; failed on the second loop iteration with the exact CI error):

The test peers, announces two blob txs (one with a lying size/type), delivers them, and waits for a disconnect.
The violating PooledTransactions reply sits in the txpool fetcher's 250 ms inbound batch before the announcement check kicks the peer.
If the 5 s syncToNewPeersEvery tick lands in that window, the all-pool announcement (~70 KB — about 2000 hashes accumulated by earlier suite tests) is written to the test peer three times:

08:16:32.606  inbound NPTH68            peer=de6008eb count=2
08:16:32.607  WRITE GET_POOLED_TXS_66   peer=de6008eb
08:16:32.607  sync-to-new-peers         entries=2002            <- 5s tick fires
08:16:32.608  WRITE NPTH68 70084 bytes  peer=de6008eb           <- one write
08:16:32.608  WRITE NPTH68 70084 bytes  peer=de6008eb           <- per
08:16:32.608  WRITE NPTH68 70084 bytes  peer=de6008eb           <- sentry
08:16:32.855  VIOLATION -> PenalizePeer peer=de6008eb           <- next 250ms batch flush

The devp2p test deliberately tolerates one pre-disconnect hash announcement (go-ethereum commit 88c8459, Sept 2024, "sometimes we'll get a blob transaction hashes announcement before the disconnect"); the duplicated second copy arrives before the kick and fails the test.

In the failed CI run the timing lines up exactly: the txpool started at 06:33:04.64 and the test failed at 06:33:14.644 — the tick+10s sync.

With the duplication fixed, at most one announcement can precede the disconnect in that window, which the test tolerates by design. Beyond the test, every new peer was receiving the full pool sync in triplicate.

Fix

Apply the same negotiated-eth-version guard in SendMessageById that the other outbound paths use. Eth-protocol messages only; wit is unaffected (it is deduplicated to a single GrpcServer at shared-server construction). This also restores the routing contract documented at FetchBlockAccessLists ("sentryIndex MUST be the sentry where peerID is actually connected"): sends via a non-matching sentry are silent no-ops again, as they were with per-sentry peer sets.

Not present on release/3.4 (#21335 is main-only), so no backport is needed.

Testing

New regression test TestGrpcServer_SendMessageById_SharedStore_NoDuplicateWrites (TDD): three GrpcServers (eth/69/70/71) sharing a PeerStore, peer negotiated eth/70, by-id fan-out across all three must produce exactly one write. Red before the fix (expected: 1, actual: 3), green after.
go test ./p2p/sentry/... ./txnprovider/txpool/... clean.
Local hive validation in a loop: before the fix not ok 18 BlobViolations reproduced on iteration 2; after the fix 12/12 iterations green, with instrumented logs showing exactly one NEW_POOLED_TRANSACTION_HASHES_68 write per new peer where there were three.
make lint (repeatedly) and make erigon integration clean.

Since #21335 all per-eth-version GrpcServers share one p2p.Server and one PeerStore, so every sentry resolves the same PeerInfo and an unfiltered SendMessageById fan-out (e.g. the txpool's new-peer pool sync via PropagatePooledTxnsToPeersList) wrote the same frame once per sentry - three identical NewPooledTransactionHashes messages per new peer on the default eth/69+70+71 build. Apply the negotiated-version guard that SendMessageToAll and SendMessageToRandomPeers already use. This is what flaked hive devp2p BlobViolations ("expected disconnect on blob violation, got msg code: 24"): the test tolerates exactly one hash announcement before the violation disconnect, and the triplicated pool sync could land two announcements inside the txpool fetcher's 250ms batching window before the kick.

Copilot

Pull request overview

This PR fixes an outbound message duplication bug introduced by shared PeerStore usage across multiple per-eth-version sentry GrpcServers. The key change ensures SendMessageById only writes eth-protocol frames via the sentry instance matching the peer’s negotiated eth protocol version, aligning it with the existing routing behavior of SendMessageToAll / SendMessageToRandomPeers.

Changes:

Add an eth-version routing guard to GrpcServer.SendMessageById to avoid duplicate writes when multiple sentries share a PeerStore.
Add a regression test that constructs multiple eth-version GrpcServers sharing one PeerStore and verifies a by-id fan-out results in exactly one write.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
p2p/sentry/sentry_grpc_server.go	Adds negotiated-eth-version filtering to `SendMessageById` for eth-protocol messages to prevent duplicate writes in shared-store mode.
p2p/sentry/sentry_grpc_server_test.go	Adds a regression test ensuring shared-store `SendMessageById` fan-out does not duplicate outbound writes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

## Summary Hive workspace logs (simulator output + erigon client logs) are uploaded with the matrix values interpolated into the artifact name: ``` name: hive-workspace-log-${{ matrix.sim }}-${{ matrix.sim-limit }}-${{ matrix.exec_mode }} ``` `actions/upload-artifact` forbids `/`, `|` and `*` in artifact names, so the upload fails for every leg whose name contains them — i.e. all `ethereum/engine`, `ethereum/rpc-compat` (`.*` limit) and `devp2p`-`eth|discv5` legs. The failure is masked by the step's `continue-on-error: true`, so the logs just silently vanish: ``` ##[error]The artifact name is not valid: hive-workspace-log-devp2p-eth|discv5-parallel. Contains the following character: Vertical bar | ``` Only `hive-workspace-log-devp2p-eth-serial` ever uploaded. This bit during the investigation of the `BlobViolations` flake behind erigontech#21597: the failing parallel-leg run (https://github.com/erigontech/erigon/actions/runs/26867180780) had no workspace-log artifact, and the root-cause analysis had to be reconstructed from the job stdout. ## Fix Compute the artifact name in a small step that replaces any character outside `[A-Za-z0-9._-]` with `_`. Resulting names per leg: | before (invalid ones rejected) | after | |---|---| | `hive-workspace-log-ethereum/engine-exchange-capabilities\|auth-serial` | `hive-workspace-log-ethereum_engine-exchange-capabilities_auth-serial` | | `hive-workspace-log-ethereum/engine-cancun-parallel` | `hive-workspace-log-ethereum_engine-cancun-parallel` | | `hive-workspace-log-ethereum/rpc-compat-.*-serial` | `hive-workspace-log-ethereum_rpc-compat-._-serial` | | `hive-workspace-log-devp2p-eth\|discv5-parallel` | `hive-workspace-log-devp2p-eth_discv5-parallel` | | `hive-workspace-log-devp2p-eth-serial` (worked) | unchanged | `test-hive-eest.yml` (`matrix.shard` values like `paris+shanghai`; `+` is allowed) and `release.yml` artifact names contain no forbidden characters, so they are left alone. Validated with `actionlint` and by evaluating the substitution against every matrix combination.

## Problem Two merge-queue evictions in the last 3 weeks were caused by the SonarCloud scan failing to download the scanner CLI from `binaries.sonarsource.com` — not by anything in the queued code: - [Jun 3 run](https://github.com/erigontech/erigon/actions/runs/26882242018): instant `Unexpected HTTP response: 403` (the action does not retry 4xx) — evicted erigontech#21597 from the queue with all sibling jobs green or still running - [Jun 1 run](https://github.com/erigontech/erigon/actions/runs/26751990487): download kept failing through the action's three internal attempts (~40s), then gave up In the merge queue the sonar job fast-cancels the whole CI Gate run on failure, so a CDN blip cancels ~40 min of green sibling jobs and `github-merge-queue` removes the PR with `failed_checks`. Per CI-GUIDELINES.md, merge-queue checks must have no false positives; CDN weather is one. ## Fix Give the scan one spaced retry: - the first attempt runs with `continue-on-error: true` - if it failed, wait 90s and run the action again - if the retry also fails, the job fails as before — a persistent outage still blocks correctly `cache-warming-only` runs are unaffected (scan skipped → outcome is `skipped`, so the retry steps skip too). A `continue-on-error` step reports `conclusion: success` to the jobs API, so ci-gate's root-cause detection won't flag a run recovered by the retry; a double failure is attributed to `SonarCloud scan (retry)`. ## Alternatives considered - **Pre-seeding the runner tool cache**: impossible — the action's `tc.find()` lookup can never match SonarSource's 4-segment version string (`semver.clean("8.1.0.6389")` is `null`), so the action's internal tool-cache path is dead code on any runner. - **Mirroring the scanner zip via `scannerBinariesUrl`**: removes the CDN dependency entirely but adds hosting and per-upgrade maintenance, and the GPG keyserver dependency remains. Can revisit if 403s persist despite the retry. No tests: CI workflow YAML change (TDD not applicable); validated with `actionlint` and `make lint`.

yperbasis requested review from anacrolix and taratorio as code owners June 3, 2026 09:42

yperbasis requested a review from Copilot June 3, 2026 09:44

Copilot started reviewing on behalf of yperbasis June 3, 2026 09:44 View session

yperbasis added the Networking label Jun 3, 2026

yperbasis added this to the 3.5.0 milestone Jun 3, 2026

yperbasis mentioned this pull request Jun 3, 2026

ci: sanitize hive workspace-log artifact names #21598

Merged

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Comment thread p2p/sentry/sentry_grpc_server_test.go

taratorio approved these changes Jun 3, 2026

View reviewed changes

p2p/sentry: close test PeerInfo workers via t.Cleanup

e0037cc

yperbasis enabled auto-merge June 3, 2026 11:10

yperbasis added this pull request to the merge queue Jun 3, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 3, 2026

Merge branch 'main' into yperbasis/sentry-by-id-no-dup

c2866fe

yperbasis enabled auto-merge June 3, 2026 12:12

yperbasis mentioned this pull request Jun 3, 2026

ci: retry SonarCloud scan once on transient failure #21604

Merged

yperbasis added this pull request to the merge queue Jun 3, 2026

Merged via the queue into main with commit 3882a9d Jun 3, 2026
88 checks passed

yperbasis deleted the yperbasis/sentry-by-id-no-dup branch June 3, 2026 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p2p/sentry: don't duplicate SendMessageById across shared-store sentries#21597

p2p/sentry: don't duplicate SendMessageById across shared-store sentries#21597
yperbasis merged 3 commits into
mainfrom
yperbasis/sentry-by-id-no-dup

yperbasis commented Jun 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yperbasis commented Jun 3, 2026

Summary

Root cause

Fix

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants