Skip to content

p2p/sentry: don't duplicate SendMessageById across shared-store sentries#21597

Merged
yperbasis merged 3 commits into
mainfrom
yperbasis/sentry-by-id-no-dup
Jun 3, 2026
Merged

p2p/sentry: don't duplicate SendMessageById across shared-store sentries#21597
yperbasis merged 3 commits into
mainfrom
yperbasis/sentry-by-id-no-dup

Conversation

@yperbasis
Copy link
Copy Markdown
Member

Summary

Fixes the hive devp2p BlobViolations flake (expected disconnect on blob violation, got msg code: 24) seen on the CI Gate run for #21524 (parallel leg) and on a runner-experiment branch the day before (serial leg) — and the gossip-duplication regression behind it.

Root cause

Since #21335 all per-eth-version sentry GrpcServers share one p2p.Server and one PeerStore. SendMessageById resolves the target peer in the shared store, so callers that fan a by-id send out across every sentry client — e.g. the txpool's new-peer pool sync (PropagatePooledTxnsToPeersList) — now deliver the same frame once per sentry: three identical NewPooledTransactionHashes messages per new peer on the default eth/69+70+71 build. Before #21335 each sentry had its own peer set, so the off-sentry sends were silent no-ops.

SendMessageToAll and SendMessageToRandomPeers already guard against this with protocolVersions.Contains(peerInfo.EthProtocol()); SendMessageById was the only outbound path without the negotiated-version check.

The BlobViolations failure sequence, reproduced locally with wire-level logging (hive --sim devp2p --sim.limit eth against an instrumented image; failed on the second loop iteration with the exact CI error):

  1. The test peers, announces two blob txs (one with a lying size/type), delivers them, and waits for a disconnect.
  2. The violating PooledTransactions reply sits in the txpool fetcher's 250 ms inbound batch before the announcement check kicks the peer.
  3. If the 5 s syncToNewPeersEvery tick lands in that window, the all-pool announcement (~70 KB — about 2000 hashes accumulated by earlier suite tests) is written to the test peer three times:
08:16:32.606  inbound NPTH68            peer=de6008eb count=2
08:16:32.607  WRITE GET_POOLED_TXS_66   peer=de6008eb
08:16:32.607  sync-to-new-peers         entries=2002            <- 5s tick fires
08:16:32.608  WRITE NPTH68 70084 bytes  peer=de6008eb           <- one write
08:16:32.608  WRITE NPTH68 70084 bytes  peer=de6008eb           <- per
08:16:32.608  WRITE NPTH68 70084 bytes  peer=de6008eb           <- sentry
08:16:32.855  VIOLATION -> PenalizePeer peer=de6008eb           <- next 250ms batch flush
  1. The devp2p test deliberately tolerates one pre-disconnect hash announcement (go-ethereum commit 88c8459, Sept 2024, "sometimes we'll get a blob transaction hashes announcement before the disconnect"); the duplicated second copy arrives before the kick and fails the test.

In the failed CI run the timing lines up exactly: the txpool started at 06:33:04.64 and the test failed at 06:33:14.644 — the tick+10s sync.

With the duplication fixed, at most one announcement can precede the disconnect in that window, which the test tolerates by design. Beyond the test, every new peer was receiving the full pool sync in triplicate.

Fix

Apply the same negotiated-eth-version guard in SendMessageById that the other outbound paths use. Eth-protocol messages only; wit is unaffected (it is deduplicated to a single GrpcServer at shared-server construction). This also restores the routing contract documented at FetchBlockAccessLists ("sentryIndex MUST be the sentry where peerID is actually connected"): sends via a non-matching sentry are silent no-ops again, as they were with per-sentry peer sets.

Not present on release/3.4 (#21335 is main-only), so no backport is needed.

Testing

  • New regression test TestGrpcServer_SendMessageById_SharedStore_NoDuplicateWrites (TDD): three GrpcServers (eth/69/70/71) sharing a PeerStore, peer negotiated eth/70, by-id fan-out across all three must produce exactly one write. Red before the fix (expected: 1, actual: 3), green after.
  • go test ./p2p/sentry/... ./txnprovider/txpool/... clean.
  • Local hive validation in a loop: before the fix not ok 18 BlobViolations reproduced on iteration 2; after the fix 12/12 iterations green, with instrumented logs showing exactly one NEW_POOLED_TRANSACTION_HASHES_68 write per new peer where there were three.
  • make lint (repeatedly) and make erigon integration clean.

Since #21335 all per-eth-version GrpcServers share one p2p.Server and one
PeerStore, so every sentry resolves the same PeerInfo and an unfiltered
SendMessageById fan-out (e.g. the txpool's new-peer pool sync via
PropagatePooledTxnsToPeersList) wrote the same frame once per sentry -
three identical NewPooledTransactionHashes messages per new peer on the
default eth/69+70+71 build. Apply the negotiated-version guard that
SendMessageToAll and SendMessageToRandomPeers already use.

This is what flaked hive devp2p BlobViolations ("expected disconnect on
blob violation, got msg code: 24"): the test tolerates exactly one hash
announcement before the violation disconnect, and the triplicated pool
sync could land two announcements inside the txpool fetcher's 250ms
batching window before the kick.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an outbound message duplication bug introduced by shared PeerStore usage across multiple per-eth-version sentry GrpcServers. The key change ensures SendMessageById only writes eth-protocol frames via the sentry instance matching the peer’s negotiated eth protocol version, aligning it with the existing routing behavior of SendMessageToAll / SendMessageToRandomPeers.

Changes:

  • Add an eth-version routing guard to GrpcServer.SendMessageById to avoid duplicate writes when multiple sentries share a PeerStore.
  • Add a regression test that constructs multiple eth-version GrpcServers sharing one PeerStore and verifies a by-id fan-out results in exactly one write.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
p2p/sentry/sentry_grpc_server.go Adds negotiated-eth-version filtering to SendMessageById for eth-protocol messages to prevent duplicate writes in shared-store mode.
p2p/sentry/sentry_grpc_server_test.go Adds a regression test ensuring shared-store SendMessageById fan-out does not duplicate outbound writes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread p2p/sentry/sentry_grpc_server_test.go
@yperbasis yperbasis enabled auto-merge June 3, 2026 11:10
@yperbasis yperbasis added this pull request to the merge queue Jun 3, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 3, 2026
@yperbasis yperbasis enabled auto-merge June 3, 2026 12:12
Sahil-4555 pushed a commit to Sahil-4555/erigon that referenced this pull request Jun 3, 2026
## Summary

Hive workspace logs (simulator output + erigon client logs) are uploaded
with the matrix values interpolated into the artifact name:

```
name: hive-workspace-log-${{ matrix.sim }}-${{ matrix.sim-limit }}-${{ matrix.exec_mode }}
```

`actions/upload-artifact` forbids `/`, `|` and `*` in artifact names, so
the upload fails for every leg whose name contains them — i.e. all
`ethereum/engine`, `ethereum/rpc-compat` (`.*` limit) and
`devp2p`-`eth|discv5` legs. The failure is masked by the step's
`continue-on-error: true`, so the logs just silently vanish:

```
##[error]The artifact name is not valid: hive-workspace-log-devp2p-eth|discv5-parallel. Contains the following character:  Vertical bar |
```

Only `hive-workspace-log-devp2p-eth-serial` ever uploaded. This bit
during the investigation of the `BlobViolations` flake behind erigontech#21597:
the failing parallel-leg run
(https://github.com/erigontech/erigon/actions/runs/26867180780) had no
workspace-log artifact, and the root-cause analysis had to be
reconstructed from the job stdout.

## Fix

Compute the artifact name in a small step that replaces any character
outside `[A-Za-z0-9._-]` with `_`. Resulting names per leg:

| before (invalid ones rejected) | after |
|---|---|
|
`hive-workspace-log-ethereum/engine-exchange-capabilities\|auth-serial`
| `hive-workspace-log-ethereum_engine-exchange-capabilities_auth-serial`
|
| `hive-workspace-log-ethereum/engine-cancun-parallel` |
`hive-workspace-log-ethereum_engine-cancun-parallel` |
| `hive-workspace-log-ethereum/rpc-compat-.*-serial` |
`hive-workspace-log-ethereum_rpc-compat-._-serial` |
| `hive-workspace-log-devp2p-eth\|discv5-parallel` |
`hive-workspace-log-devp2p-eth_discv5-parallel` |
| `hive-workspace-log-devp2p-eth-serial` (worked) | unchanged |

`test-hive-eest.yml` (`matrix.shard` values like `paris+shanghai`; `+`
is allowed) and `release.yml` artifact names contain no forbidden
characters, so they are left alone.

Validated with `actionlint` and by evaluating the substitution against
every matrix combination.
@yperbasis yperbasis added this pull request to the merge queue Jun 3, 2026
Merged via the queue into main with commit 3882a9d Jun 3, 2026
88 checks passed
@yperbasis yperbasis deleted the yperbasis/sentry-by-id-no-dup branch June 3, 2026 14:50
chris-mercer pushed a commit to white-b0x/erigon that referenced this pull request Jun 3, 2026
## Problem

Two merge-queue evictions in the last 3 weeks were caused by the
SonarCloud scan failing to download the scanner CLI from
`binaries.sonarsource.com` — not by anything in the queued code:

- [Jun 3
run](https://github.com/erigontech/erigon/actions/runs/26882242018):
instant `Unexpected HTTP response: 403` (the action does not retry 4xx)
— evicted erigontech#21597 from the queue with all sibling jobs green or still
running
- [Jun 1
run](https://github.com/erigontech/erigon/actions/runs/26751990487):
download kept failing through the action's three internal attempts
(~40s), then gave up

In the merge queue the sonar job fast-cancels the whole CI Gate run on
failure, so a CDN blip cancels ~40 min of green sibling jobs and
`github-merge-queue` removes the PR with `failed_checks`. Per
CI-GUIDELINES.md, merge-queue checks must have no false positives; CDN
weather is one.

## Fix

Give the scan one spaced retry:

- the first attempt runs with `continue-on-error: true`
- if it failed, wait 90s and run the action again
- if the retry also fails, the job fails as before — a persistent outage
still blocks correctly

`cache-warming-only` runs are unaffected (scan skipped → outcome is
`skipped`, so the retry steps skip too). A `continue-on-error` step
reports `conclusion: success` to the jobs API, so ci-gate's root-cause
detection won't flag a run recovered by the retry; a double failure is
attributed to `SonarCloud scan (retry)`.

## Alternatives considered

- **Pre-seeding the runner tool cache**: impossible — the action's
`tc.find()` lookup can never match SonarSource's 4-segment version
string (`semver.clean("8.1.0.6389")` is `null`), so the action's
internal tool-cache path is dead code on any runner.
- **Mirroring the scanner zip via `scannerBinariesUrl`**: removes the
CDN dependency entirely but adds hosting and per-upgrade maintenance,
and the GPG keyserver dependency remains. Can revisit if 403s persist
despite the retry.

No tests: CI workflow YAML change (TDD not applicable); validated with
`actionlint` and `make lint`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants