FIP: Requirements for Onboarding New Farcaster Validators #272

manan19 · 2026-06-15T15:43:28Z

manan19
Jun 15, 2026
Maintainer

Field	Value
Status	Draft
Author	Neynar (manan@neynar.com)
Created	2026-06-09

Summary

Farcaster consensus is BFT (Malachite) with equal voting power per validator and a >2/3
quorum. On today's ~6-validator shards the fault budget is 1, so a single faulty, slow, or
divergent validator can halt a shard. Adding a validator is a one-line effective_at edit to
validators.toml — cheap to
do, expensive to get wrong.

The hard case, and the core of this proposal, is a validator running a different codebase
than snapchain (a fork or an independent reimplementation). Such a client must be bit-for-bit
deterministic with the rest of the network or it breaks consensus. This document defines the
requirements an alternative client must meet, how they are verified, and the rollout for
admitting one. Deployment requirements that apply to any new validator (including same-binary
expansion into a new geo/datacenter) are covered in one section near the end.

Motivation

Two things are happening:

Near-term: expansion of the validator set into a new geo/datacenter (latency-sensitive —
see Deployment requirements below).
Structurally: interest in client diversity — a validator built from a different codebase.
"Client" here means the consensus node (the validator/hub binary), not an app.

There is no written, testable bar today. This FIP provides one, scaled to risk.

Current phase. Validator membership is governed manually — there are no native protocol
incentives or permissionless staking yet, so the set is maintained by mutual agreement among
operators via validators.toml. Until that changes, admission and removal are partly a trust and
collaboration decision, not only a technical one (see Operator requirements below).

Risk tiers

Tier	Candidate	Added requirement over the tier above
A	Stock snapchain, new operator / geo	Deployment, networking, custody, ops only.
B	Fork of snapchain	Everything in A + the full determinism contract below (forks drift).
C	Independent reimplementation	Everything in B, with no shared code to fall back on — the determinism contract is the entire burden.

Tiers are cumulative. The bulk of this document is the B/C determinism contract; Tier A only
needs the Deployment + Operational sections.

Requirements for alternative client implementations

A non-snapchain validator (Tier B/C) must satisfy all of the following before it is added to
validators.toml on mainnet. Each is a hard gate, not a guideline.

R1 — Consensus determinism (the core gate)

Consensus signs over encoded bytes and hashes headers with BLAKE3
(snapchain_codec.rs,
blocks.proto).
Any divergence in encoding, hashing, or state computation produces different signed bytes or block
hashes → votes are rejected → the shard stalls (or, if a quorum diverges together, forks). The client
must produce, for every input in the shared conformance corpus:

byte-exact protobuf encoding of messages, blocks, votes, and proposals;
the same BLAKE3 header hash;
a valid, verifiable Ed25519 signature over the canonical bytes;
the same post-state merkle/state root.

Gate: 100% byte-exact match against the shared conformance vectors (see Verification below).

R2 — Validation parity

The client must make the identical accept/reject decision on every message as the reference
validation rules (src/core/validations/).
One client accepting what another rejects produces non-deterministic block contents and breaks
consensus.

R3 — Protocol compatibility & versioning

The client must declare and track a specific protocol-compatibility version: the proto
definitions, the Malachite consensus wire format/version, and the message-validation rule set it
conforms to.
Consensus-critical changes upstream (encoding, hashing, validation, timing) must be adopted within
a defined compatibility window. A client that falls behind is removed until it re-conforms.

R4 — Byzantine safety & crash recovery

No equivocation: the client must never sign two values at the same height/round, even across a
crash/restart. Signer design must make double-signing impossible (e.g. height/round high-water
mark persisted before signing).
Crash recovery: on restart the client must replay its write-ahead state and resume voting
consistently with its pre-crash self.

R5 — Networking conformance

The client must speak the existing p2p protocol: libp2p gossipsub over QUIC, the consensus /
mempool / decided-values / contact-info topics, and the contact-info exchange used for peer
discovery and mesh formation
(gossip.rs). It must
form and hold the gossip mesh with existing validators.

R6 — Operational & custody

Key custody: validator Ed25519 key in an HSM / managed secret store; no key reuse across
environments; equivocation-safe signer (R4).
Monitoring: block-height lag, round count, message-rejection counts, and peer/mesh health
exported.
Auditability: because a divergent binary is a consensus-safety dependency for every
operator, the client should be auditable by the existing operators (source available for review).

R7 — Continuous compliance

Conformance is not one-and-done. The client must:

run the shared conformance + validation suites in its own CI, pinned to the declared protocol
version;
re-verify on every release (its own and on each snapchain protocol bump);
remain subject to removal via effective_at if it drifts.

How requirements are verified

The above requirements map onto four test layers. Snapchain is strong at L2 today but thin at L0/L3
— the concrete gaps are tracked under snapchain#924
and must be closed before a Tier B/C client is admitted.

Layer	Verifies	Mechanism	Status
L0 Conformance vectors	R1, R2	Shared, versioned corpus: input → expected bytes/hash/signature/state-root. Today `client_parity_tests` is one-directional (input validation only); it must become bidirectional with output assertions.	Gap — #917
L1 Unit / validation	R2, R4	The client passes the reference validation corpus and a Byzantine/equivocation harness.	Validation: exists. Byzantine harness: gap — #918, fuzz: #923
L2 Multi-node	R4, R5	The `consensus_test.rs` `TestNetwork` harness with the candidate as a node: consensus, sync, crash/recovery, validator-add, cross-shard, partition.	Strong, but add path is untested — #919
L3 Full-network testnet	R1–R7 end-to-end	Production-like testnet (real config + QUIC, `setup_local_testnet`, load via `src/perf/`) with the candidate validating.	Gap — #922

Deployment requirements (all new validators)

These apply to every new validator, Tier A included — most relevant for a validator in a new
geo/datacenter, because snapchain's timing budget is tight (block_time 1s; propose_time 1s;
prevote_time/precommit_time 500ms; round timeouts grow by step_delta 500ms,
consensus.rs).

Latency budget: measured RTT from the new location to each existing validator must fit inside
the propose/prevote windows with margin. A validator that can't get votes out in time degrades
throughput for everyone. (No latency-injection test exists yet — #920.)
Soak: a multi-day run from the real location to catch diurnal jitter, packet loss, and mesh
re-formation under churn.
Resilience: partition/failover drills from the new location; NTP/clock-sync verified.
Reachability: bootstrap/direct-peer config and QUIC/firewall reachability confirmed.
Bootstrap: node fully synced (snapshot/replication) and tracking tip before its
effective_at.

Operator requirements (all validators)

While the set is manually governed (see Current phase above), recovering from a high-priority
incident — a stalled shard needing a coordinated validator-set cutover, or a fast removal of a
misbehaving node — depends on operators coordinating in real time. A prospective operator must commit
to:

a reachable on-call / incident contact and an agreed escalation path;
participation in coordinated validator-set changes (cutovers, rollbacks) on short notice;
good-faith, professional collaboration with other operators.

Operators are also added as maintainers of the snapchain repository, sharing responsibility for
review, releases, and incident fixes — reinforcing the auditability expectation in R6 and ensuring
every operator can act during an incident.

The same manual process that admits a validator can remove one — for technical drift (R3/R7) or
for failing to uphold these collaboration expectations. These criteria are interim and expected to
be superseded once native protocol incentives exist.

Rollout

Staged and reversible:

Pass L0–L2. No mainnet scheduling until green.
Testnet read-node. Candidate syncs (no voting); verify zero state-root divergence over an
observation window.
Testnet validator. Add via future effective_at; observe the L3 gates for an observation
window, including a partition drill.
Mainnet. Schedule effective_at to cut into all shards at around the same time —
staggering leaves the validator sets mismatched across shards and risks cross-shard instability.
Because effective_at is a per-shard height and shards advance independently, a near-simultaneous
cutover needs per-shard height estimates coordinated with existing operators.
Rollback. Removal is the same mechanism in reverse — a validator-set entry dropping the
candidate at a future effective_at. Operators should know how to execute and observe a removal
before scheduling the add.

Acceptance checklist (go / no-go)

Complete before a mainnet validators.toml edit. Tier tags in parentheses; unmarked = all tiers.

Open questions

Should a Byzantine/equivocation fault-injection harness (#918)
be a hard requirement for Tier C, or is conformance + audit sufficient?
If a desirable geo can't fit the current timing budget, do we revisit propose_time /
prevote_time / block_time — a coordinated, all-node change?
Who owns and versions the shared conformance corpus, and where does it live so multiple
clients can depend on it?

References

Testing-gap tracking issue: snapchain#924
(sub-issues #917–#923)
Consensus-critical codec: snapchain_codec.rs,
blocks.proto
Validator config & timing: consensus.rs,
validators.toml
Networking: gossip.rs
Test harness: consensus_test.rs,
client_parity_tests/,
setup_local_testnet.rs,
src/perf/

CassOnMars · 2026-06-16T22:04:21Z

If a desirable geo can't fit the current timing budget, do we revisit propose_time /
prevote_time / block_time — a coordinated, all-node change?

I'll reiterate previous concerns stated – the current block time will not survive larger quorums and/or varied geographies. It's better to rip the bandaid off sooner than later.

2 replies

topocount Jun 16, 2026
Maintainer

we are running a testnet node in SE Asia without missing proposals. We are also slowly increasing quorums and have a good amount of buffer with the current block time as it stands.

CassOnMars Jun 16, 2026

is this in AWS? No discredit to the work done by their team, it's hard to get latency down between regions, however, if it's AWS, that would explain why – if you want to see an example of dedicated hosts with reasonable latency in SEA that will hit awful ping times to/from AWS, OVH Singapore is a good example

CassOnMars · 2026-06-16T23:38:52Z

CassOnMars
Jun 16, 2026

A couple suggested additions regarding acceptance criteria (makes alternative implementations' lives harder but is a reasonable expectation):

RPC interference latency parity: while protobuf spec compliance is important, one thing stands out for independent implementations – a validator tested in isolation with no request load on the RPC read endpoints doesn't verify that real world workloads won't impart consensus failures. snapchain, hypersnap, and other fork-based implementations (most likely) have separation to avoid read pressure interfering with validator latency, but this isn't guaranteed. As @manan19 noted in the last dev call as a possible concern with hypersnap's search indexing, alternative clients may have additional loads (e.g. either extra indexing or APIs). The acceptance criteria should incorporate a healthy blend of at least snapchain-compatible API request workloads as part of the burn-in test for validators. As a plus all around for ecosystem health, adding this to the existing test suite would help identify hot paths if they exist, including in snapchain. A more expansive, but fair take for alternative clients, is to also include a reasonable blend of client-specific API request workloads as part of the burn-in test for them. Happy to scope those for Hypersnap based on current traffic patterns.
SDK compatibility parity: this one is potentially debatable, but in the days of hubs, it was expected that the SDK for using hubs could be pointed at any hub and should work. I don't think this is unreasonable to extend the same expectation for the current SDK, especially for as long as validators are POA based. A client dev should in theory expect the snapchain-centric SDK can point to any snapchain-approved validator and function.

0 replies

mahdi858 · 2026-06-17T08:42:48Z

mahdi858
Jun 17, 2026

Nice

0 replies

FIP: Requirements for Onboarding New Farcaster Validators #272

Uh oh!

Uh oh!

manan19 Jun 15, 2026 Maintainer

Summary

Motivation

Risk tiers

Requirements for alternative client implementations

R1 — Consensus determinism (the core gate)

R2 — Validation parity

R3 — Protocol compatibility & versioning

R4 — Byzantine safety & crash recovery

R5 — Networking conformance

R6 — Operational & custody

R7 — Continuous compliance

How requirements are verified

Deployment requirements (all new validators)

Operator requirements (all validators)

Rollout

Acceptance checklist (go / no-go)

Open questions

References

Replies: 11 comments · 2 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

topocount Jun 16, 2026 Maintainer

Uh oh!

Uh oh!

Uh oh!

manan19
Jun 15, 2026
Maintainer

Replies: 11 comments 2 replies

topocount Jun 16, 2026
Maintainer