Valkey-native execution engine for long-running, interruptible, resource-aware workflows.
βββββββββββββββ
β ff-server β HTTP API + boot sequence
ββββββββ¬βββββββ
β
ββββββββββββββββΌβββββββββββββββ
β β β
ββββββββ΄βββββββ ββββββ΄βββββ βββββββββ΄ββββββββ
β ff-engine β β ff-sdk β β ff-scheduler β
β 17 scanners β β worker β β claim-grant β
ββββββββ¬βββββββ β API β β cycle β
β ββββββ¬βββββ βββββββββ¬ββββββββ
β β β
ββββββββββββββββΌβββββββββββββββ
β
ββββββββ΄βββββββ
β ff-script β typed FCALL wrappers
ββββββββ¬βββββββ
β
ββββββββ΄βββββββ
β ff-core β types, state, keys, errors
ββββββββ¬βββββββ
β
ββββββββ΄βββββββ
β ferriskey β Valkey client (Rust)
βββββββββββββββ
- Lease-based ownership -- workers hold leases on executions; auto-renewed, crash-safe
- Suspend / signal / resume -- human-in-the-loop and async event-driven workflows with HMAC-signed waitpoint tokens
- Flow coordination -- DAG execution with dependency edges, fan-out, skip propagation
- Budget and quota -- per-dimension usage tracking, hard/soft limits, sliding-window rate limiting
- Streaming output -- append-only frame streams scoped to each attempt
- Priority scheduling -- score-based eligible sets with priority clamping
- Capability routing -- workers advertise capabilities; scheduler subset-matches per-execution requirements
- REST API -- 27 endpoints on axum with JSON error handling, CORS, health check
- Backend trait --
EngineBackendis the stable surface for alternate backends (Valkey + Postgres both first-class at v0.8.0, RFC-017 Stage E4);capabilities()returns a flatCapabilities { identity, supports }β dot-accesscaps.supports.<method>bools (v0.10, #277); typedsubscribe_lease_history/subscribe_completion/subscribe_signal_deliverywith a required&ScannerFilterparameter for per-tenant tag isolation (pass&ScannerFilter::default()for the unfiltered stream) (v0.10, #282); cursor-paginatedlist_executions/list_flows/list_lanes/list_suspendedfor operator tooling - Client-local tower-style layer surface -- opt-in
TracingLayer,RateLimitLayer,MetricsLayer,CircuitBreakerLayercompose around anyEngineBackend - Observability -- OTEL via
ff-observability, Prometheus scrape endpoint viaff-server(engine-side) orff-observability-httpcrate (consumer-side), optional Sentry integration, Grafana dashboard JSON bundled - Cluster-safe -- all operations use hash-tag partitioning; cluster-aware enumeration via ferriskey's hash-tag-aware
cluster_scan(single-shard routing when the match pattern embeds a tag)
FlowFabric is Valkey-native and is not tested or supported against Redis.
The version check expects Valkey's INFO shape; pointing ff-server at a
Redis server is unsupported and is rejected at boot by the version gate
(parse_valkey_version requires server_name:valkey in INFO). The
redis_version: fallback exists for pre-8.0 Valkey (which emits only
redis_version:, not valkey_version:), not to legitimize or allow Redis.
FlowFabric requires Valkey >= 7.2 (RFC-011 Β§13, enforced at boot). 7.2 is the
release where the Valkey Functions API + RESP3 stabilized. Valkey 7.0 and
earlier are rejected by ff-server.
docker run -d --name valkey -p 6379:6379 valkey/valkey:8-alpineThe quickstart image pins major 8 for reproducibility, but any valkey/valkey
tag whose version is β₯ 7.2 will boot. CI exercises both 7.2 and 8.
FF_WAITPOINT_HMAC_SECRET is required at boot β the server refuses to start
without it so waitpoint HMAC authentication can never be silently disabled
(see RFC-004 Β§Waitpoint Security). Any even-length hex string works for
local dev; production should use 64 hex chars (32 bytes) from a secret
manager.
FF_WAITPOINT_HMAC_SECRET=$(openssl rand -hex 32) cargo run -p ff-serverSix end-to-end examples live under examples/:
v011-wave9-postgres-- v0.11 headline demo for the RFC-020 Wave 9 Postgres release. Multi-tenant operator dashboard exercising all six Wave-9 method groups on Postgres: budget/quota admin,change_priority,cancel_execution+ack_cancel_member,replay_execution,list_pending_waitpoints,cancel_flow_header, andread_execution_info. RequiresFF_PG_TEST_URL.v010-read-side-ergonomics-- v0.10 headline demo for consumer read-side APIs: flatCapabilities::supports.<flag>discovery (#277), typedLeaseHistoryEventfromsubscribe_lease_history(#282), and tag-restricted subscriptions viaScannerFilter::with_instance_tag(..). Multi-tenant lease-audit console pattern. No external dependencies beyond a runningff-server.llm-race-- race N free OpenRouter LLM providers against the same prompt; automatically cancel losers when one wins; stream the winner viaDurableSummarywith JSON Merge Patch. Exercises v0.6AnyOf{CancelRemaining}+Count{DistinctSources}+ typedSuspendArgs. Verified live 2026-04-24. Best starting point for learning v0.6 primitives. RequiresOPENROUTER_API_KEY(free tier).coding-agent-- LLM-powered code-patch worker with streaming output and human-in-the-loop suspend/signal review. RequiresOPENROUTER_API_KEY.media-pipeline-- three-stage audio pipeline (transcribe β summarize β embed) exercising capability routing, stream tail with terminal markers, and HMAC-signed waitpoint signals. RequiresOPENROUTER_API_KEY+ local whisper.cpp.retry-and-cancel-- minimal control-plane demo of retry-exhaustion terminal failure +cancel_flowcascade. No external dependencies beyond the running server.
Quick start with the coding agent:
# Terminal 1: worker
cd examples/coding-agent
OPENROUTER_API_KEY=sk-or-... OPENROUTER_MODEL=moonshotai/kimi-k2.6 cargo run --bin worker
# Terminal 2: submit a task
cd examples/coding-agent
cargo run --bin submit -- --issue "Write a Rust function that checks if a string is a palindrome"The SDK's claim_next() method bypasses budget/quota admission control and is gated behind a feature flag. Enable it explicitly:
ff-sdk = { path = "crates/ff-sdk", features = ["direct-valkey-claim"] }For production deployments, use the Scheduler (ff-scheduler) which enforces admission control before issuing claim grants. The direct claim path is intended for development, testing, and trusted single-worker setups.
| Crate | Description |
|---|---|
ferriskey |
Valkey client -- in-tree, forked from glide-core (valkey-glide). Hash-tag-aware cluster_scan single-shard routing. |
ff-core |
Core types, state enums, partition math, key builders, EngineBackend trait, error codes |
ff-script |
Typed FCALL wrappers and Lua library loader |
ff-engine |
Cross-partition dispatch and 17 background scanners |
ff-scheduler |
Claim-grant cycle, fairness, capability matching |
ff-backend-valkey |
EngineBackend implementation backed by Valkey FCALL |
ff-backend-postgres |
EngineBackend implementation backed by Postgres (RFC-017 Stage E) |
flowfabric |
Umbrella re-export crate; feature-flagged backend selection (valkey default, postgres opt-in) |
ff-sdk |
Worker SDK β public API for worker authors; includes the client-local EngineBackendLayer surface |
ff-server |
HTTP API server, Valkey connection, boot sequence, engine /metrics endpoint |
ff-observability |
OTEL + Prometheus instrumentation, optional Sentry integration |
ff-observability-http |
Consumer-side /metrics axum router for workers embedding ff-sdk in library mode |
ff-test |
Integration test harness, fixtures, assertion helpers |
ff-readiness-tests |
Release-gate behavioral smokes (lifecycle, flow fanout/join, HMAC roundtrip, cancel cascade) |
-
Env var reference β
ServerConfig::from_envincrates/ff-server/src/config.rsis the source of truth. Full list below (required ones are markedrequired):Variable Default Description FF_WAITPOINT_HMAC_SECRETrequired Hex-encoded HMAC signing secret for waitpoint tokens (RFC-004 Β§Waitpoint Security). Even-length hex; 64 chars (32 bytes) recommended. FF_BACKENDvalkeyBackend family β valkeyorpostgres. Both first-class at v0.8.0 (RFC-017 Stage E4).FF_HOSTlocalhostValkey host (ignored when FF_BACKEND=postgres)FF_PORT6379Valkey port (ignored when FF_BACKEND=postgres)FF_TLSfalseEnable Valkey TLS ( 1ortrue)FF_CLUSTERfalseEnable Valkey cluster mode FF_POSTGRES_URL(empty) Postgres connection URL (required when FF_BACKEND=postgres). Example:postgres://user:pass@host:5432/db.FF_POSTGRES_POOL_SIZE10Postgres pool size (ignored on the Valkey path). FF_LISTEN_ADDR0.0.0.0:9090API listen address FF_LANESdefaultComma-separated lane names FF_FLOW_PARTITIONS256Flow partition count (exec keys co-locate under RFC-011) FF_BUDGET_PARTITIONS32Budget partition count FF_QUOTA_PARTITIONS32Quota partition count FF_CORS_ORIGINS*Comma-separated CORS origins; empty string is rejected FF_API_TOKEN(none) Shared-secret Bearer token; if set, all non- /healthzrequests require itFF_WAITPOINT_HMAC_GRACE_MS86400000Grace window for previous-kid token acceptance after rotation FF_MAX_CONCURRENT_STREAM_OPS64Shared semaphore for stream reads + tails; legacy FF_MAX_CONCURRENT_TAILstill acceptedFF_LEASE_EXPIRY_INTERVAL_MS1500Lease-expiry scanner interval FF_DELAYED_PROMOTER_INTERVAL_MS750Delayed-promoter scanner interval FF_INDEX_RECONCILER_INTERVAL_S45Index reconciler interval FF_ATTEMPT_TIMEOUT_INTERVAL_S2Attempt-timeout scanner interval FF_SUSPENSION_TIMEOUT_INTERVAL_S2Suspension-timeout scanner interval FF_PENDING_WP_EXPIRY_INTERVAL_S5Pending-waitpoint expiry interval FF_RETENTION_TRIMMER_INTERVAL_S60Retention-trimmer scanner interval FF_BUDGET_RESET_INTERVAL_S15Budget-reset scanner interval FF_BUDGET_RECONCILER_INTERVAL_S30Budget reconciler interval FF_QUOTA_RECONCILER_INTERVAL_S30Quota reconciler interval FF_UNBLOCK_INTERVAL_S5Unblock scanner interval FF_DEPENDENCY_RECONCILER_INTERVAL_S15DAG dependency reconciler interval FF_FLOW_PROJECTOR_INTERVAL_S15Flow projector scanner interval FF_EXECUTION_DEADLINE_INTERVAL_S5Execution-deadline scanner interval FF_CANCEL_RECONCILER_INTERVAL_S15Cancel-reconciler scanner interval -
Auth is opt-in β if
FF_API_TOKENis unset, every endpoint exceptGET /healthzis reachable without authentication. Always setFF_API_TOKEN(and consider CORS viaFF_CORS_ORIGINS) before exposing the server beyond localhost. -
No API rate limiting β deploy behind a reverse proxy (nginx, Envoy) with rate limiting configured
-
No HTTP connection limit β axum accepts unbounded connections; configure limits at the load balancer
-
cancel_flow returns
CancellationScheduledimmediately forcancel_allpolicy and dispatches member cancellations asynchronously (seePOST /v1/flows/{id}/cancel); append?wait=truefor synchronous completion. Background dispatch is drained with a 15s timeout on shutdown and may be aborted for very large flows. -
detect_cycle BFS can issue ~100K
redis.calloperations on large DAGs (up to 1000 nodes Γ edges per node), blocking the Valkey event loop; keep flow graph fan-out moderate -
Terminal operation replay β
complete(),fail(), andcancel()are idempotent on retry: if the connection drops after the Lua commit but before the client reads the response, a retry by the same caller (matchinglease_epoch+attempt_id) is reconciled by the SDK and returns the same outcome as the original call would have. A retry after a different terminal op committed (e.g. cancel after complete) surfaces asExecutionNotActivewith populatedterminal_outcomeso the caller can inspect what actually happened. -
cancel_flow dispatch durability β the async member-cancel loop is durable against process crashes and permanent per-member errors via a per-flow-partition
cancel_backlogZSET and per-flowpending_cancelsSET. Live dispatch acks members as they cancel; thecancel_reconcilerscanner drains any remainder on its interval (default 15s). A worker can still claim a cancelled flow's member in the cross-slot window betweencancel_flowand the member's ownff_cancel_executionβff_claim_executionreads exec_core on{p:N}and cannot atomically consultflow_core.public_flow_stateon{fp:N}, so a member may run one partial attempt before the reconciler cancels it.?wait=truestill eliminates that window synchronously; default-async now converges withinreconciler_interval + grace_msrather than relying on retention. -
Lua library after failover β the server auto-reloads the Lua library on first "function not found" error after a Valkey failover, but the first request in the new epoch will fail and be retried
KEYS-arity changes (and the LIBRARY_VERSION bump that paired with Batch A) require blue-green or stop-then-start deployment. Running old and new ff-server binaries against the same Valkey simultaneously can produce Lua errors on the old binary because the new binary loads a library whose KEYS arity differs from what the old binary expects β redis.call(..., nil, ...) surfaces as ResponseError, not silent corruption, but requests will fail until the old instances are drained.
The version string lives in lua/version.lua (single source of truth). scripts/gen-ff-script-lua.sh extracts it and writes crates/ff-script/src/flowfabric_lua_version, which Rust reads via include_str!. Regenerate after any edit to lua/*.lua; CI fails on drift.
Future: per-FCALL version negotiation. For now: drain old instances before starting new ones, or run a single writer at a time.
Two PR-time GitHub Actions workflows gate merges:
.github/workflows/matrix.ymlβ 6-job host Γ valkey Γ mode matrix. RFC-011 Β§13 requires Valkey >= 7.2, enforced atff-serverboot.- linux x86_64 (
ubuntu-latest) Γ valkey 8 (valkey/valkey:8-alpine) Γ {standalone, cluster} - linux x86_64 (
ubuntu-latest) Γ valkey 7.2 (valkey/valkey:7.2) Γ standalone β guards against accidental 8-specific primitive adoption - linux arm64 (
ubuntu-24.04-arm) Γ valkey 8 Γ {standalone, cluster} - macos arm64 (
macos-latest) Γ valkey 8 Γ standalone (Homebrew Valkey, currently tracking the 8.x line unpinned; docker-on-mac on GitHub hosted runners does not support the privileged-mode cluster setup). - Cluster mode on linux uses a 3-master 3-replica cluster via
.github/cluster/docker-compose.cluster.yml+bootstrap.sh.
- linux x86_64 (
.github/workflows/security-and-quality.ymlβ 6 independent gates, any one blocks merge:cargo auditwith--deny warnings; ignore list in.cargo/audit.tomlcargo deny check(licenses, bans, sources, advisories); policy indeny.tomlcargo geigerunsafe-expression ratchet against a committed baseline at.github/geiger-baseline.json. Refresh withbin/update-geiger-baseline.shin a PR that deliberately changes the unsafe count.- CodeQL gate (polls
codeql.ymlfor the same SHA) cargo macheteunused-dep detection oncrates/+ferriskey(benches and examples are out of scope β their authors own hygiene there)cargo semver-checksagainst the latestv*.*.*git tag; skips gracefully before the first release.
cargo install takes a minute per tool; the Swatinem/rust-cache@v2 key partitioning in both workflows amortises this across PR pushes.
Apache-2.0