Migration Notes
- Cassandra schema: transfer task and timer task tables require new columns for domain multi-tenancy task routing. Apply schema migrations before rolling out.
- Histogram metrics: histograms config block must be explicitly set to opt into histogram emission. See Timer to Histogram Metric Migration section above for the full rollout path.
Major Features
1. Cadence Schedules
Cadence Schedules is now generally available. See the blog post and concepts documentation. Introduces a first-class scheduling primitive that replaces hand-rolled cron workflows. The full API surface (Create, Describe, Update, Delete, Pause, Unpause, List, Backfill) is implemented across the type system, proto mappers, Thrift transport, frontend handlers, and CLI. The scheduler runs as a per-domain durable workflow with ContinueAsNew, enforcing configurable overlap policies (SKIP, CONCURRENT, BUFFER, CATCH_UP_ALL), catch-up for missed fire times, concurrency limits, backfill with per-fire override policies, and search attribute propagation to target workflows. A per-domain scheduler worker manager uses membership-based routing and is enabled by default with a per-domain filter.
Migration Path
- The scheduler worker is disabled by default. Operators can enable it per-domain via dynamic config.
- Schedule search attributes must be added to your visibility index. They are included in the default key set automatically.
2. Active-Active Domains
Features
Added MySQL support for Active-Active
Cadence Operations Failover workflow now supports passing the set of cluster-attributes that should be failed over
Bug Fixes
It is no longer possible to convert a domain to active-active via the FailoverDomain RPC
3. Domain Multi-Tenancy (in progress)
Extends Cadence's multi-tenancy guarantees below the domain level. Cadence already enforces rate limiting, task prioritization, and fair task scheduling at the domain level to prevent noisy-neighbor interference between customers. However, platform teams that consolidate multiple use cases into a single domain have historically had no isolation between those use cases. This release introduces task-list-level isolation primitives, including task routing, rate limiting, and scheduling priority, so that platform teams can get the same guarantees within a domain that domain-level isolation provides between domains.
Key Capabilities
- Hierarchical weighted round-robin (IWRR) scheduler for fair task scheduling across task lists within a domain
- Task list nice value for priority control between task lists within a domain
- Task list rate limiting applied to both regular and sticky polls, preventing a single task list from starving others in the same domain
- Task list name and kind stored in history and transfer tasks to enable task routing decisions
- History task latency metrics tracked at the task list level for per-task-list observability
Migration Path
- Cassandra schema upgrade required: transfer task and timer task tables gain new columns for task list name and kind. Apply schema migrations before rolling out this version.
Current Limitations
Tasklist-level isolation is currently in active development. Additional controls and enforcement points will be added in subsequent releases.
4. Timer to Histogram Metric Migration
Cadence is migrating from Tally's costly Timer metrics to ExponentialHistogram to reduce infrastructure costs and align with Prometheus and OpenTelemetry standards.
The v1.4.1 release enables dual-emission for all timer metrics, alongside new GaugeMigration and CounterMigration frameworks.
Migration Path
Emission behavior is controlled per-metric or globally via the histograms block in your service config YAML. The valid modes are timer, histogram, and both.
histograms:
default: timer # global default: "timer", "histogram", or "both"
names:
task_latency: true # true = emit histogram, false = suppress
persistence_latency: true
If histograms are not configured, or default is not set, behavior is the same as previous releases, only timers are emitted. Histogram emission must be explicitly opted into. Metrics not listed in the migration registry are always emitted regardless of this config.
The recommended rollout path is:
- Set default: timer (or leave unset) to preserve existing behavior
- Opt individual metrics into both to validate histogram output alongside existing timers
- Set default: both once you are confident histograms are correct
- Set default: histogram to drop legacy timers when your dashboards and alerts are fully migrated
Current Limitations
Legacy timers will eventually be removed. The histograms config block is transitional and will become validation-only (rejecting explicit timer mode) once the migration is complete.
Performance & Scalability Improvements
- Matching task list manager lookup is now O(1) by name (#7733).
- Rate-limiter token waste and CPU spin under low task rates are fixed, significantly reducing idle CPU usage (#7977).
- History queue processors are notified before shard lock release to reduce task processing latency (#8130).
- Automatic mutable state corruption detection and repair (#7850).
- Worker redundancy is improved (#7840).
- Batcher RPS and concurrency can now be tuned at runtime via signals (#7824).
Notable Bug Fixes
-
Matching
- Fix two separate getTasksPump deadlocks (#7855, #7930)
- Fix rate-limiter token waste and CPU spin under low task rates (#7977)
- Reclassify query-task-not-found as EntityNotExistsError (#7938)
- Add timeout to notifyPartitionConfig during startup (#7833)
- Limit context deadline when calling RecordTaskStarted (#7792)
-
Persistence / History
- Use next_event_id column as source of truth when reading workflow execution from Cassandra (#7738)
- Trim workflow timer tasks on workflow close and deletion (#7941)
- Fix deadlock in workflow reset signal reapplication (#7913)
- Convert replicator panics into errors rather than crashing the process (#8063)
Observability Enhancements
- Add shard-id to persistence error logs (#8017)
- Inject Start/Stop latency measurements in canary shards (#7985)
- Emit correct is_retry tag on retried persistence and client calls (#8049)
- Route PutReplicationTaskToDLQ metrics by domain (#8053)
- Normalize cadence_authorization_latency labels for Prometheus (#8215)
- Normalize cache metric labels for Prometheus (#8120)
CLI Enhancements
- Add concurrency_limit support in CLI (#8028)
- Add operational dynamic config CLI commands: get, update, delete (#8101)
- Add --latest_time support for DecisionCompletedTime reset type (#8151)
Code Quality & Maintenance
- Replace x/exp/maps with standard maps package (#8098)
- Generalize LimiterFactory/Collection with type parameters (#7841)
- Convert task schedulers to use Go generics (#7813)
- Reduce log noise (#7915, #7910, #7903, #8071, #7901, #7991)
Infrastructure
- Add domain audit log support for MySQL and SQLite
- Add support for region-specific S3 access points in archival (#8015)
- Update Docker image to support custom config file (#8155)
- Fix Kafka default protocol version from 0.10.2.0 to 2.1.0 (#7890) — fixes startup crash with Sarama ≥ v1.45 against modern Kafka clusters
- Reject requests that update both replication config and domain data simultaneously (#8252)
Full Changelog: v1.4.0...v1.4.1