Be notified of new releases
Create your free GitHub account today to subscribe to this repository for new releases and build software alongside 40 million developers.Sign up
This release includes a fix to support the WAL configuration via YAML config file (the
0.6.0 release supports WAL configuration only via CLI flags).
- [BUGFIX] Fixed parsing of the WAL configuration when specified in the YAML config file. #2071
Thanks to all contributors!
This release is made up of 109 contributions from 30 authors, and includes many features and improvements. Some highlights:
- Experimental Write-Ahead-Log (WAL) in ingesters for more data reliability against ingester crashes by @codesome
- On-the-fly migration path between two key-value (KV) stores using the
multistore by @pstibrany
- Global ingestion rate limit by @pracucci
- Improvements to the experimental TSDB blocks storage by @thorfour @pstibrany @pracucci
Note that the ruler flags need to be changed in this upgrade. You're moving from a single node ruler to something that might need to be sharded.
Further, if you're using the configs service, we've upgraded the migration library and this requires some manual intervention. See full instructions below to upgrade your PostgreSQL.
- [CHANGE] The frontend component now does not cache results if it finds a
Cache-Controlheader and if one of its values is
- [CHANGE] Flags changed with transition to upstream Prometheus rules manager:
ruler.configs.client-timeoutin order to match
-ruler.group-timeouthas been removed.
-ruler.num-workershas been removed.
-ruler.rule-pathhas been added to specify where the prometheus rule manager will sync rule files.
-ruler.storage.typehas beem added to specify the rule store backend type, currently only the configdb.
-ruler.poll-intervalhas been added to specify the interval in which to poll new rule groups.
-ruler.evaluation-intervaldefault value has changed from
1mto match the default evaluation interval in Prometheus.
- Ruler sharding requires a ring which can be configured via the ring flags prefixed by
- [CHANGE] Use relative links from /ring page to make it work when used behind reverse proxy. #1896
- [CHANGE] Deprecated
- [CHANGE] Ingesters now write only normalised tokens to the ring, although they can still read denormalised tokens used by other ingesters.
-ingester.normalise-tokensis now deprecated, and ignored. If you want to switch back to using denormalised tokens, you need to downgrade to Cortex 0.4.0. Previous versions don't handle claiming tokens from normalised ingesters correctly. #1809
- [CHANGE] Overrides mechanism has been renamed to "runtime config", and is now separate from limits. Runtime config is simply a file that is reloaded by Cortex every couple of seconds. Limits and now also multi KV use this mechanism.
New arguments were introduced:
-runtime-config.file(defaults to empty) and
-runtime-config.reload-period(defaults to 10 seconds), which replace previously used
-limits.per-user-override-periodoptions. Old options are still used if
-runtime-config.fileis not specified. This change is also reflected in YAML configuration, where old
limits.per_tenant_override_periodfields are replaced with
- [CHANGE] Cortex now rejects data with duplicate labels. Previously, such data was accepted, with duplicate labels removed with only one value left. #1964
- [CHANGE] Changed the default value for
ha-tracker/in order to not clash with other keys (ie. ring) stored in the same key-value store. #1940
- [FEATURE] Experimental: Write-Ahead-Log added in ingesters for more data reliability against ingester crashes. #1103
--ingester.wal-enabled: Setting this to
trueenables writing to WAL during ingestion.
--ingester.wal-dir: Directory where the WAL data should be stored and/or recovered from.
--ingester.checkpoint-enabled: Set this to
trueto enable checkpointing of in-memory chunks to disk.
--ingester.checkpoint-duration: This is the interval at which checkpoints should be created.
--ingester.recover-from-wal: Set this to
trueto recover data from an existing WAL.
- For more information, please checkout the "Ingesters with WAL" guide.
- [FEATURE] The distributor can now drop labels from samples (similar to the removal of the replica label for HA ingestion) per user via the
- [FEATURE] Added flag
debug.mutex-profile-fractionto enable mutex profiling #1969
- [FEATURE] Added
globalingestion rate limiter strategy. Deprecated
- [FEATURE] Added support for Microsoft Azure blob storage to be used for storing chunk data. #1913
- [FEATURE] Added readiness probe endpoint
/readyto queriers. #1934
- [FEATURE] Added "multi" KV store that can interact with two other KV stores, primary one for all reads and writes, and secondary one, which only receives writes. Primary/secondary store can be modified in runtime via runtime-config mechanism (previously "overrides"). #1749
- [FEATURE] Added support to store ring tokens to a file and read it back on startup, instead of generating/fetching the tokens to/from the ring. This feature can be enabled with the flag
- [FEATURE] Experimental TSDB: Added
/seriesAPI endpoint support with TSDB blocks storage. #1830
- [FEATURE] Experimental TSDB: Added TSDB blocks
compactorcomponent, which iterates over users blocks stored in the bucket and compact them according to the configured block ranges. #1942
- [ENHANCEMENT] metric
cortex_ingester_flush_reasonsgets a new
-ingester.spread-flushesoption is enabled. #1978
- [ENHANCEMENT] Added
enable_tlsoptions to redis cache configuration. Enables usage of Microsoft Azure Cache for Redis service. #1923
- [ENHANCEMENT] Upgraded Kubernetes API version for deployments from
- [ENHANCEMENT] Experimental TSDB: Open existing TSDB on startup to prevent ingester from becoming ready before it can accept writes. The max concurrency is set via
- [ENHANCEMENT] Experimental TSDB: Querier now exports aggregate metrics from Thanos bucket store and in memory index cache (many metrics to list, but all have
- [ENHANCEMENT] Experimental TSDB: Improved multi-tenant bucket store. #1991
- Allowed to configure the blocks sync interval via
-experimental.tsdb.bucket-store.sync-interval(0 disables the sync)
- Limited the number of tenants concurrently synched by
cortex_querier_blocks_sync_secondsmetric for the initial sync too
- Allowed to configure the blocks sync interval via
- [BUGFIX] Fixed unnecessary CAS operations done by the HA tracker when the jitter is enabled. #1861
- [BUGFIX] Fixed ingesters getting stuck in a LEAVING state after coming up from an ungraceful exit. #1921
- [BUGFIX] Reduce memory usage when ingester Push() errors. #1922
- [BUGFIX] Table Manager: Fixed calculation of expected tables and creation of tables from next active schema considering grace period. #1976
- [BUGFIX] Experimental TSDB: Fixed ingesters consistency during hand-over when using experimental TSDB blocks storage. #1854 #1818
- [BUGFIX] Experimental TSDB: Fixed metrics when using experimental TSDB blocks storage. #1981 #1982 #1990 #1983
- [BUGFIX] Experimental memberlist: Use the advertised address when sending packets to other peers of the Gossip memberlist. #1857
Upgrading PostgreSQL (if you're using configs service)
- Install the migrate package cli tool: https://github.com/golang-migrate/migrate/tree/master/cmd/migrate#installation
- Drop the
DROP TABLE schema_migrations;.
- Run the migrate command:
migrate -path <absolute_path_to_cortex>/cmd/cortex/migrations -database postgres://localhost:5432/database force 2
cortex_prometheus_rule_group_last_evaluation_timestamp_secondsmetric, tracked by the ruler, is not unregistered for rule groups not being used anymore. This issue will be fixed in the next Cortex release (see 2033).
Write-Ahead-Log (WAL) does not have automatic repair of corrupt checkpoint or WAL segments, which is possible if ingester crashes abruptly or the underlying disk corrupts. Currently the only way to resolve this is to manually delete the affected checkpoint and/or WAL segments. Automatic repair will be added in the future releases.
- [CHANGE] The frontend component has been refactored to be easier to re-use. When upgrading the frontend, cache entries will be discarded and re-created with the new protobuf schema. #1734
- [CHANGE] Removed direct DB/API access from the ruler.
-ruler.configs.urlhas been now deprecated. #1579
- [CHANGE] Removed
Deltaencoding. Any old chunks with
Deltaencoding cannot be read anymore. If
ingester.chunk-encodingis set to
Deltathe ingester will fail to start. #1706
- [CHANGE] Setting
-ingester.max-transfer-retriesto 0 now disables hand-over when ingester is shutting down. Previously, zero meant infinite number of attempts. #1771
dynamohas been removed as a valid storage name to make it consistent for all components.
aws-dynamoremain as valid storage names.
- [CHANGE/FEATURE] The frontend split and cache intervals can now be configured using the respective flag
--querier.split-queries-by-intervalis not provided request splitting is disabled by default.
--querier.split-queries-by-dayis still accepted for backward compatibility but has been deprecated. You should now use
--querier.split-queries-by-interval. We recommend a to use a multiple of 24 hours.
- [FEATURE] Global limit on the max series per user and metric #1760
-distributor.shard-by-all-labelsset for the ingesters too
- [FEATURE] Flush chunks with stale markers early with
- [FEATURE] EXPERIMENTAL: Added new KV Store backend based on memberlist library. Components can gossip about tokens and ingester states, instead of using Consul or Etcd. #1721
- [FEATURE] EXPERIMENTAL: Use TSDB in the ingesters & flush blocks to S3/GCS ala Thanos. This will let us use an Object Store more efficiently and reduce costs. #1695
- [FEATURE] Allow Query Frontend to log slow queries with
- [FEATURE] Add HTTP handler to trigger ingester flush & shutdown - used when running as a stateful set with the WAL enabled. #1746
- [ENHANCEMENT] Reduce memory allocations in the write path. #1706
- [ENHANCEMENT] Consul client now follows recommended practices for blocking queries wrt returned Index value. #1708
- [ENHANCEMENT] Consul client can optionally rate-limit itself during Watch (used e.g. by ring watchers) and WatchPrefix (used by HA feature) operations. Rate limiting is disabled by default. New flags added:
- [ENHANCEMENT] Added jitter to HA deduping heartbeats, configure using
- [ENHANCEMENT] Add ability to flush chunks with stale markers early. #1759
- [BUGFIX] Stop reporting successful actions as 500 errors in KV store metrics. #1798
- [BUGFIX] Fix bug where duplicate labels can be returned through metadata APIs. #1790
- [BUGFIX] Fix reading of old, v3 chunk data. #1779
- [BUGFIX] Now support IAM roles in service accounts in AWS EKS. #1803
In this release we updated the following dependencies:
- gRPC v1.25.0 (resulted in a drop of 30% CPU usage when compression is on)
- jaeger-client v2.20.0
- aws-sdk-go to v1.25.22
This release adds support for Redis as an alternative to Memcached, and also includes many optimisations which reduce CPU and memory usage.
- [CHANGE] Gauge metrics were renamed to drop the
- In Alertmanager,
- In Ruler,
- In Alertmanager,
- [CHANGE] The "auto Slack root" feature was removed, including the
--alertmanager.configs.auto-slack-rootCLI flag. #1597
- [CHANGE] In table-manager, default DynamoDB capacity was reduced from 3,000 units to 1,000 units. We recommend you do not run with the defaults: find out what figures are needed for your environment and set that via
- [FEATURE] Add Redis support for caching #1612
- [FEATURE] Allow spreading chunk writes across multiple S3 buckets #1625
- [ENHANCEMENT] Upgraded Prometheus to 2.12.0 and Alertmanager to 0.19.0. #1597
- [ENHANCEMENT] Cortex is now built with Go 1.13 #1675, #1676, #1679
- [ENHANCEMENT] Many optimisations, mostly impacting ingester and querier: #1574, #1624, #1638, #1644, #1649, #1654, #1702
Full list of changes: v0.2.0...v0.3.0
- [BUG] The Alertmanager UI is non-functional in this release
This release has several exciting features, the most notable of them being setting
-ingester.spread-flushes to potentially reduce your storage space by upto 50%.
- [CHANGE] Flags changed due to changes upstream in Prometheus Alertmanager #929:
alertmanager.mesh.peer.servicecan be replaced by
- [CHANGE] --claim-on-rollout flag deprecated; feature is now always on #1566
- [CHANGE] Retention period must now be a multiple of periodic table duration #1564
- [CHANGE] The value for the name label for the chunks memcache in all
cortex_cache_metrics is now
chunksmemcache(before it was
- [FEATURE] Makes the ingester flush each timeseries at a specific point in the max-chunk-age cycle with
-ingester.spread-flushes. This means multiple replicas of a chunk are very likely to contain the same contents which cuts chunk storage space by up to 66%. #1578
- [FEATURE] Make minimum number of chunk samples configurable per user #1620
- [FEATURE] Honor HTTPS for custom S3 URLs #1603
- [FEATURE] You can now point the query-frontend at a normal Prometheus for parallelisation and caching #1441
- [FEATURE] You can now specify
http_configon alert receivers #929
- [FEATURE] Add option to use jump hashing to load balance requests to memcached #1554
- [FEATURE] Add status page for HA tracker to distributors #1546
- [FEATURE] The distributor ring page is now easier to read with alternate rows grayed out #1621