Support out of order samples ingestion #4964

yeya24 · 2022-11-13T06:49:58Z

Signed-off-by: Ben Ye benye@amazon.com

What this PR does:

Support OOO samples ingestion. This is a new feature from Prometheus v2.39.x and we just need to enable it via flag.
Add out_of_order_time_window to limits so that each tenant can configure their own OOO time window.
Add out_of_order_cap_max in the ingester configuration. This is not a per-tenant configuration.

Which issue(s) this PR fixes:
Fixes #4895

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

yeya24 · 2022-11-13T06:50:24Z

Several open questions:

Should we make out_of_order_cap_max also per tenant?
Should we still keep reject_old_samples and reject_old_samples_max_age? Distributor side can drop samples early without affecting ingesters.
Based on https://prometheus.io/docs/prometheus/latest/configuration/configuration/#tsdb, the OOO time window is reloadable, shall we support that?

We can follow up in next prs if needed.

Updated: we decided to keep 2 and we already implemented reloadable for 3.

yeya24 · 2022-11-14T00:41:09Z

docs/blocks-storage/querier.md

+    # [EXPERIMENTAL] Configures the maximum capacity for out-of-order chunks (in
+    # samples). If set to <=0, default value 32 is assumed.
+    # CLI flag: -blocks-storage.tsdb.out-of-order-cap-max
+    [out_of_order_cap_max: <int> | default = 32]


It feels weird to me. Shall we extract TSDB configs from block storage section? Having them in querier and SG is strange.

I think querier comonent doesn't carray about this field, Is this due to the use of automatic template document?

Yeah I think querier and store gateway uses some configs from block storage.
But tsdb configs shouldn't be related.

If out of order is enabled, users should not care about the configuration here (the burden of configuration is too heavy), unless there is a maximum capacity limit (the default is the largest).

If you are talking about the ooo capacity, I don't think this change would introduce additional burden to them.

This is a global value, not per tenant.

t00350320 · 2022-11-14T02:13:17Z

Several open questions:

Should we make out_of_order_cap_max also per tenant?

Should we still keep reject_old_samples and reject_old_samples_max_age? Distributor side can drop samples early without affecting ingesters.

Based on https://prometheus.io/docs/prometheus/latest/configuration/configuration/#tsdb, the OOO time window is reloadable, shall we support that?

We can follow up in next prs if needed.

1、ooo will affect ingester's performance, it seems per tenant's out_of_order_cap_max makes this more complicated.
2、Distributor side‘s reject_old_samples and reject_old_samples_max_age are also per-tenant. One customer may definetely confirm that he doesn't care about datas 6 mount ago . So the distributor can drop samples early. I think ooo just improve tsdb's ability.
So I think the old configuration still makes sense.
3、it seems nice to have, haha.

yeya24 · 2022-11-14T18:45:14Z

Distributor side‘s reject_old_samples and reject_old_samples_max_age are also per-tenant. One customer may definetely confirm that he doesn't care about datas 6 mount ago . So the distributor can drop samples early. I think ooo just improve tsdb's ability.

The OOO time window is also per tenant so using this configuration is almost the same. The only difference is that distributor can drop samples early to avoid affecting ingesters.

wgliang · 2022-11-16T05:29:46Z

3. OOO time window is reloadable

If there is no additional burden, there is absolutely no need to add more restrictions, right?

alvinlin123 · 2022-11-17T23:52:11Z

docs/blocks-storage/querier.md

@@ -878,4 +878,9 @@ blocks_storage:
    # will be stored. 0 or less means disabled.
    # CLI flag: -blocks-storage.tsdb.max-exemplars
    [max_exemplars: <int> | default = 0]
+
+    # [EXPERIMENTAL] Configures the maximum capacity for out-of-order chunks (in


I suggest following reword:

Configures the maximum number of samples that can be out-of-order. See [some link] on how out of order works.

I think we somehow need to link to a document that talks about https://docs.google.com/document/d/1Kppm7qL9C-BJB1j6yb6-9ObG3AbdZnFUBYPNNWwDBYM/edit and prometheus/prometheus#11075 because of the experimental feature.

I am thinking we should have a doc in cortexmetrics.io about OOO support talking about some operational implications. WDYT? I am more than happy to work on the documentation or OOO support.

The documentation is non-blocking for this PR, just bring this up as point of discussion.

Also, if set to <=0, default value 32 is assumed. is this what Prometheus do? My preference is not to do this. I would prefer my application fail to start if I configure some nonsense value because its more explicit.

Configures the maximum number of samples that can be out-of-order.

This looks good. I think I will also mention this per chunk.
Yeah I love the idea of having a doc about this feature for sure.
And I added a validation to make sure it is > 0.

alvinlin123 · 2022-11-17T23:56:57Z

docs/configuration/config-file-reference.md

@@ -2587,14 +2587,6 @@ The `limits_config` configures default and per-tenant limits imposed by Cortex s
 # CLI flag: -validation.max-metadata-length
 [max_metadata_length: <int> | default = 1024]

-# Reject old samples.
-# CLI flag: -validation.reject-old-samples
-[reject_old_samples: <boolean> | default = false]


Per https://cortexmetrics.io/docs/configuration/v1guarantees/#flags-config-and-minor-version-upgrades we might need to keep the reject_old-* config for 2 minor releases.

I see. I will add them back.

I am wondering about the behaviour of the co-existence of reject_old_samples and out_of_order_time_window.

By default reject_old_samples=false but out_of_order_time_window=0, one config says accept old (out of order) samples, the other says disable out of order samples. What should Cortex do?

Generally speaking, what happens if the two config seemingly conflicts with each other? May be worth to document the behaviour in v1-guarantees.md.

For the default values, if the sample is too old it will be rejected by the TSDB anyway if we don't enable OOO. This is the same even if we don't have OOO support so we don't change behavior here.

Yeah I could document the behavior. Basically I think reject_old_samples happens only on the distributor side. OOO window is totally an ingester thing. So if users want to make OOO to work they need to adjust their configs for reject_old_samples to allow old samples coming to ingester

Actually I feel instead of adding this to v1-guarantees.md, it is better to document this to the Out of order samples operational doc you mentioned?
Wdyt? v1-guarantees.md seems just list flags we have, but nothing really detailed about the settings and usage.

Im not sure if we should deprecate this flag....

Could i configure i can accept our of order for 5 min BUT if not out of order i can accept samples that is 2 hours old? I know we wanna simplify the config but seems different things no?

Could i configure i can accept our of order for 5 min BUT if not out of order i can accept samples that is 2 hours old?

Isn't it doable with only OOO? Non out of order samples are always accepted unless it is outside head time range.

But I agree they are still slightly different things since we can drop samples early on distributors rather than waiting till ingesters.

Discussed with @alanprot, we decided to not deprecate the two flags as they are different from the OOO settings.

alvinlin123

We should update https://cortexmetrics.io/docs/configuration/v1guarantees/#experimental-features with the experimental flags as part of this PR :)

yeya24 · 2022-11-23T02:58:39Z

PTAL. @alvinlin123 @songjiayang @alanprot

CHANGELOG.md

alvinlin123

Few comments.

I also added #4990 so we remember to remove the deprecated flags in later release.

pkg/distributor/distributor_test.go

alvinlin123 · 2022-11-23T17:36:28Z

docs/configuration/config-file-reference.md

@@ -2587,14 +2587,6 @@ The `limits_config` configures default and per-tenant limits imposed by Cortex s
 # CLI flag: -validation.max-metadata-length
 [max_metadata_length: <int> | default = 1024]

-# Reject old samples.
-# CLI flag: -validation.reject-old-samples
-[reject_old_samples: <boolean> | default = false]


I am wondering about the behaviour of the co-existence of reject_old_samples and out_of_order_time_window.

By default reject_old_samples=false but out_of_order_time_window=0, one config says accept old (out of order) samples, the other says disable out of order samples. What should Cortex do?

Generally speaking, what happens if the two config seemingly conflicts with each other? May be worth to document the behaviour in v1-guarantees.md.

pkg/util/validation/limits.go

friedrichg

can out of order samples create bad results if a query is cached?
after a few tests we need a page to clarify expectations to users.

pkg/util/validation/limits.go

yeya24 · 2022-11-26T18:21:55Z

can out of order samples create bad results if a query is cached?
after a few tests we need a page to clarify expectations to users.

This is a follow up in the query frontend to allow users to specify a non-cacheable time window.

wgliang · 2022-12-09T02:11:32Z

When will this PR be merged? Very much looking forward to this new feature.

alanprot · 2022-12-22T23:21:03Z

LGTM

friedrichg

Thanks! It's so clean and understandable.

Just one tiny nit.

pkg/storage/tsdb/config.go

wgliang · 2023-02-02T09:40:51Z

cortex/pkg/ingester/ingester.go

Line 1908 in a63bbb0

    
           false, // No need to upload compacted blocks. Cortex compactor takes care of that.

It may be necessary to consider adapting repeated compacted blocks uploads to S3. After enabling it, we found that block with a level greater than 1 was ignored and uploaded.

https://github.com/thanos-io/thanos/blob/84959bcfd923ea06a9fe931756f8445ac84c4ef8/pkg/shipper/shipper.go#L276

Signed-off-by: Ben Ye <benye@amazon.com>

pull-request-size bot added the size/L label Nov 13, 2022

yeya24 commented Nov 14, 2022

View reviewed changes

alvinlin123 reviewed Nov 17, 2022

View reviewed changes

alvinlin123 reviewed Nov 18, 2022

View reviewed changes

yeya24 force-pushed the support-ooo branch from 5acf328 to e402433 Compare November 20, 2022 20:42

alvinlin123 reviewed Nov 23, 2022

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

alvinlin123 requested changes Nov 23, 2022

View reviewed changes

yeya24 force-pushed the support-ooo branch from 17b81eb to 6e45d9b Compare November 23, 2022 19:26

friedrichg reviewed Nov 25, 2022

View reviewed changes

pkg/util/validation/limits.go Outdated Show resolved Hide resolved

yeya24 force-pushed the support-ooo branch 2 times, most recently from 5ad76ea to a9cb490 Compare November 29, 2022 21:54

haplone mentioned this pull request Dec 14, 2022

is there a plan to optimize cache when ooo is supported #5039

Open

jeromeinsf added this to the Release 1.15 milestone Dec 14, 2022

yeya24 force-pushed the support-ooo branch 3 times, most recently from 5159af1 to b6b6d5f Compare December 22, 2022 22:53

alanprot approved these changes Dec 22, 2022

View reviewed changes

yeya24 requested a review from friedrichg December 22, 2022 23:36

friedrichg requested changes Dec 27, 2022

View reviewed changes

pkg/storage/tsdb/config.go Show resolved Hide resolved

friedrichg approved these changes Dec 28, 2022

View reviewed changes

yeya24 enabled auto-merge (squash) December 28, 2022 17:19

yeya24 disabled auto-merge December 28, 2022 17:20

yeya24 requested a review from alvinlin123 December 28, 2022 17:21

yeya24 mentioned this pull request Dec 28, 2022

Remove support of reject_old_samples and reject_old_samples_max_age in favour of Out Of Order Samples support #4990

Closed

yeya24 force-pushed the support-ooo branch 4 times, most recently from d219692 to 463dc13 Compare February 7, 2023 01:19

yeya24 force-pushed the support-ooo branch from 463dc13 to 3c1280f Compare February 26, 2023 20:38

yeya24 force-pushed the support-ooo branch 3 times, most recently from 0de7e49 to e7be020 Compare March 22, 2023 21:42

support out of order samples ingestion feature

79a3cf9

Signed-off-by: Ben Ye <benye@amazon.com>

yeya24 force-pushed the support-ooo branch from e7be020 to 79a3cf9 Compare March 22, 2023 22:22

alvinlin123 approved these changes Mar 23, 2023

View reviewed changes

alvinlin123 merged commit 909a090 into cortexproject:master Mar 23, 2023
14 checks passed

AmerSelimovic mentioned this pull request Jun 13, 2023

Unshipped blocks when out of order writes are enabled #5402

Closed

yeya24 deleted the support-ooo branch October 27, 2023 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support out of order samples ingestion #4964

Support out of order samples ingestion #4964

yeya24 commented Nov 13, 2022 •

edited

yeya24 commented Nov 13, 2022 •

edited

yeya24 Nov 14, 2022 •

edited

songjiayang Nov 14, 2022

yeya24 Nov 14, 2022

wgliang Nov 16, 2022

yeya24 Nov 16, 2022

t00350320 commented Nov 14, 2022 •

edited

yeya24 commented Nov 14, 2022

wgliang commented Nov 16, 2022

alvinlin123 Nov 17, 2022 •

edited

alvinlin123 Nov 17, 2022

yeya24 Nov 20, 2022

alvinlin123 Nov 17, 2022

yeya24 Nov 20, 2022

alvinlin123 Nov 23, 2022 •

edited

yeya24 Nov 23, 2022 •

edited

yeya24 Nov 23, 2022

alanprot Dec 9, 2022 •

edited

yeya24 Dec 9, 2022 •

edited

yeya24 Dec 22, 2022

alvinlin123 left a comment

yeya24 commented Nov 23, 2022

alvinlin123 left a comment

alvinlin123 Nov 23, 2022 •

edited

friedrichg left a comment

yeya24 commented Nov 26, 2022

wgliang commented Dec 9, 2022

alanprot commented Dec 22, 2022

friedrichg left a comment

wgliang commented Feb 2, 2023 •

edited

Support out of order samples ingestion #4964

Support out of order samples ingestion #4964

Conversation

yeya24 commented Nov 13, 2022 • edited

yeya24 commented Nov 13, 2022 • edited

yeya24 Nov 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

t00350320 commented Nov 14, 2022 • edited

yeya24 commented Nov 14, 2022

wgliang commented Nov 16, 2022

alvinlin123 Nov 17, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alvinlin123 Nov 23, 2022 • edited

Choose a reason for hiding this comment

yeya24 Nov 23, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanprot Dec 9, 2022 • edited

Choose a reason for hiding this comment

yeya24 Dec 9, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alvinlin123 left a comment

Choose a reason for hiding this comment

yeya24 commented Nov 23, 2022

alvinlin123 left a comment

Choose a reason for hiding this comment

alvinlin123 Nov 23, 2022 • edited

Choose a reason for hiding this comment

friedrichg left a comment

Choose a reason for hiding this comment

yeya24 commented Nov 26, 2022

wgliang commented Dec 9, 2022

alanprot commented Dec 22, 2022

friedrichg left a comment

Choose a reason for hiding this comment

wgliang commented Feb 2, 2023 • edited

yeya24 commented Nov 13, 2022 •

edited

yeya24 commented Nov 13, 2022 •

edited

yeya24 Nov 14, 2022 •

edited

t00350320 commented Nov 14, 2022 •

edited

alvinlin123 Nov 17, 2022 •

edited

alvinlin123 Nov 23, 2022 •

edited

yeya24 Nov 23, 2022 •

edited

alanprot Dec 9, 2022 •

edited

yeya24 Dec 9, 2022 •

edited

alvinlin123 Nov 23, 2022 •

edited

wgliang commented Feb 2, 2023 •

edited