Implement max_fetched_data_bytes_per_query limit #4854

harry671003 · 2022-09-07T18:05:27Z

Signed-off-by: 🌲 Harry 🌊 John 🏔 johrry@amazon.com

What this PR does:
A query fetches both series and samples. The size of samples is already limited using the max-fetched-chunk-bytes-per-query limit. We don't currently limit the total size of series labels fetched. This can OOM kill queriers.
This PR adds a new limit: max-fetched-data-bytes-per-query and deprecates max-fetched-chunks-bytes-per-query.

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

alvinlin123 · 2022-09-08T18:07:58Z

docs/configuration/config-file-reference.md

+# ingester and storage. This limit is enforced in the querier and ruler only
+# when running Cortex with blocks storage. 0 to disable.
+# CLI flag: -querier.max-fetched-label-bytes-per-query
+[max_fetched_label_bytes_per_query: <int> | default = 0]


I am wondering if it would be better if we deprecate max_fetched_chunk_bytes_per_query and have a new configuration that is simply called max_fetched_data_bytes_per_query, which limits the combination of labels and chunks.

Reason for this suggestion is so that we don't make users think about what is "label" v.s. what is "chunk"; I this is simpler for users.

What do you think @harry671003 @alanprot @friedrichg ?

I was talking to @alanprot who wasn't convinced we really need this limit. But from the stress tests, I saw that queriers were OOM killed when fetching 2 million series with 10KB labels size. So we still need a way to control the total bytes fetched during a query.
It makes sense to combine them into a single limit.

It might also be possible to not enable other limits like, max_fetched_chunks, max_fetched_series etc if we enable the max_fetched_data_bytes_per_query limit.

gently ping @alanprot @alvinlin123

harry671003 · 2022-09-13T17:13:16Z

Test 1: without enabling the max_fetched_data_bytes_per_query

Limit

  max_fetched_chunks_per_query: 20000000
  max_fetched_series_per_query: 12000000
  max_fetched_chunk_bytes_per_query: 750000000

Result

➜  ~ awscurl --region us-west-2 --service aps -X POST "https://<redacted>/api/v1/query" -d 'query=count({__name__=~"metric_0.*"})&start=1663086375&end=1663086399' --header 'Content-Type: application/x-www-form-urlencoded'
rpc error: code = Canceled desc = context canceled

Traceback (most recent call last):
  File "/usr/local/bin/awscurl", line 33, in <module>
    sys.exit(load_entry_point('awscurl==0.26', 'console_scripts', 'awscurl')())
  File "/usr/local/Cellar/awscurl/0.26_1/libexec/lib/python3.10/site-packages/awscurl/awscurl.py", line 521, in main
    inner_main(sys.argv[1:])
  File "/usr/local/Cellar/awscurl/0.26_1/libexec/lib/python3.10/site-packages/awscurl/awscurl.py", line 515, in inner_main
    response.raise_for_status()
  File "/usr/local/Cellar/awscurl/0.26_1/libexec/lib/python3.10/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://<redacted>/api/v1/query

Test 2: With max_fetched_data_bytes_per_query set to 10 GB

Limits:

  max_fetched_chunks_per_query: 20000000
  max_fetched_series_per_query: 12000000
  max_fetched_chunk_bytes_per_query: 750000000
  max_fetched_data_bytes_per_query: 10000000000

Result

➜  ~ awscurl --region us-west-2 --service aps -X POST "https://<redacted>/api/v1/query" -d 'query=count({__name__=~"metric_0.*"})&start=1663086375&end=1663086399' --header 'Content-Type: application/x-www-form-urlencoded'
{"status":"error","errorType":"execution","error":"expanding series: the query hit the aggregated label size limit (limit: 10000000000 bytes)"}
Traceback (most recent call last):
  File "/usr/local/bin/awscurl", line 33, in <module>
    sys.exit(load_entry_point('awscurl==0.26', 'console_scripts', 'awscurl')())
  File "/usr/local/Cellar/awscurl/0.26_1/libexec/lib/python3.10/site-packages/awscurl/awscurl.py", line 521, in main
    inner_main(sys.argv[1:])
  File "/usr/local/Cellar/awscurl/0.26_1/libexec/lib/python3.10/site-packages/awscurl/awscurl.py", line 515, in inner_main
    response.raise_for_status()
  File "/usr/local/Cellar/awscurl/0.26_1/libexec/lib/python3.10/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 422 Client Error: Unprocessable Entity for url: https://<redacted>/api/v1/query

The above test verifies that the combined data bytes fetched per query is necessary to prevent queriers from getting OOM killed.

pkg/alertmanager/distributor_test.go

Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>

pkg/util/limiter/query_limiter.go

alvinlin123

I will merge this PR once the code comment is addressed.

yeya24 · 2022-09-28T18:19:11Z

pkg/distributor/distributor.go

@@ -1029,6 +1029,9 @@ func (d *Distributor) MetricsForLabelMatchers(ctx context.Context, from, through
 			}
 			ms := ingester_client.FromMetricsForLabelMatchersResponse(resp)
 			for _, m := range ms {
+				if err := queryLimiter.AddDataBytes(resp.Size()); err != nil {


Why do we increase response size per matcher? I might miss something, but why we don't add response size once since it is a single request

It should be outside the loop. Thanks for catching it. :)

yeya24 · 2022-09-28T22:36:00Z

pkg/querier/blocks_store_queryable.go

@@ -687,14 +691,18 @@ func (q *blocksStoreQuerier) fetchSeriesFromStores(

 			numSeries := len(mySeries)
 			chunkBytes := countChunkBytes(mySeries...)
+			labelBytes := countDataBytes(mySeries...)
+			dataBytes := labelBytes + chunkBytes


So here labelBytes := countDataBytes(mySeries...) already includes both chunk bytes and label bytes so we count chunkBytes twice.

yeya24 · 2022-09-28T22:40:45Z

pkg/querier/blocks_store_queryable.go

 					if chunkBytesLimitErr := queryLimiter.AddChunkBytes(chunksSize); chunkBytesLimitErr != nil {
 						return validation.LimitError(chunkBytesLimitErr.Error())
 					}
 					if chunkLimitErr := queryLimiter.AddChunks(len(s.Chunks)); chunkLimitErr != nil {
 						return validation.LimitError(chunkLimitErr.Error())
 					}
+					if dataBytesLimitErr := queryLimiter.AddDataBytes(dataSize); dataBytesLimitErr != nil {


One question here. So the query limiter we are using have the same lifecycle with each query, let's say if we first query a SG and we limit some data bytes, then maybe due to network issue or SG restart, the stream erros and we need to retry another store gateway.
In this case, do you think it makes more sense to release the bytes we consumed and then start retrying another store gateway? Otherwise probably it is easy to hit the limit.

But it is also okay to hit the limit, and the query frontend will retry. WDYT?

This is a valid point. I also want to note that we are not doing this for other existing limits like fetched chunks and fetched series limits.

The query-frontend only retries 5XXs. In this case, we'll be returning a 422 error which will not be retried.

We could do something like:

if isRetryableError(err) { level.Warn(spanLog).Log("err", errors.Wrapf(err, "failed to receive series from %s due to retryable error", c.RemoteAddress())) queryLimiter.RemoveSeries(seriesCnt) queryLimiter.RemoveDataBytes(databytesCnt) queryLimiter.RemoveChunkBytes(chunkbytesCnt) return nil }

If others agree with this approach, I can implement this as an enhancement in another PR.

That's a good point. We should follow up with another PR to fix this i think!

yeya24 · 2022-09-28T22:49:23Z

docs/configuration/config-file-reference.md

+# ingester and storage. This limit is enforced in the querier and ruler only
+# when running Cortex with blocks storage. 0 to disable.
+# CLI flag: -querier.max-fetched-data-bytes-per-query
+[max_fetched_data_bytes_per_query: <int> | default = 0]


Does this limit apply only to Select call? Since I don't see we are limiting label names and label values. If that's the case probably we can mention it in the doc or give the flag a better name.

That's the only question I have now. Other LGTM

The naming of the limit is consistent with other similar limits. I think we can implement this limit also on the LabelNames and LabelValues as well. Right now I've updated the doc saying that it is only applied for query, query_range and series APIs

NIT: when running Cortex with blocks storage

Currently we only have block storage as chunk storage was removed.

All the other limits have this line:

This limit is enforced in the querier only when running Cortex with blocks storage

Should we remove it everywhere?

I think we should...

I'll also address this in the PR to handle retryable failures.

yeya24 · 2022-09-29T00:04:54Z

pkg/querier/blocks_store_queryable.go

@@ -1008,7 +1007,7 @@ func countChunkBytes(series ...*storepb.Series) (count int) {
 	return count
 }

-// countChunkBytes returns the size of the chunks making up the provided series in bytes
+// countDataBytes returns the combined size of the all the series


of the all the series we can probably remove one the. Small nit lol

Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>

yeya24

lgtm

pull-request-size bot added the size/L label Sep 7, 2022

alvinlin123 reviewed Sep 8, 2022

View reviewed changes

harry671003 force-pushed the collect_query_data branch from 2ddd736 to de03d68 Compare September 12, 2022 17:14

harry671003 changed the title ~~Implement max_fetched_label_bytes_per_query limit~~ Implement max_fetched_data_bytes_per_query limit Sep 12, 2022

harry671003 force-pushed the collect_query_data branch 2 times, most recently from f1c3f89 to 05903b4 Compare September 14, 2022 18:22

pull-request-size bot added size/XL and removed size/L labels Sep 14, 2022

harry671003 force-pushed the collect_query_data branch from 05903b4 to 8309bbc Compare September 14, 2022 18:31

pull-request-size bot added size/L size/XL and removed size/XL size/L labels Sep 14, 2022

harry671003 commented Sep 14, 2022

View reviewed changes

pkg/alertmanager/distributor_test.go Outdated Show resolved Hide resolved

harry671003 force-pushed the collect_query_data branch from ae3827e to b5b92c0 Compare September 14, 2022 23:02

Implementy max_fetched_data_bytes_per_query limit

e7224cd

Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>

harry671003 force-pushed the collect_query_data branch from b5b92c0 to e7224cd Compare September 20, 2022 15:43

pull-request-size bot added size/L and removed size/XL labels Sep 20, 2022

alanprot approved these changes Sep 26, 2022

View reviewed changes

alvinlin123 reviewed Sep 28, 2022

View reviewed changes

pkg/util/limiter/query_limiter.go Outdated Show resolved Hide resolved

alvinlin123 approved these changes Sep 28, 2022

View reviewed changes

Merge remote-tracking branch 'origin/master' into collect_query_data

3a087de

yeya24 reviewed Sep 28, 2022

View reviewed changes

harry671003 force-pushed the collect_query_data branch from 8e6a552 to 5a23b2d Compare September 28, 2022 20:05

yeya24 reviewed Sep 28, 2022

View reviewed changes

harry671003 force-pushed the collect_query_data branch from 5a23b2d to 1635c33 Compare September 28, 2022 23:49

yeya24 reviewed Sep 29, 2022

View reviewed changes

harry671003 force-pushed the collect_query_data branch 2 times, most recently from 4f186b4 to 7e772f1 Compare September 29, 2022 00:10

Address a review comment

65a5bde

Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>

harry671003 force-pushed the collect_query_data branch from 7e772f1 to 65a5bde Compare September 29, 2022 00:11

yeya24 approved these changes Sep 29, 2022

View reviewed changes

alvinlin123 merged commit faf193e into cortexproject:master Sep 29, 2022

yeya24 mentioned this pull request Oct 2, 2022

Limit data bytes fetched per query and per store thanos-io/thanos#5750

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement max_fetched_data_bytes_per_query limit #4854

Implement max_fetched_data_bytes_per_query limit #4854

harry671003 commented Sep 7, 2022 •

edited

Loading

alvinlin123 Sep 8, 2022

harry671003 Sep 9, 2022 •

edited

Loading

harry671003 Sep 12, 2022

harry671003 commented Sep 13, 2022

alvinlin123 left a comment •

edited

Loading

yeya24 Sep 28, 2022

harry671003 Sep 28, 2022

yeya24 Sep 28, 2022

yeya24 Sep 28, 2022 •

edited

Loading

harry671003 Sep 28, 2022

harry671003 Sep 28, 2022

yeya24 Sep 29, 2022

alanprot Sep 29, 2022

yeya24 Sep 28, 2022

yeya24 Sep 29, 2022

harry671003 Sep 29, 2022

alanprot Sep 29, 2022

harry671003 Sep 29, 2022

alanprot Sep 29, 2022

harry671003 Sep 29, 2022

yeya24 Sep 29, 2022

yeya24 left a comment

Implement max_fetched_data_bytes_per_query limit #4854

Implement max_fetched_data_bytes_per_query limit #4854

Conversation

harry671003 commented Sep 7, 2022 • edited Loading

Choose a reason for hiding this comment

harry671003 Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harry671003 commented Sep 13, 2022

Test 1: without enabling the max_fetched_data_bytes_per_query

Limit

Result

Test 2: With max_fetched_data_bytes_per_query set to 10 GB

Limits:

Result

alvinlin123 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeya24 Sep 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeya24 left a comment

Choose a reason for hiding this comment

harry671003 commented Sep 7, 2022 •

edited

Loading

harry671003 Sep 9, 2022 •

edited

Loading

alvinlin123 left a comment •

edited

Loading

yeya24 Sep 28, 2022 •

edited

Loading