Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add per-user query metrics for series and bytes returned #4343

Merged
merged 7 commits into from
Jul 20, 2021
Merged

Add per-user query metrics for series and bytes returned #4343

merged 7 commits into from
Jul 20, 2021

Conversation

56quarters
Copy link
Contributor

@56quarters 56quarters commented Jul 6, 2021

What this PR does:

Add stats included in query responses from the querier and distributor
for measuring the number of series and bytes included in successful
queries. These stats are emitted per-user as counters from the query
frontends.

These stats are picked to add visibility into the same resources limited
as part of #4179 and #4216.

Fixes #4259

Notes to reviewers

Open issue:

  • The spanlog in pkg/querier/blocks_store_queryable.go [1] computes the number of bytes for the series (countSeriesBytes()) differently than the way chunk bytes are limited. I wasn't sure if I should update the spanlog to use countChunkBytes() the same way I did for the stats emitted or if it's supposed to be measuring something entirely different. As it is, the number bytes emitted by the spanlog is consistently lower than countChunkBytes() since it doesn't include the timestamps for each chunk.

Looking for input:

  • The buckets picked for the histograms were chosen based on queries performed by the cortex-mixin dashboards. As they are now, nearly all queries fall into the largest bucket or lower (number of series and chunk bytes). If there are changes you'd like to see, let me know.
  • Lots of casting to uint64 from int and vice versa. It seemed from the PRs that added the limits that there wasn't consensus on int vs uint64. Let me know if you'd like to see uint64 here changed to something else.

Signed-off-by: Nick Pillitteri nick.pillitteri@grafana.com

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns doing it a per-user basis. @tomwilkie just yesterday commented here #4259 (comment) on the fact that we have to pay attention each time we add a per-user histogram, cause they easily explode cardinality.

@56quarters
Copy link
Contributor Author

I have some concerns doing it a per-user basis. @tomwilkie just yesterday commented here #4259 (comment) on the fact that we have to pay attention each time we add a per-user histogram, cause they easily explode cardinality.

Understood. The approach here (to emit the metrics from the query-frontend) was taken based on a suggestion from @pstibrany to reduce the number of series (since there are usually only a few query frontends). I can definitely reduce the number of buckets of the histograms.

@56quarters
Copy link
Contributor Author

Based on @tomwilkie 's comment it seems like what I should do is:

  • Keep emitting these metrics as part of the query log
  • Change these histograms to summaries? Or perhaps remove them entirely?

@tomwilkie
Copy link
Contributor

tomwilkie commented Jul 9, 2021

Yes please Nick! I'd probably go with logs in this case, but a summary would also be fine with 1 or 2 quantiles.

@56quarters 56quarters changed the title Add per-user query histograms for series and bytes returned Add per-user query metrics for series and bytes returned Jul 13, 2021
Add stats included in query responses from the querier and distributor
for measuring the number of series and bytes included in successful
queries. These stats are emitted per-user as summaries from the query
frontends.

These stats are picked to add visibility into the same resources limited
as part of #4179 and #4216.

Fixes #4259

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters requested a review from pracucci July 14, 2021 13:11
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Copy link
Contributor

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one comment

pkg/distributor/query.go Outdated Show resolved Hide resolved
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job @56quarters! I left few comments about naming to try to make it a bit more clear. A part from this LGTM (and no concern about cardinality since you moved away from histograms). Thanks for addressing my initial feedback! 🙏

pkg/querier/stats/stats.proto Outdated Show resolved Hide resolved
pkg/querier/stats/stats.proto Outdated Show resolved Hide resolved
pkg/querier/blocks_store_queryable.go Show resolved Hide resolved
pkg/frontend/transport/handler.go Outdated Show resolved Hide resolved
pkg/frontend/transport/handler.go Outdated Show resolved Hide resolved
pkg/frontend/transport/handler.go Outdated Show resolved Hide resolved
pkg/frontend/transport/handler.go Outdated Show resolved Hide resolved
pkg/frontend/transport/handler.go Outdated Show resolved Hide resolved
pkg/frontend/transport/handler.go Outdated Show resolved Hide resolved
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters requested a review from pracucci July 16, 2021 14:53
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (modulo a couple of final nits). Thanks!

pkg/frontend/transport/handler.go Outdated Show resolved Hide resolved
pkg/querier/stats/stats_test.go Outdated Show resolved Hide resolved
pkg/querier/stats/stats_test.go Outdated Show resolved Hide resolved
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@jtlisi jtlisi self-requested a review July 16, 2021 15:05
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

pkg/frontend/transport/handler.go Outdated Show resolved Hide resolved
@56quarters
Copy link
Contributor Author

I think it would be useful to emit this metric in some non-successful queries too. The specific case I have in mind is when the query do indeed hit the limit, we should still have this metric emitted so we can detect if any user hit the limit.

I think it would be useful as well but I'm not going to make the change here since I think it would involve a fairly large refactoring of the code. Nothing preventing us from doing this as a follow-up though.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@pstibrany pstibrany enabled auto-merge (squash) July 20, 2021 14:16
@pstibrany pstibrany merged commit 1e4e0ca into cortexproject:master Jul 20, 2021
alvinlin123 pushed a commit to ac1214/cortex that referenced this pull request Jan 14, 2022
…ct#4343)

* Add per-user query metrics for series and bytes returned

Add stats included in query responses from the querier and distributor
for measuring the number of series and bytes included in successful
queries. These stats are emitted per-user as summaries from the query
frontends.

These stats are picked to add visibility into the same resources limited
as part of cortexproject#4179 and cortexproject#4216.

Fixes cortexproject#4259

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Formatting fix

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Fix changelog to match actual changes

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Typo

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Code review changes, rename things for clarity

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Apply suggestions from code review

Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Code review changes, remove superfluous summaries

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

per-tenant metrics on chunks-per-query, chunk-size-bytes-per-query, and fetched-series-per-query
5 participants