New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: instrument the query layer to track rate-limited queries #3894
feat: instrument the query layer to track rate-limited queries #3894
Conversation
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have metric to monitor queue length per user, but that doesn't tell us max queue length, so I can see how this new metric could be useful.
I'd suggest to start without "reason" label – we can always add it later, when needed.
pkg/scheduler/queue/queue.go
Outdated
@@ -91,6 +95,7 @@ func (q *RequestQueue) EnqueueRequest(userID string, req Request, maxQueriers in | |||
} | |||
return nil | |||
default: | |||
q.discardedQueries.WithLabelValues(userID, validation.RateLimited).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "rate limited" is misleading here. We only drop the request if queue is full, which can be for various reasons (slow queries, queriers crashed, no querier connected or all queriers are busy). I'd suggest to start without "reason" label.
Also I don't think we should use "validation" reasons (used for validating incoming samples) here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
pkg/scheduler/scheduler.go
Outdated
@@ -100,7 +101,12 @@ func NewScheduler(cfg Config, limits Limits, log log.Logger, registerer promethe | |||
Name: "cortex_query_scheduler_queue_length", | |||
Help: "Number of queries in the queue.", | |||
}, []string{"user"}) | |||
s.requestQueue = queue.NewRequestQueue(cfg.MaxOutstandingPerTenant, s.queueLength) | |||
|
|||
s.discardedQueries = promauto.With(registerer).NewCounterVec(prometheus.CounterOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please cleanup this metric in cleanupMetricsForInactiveUser
and add it to tests to verify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you.
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
…xproject#3894) * feat: instrument the query layer to track rate-limited queries Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com> * chore: update changelog Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com> * fix: fix goimports linting error Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com> * fix per PR comments Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com> * rename discarded_queries --> discarded_requests Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com> * fix lint Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
What this PR does:
Instrument the query-frontend and query-scheduler to track the queries that were discarded because of the configured outstanding queries rate limiter. I pre-emptively added a reason label to allow for flexibility in adding other reasons for why a query may be discarded.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]