Title: use rq_total in load_stats_reporter, current math might be double counting,
Description:
In current load_stats_reporter impl, there is a chance that we are double counting requests:
https://github.com/envoyproxy/envoy/blob/main/source/common/upstream/load_stats_reporter.cc#L79-L111
- rq_success inc()ed upon seeing "non-5xx" response headers.
- rq_active_ dec() happens on the stream deferred deletion time.
- if [load reporter latch] happen between 1. and 2., double counting on active requests would happen. This is more visible when stream destruction involving some time consuming operations, e.g. logging to external services.
It's also noteworthy that the if statement is checking on rq_success+rq_error+rq_active, in most happy cases, this number should match rq_total.
There is an error case that leads to load reporting failure tho: with grpc, grpc server fails to send trailers, the rq_success_.inc() or rq_error_.inc() in Router::Filter::onUpstreamTrailers() will never be called:
rq_active_ will dec() as the stream destructs, but rq_[success/error]_ will not inc().
We should probably fix the rq_[success/error]_ inc issue in another PR, but in this issue I hope we can solve the issue on the load_stats_reporter side by changing the if statement to use rq_issued, specifically:
changing if statement to check on rq_issued.
This way it gives the LRS server a chance to infer the "load " totally based on rq_total.
Two issues here:
- if LRS servers uses rq_success+rq_error+rq_active, risk of double counting.
- for envoy grpc upstreams, if no trailers sent back by server, no load reporting will be recorded or reported by LRS. (due to looking at "rq_success+rq_error+rq_active" only).
if we change the lrs_stats_reporter.cc if condition, we at least give the remote grpc server a chance to rq_total to count (mostly correct in all cases) as the load.
Title: use rq_total in load_stats_reporter, current math might be double counting,
Description:
In current load_stats_reporter impl, there is a chance that we are double counting requests:
https://github.com/envoyproxy/envoy/blob/main/source/common/upstream/load_stats_reporter.cc#L79-L111
It's also noteworthy that the if statement is checking on
rq_success+rq_error+rq_active, in most happy cases, this number should match rq_total.There is an error case that leads to load reporting failure tho: with grpc, grpc server fails to send trailers, the rq_success_.inc() or rq_error_.inc() in Router::Filter::onUpstreamTrailers() will never be called:
rq_active_ will dec() as the stream destructs, but rq_[success/error]_ will not inc().
We should probably fix the rq_[success/error]_ inc issue in another PR, but in this issue I hope we can solve the issue on the load_stats_reporter side by changing the if statement to use rq_issued, specifically:
changing if statement to check on
rq_issued.This way it gives the LRS server a chance to infer the "load " totally based on rq_total.
Two issues here:
if we change the lrs_stats_reporter.cc if condition, we at least give the remote grpc server a chance to rq_total to count (mostly correct in all cases) as the load.