New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-14615: Implement CPU Utilization Based Circuit Breaker #1737
Conversation
This commit introduces CPU based circuit breaker. This circuit breaker tracks the average CPU load per minute and triggers if the value exceeds a configurable value. This commit also adds a specific control flag for Memory Circuit Breaker to allow enabling/disabling the same.
The configuration can be as simple as
This way you can just read all the attributes all at once from the |
As discussed offline, I will refactor circuit breaker infrastructure to use PluginInfo as a part of 8.7 (hence will leave this PR's JIRA open for that effort). Not proceeding with that effort in this PR. |
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
Outdated
Show resolved
Hide resolved
@madrob Any further thoughts on this? |
solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java
Outdated
Show resolved
Hide resolved
* that data to take a decision. Ideally, we should be able to cache the value | ||
* locally and only query once the minute has elapsed. However, that will introduce | ||
* more complexity than the current structure and might not get us major performance | ||
* wins. If this ever becomes a performance bottleneck, that can be considered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc for OSMxBean says "This method is designed to provide a hint about the system load and may be queried frequently." Not sure what "frequently" means here, though...
It will be interesting to see the dynamic behavior of this breaker - I'm somewhat worried that the 1-min average may be too unstable and we end up flip-flopping between the states too often (eg. every dozen requests or so). Depending on the volume of momentary load (eg. an ongoing large merge or split shard) the 1-min average may exceed a threshold even though it doesn't properly reflect a sustained longer-term overload that we should worry about. Average load is a convenient lie ;)
OTOH that's the only method we have in the OS MXBean, so we would have to compute that longer-term average ourselves, which is messy... so I guess we'll see .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, caching doesn't have to be complicated .. you simply check the elapsed time since last request and if it's longer than timeout you refresh the value first, it's probably less than 10 lines of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think this is a very good point. If load average only updates every 5 seconds, that seems like a huge interval where we're going to block query execution. Let's think about how we can get better granularity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Javadocs state that the average load is updated per one minute -- so I am not sure how we can get a granularity better than that. Its unfortunate that this unit is not configurable (and I believe it must be internally caching as well since it talks about being prepared for "querying frequently". Any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this, the load is an average of one minute. If there is a transient spike -- should it not be smoothened out over the minute so that the average value is reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh ... I see there's still some misunderstanding about this. The call itself is directly passed to the native method that invokes stdlib getloadavg
, which in turn reads these values from the /proc pseudo-fs. So, the cost is truly minimal and the call doesn't block - if it turns out that it's still too costly to call for every request then we can introduce some timeout-based caching.
These averages are so called exponentially weighted moving averages, so indeed a 1-min average has traces of past load values from outside the 1-min window, which helps in smoothing it. This may turn out to be sufficient to avoid false positives due to short-term spikes (such as large merges). Linux loadavg represents to some degree a combined CPU + disk IO load, so indeed intensive IO operations will affect it.
We always have an option to use Codahale Meter to easily calculate 5- and 15-min EWMAs if it turns out that we're getting too many false positives. Until then users can configure higher thresholds, thus reducing the number of false positives at the cost of higher contention.
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
Outdated
Show resolved
Hide resolved
* that data to take a decision. Ideally, we should be able to cache the value | ||
* locally and only query once the minute has elapsed. However, that will introduce | ||
* more complexity than the current structure and might not get us major performance | ||
* wins. If this ever becomes a performance bottleneck, that can be considered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh ... I see there's still some misunderstanding about this. The call itself is directly passed to the native method that invokes stdlib getloadavg
, which in turn reads these values from the /proc pseudo-fs. So, the cost is truly minimal and the call doesn't block - if it turns out that it's still too costly to call for every request then we can introduce some timeout-based caching.
These averages are so called exponentially weighted moving averages, so indeed a 1-min average has traces of past load values from outside the 1-min window, which helps in smoothing it. This may turn out to be sufficient to avoid false positives due to short-term spikes (such as large merges). Linux loadavg represents to some degree a combined CPU + disk IO load, so indeed intensive IO operations will affect it.
We always have an option to use Codahale Meter to easily calculate 5- and 15-min EWMAs if it turns out that we're getting too many false positives. Until then users can configure higher thresholds, thus reducing the number of false positives at the cost of higher contention.
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java
Show resolved
Hide resolved
solr/core/src/test-files/solr/collection1/conf/solrconfig-memory-circuitbreaker.xml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor comments, this looks pretty good
solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java
Outdated
Show resolved
Hide resolved
|
||
=== CPU Utilization Based Circuit Breaker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sigram we could call this Load Avg Based Circuit Breaker, WDYT? Maybe that is too much implementation detail?
solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java
Outdated
Show resolved
Hide resolved
This commit introduces CPU based circuit breaker. This circuit breaker tracks the average CPU load per minute and triggers if the value exceeds a configurable value. This commit also adds a specific control flag for Memory Circuit Breaker to allow enabling/disabling the same.
This commit introduces CPU based circuit breaker. This circuit breaker
tracks the average CPU load per minute and triggers if the value exceeds
a configurable value.
This commit also adds a specific control flag for Memory Circuit Breaker
to allow enabling/disabling the same.