SOLR-14615: Implement CPU Utilization Based Circuit Breaker #1737

atris · 2020-08-11T08:48:18Z

This commit introduces CPU based circuit breaker. This circuit breaker
tracks the average CPU load per minute and triggers if the value exceeds
a configurable value.

This commit also adds a specific control flag for Memory Circuit Breaker
to allow enabling/disabling the same.

This commit introduces CPU based circuit breaker. This circuit breaker tracks the average CPU load per minute and triggers if the value exceeds a configurable value. This commit also adds a specific control flag for Memory Circuit Breaker to allow enabling/disabling the same.

noblepaul · 2020-08-11T09:13:09Z

The configuration can be as simple as

<circuitBreaker memoryCircuitBreakerThresholdPct="" isCpuCircuitBreakerEnabled="true" memoryCircuitBreakerThresholdPct ="" cpuCircuitBreakerThresholdPct="">

This way you can just read all the attributes all at once from the PluginInfo .
CircuitBreaker should be a type of plugin. It should be an interface

atris · 2020-08-11T11:03:14Z

The configuration can be as simple as

<circuitBreaker memoryCircuitBreakerThresholdPct="" isCpuCircuitBreakerEnabled="true" memoryCircuitBreakerThresholdPct ="" cpuCircuitBreakerThresholdPct="">

This way you can just read all the attributes all at once from the PluginInfo .
CircuitBreaker should be a type of plugin. It should be an interface

As discussed offline, I will refactor circuit breaker infrastructure to use PluginInfo as a part of 8.7 (hence will leave this PR's JIRA open for that effort). Not proceeding with that effort in this PR.

solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java

solr/core/src/java/org/apache/solr/core/SolrConfig.java

solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java

atris · 2020-08-13T05:11:20Z

@madrob Any further thoughts on this?

solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java

solr/solr-ref-guide/src/circuit-breakers.adoc

solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java

sigram · 2020-08-17T11:04:42Z

solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java

+ * that data to take a decision. Ideally, we should be able to cache the value
+ * locally and only query once the minute has elapsed. However, that will introduce
+ * more complexity than the current structure and might not get us major performance
+ * wins. If this ever becomes a performance bottleneck, that can be considered.


Javadoc for OSMxBean says "This method is designed to provide a hint about the system load and may be queried frequently." Not sure what "frequently" means here, though...

It will be interesting to see the dynamic behavior of this breaker - I'm somewhat worried that the 1-min average may be too unstable and we end up flip-flopping between the states too often (eg. every dozen requests or so). Depending on the volume of momentary load (eg. an ongoing large merge or split shard) the 1-min average may exceed a threshold even though it doesn't properly reflect a sustained longer-term overload that we should worry about. Average load is a convenient lie ;)

OTOH that's the only method we have in the OS MXBean, so we would have to compute that longer-term average ourselves, which is messy... so I guess we'll see .

Also, caching doesn't have to be complicated .. you simply check the elapsed time since last request and if it's longer than timeout you refresh the value first, it's probably less than 10 lines of code.

Yea, I think this is a very good point. If load average only updates every 5 seconds, that seems like a huge interval where we're going to block query execution. Let's think about how we can get better granularity.

The Javadocs state that the average load is updated per one minute -- so I am not sure how we can get a granularity better than that. Its unfortunate that this unit is not configurable (and I believe it must be internally caching as well since it talks about being prepared for "querying frequently". Any thoughts on this?

Thinking more about this, the load is an average of one minute. If there is a transient spike -- should it not be smoothened out over the minute so that the average value is reasonable?

Uh ... I see there's still some misunderstanding about this. The call itself is directly passed to the native method that invokes stdlib getloadavg, which in turn reads these values from the /proc pseudo-fs. So, the cost is truly minimal and the call doesn't block - if it turns out that it's still too costly to call for every request then we can introduce some timeout-based caching.

These averages are so called exponentially weighted moving averages, so indeed a 1-min average has traces of past load values from outside the 1-min window, which helps in smoothing it. This may turn out to be sufficient to avoid false positives due to short-term spikes (such as large merges). Linux loadavg represents to some degree a combined CPU + disk IO load, so indeed intensive IO operations will affect it.

We always have an option to use Codahale Meter to easily calculate 5- and 15-min EWMAs if it turns out that we're getting too many false positives. Until then users can configure higher thresholds, thus reducing the number of false positives at the cost of higher contention.

solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java

solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java

solr/core/src/java/org/apache/solr/core/SolrConfig.java

solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java

solr/solr-ref-guide/src/circuit-breakers.adoc

atris · 2020-08-17T18:20:57Z

@madrob @sigram Please see the next iteration

solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java

solr/core/src/java/org/apache/solr/core/SolrConfig.java

sigram · 2020-08-18T10:15:58Z

solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java

+ * that data to take a decision. Ideally, we should be able to cache the value
+ * locally and only query once the minute has elapsed. However, that will introduce
+ * more complexity than the current structure and might not get us major performance
+ * wins. If this ever becomes a performance bottleneck, that can be considered.


Uh ... I see there's still some misunderstanding about this. The call itself is directly passed to the native method that invokes stdlib getloadavg, which in turn reads these values from the /proc pseudo-fs. So, the cost is truly minimal and the call doesn't block - if it turns out that it's still too costly to call for every request then we can introduce some timeout-based caching.

These averages are so called exponentially weighted moving averages, so indeed a 1-min average has traces of past load values from outside the 1-min window, which helps in smoothing it. This may turn out to be sufficient to avoid false positives due to short-term spikes (such as large merges). Linux loadavg represents to some degree a combined CPU + disk IO load, so indeed intensive IO operations will affect it.

We always have an option to use Codahale Meter to easily calculate 5- and 15-min EWMAs if it turns out that we're getting too many false positives. Until then users can configure higher thresholds, thus reducing the number of false positives at the cost of higher contention.

solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java

solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java

solr/core/src/test-files/solr/collection1/conf/solrconfig-memory-circuitbreaker.xml

solr/solr-ref-guide/src/circuit-breakers.adoc

sigram

LGTM - thanks!

madrob

Very minor comments, this looks pretty good

solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java

solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java

madrob · 2020-08-19T17:18:23Z

solr/solr-ref-guide/src/circuit-breakers.adoc


+=== CPU Utilization Based Circuit Breaker


@sigram we could call this Load Avg Based Circuit Breaker, WDYT? Maybe that is too much implementation detail?

solr/server/solr/configsets/_default/conf/solrconfig.xml

solr/core/src/test/org/apache/solr/util/TestCircuitBreaker.java

atris · 2020-08-20T07:51:47Z

Thank you @madrob and @ab!

This commit introduces CPU based circuit breaker. This circuit breaker tracks the average CPU load per minute and triggers if the value exceeds a configurable value. This commit also adds a specific control flag for Memory Circuit Breaker to allow enabling/disabling the same.

atris added 4 commits August 11, 2020 14:13

Remove redundant stuff

cf897f8

Add CHANGES.txt entry

96513f0

Remove redundant print statements

1f44a29

Update to use Java MXBean

342f527

atris requested review from anshumg and madrob August 11, 2020 16:08

madrob reviewed Aug 11, 2020

View reviewed changes

atris added 2 commits August 11, 2020 23:55

Update per comments

0972a36

Rename parameter and remove ceiling on the threshold range

c5d95cd

atris requested a review from madrob August 13, 2020 05:11

madrob requested changes Aug 14, 2020

View reviewed changes

sigram reviewed Aug 17, 2020

View reviewed changes

Update per comments

bdd4f21

atris requested a review from madrob August 17, 2020 18:21

madrob reviewed Aug 17, 2020

View reviewed changes

solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java Outdated Show resolved Hide resolved

Use withInitial correctly

095a6f2

atris requested a review from madrob August 18, 2020 04:13

atris added 2 commits August 18, 2020 09:59

Remove redundant import

fe997a4

More precommit stuff

98612bc

sigram reviewed Aug 18, 2020

View reviewed changes

Update per comments

cada13b

sigram reviewed Aug 18, 2020

View reviewed changes

sigram approved these changes Aug 18, 2020

View reviewed changes

Fix Compilation Errors

bae9ebe

madrob reviewed Aug 19, 2020

View reviewed changes

madrob approved these changes Aug 19, 2020

View reviewed changes

Update per comments

d964ddb

atris merged commit 2f37f40 into apache:master Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-14615: Implement CPU Utilization Based Circuit Breaker #1737

SOLR-14615: Implement CPU Utilization Based Circuit Breaker #1737

atris commented Aug 11, 2020

noblepaul commented Aug 11, 2020 •

edited

atris commented Aug 11, 2020 •

edited

atris commented Aug 13, 2020

sigram Aug 17, 2020

sigram Aug 17, 2020

madrob Aug 17, 2020

atris Aug 17, 2020

atris Aug 17, 2020

sigram Aug 18, 2020

atris commented Aug 17, 2020

sigram Aug 18, 2020

sigram left a comment

madrob left a comment

madrob Aug 19, 2020

atris commented Aug 20, 2020

SOLR-14615: Implement CPU Utilization Based Circuit Breaker #1737

SOLR-14615: Implement CPU Utilization Based Circuit Breaker #1737

Conversation

atris commented Aug 11, 2020

noblepaul commented Aug 11, 2020 • edited

atris commented Aug 11, 2020 • edited

atris commented Aug 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atris commented Aug 17, 2020

Choose a reason for hiding this comment

sigram left a comment

Choose a reason for hiding this comment

madrob left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atris commented Aug 20, 2020

noblepaul commented Aug 11, 2020 •

edited

atris commented Aug 11, 2020 •

edited