-
Notifications
You must be signed in to change notification settings - Fork 768
Broken tests from closing meter providers on core unload and reload #3707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken tests from closing meter providers on core unload and reload #3707
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After rebasing the branch, I found this test here that wasn't there before that was failing. Wasn't closing the meter providers on core unload and reload so added that in. Doing so revealed a number of issues with the feature branch I tried to fix here with some being more difficult than others. There are still more tests failing and have been difficult to track them all down. Could you some help or guidance on trying to fix this branch.
grant { | ||
permission jdk.jfr.FlightRecorderPermission "accessFlightRecorder"; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this because of JFR for JVM metrics from the OTEL library
private final List<MetricProducerInfo> registeredProducers = new ArrayList<>(); | ||
|
||
private record MetricProducerInfo(SolrMetricProducer producer, String scope) {} | ||
private record MetricProducerInfo(SolrMetricProducer producer, Attributes attributes) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a little bit of cleanup on this because of this weird scope string.
String key = enforcePrefix(registry); | ||
MeterProviderAndReaders mpr = meterProviderAndReaders.remove(key); | ||
if (mpr != null) { | ||
IOUtils.closeQuietly(mpr.sdkMeterProvider()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This here was difficult to track down. Tests were running into a dead lock because of this. What I had found is that when cores were being closed the Meter providers were going with it. The periodic metric reader was being registered even when the exporter was a Noop. When the periodic metric reader is closed, it triggers one last metric collection for the exporter calling the observable instruments callbacks holding the lock for that concurrent hashmaps bucket. One of the callbacks was getting index size in SolrCore which had a searcher lock. At the same time, the Solr core closing looked to hold that same searcher lock creating a deadlock. Basically from what I can tell:
Thread A: holds ConcurrentHashMaps bucket lock → waiting for searcherLock (inside the callback).
Thread B: holds searcherLock from the Solr close call → waiting for CHM bin lock.
Deadlock!
So changed this to just close the provider without the computeIfPresent.
assertTrue(prometheusMetrics.contains("name=\"documentCache\"")); | ||
// We need a reference to the core because a reload from /config call closes the whole core thus | ||
// closing metric registries | ||
try (SolrCore core = solrClientTestRule.getCoreContainer().getCore("collection1")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this was common knowledge but this was difficult to figure out. From what I can tell, when a reload happened here I think it closes the cores when there were no more references. Since we now close them on core close, runConfigCommand
closed the registries just making all the metrics go null. So I have to keep a reference to the core so that the reload doesn't just tear down the meter provider in this test.
isCompleted = checkReloadCompletion(asyncId); | ||
} while (!isCompleted); | ||
requestMetrics(false); | ||
try (SolrCore core = h.getCoreInc()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing happened here as above, so need to keep references to avoid closing the meter providers
// NOCOMMIT: Core reload causing a flaky test for cache metrics | ||
@Test | ||
@AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/SOLR-17458") | ||
public void testRecursiveFilter() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't figure out this problem now that we close meter providers on unload and reload. All tests that pretty much do a reload and check cache metrics and SolrIndexSearcher metrics began to fail. When I test reload with a running solr cloud locally, metrics all reinitialize except for Caffeine Cache
and Solr index searcher
metrics as its hit or miss. Almost seems like a timing or race condition for a reload and when searchers register initializing metrics. I could not get to the bottom of why. Was hoping to get some help if someone understands why.
metricManager.removeRegistry(solrMetricsContext.getRegistryName()); | ||
registeredProducers.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think I found the culprit for reload and metrics not working correctly. I had added to delete the registry here for OTEL metrics but Dropwizard didn't delete the registry but instead just unregistered gauges which isn't all metrics. So I removed this and tests all pass like normal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that a core's metrics are mostly still accessible, even after a core is deleted? Technically you didn't, this code is in the context of "close"... but I'd hope that a high level command to remove/delete a core would do similarly for the underlying metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics are not accessible on core delete or unload. Those metrics are completely gone and the registry goes with it. This change was for the reload scenario. Turns out, it didn't reset normal counters and meters in Dropwizard and never actually deletes the registry. Only the gauges it seems.
permission java.io.FilePermission "${user.home}${/}.m2${/}repository${/}-", "read"; | ||
}; | ||
|
||
grant { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidental? But it's okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not accident. Needed it for JvmMetricsTest
.
After rebasing the feature branch, closing the registries on core reload and unload broke a number of things. Fix these in this PR.