Skip to content

Conversation

mlbiscoc
Copy link
Contributor

After rebasing the feature branch, closing the registries on core reload and unload broke a number of things. Fix these in this PR.

Copy link
Contributor Author

@mlbiscoc mlbiscoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rebasing the branch, I found this test here that wasn't there before that was failing. Wasn't closing the meter providers on core unload and reload so added that in. Doing so revealed a number of issues with the feature branch I tried to fix here with some being more difficult than others. There are still more tests failing and have been difficult to track them all down. Could you some help or guidance on trying to fix this branch.

Comment on lines +281 to +283
grant {
permission jdk.jfr.FlightRecorderPermission "accessFlightRecorder";
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this because of JFR for JVM metrics from the OTEL library

private final List<MetricProducerInfo> registeredProducers = new ArrayList<>();

private record MetricProducerInfo(SolrMetricProducer producer, String scope) {}
private record MetricProducerInfo(SolrMetricProducer producer, Attributes attributes) {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a little bit of cleanup on this because of this weird scope string.

Comment on lines +891 to +895
String key = enforcePrefix(registry);
MeterProviderAndReaders mpr = meterProviderAndReaders.remove(key);
if (mpr != null) {
IOUtils.closeQuietly(mpr.sdkMeterProvider());
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This here was difficult to track down. Tests were running into a dead lock because of this. What I had found is that when cores were being closed the Meter providers were going with it. The periodic metric reader was being registered even when the exporter was a Noop. When the periodic metric reader is closed, it triggers one last metric collection for the exporter calling the observable instruments callbacks holding the lock for that concurrent hashmaps bucket. One of the callbacks was getting index size in SolrCore which had a searcher lock. At the same time, the Solr core closing looked to hold that same searcher lock creating a deadlock. Basically from what I can tell:

Thread A: holds ConcurrentHashMaps bucket lock → waiting for searcherLock (inside the callback).
Thread B: holds searcherLock from the Solr close call → waiting for CHM bin lock.

Deadlock!

So changed this to just close the provider without the computeIfPresent.

assertTrue(prometheusMetrics.contains("name=\"documentCache\""));
// We need a reference to the core because a reload from /config call closes the whole core thus
// closing metric registries
try (SolrCore core = solrClientTestRule.getCoreContainer().getCore("collection1")) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this was common knowledge but this was difficult to figure out. From what I can tell, when a reload happened here I think it closes the cores when there were no more references. Since we now close them on core close, runConfigCommand closed the registries just making all the metrics go null. So I have to keep a reference to the core so that the reload doesn't just tear down the meter provider in this test.

isCompleted = checkReloadCompletion(asyncId);
} while (!isCompleted);
requestMetrics(false);
try (SolrCore core = h.getCoreInc()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing happened here as above, so need to keep references to avoid closing the meter providers

// NOCOMMIT: Core reload causing a flaky test for cache metrics
@Test
@AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/SOLR-17458")
public void testRecursiveFilter() throws Exception {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out this problem now that we close meter providers on unload and reload. All tests that pretty much do a reload and check cache metrics and SolrIndexSearcher metrics began to fail. When I test reload with a running solr cloud locally, metrics all reinitialize except for Caffeine Cache and Solr index searcher metrics as its hit or miss. Almost seems like a timing or race condition for a reload and when searchers register initializing metrics. I could not get to the bottom of why. Was hoping to get some help if someone understands why.

@mlbiscoc mlbiscoc requested a review from dsmiley September 28, 2025 03:38
Comment on lines -191 to -192
metricManager.removeRegistry(solrMetricsContext.getRegistryName());
registeredProducers.clear();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think I found the culprit for reload and metrics not working correctly. I had added to delete the registry here for OTEL metrics but Dropwizard didn't delete the registry but instead just unregistered gauges which isn't all metrics. So I removed this and tests all pass like normal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that a core's metrics are mostly still accessible, even after a core is deleted? Technically you didn't, this code is in the context of "close"... but I'd hope that a high level command to remove/delete a core would do similarly for the underlying metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics are not accessible on core delete or unload. Those metrics are completely gone and the registry goes with it. This change was for the reload scenario. Turns out, it didn't reset normal counters and meters in Dropwizard and never actually deletes the registry. Only the gauges it seems.

permission java.io.FilePermission "${user.home}${/}.m2${/}repository${/}-", "read";
};

grant {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental? But it's okay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not accident. Needed it for JvmMetricsTest.

@mlbiscoc mlbiscoc merged commit e4af8c9 into apache:feature/SOLR-17458-rebased Oct 1, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants