-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17118: Simplify CoreContainerProvider initialization #2474
SOLR-17118: Simplify CoreContainerProvider initialization #2474
Conversation
Thereby fixing some rare bugs in tests, maybe in the real world too.
CoreContainer getCoreContainer() throws UnavailableException { | ||
waitForCoreContainer(() -> cores, init); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't wait at this point anymore; it ought to be ready (so we check to report a nice exception if it isn't)
return services.computeIfAbsent(key, ServiceHolder::new); | ||
|
||
/** Acquires an instance from the context, waiting if necessary. */ | ||
public static CoreContainerProvider serviceForContext(ServletContext ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wait for it; previously did not. This allows us to return a bona-fide CoreContainerProvider instead of a ServiceHolder wrapper. Alas, even CCP is itself a wrapper to what we really want.
solr/core/src/java/org/apache/solr/servlet/CoreContainerProvider.java
Outdated
Show resolved
Hide resolved
@Override | ||
public synchronized void lifeCycleStarted(LifeCycle arg0) { | ||
// awkwardly, parts of Solr want to know the port but we don't know that until now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably something we could fix. The Jetty server inserts itself into the context, thus CoreContainerProvider should be able to pull it out on initialization in order to find the port.
solr/core/src/java/org/apache/solr/servlet/SolrDispatchFilter.java
Outdated
Show resolved
Hide resolved
// TODO: see if we can get rid of the holder here (Servlet spec actually guarantees | ||
// ContextListeners run before filter init, but JettySolrRunner that we use for tests is | ||
// complicated) | ||
private ServiceHolder coreService; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good riddance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the sticky bit. Before pushing I suggest you beast the tests for a good while. My memory is that without this I got a low level of failures and or deadlocks. The current state while apparently not perfect had a low enough frequency that it didn't crop up. (though I don't remember exactly how long I was beasting things either).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please recommend the beasting options I should use today/tonight?
I'd like to merge this to main only and then only backport after awhile.
public static CoreContainerProvider serviceForContext(ServletContext ctx) | ||
throws InterruptedException { | ||
long startWait = System.nanoTime(); | ||
synchronized (ctx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm interesting idea, though slightly scary to wait on something jetty owns. Never know when they might also (try to) lock it during server initialization. Also I'm unsure if this would be safe if we ever had another ServletContextListner do the same thing... Worse yet the use of JettySolrRunner for our tests means that we don't have any tests for what start.jar is doing. That dichotomy is a lurking problem, for which there are 2 possible solutions: 1) make jetty startup quicker for tests with https://eclipse.dev/jetty/documentation/jetty-12/operations-guide/index.html#og-quickstart and ditch JettySolrRunner or 2) ditching start.jar and transitioning to custom programatic jetty startup (possibly a variant of JettySolrRunner). As it stands, the tests don't do a good job of covering the true startup process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@epugh was the one who mentioned the jetty quickstart mechanism to me back when I was doing this. I don't know if he's had thoughts since then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worse yet the use of JettySolrRunner for our tests means that we don't have any tests for what start.jar is doing.
BATS integration tests is our answer to that. JettySolrRunner just needs to handle the sequencing realistically (what starts and stops when). You might have also noticed some bolted-on requirements for tests like debug filter, proxy port, and more, so I don't see JettySolrRunner going away. It could be simplified further!
I'm not optimistic about that QuickStart thing helping us -- we've already tuned Jetty. Most notably, our web.xml has metadata-complete="true"
thus there is no annotation scanning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never know when they might also (try to) lock it during server initialization.
I'm not concerned; it's only Jetty and us to worry about. The code here is resilient to wait
being notified inexplicably. There's a loop and it keeps checking for the CoreContainer to show up. Perhaps it'll log a bit more that it's waiting; it's okay. In practice I don't think it'll wait at all since you had pointed out in comments that the ServletContextListener is initialized first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the initialization sequence that I'm coming to learn, I suspected that this code doesn't even have to wait and I was right! I added an assert before ctx.wait and it never tripped. Ha!
Maybe this is controversial but I'm tempted to remove this code (and the "notify" code elsewhere) and replace it with a lookup that on non-existence throws an IllegalStateException claiming that the ServletContextListener should have done its job by then. If it didn't, something is seriously wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. If it doesn't work we can always change back right?
The logic for "we want to shutdown outside of jetty cutting us off" is clearly bogus in its present form.
assertEquals(CommonParams.OK, req.process(solrClient).getResponse().get(CommonParams.STATUS)); | ||
|
||
// shutdown the core container of new node | ||
newJetty.getCoreContainer().shutdown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this portion of the test invalid since it quite intentionally (see the opening comment) shuts down the container without touching Jetty. That's simply not how Jetty shuts down; only Jetty's lifecycle initiates an orderly shutdown. I could test trying an attempt to do a request against a non-existent server but that seemed silly.
solr/core/src/test/org/apache/solr/handler/admin/HealthCheckHandlerTest.java
Show resolved
Hide resolved
solr/test-framework/src/java/org/apache/solr/embedded/JettySolrRunner.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in JIRA, after some more testing I'd like to merge to main then monitor builds carefully. It's better than the current state.
Also I need to run the Nightly tests and Integration tests -- neither run in PR validation.
@markrmiller you might have insights
public static CoreContainerProvider serviceForContext(ServletContext ctx) | ||
throws InterruptedException { | ||
long startWait = System.nanoTime(); | ||
synchronized (ctx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the initialization sequence that I'm coming to learn, I suspected that this code doesn't even have to wait and I was right! I added an assert before ctx.wait and it never tripped. Ha!
Maybe this is controversial but I'm tempted to remove this code (and the "notify" code elsewhere) and replace it with a lookup that on non-existence throws an IllegalStateException claiming that the ServletContextListener should have done its job by then. If it didn't, something is seriously wrong.
solr/test-framework/src/java/org/apache/solr/embedded/JettySolrRunner.java
Show resolved
Hide resolved
} | ||
}); | ||
|
||
debugFilter = root.addFilter(DebugFilter.class, "/*", EnumSet.of(DispatcherType.REQUEST)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer is this section in a LifeCycle.Listener. No longer does it do JettySolrRunner.this.notify(); waitOnSolr is gone.
QueuedThreadPool qtp = (QueuedThreadPool) server.getThreadPool(); | ||
ReservedThreadExecutor rte = qtp.getBean(ReservedThreadExecutor.class); | ||
|
||
server.stop(); | ||
|
||
if (server.getState().equals(Server.FAILED)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we install these components before Jetty starts in a normal way now
@@ -556,15 +528,7 @@ public synchronized void start(boolean reusePort) throws Exception { | |||
server.start(); | |||
} | |||
} | |||
synchronized (JettySolrRunner.this) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after server.start returns, dispatchFilter should be running. It's not started asynchronously; no race.
@@ -699,12 +653,6 @@ public synchronized void stop() throws Exception { | |||
timeout.waitFor("Timeout waiting for reserved executor to stop.", rte::isStopped); | |||
} | |||
|
|||
// we want to shutdown outside of jetty cutting us off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I said in the commit message for this change, this block here has been pointless ever since the previous big change by Gus. The comment dates back to @markrmiller but it's not clear to me way way back then what the concern was. Any way, I've found Jetty to do stuff fairly predictably and in one thread that I can reason about the sequencing.
@@ -726,10 +674,6 @@ public synchronized void stop() throws Exception { | |||
} | |||
} | |||
|
|||
private ExecutorService getJettyShutDownThreadPool() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an aside, I hate supposed getters that in fact yield new things always
If I get no more feedback, I'll merge Friday. Suggested CHANGES.txt in "Bugs":
|
And another hang in another of my PRs: #2487 (comment) Looking forward to merging this today/Friday :-) and not waiting that long to bring to 9x. |
# Conflicts: # solr/core/src/test/org/apache/solr/handler/admin/HealthCheckHandlerTest.java
Update dependency commons-cli:commons-cli to v1.8.0 (apache#2481) * SOLR-17137 - Add configuration for SSL between Solr and Prometheus exporter. (apache#2232) When running Solr with mTLS the exporter also needs to use mTLS to be able to talk to Solr api. This commit also adds a bats test that verifies that solr exporter actually scrape the running solr instance by verifying information about a newly made core is present. * Move setShardAttributesToParams (apache#2486) Too special-purpose to warrant any back-compat concern. Was here since 2023-06 note: the test is dubious and admittedly isn't in a great location. * SOLR-16093: Tests: don't require IPv6 (apache#2484) The Java VM/host, doesn't always support IPv6, our tests shouldn't require it. security.policy: Removed 3 lines that were only for tests. * SOLR-17271: PerReplicaState: Shard leader elections still impact state.json (apache#2443) * SOLR-17118: Simplify/fix CoreContainerProvider initialization (apache#2474) Thereby fixing a rare occurrence of Solr hanging on startup. CoreContainerProvider: Don't need any CountDownLatch (x2), the synchronized WeakHashMap of "services", the ServiceHolder, the ContextInitializationKey. No looping to wait for initialization. JettySolrRunner: incorporate the CoreContainerProvider and various servlet filters in a normal way -- add all this before starting, not after. Thus Jetty will shut them down properly so we don't have to. Removed some needless synchronized wait/notify and other needless stuff. HealthCheckHandlerTest was shutting down CoreContainer improperly; this subset of the test was removed. * SOLR-17272: PerReplicaState: Replica "state" and "leader" are still in state.json (apache#2458) * Build: report test history URLs. Fix classMethod suffix (apache#2487) * Test failures reported at the end of a build now include links to view test history at ge.apache.org and fucit.org * Use "./gradlew" (dot-slash) so we can paste the repro line to a shell * SOLR-15447: classMethod suffix is removed Refactored the gradle script some too. * Refactor: dedupe Utils.RANDOM from other spots (apache#2495) * SOLR-17256: Deprecate SolrRequest get/set BasePath and RequestWriter.getPath (apache#2494) Just deprecation. Later, will follow-up with replacements and removal, including CHANGES.txt. * SOLR-17099: snitch does not return spurious tags (apache#2278) * Update dependency org.semver4j:semver4j to v5.3.0 (apache#2440) * Update dependency com.fasterxml.jackson:jackson-bom to v2.17.1 (apache#2346) * Update dependency io.swagger.core.v3:swagger-annotations-jakarta to v2.2.22 (apache#2373) * Update org.mockito:mockito* to v5.12.0 (apache#2490) * Add 'score' field caveat to cursormark docs (apache#2492) Cursormark can give some funky results if used across multiple replicas in a SolrCloud collection, if `score` is used as a sort field. This commit documents this limitation in the cursormark docs. * SOLR-17304: PKG_VERSIONS not honored when loading the schema plugins (apache#2471) Add a reproducer test and a fix. * SOLR-17313: Remove deprecated class SolrLogPostTool (apache#2485) * Remove deprecated class * SOLR-17317: Correct 'wt' interpretation for v2 APIs (apache#2496) Prior to this commit a bug was causing 'wt=json' to return XML content. This is now fixed - the problem was a missing clause in a switch-case statement. * sorl #5: Migration of Thread page into a Thread Component. * sorl #5: Migration of Thread page into a Thread Component. --------- Co-authored-by: Solr Bot <125606113+solrbot@users.noreply.github.com> Co-authored-by: Eivind Bergstøl <eivind@bergstol.no> Co-authored-by: David Smiley <dsmiley@apache.org> Co-authored-by: David Smiley <dsmiley@salesforce.com> Co-authored-by: Noble Paul <noblepaul@users.noreply.github.com> Co-authored-by: Pierre Salagnac <82967811+psalagnac@users.noreply.github.com> Co-authored-by: Jason Gerlowski <gerlowskija@apache.org> Co-authored-by: Andrey Bozhko <andybozhko@gmail.com> Co-authored-by: Eric Pugh <epugh@opensourceconnections.com> Co-authored-by: Nicholas J <nick87@byui.edu>
Thereby fixing a rare occurrence of Solr hanging on startup. CoreContainerProvider: Don't need any CountDownLatch (x2), the synchronized WeakHashMap of "services", the ServiceHolder, the ContextInitializationKey. No looping to wait for initialization. JettySolrRunner: incorporate the CoreContainerProvider and various servlet filters in a normal way -- add all this before starting, not after. Thus Jetty will shut them down properly so we don't have to. Removed some needless synchronized wait/notify and other needless stuff. HealthCheckHandlerTest was shutting down CoreContainer improperly; this subset of the test was removed. (cherry picked from commit 1177796)
Thereby fixing some rare bugs in tests, maybe in the real world too.
https://issues.apache.org/jira/browse/SOLR-17118