Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17118: Simplify CoreContainerProvider initialization #2474

Merged

Conversation

dsmiley
Copy link
Contributor

@dsmiley dsmiley commented May 22, 2024

Thereby fixing some rare bugs in tests, maybe in the real world too.

https://issues.apache.org/jira/browse/SOLR-17118

Thereby fixing some rare bugs in tests, maybe in the real world too.
CoreContainer getCoreContainer() throws UnavailableException {
waitForCoreContainer(() -> cores, init);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't wait at this point anymore; it ought to be ready (so we check to report a nice exception if it isn't)

return services.computeIfAbsent(key, ServiceHolder::new);

/** Acquires an instance from the context, waiting if necessary. */
public static CoreContainerProvider serviceForContext(ServletContext ctx)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wait for it; previously did not. This allows us to return a bona-fide CoreContainerProvider instead of a ServiceHolder wrapper. Alas, even CCP is itself a wrapper to what we really want.

@Override
public synchronized void lifeCycleStarted(LifeCycle arg0) {
// awkwardly, parts of Solr want to know the port but we don't know that until now
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably something we could fix. The Jetty server inserts itself into the context, thus CoreContainerProvider should be able to pull it out on initialization in order to find the port.

Comment on lines -76 to -79
// TODO: see if we can get rid of the holder here (Servlet spec actually guarantees
// ContextListeners run before filter init, but JettySolrRunner that we use for tests is
// complicated)
private ServiceHolder coreService;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good riddance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the sticky bit. Before pushing I suggest you beast the tests for a good while. My memory is that without this I got a low level of failures and or deadlocks. The current state while apparently not perfect had a low enough frequency that it didn't crop up. (though I don't remember exactly how long I was beasting things either).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please recommend the beasting options I should use today/tonight?

I'd like to merge this to main only and then only backport after awhile.

Fix checkReady.
public static CoreContainerProvider serviceForContext(ServletContext ctx)
throws InterruptedException {
long startWait = System.nanoTime();
synchronized (ctx) {
Copy link
Contributor

@gus-asf gus-asf May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting idea, though slightly scary to wait on something jetty owns. Never know when they might also (try to) lock it during server initialization. Also I'm unsure if this would be safe if we ever had another ServletContextListner do the same thing... Worse yet the use of JettySolrRunner for our tests means that we don't have any tests for what start.jar is doing. That dichotomy is a lurking problem, for which there are 2 possible solutions: 1) make jetty startup quicker for tests with https://eclipse.dev/jetty/documentation/jetty-12/operations-guide/index.html#og-quickstart and ditch JettySolrRunner or 2) ditching start.jar and transitioning to custom programatic jetty startup (possibly a variant of JettySolrRunner). As it stands, the tests don't do a good job of covering the true startup process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epugh was the one who mentioned the jetty quickstart mechanism to me back when I was doing this. I don't know if he's had thoughts since then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worse yet the use of JettySolrRunner for our tests means that we don't have any tests for what start.jar is doing.

BATS integration tests is our answer to that. JettySolrRunner just needs to handle the sequencing realistically (what starts and stops when). You might have also noticed some bolted-on requirements for tests like debug filter, proxy port, and more, so I don't see JettySolrRunner going away. It could be simplified further!

I'm not optimistic about that QuickStart thing helping us -- we've already tuned Jetty. Most notably, our web.xml has metadata-complete="true" thus there is no annotation scanning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never know when they might also (try to) lock it during server initialization.

I'm not concerned; it's only Jetty and us to worry about. The code here is resilient to wait being notified inexplicably. There's a loop and it keeps checking for the CoreContainer to show up. Perhaps it'll log a bit more that it's waiting; it's okay. In practice I don't think it'll wait at all since you had pointed out in comments that the ServletContextListener is initialized first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the initialization sequence that I'm coming to learn, I suspected that this code doesn't even have to wait and I was right! I added an assert before ctx.wait and it never tripped. Ha!

Maybe this is controversial but I'm tempted to remove this code (and the "notify" code elsewhere) and replace it with a lookup that on non-existence throws an IllegalStateException claiming that the ServletContextListener should have done its job by then. If it didn't, something is seriously wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. If it doesn't work we can always change back right?

The logic for "we want to shutdown outside of jetty cutting us off" is clearly bogus in its present form.
assertEquals(CommonParams.OK, req.process(solrClient).getResponse().get(CommonParams.STATUS));

// shutdown the core container of new node
newJetty.getCoreContainer().shutdown();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this portion of the test invalid since it quite intentionally (see the opening comment) shuts down the container without touching Jetty. That's simply not how Jetty shuts down; only Jetty's lifecycle initiates an orderly shutdown. I could test trying an attempt to do a request against a non-existent server but that seemed silly.

@dsmiley dsmiley requested a review from markrmiller May 23, 2024 01:42
Copy link
Contributor Author

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in JIRA, after some more testing I'd like to merge to main then monitor builds carefully. It's better than the current state.

Also I need to run the Nightly tests and Integration tests -- neither run in PR validation.

@markrmiller you might have insights

public static CoreContainerProvider serviceForContext(ServletContext ctx)
throws InterruptedException {
long startWait = System.nanoTime();
synchronized (ctx) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the initialization sequence that I'm coming to learn, I suspected that this code doesn't even have to wait and I was right! I added an assert before ctx.wait and it never tripped. Ha!

Maybe this is controversial but I'm tempted to remove this code (and the "notify" code elsewhere) and replace it with a lookup that on non-existence throws an IllegalStateException claiming that the ServletContextListener should have done its job by then. If it didn't, something is seriously wrong.

}
});

debugFilter = root.addFilter(DebugFilter.class, "/*", EnumSet.of(DispatcherType.REQUEST));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer is this section in a LifeCycle.Listener. No longer does it do JettySolrRunner.this.notify(); waitOnSolr is gone.

QueuedThreadPool qtp = (QueuedThreadPool) server.getThreadPool();
ReservedThreadExecutor rte = qtp.getBean(ReservedThreadExecutor.class);

server.stop();

if (server.getState().equals(Server.FAILED)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we install these components before Jetty starts in a normal way now

@@ -556,15 +528,7 @@ public synchronized void start(boolean reusePort) throws Exception {
server.start();
}
}
synchronized (JettySolrRunner.this) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after server.start returns, dispatchFilter should be running. It's not started asynchronously; no race.

@@ -699,12 +653,6 @@ public synchronized void stop() throws Exception {
timeout.waitFor("Timeout waiting for reserved executor to stop.", rte::isStopped);
}

// we want to shutdown outside of jetty cutting us off
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said in the commit message for this change, this block here has been pointless ever since the previous big change by Gus. The comment dates back to @markrmiller but it's not clear to me way way back then what the concern was. Any way, I've found Jetty to do stuff fairly predictably and in one thread that I can reason about the sequencing.

@@ -726,10 +674,6 @@ public synchronized void stop() throws Exception {
}
}

private ExecutorService getJettyShutDownThreadPool() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an aside, I hate supposed getters that in fact yield new things always

@dsmiley
Copy link
Contributor Author

dsmiley commented May 28, 2024

If I get no more feedback, I'll merge Friday.
I'm not sure what single test (or small number of such) to beast. Running all tests isn't viable.

Suggested CHANGES.txt in "Bugs":

  • Simplify CoreContainerProvider initialization and JettySolrRunner. Avoid deadlock/hang.

@dsmiley
Copy link
Contributor Author

dsmiley commented May 31, 2024

And another hang in another of my PRs: #2487 (comment)

Looking forward to merging this today/Friday :-) and not waiting that long to bring to 9x.

# Conflicts:
#	solr/core/src/test/org/apache/solr/handler/admin/HealthCheckHandlerTest.java
@dsmiley dsmiley merged commit 1177796 into apache:main Jun 1, 2024
3 checks passed
@dsmiley dsmiley deleted the SOLR-17118-CoreContainerProvider-simplify branch June 1, 2024 02:04
NicholasJDev added a commit to Helen-JS/solr that referenced this pull request Jun 10, 2024
Update dependency commons-cli:commons-cli to v1.8.0 (apache#2481)

* SOLR-17137 - Add configuration for SSL between Solr and Prometheus exporter. (apache#2232)

When running Solr with mTLS the exporter also needs to use mTLS to be
able to talk to Solr api.

This commit also adds a bats test that verifies that solr exporter actually scrape the running
solr instance by verifying information about a newly made core is present.

* Move setShardAttributesToParams (apache#2486)

Too special-purpose to warrant any back-compat concern.  Was here since 2023-06
note: the test is dubious and admittedly isn't in a great location.

* SOLR-16093: Tests: don't require IPv6 (apache#2484)

The Java VM/host, doesn't always support IPv6, our tests shouldn't require it.
security.policy: Removed 3 lines that were only for tests.

* SOLR-17271: PerReplicaState: Shard leader elections still impact state.json (apache#2443)

* SOLR-17118: Simplify/fix CoreContainerProvider initialization (apache#2474)

Thereby fixing a rare occurrence of Solr hanging on startup.

CoreContainerProvider:  Don't need any CountDownLatch (x2), the synchronized WeakHashMap of "services", the ServiceHolder, the ContextInitializationKey.  No looping to wait for initialization.

JettySolrRunner: incorporate the CoreContainerProvider and various servlet filters in a normal way -- add all this before starting, not after.  Thus Jetty will shut them down properly so we don't have to.  Removed some needless synchronized wait/notify and other needless stuff.

HealthCheckHandlerTest was shutting down CoreContainer improperly; this subset of the test was removed.

* SOLR-17272: PerReplicaState: Replica "state" and "leader" are still in state.json (apache#2458)

* Build: report test history URLs.  Fix classMethod suffix (apache#2487)

* Test failures reported at the end of a build now include links to view test history at ge.apache.org and fucit.org
* Use "./gradlew" (dot-slash) so we can paste the repro line to a shell
* SOLR-15447: classMethod suffix is removed

Refactored the gradle script some too.

* Refactor: dedupe Utils.RANDOM from other spots (apache#2495)

* SOLR-17256: Deprecate SolrRequest get/set BasePath and RequestWriter.getPath (apache#2494)

Just deprecation.  
Later, will follow-up with replacements and removal, including CHANGES.txt.

* SOLR-17099: snitch does not return spurious tags (apache#2278)

* Update dependency org.semver4j:semver4j to v5.3.0 (apache#2440)

* Update dependency com.fasterxml.jackson:jackson-bom to v2.17.1 (apache#2346)

* Update dependency io.swagger.core.v3:swagger-annotations-jakarta to v2.2.22 (apache#2373)

* Update org.mockito:mockito* to v5.12.0 (apache#2490)

* Add 'score' field caveat to cursormark docs (apache#2492)

Cursormark can give some funky results if used across multiple replicas
in a SolrCloud collection, if `score` is used as a sort field.  This
commit documents this limitation in the cursormark docs.

* SOLR-17304: PKG_VERSIONS not honored when loading the schema plugins (apache#2471)

Add a reproducer test and a fix.

* SOLR-17313: Remove deprecated class SolrLogPostTool (apache#2485)

* Remove deprecated class

* SOLR-17317: Correct 'wt' interpretation for v2 APIs (apache#2496)

Prior to this commit a bug was causing 'wt=json' to return XML content.
This is now fixed - the problem was a missing clause in a switch-case
statement.

* sorl #5: Migration of Thread page into a Thread Component.

* sorl #5: Migration of Thread page into a Thread Component.

---------

Co-authored-by: Solr Bot <125606113+solrbot@users.noreply.github.com>
Co-authored-by: Eivind Bergstøl <eivind@bergstol.no>
Co-authored-by: David Smiley <dsmiley@apache.org>
Co-authored-by: David Smiley <dsmiley@salesforce.com>
Co-authored-by: Noble Paul <noblepaul@users.noreply.github.com>
Co-authored-by: Pierre Salagnac <82967811+psalagnac@users.noreply.github.com>
Co-authored-by: Jason Gerlowski <gerlowskija@apache.org>
Co-authored-by: Andrey Bozhko <andybozhko@gmail.com>
Co-authored-by: Eric Pugh <epugh@opensourceconnections.com>
Co-authored-by: Nicholas J <nick87@byui.edu>
dsmiley added a commit that referenced this pull request Jul 3, 2024
Thereby fixing a rare occurrence of Solr hanging on startup.

CoreContainerProvider:  Don't need any CountDownLatch (x2), the synchronized WeakHashMap of "services", the ServiceHolder, the ContextInitializationKey.  No looping to wait for initialization.

JettySolrRunner: incorporate the CoreContainerProvider and various servlet filters in a normal way -- add all this before starting, not after.  Thus Jetty will shut them down properly so we don't have to.  Removed some needless synchronized wait/notify and other needless stuff.

HealthCheckHandlerTest was shutting down CoreContainer improperly; this subset of the test was removed.

(cherry picked from commit 1177796)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants