-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-16615: Reuse Jersey apps for cores with the same configset #1286
SOLR-16615: Reuse Jersey apps for cores with the same configset #1286
Conversation
This commit introduces the first pieces needed to share Jersey applications across cores which share a configset. This is done by moving ownership of the Jersey 'ApplicationHandler' from the SolrCore object itself to a new 'JerseyAppHandlerCache' class. Reference counts are used to track how many SolrCores are currently relying on a particular ApplicationHandler; ApplicationHandlers are removed from the cache when this reference count hits 0. In order to support multiple sharing strategies, this commit also introduces the ConfigsetIdGenerator abstraction. Each Generator implementation takes in a 'ConfigSet' and a 'SolrCore' and produces a String ID that is used to fetch the Jersey application from the cache. Two implementations are provided: NoSharingConfigsetIdGenerator uses the SolrCore's UUID as its ID (ensuring that no app-sharing will occur), and ConfigsetNameIdGenerator uses the name of the configset as its ID (ensuring that all cores using that configset will share the same Jersey application). This initial pass has several limitations that need further investigation: - Relying on the configset name alone is insufficient, as configsets can be changed "in place" via ZK or filesystem changes. A more nuanced 'ConfigsetIdGenerator' impl will be needed here, perhaps that hashes the contents of the configset or something similarly content aware? - SolrCoreJerseyApp needs new injectors to make sure it can find the right SolrCore and metrics objects at request time.
The prior mechanism for injecting SolrCore instances into Jersey resources used a singleton factory, and took advantage of the fact that each ApplicationHandler was specific to a single core. This commit replaces the singleton-factory used for injection with a more standard factory that lifts the SolrCore out of the request context.
This still lacks some needed polish - hoping to push some improvements for configset resolution and metrics-lookup later this evening. And of course tests and cleanup remain after that. But the skeleton of this functionality is all there and ready for review. |
We want to share a Jersey appHandler among all cores that have the same config, but this requires a deeper check than just looking at the name of the configset. The "effective configset" used by a core really combines a number of different pieces of state: the configset files themselves (primarily solrconfig.xml), any core or collection properties that are used as variables in solrconfig.xml, and any config modifications made using the "overlay" API. This commit attempts to update ConfigsetIdGenerator to take each of these chunks of state into account when generating an ID for the core+configset combination.
Alright - I've done much of the gap-filling and cleanup that I had in mind. Taking this out of "Draft" as it's truly ready for review at this point. The last remaining bit here is testing. I already have some some unit tests written for JerseyAppHandlerCache, but they require a slight change to our mockito dep that I'd rather do in a separate PR (though we could def bundle it here too if desired). Hoping to add other tests as well. But otherwise this should be all ready to go. |
Seeing this failure in github actions and locally:
overall like the way this looks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me!
I think a test showing that the cores are using the same application (like looking at the cache size), would really complete this PR.
solr/core/src/java/org/apache/solr/jersey/RequestMetricHandling.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Kevin Risden <risdenk@users.noreply.github.com>
Alright, added the test you asked for @HoustonPutman and merged your precommit fix @risdenk . Thanks for the review everyone! I'm going to post some perf data and instructions for reproducing over on SOLR-16531, and give folks a few days to reproduce if they'd like (a fun exercise for anyone interested in brushing up on |
private void assertJerseyAppCacheHasSize(int expectedSize) { | ||
assertEquals( | ||
expectedSize, | ||
cluster.getJettySolrRunners().get(0).getCoreContainer().getAppHandlerCache().size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get an assert message here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the test. This looks good to me!
Hey @risdenk : I was hoping to gear up to merge, but see you still technically have an outstanding "change requested" on this PR. 99% sure this was the precommit fix that you were nice enough to 'Suggest' and that I accepted as-is, but I didn't want to assume in case you had something else you'd intended to mention? Anything else you want to see here? If not I'll merge in a few days! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good to me - didn't mean to hold anything up.
Prior to this commit our Jersey integration involved creating an 'ApplicationHandler' class for each SolrCore. These classes are expensive and their creation can have a noticeable cumulative effect at startup time on nodes hosting many many cores. This PR addresses this problem by allowing ApplicationHandler's to be shared across multiple SolrCores, as long as each of those cores use the same configset. To do this we introduce JerseyAppHandlerCache, which takes over ownership of all core-level ApplicationHandler's. AH's are ref-counted so that they can be removed from the cache and closed when no longer needed by any SolrCores.
private final int znodeVersion; | ||
private final int version; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"znodeVersion" is quite popular. Are you trying to make this more generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, because the version might not be a znode version, it might be an ending to the lastModified date from the filesystem (using a filesystem configset). There are other znodeVersion names I could have changed, but that would have been a lot more far-ranging than just changing this one for now.
In general for things that don't have to be a ZKConfigSet we should move from znodeVersion to version where feasible.
|
||
public ApplicationHandler getJerseyApplicationHandler() { | ||
return jerseyAppHandler; | ||
} | ||
|
||
public JerseyAppHandlerCache getAppHandlerCache() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No Jersey in the name? I like the clarity of putting Jersey in the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept 'Jersey' from the getter name because we might swap Jersey out for some other JAX-RS implementation down the road, and the less we reference the name of the specific implementation the better.
I've been somewhat inconsistent on this though, so if you think it helps make things clearer then I'm happy to change the getter name.
log.info( | ||
"Removing AppHandler from cache for 'effective configset' [{}]", | ||
effectiveConfigSetId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure this is interesting to you in development of this feature, but it'll be noise for the rest of us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is at least potentially useful in case of some weird bug or interaction caused by the AppHandler reuse, but I'll definitely drop it down to be debug-level so no one else has to see the noise until they want it.
log.info( | ||
"Removing AppHandler from cache for 'effective configset' [{}]", | ||
effectiveConfigSetId); | ||
applicationByConfigSetId.remove(effectiveConfigSetId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a race condition here if a core closes at the same time one is created with the same need of this handler. The thread creating a core could fetch from the ConcurrentHashMap.computeIfAbsent but hasn't quite called incref() yet. Then the other thread calling decref
will close this here and remove. I'm not sure what the consequence is yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Let me restate the race to make sure I understand:
- coreA is created and puts its AppHandler in the cache (with refCount=1)
- The Solr node receives a 'create' for coreB and a 'delete' for coreA come in in close proximity
- coreB fetches the existing appHandler from the cache (but hasn't yet incref'd and returned)
- coreA is deleted and calls its
decref()
- CoreA's call to decref() causes the AppHandler to be removed from the cache, even though coreB is about to get a reference to it.
- coreB makes its incref() call and returns from JAHC.computeIfAbsent
It's definitely a race, good catch.
The most obvious impact here is that some other coreC created later would need to instantiate an AppHandler when it should've been able to reuse the one that coreB is already using.
That's the only impact that jumps out at me. I was concerned at first that if coreB was deleted later on, its decref call would end up causing coreC's AppHandler to be removed from the cache. But since ApplicationHandler doesn't implement equals()
the cache-removal triggered when refCount==0 wouldn't actually do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to think about this a bit more.
I think we could trivially fix this by making JerseyAppHandlerCache.computeIfAbsent a synchronized method, but that seems like a bit of a blunt instrument that would do a good bit of over-locking. Conversely we could create a different lock for each key (i.e. effective-configset ID) and do more fine-grained locking, but that feels complex for the risk and impact here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To close the loop here - this race condition ended up going away in the new PR by removing RefCounting altogether (and using a Caffeine cache implementation, as suggested by David there).
I merged this change on Monday but quickly reverted it when it caused some build failures (a bad bug crept in when resolving some merge conflicts). I've since fixed almost all of those, and intend to put this back on 'main'. Collaboration-wise, I'm not sure the best way to do that. AFAICT I can't reopen this PR, since Github considers it "merged", but if I create a new PR I'll lose the review comments that @dsmiley put on this yesterday. Might have to bite the bullet on that if there's no trick to doing this, but I'll poke around a bit before pushing code up for review later this morning. |
Turns out there isn't any way in GH to reopen merged PRs, so I've opened a new PR (here) to continue iteration on this functionality. Planning to address David's comments (with some help from Houston) over there. |
https://issues.apache.org/jira/browse/SOLR-16615
Description
Currently, our Jersey integration involves creating an 'ApplicationHandler' class for each SolrCore. These classes are expensive and their creation can have a noticeable cumulative effect at startup time on nodes hosting many many cores. (See SOLR-16531 for more details.)
Solution
This PR addresses this problem by allowing ApplicationHandler's to be shared across multiple SolrCores, as long as each of those cores use the same configset. To do this we introduce JerseyAppHandlerCache, which takes over ownership of all core-level ApplicationHandler's. AH's are ref-counted so that they can be removed from the cache and closed when no longer needed by any SolrCore's.
Tests
TBD
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.