Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-16615: Reuse Jersey apps for cores with the same configset #1286

Merged
merged 14 commits into from
Jan 23, 2023

Conversation

gerlowskija
Copy link
Contributor

@gerlowskija gerlowskija commented Jan 10, 2023

https://issues.apache.org/jira/browse/SOLR-16615

Description

Currently, our Jersey integration involves creating an 'ApplicationHandler' class for each SolrCore. These classes are expensive and their creation can have a noticeable cumulative effect at startup time on nodes hosting many many cores. (See SOLR-16531 for more details.)

Solution

This PR addresses this problem by allowing ApplicationHandler's to be shared across multiple SolrCores, as long as each of those cores use the same configset. To do this we introduce JerseyAppHandlerCache, which takes over ownership of all core-level ApplicationHandler's. AH's are ref-counted so that they can be removed from the cache and closed when no longer needed by any SolrCore's.

Tests

TBD

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

This commit introduces the first pieces needed to share Jersey
applications across cores which share a configset.  This is done by
moving ownership of the Jersey 'ApplicationHandler' from the SolrCore
object itself to a new 'JerseyAppHandlerCache' class.  Reference counts
are used to track how many SolrCores are currently relying on a
particular ApplicationHandler; ApplicationHandlers are removed from the
cache when this reference count hits 0.

In order to support multiple sharing strategies, this commit also
introduces the ConfigsetIdGenerator abstraction.  Each Generator
implementation takes in a 'ConfigSet' and a 'SolrCore' and produces a
String ID that is used to fetch the Jersey application from the cache.

Two implementations are provided: NoSharingConfigsetIdGenerator uses the
SolrCore's UUID as its ID (ensuring that no app-sharing will occur), and
ConfigsetNameIdGenerator uses the name of the configset as its ID (ensuring that
all cores using that configset will share the same Jersey application).

This initial pass has several limitations that need further
investigation:
  - Relying on the configset name alone is insufficient, as configsets
    can be changed "in place" via ZK or filesystem changes.  A more
    nuanced 'ConfigsetIdGenerator' impl will be needed here, perhaps
    that hashes the contents of the configset or something similarly
    content aware?
  - SolrCoreJerseyApp needs new injectors to make sure it can find the
    right SolrCore and metrics objects at request time.
The prior mechanism for injecting SolrCore instances into Jersey
resources used a singleton factory, and took advantage of the fact that
each ApplicationHandler was specific to a single core.

This commit replaces the singleton-factory used for injection with a
more standard factory that lifts the SolrCore out of the request
context.
@gerlowskija gerlowskija marked this pull request as draft January 10, 2023 21:58
@gerlowskija
Copy link
Contributor Author

This still lacks some needed polish - hoping to push some improvements for configset resolution and metrics-lookup later this evening. And of course tests and cleanup remain after that. But the skeleton of this functionality is all there and ready for review.

HoustonPutman and others added 9 commits January 10, 2023 21:53
We want to share a Jersey appHandler among all cores that have the same
config, but this requires a deeper check than just looking at the name
of the configset.

The "effective configset" used by a core really combines a number of
different pieces of state: the configset files themselves (primarily
solrconfig.xml), any core or collection properties that are used as
variables in solrconfig.xml, and any config modifications made using the
"overlay" API.

This commit attempts to update ConfigsetIdGenerator to take each of
these chunks of state into account when generating an ID for the
core+configset combination.
@gerlowskija
Copy link
Contributor Author

Alright - I've done much of the gap-filling and cleanup that I had in mind. Taking this out of "Draft" as it's truly ready for review at this point.

The last remaining bit here is testing. I already have some some unit tests written for JerseyAppHandlerCache, but they require a slight change to our mockito dep that I'd rather do in a separate PR (though we could def bundle it here too if desired). Hoping to add other tests as well.

But otherwise this should be all ready to go.

@gerlowskija gerlowskija marked this pull request as ready for review January 11, 2023 20:28
@risdenk
Copy link
Contributor

risdenk commented Jan 12, 2023

Seeing this failure in github actions and locally:

1. ERROR in /home/runner/work/solr/solr/solr/core/src/java/org/apache/solr/jersey/RequestMetricHandling.java (at line 45)
	* PluginBag.JaxrsResourceToHandlerMappings}), and using that to look up the associated request
	  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Javadoc: Invalid member type qualification

overall like the way this looks!

Copy link
Contributor

@HoustonPutman HoustonPutman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

I think a test showing that the cores are using the same application (like looking at the cache size), would really complete this PR.

@gerlowskija
Copy link
Contributor Author

Alright, added the test you asked for @HoustonPutman and merged your precommit fix @risdenk . Thanks for the review everyone!

I'm going to post some perf data and instructions for reproducing over on SOLR-16531, and give folks a few days to reproduce if they'd like (a fun exercise for anyone interested in brushing up on solr-bench). But then I'll look to merge+backport this.

private void assertJerseyAppCacheHasSize(int expectedSize) {
assertEquals(
expectedSize,
cluster.getJettySolrRunners().get(0).getCoreContainer().getAppHandlerCache().size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get an assert message here?

Copy link
Contributor

@HoustonPutman HoustonPutman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the test. This looks good to me!

@gerlowskija
Copy link
Contributor Author

gerlowskija commented Jan 18, 2023

Hey @risdenk : I was hoping to gear up to merge, but see you still technically have an outstanding "change requested" on this PR. 99% sure this was the precommit fix that you were nice enough to 'Suggest' and that I accepted as-is, but I didn't want to assume in case you had something else you'd intended to mention?

Anything else you want to see here? If not I'll merge in a few days!

Copy link
Contributor

@risdenk risdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me - didn't mean to hold anything up.

@gerlowskija gerlowskija merged commit c21719e into apache:main Jan 23, 2023
@gerlowskija gerlowskija deleted the SOLR-16615-reuse-jersey-apps branch January 23, 2023 16:59
gerlowskija added a commit that referenced this pull request Jan 23, 2023
Prior to this commit our Jersey integration involved creating an 'ApplicationHandler'
class for each SolrCore. These classes are expensive and their creation can have a
noticeable cumulative effect at startup time on nodes hosting many many cores.
 
This PR addresses this problem by allowing ApplicationHandler's to be shared across
multiple SolrCores, as long as each of those cores use the same configset. To do this
we introduce JerseyAppHandlerCache, which takes over ownership of all core-level
ApplicationHandler's. AH's are ref-counted so that they can be removed from the cache
and closed when no longer needed by any SolrCores.
private final int znodeVersion;
private final int version;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"znodeVersion" is quite popular. Are you trying to make this more generic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, because the version might not be a znode version, it might be an ending to the lastModified date from the filesystem (using a filesystem configset). There are other znodeVersion names I could have changed, but that would have been a lot more far-ranging than just changing this one for now.

In general for things that don't have to be a ZKConfigSet we should move from znodeVersion to version where feasible.


public ApplicationHandler getJerseyApplicationHandler() {
return jerseyAppHandler;
}

public JerseyAppHandlerCache getAppHandlerCache() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Jersey in the name? I like the clarity of putting Jersey in the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept 'Jersey' from the getter name because we might swap Jersey out for some other JAX-RS implementation down the road, and the less we reference the name of the specific implementation the better.

I've been somewhat inconsistent on this though, so if you think it helps make things clearer then I'm happy to change the getter name.

Comment on lines +68 to +70
log.info(
"Removing AppHandler from cache for 'effective configset' [{}]",
effectiveConfigSetId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure this is interesting to you in development of this feature, but it'll be noise for the rest of us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is at least potentially useful in case of some weird bug or interaction caused by the AppHandler reuse, but I'll definitely drop it down to be debug-level so no one else has to see the noise until they want it.

log.info(
"Removing AppHandler from cache for 'effective configset' [{}]",
effectiveConfigSetId);
applicationByConfigSetId.remove(effectiveConfigSetId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a race condition here if a core closes at the same time one is created with the same need of this handler. The thread creating a core could fetch from the ConcurrentHashMap.computeIfAbsent but hasn't quite called incref() yet. Then the other thread calling decref will close this here and remove. I'm not sure what the consequence is yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let me restate the race to make sure I understand:

  1. coreA is created and puts its AppHandler in the cache (with refCount=1)
  2. The Solr node receives a 'create' for coreB and a 'delete' for coreA come in in close proximity
  3. coreB fetches the existing appHandler from the cache (but hasn't yet incref'd and returned)
  4. coreA is deleted and calls its decref()
  5. CoreA's call to decref() causes the AppHandler to be removed from the cache, even though coreB is about to get a reference to it.
  6. coreB makes its incref() call and returns from JAHC.computeIfAbsent

It's definitely a race, good catch.

The most obvious impact here is that some other coreC created later would need to instantiate an AppHandler when it should've been able to reuse the one that coreB is already using.

That's the only impact that jumps out at me. I was concerned at first that if coreB was deleted later on, its decref call would end up causing coreC's AppHandler to be removed from the cache. But since ApplicationHandler doesn't implement equals() the cache-removal triggered when refCount==0 wouldn't actually do anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to think about this a bit more.

I think we could trivially fix this by making JerseyAppHandlerCache.computeIfAbsent a synchronized method, but that seems like a bit of a blunt instrument that would do a good bit of over-locking. Conversely we could create a different lock for each key (i.e. effective-configset ID) and do more fine-grained locking, but that feels complex for the risk and impact here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To close the loop here - this race condition ended up going away in the new PR by removing RefCounting altogether (and using a Caffeine cache implementation, as suggested by David there).

@gerlowskija
Copy link
Contributor Author

gerlowskija commented Jan 25, 2023

I merged this change on Monday but quickly reverted it when it caused some build failures (a bad bug crept in when resolving some merge conflicts).

I've since fixed almost all of those, and intend to put this back on 'main'. Collaboration-wise, I'm not sure the best way to do that. AFAICT I can't reopen this PR, since Github considers it "merged", but if I create a new PR I'll lose the review comments that @dsmiley put on this yesterday. Might have to bite the bullet on that if there's no trick to doing this, but I'll poke around a bit before pushing code up for review later this morning.

@gerlowskija
Copy link
Contributor Author

Turns out there isn't any way in GH to reopen merged PRs, so I've opened a new PR (here) to continue iteration on this functionality. Planning to address David's comments (with some help from Houston) over there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants