SOLR-4587: integrate lucene-monitor into solr #2382

kotman12 · 2024-04-01T17:52:51Z

https://issues.apache.org/jira/browse/SOLR-4587

Description

The module hopes to simplify distribution and scaling query-indexes for monitoring and alerting workflows (also known as reverse search) by providing a bridge between solr-managed search index and lucene-monitor's efficient reverse search algorithms.

Here is some evidence that the community might find this useful.

Blog-post that partly inspired the current approach
Users asking about a percolator-like feature on stackoverflow.
Someone contributed this extension but it doesn't really provide percolator-like functionality and because it wasn't upstreamed it fell out of maintenance.
Plug for my own question on the issue!

Solution

This is still a WiP but I am opening up as a PR to get community feedback. The current approach is to ingest queries as solr documents, decompose them for perfromance, and then use child-document feature to index the decomposed subqueries under one atomic parent document block. On the search side the latest approach is to use a dedicated component that creates hooks into lucene-monitor's Presearcher, QueryTermFilter and CandidateMatcher.

The current optional cache implementation uses caffeine instead of lucene-monitor's simpler ConcurrentHashMap. It's worth noting that this cache should likely be quite a bit larger than your average query or document cache since query parsing involves a non-trivial amount of compute and disk I/O (especially for large results and/or queries). It's also worth noting that lucene-monitor will keep all the indexed queries cached in memory with in its default configuration. A unique solr-monitor feature was the addition of a bespoke cache warmer that tries to populate the cache with approximately all the latest updated queries since the last commit. This approach was added to have a baseline when comparing with lucene-monitor performance. The goal was to make it possible to effectively cache all queries in memory (since that is what lucene-monitor enables by default) but not necessarily require it.

Currently the PR has some visitor classes in the org.apache.lucene.monitor package that exposes certain lucene-monitor internals. If this approach gets accepted then the lucene project will likely need to be updated to expose what is necessary.

Tests

testMonitorQuery: basic functionality before and after an update
testNoDocListInResponse: The current API allows for two types of responses, a special monitorDocuments response that can relay lucene-monitor's response structure and unique features such as "reverse" highlights. The other response structure is a regular solr document list with each "response" document really referring to a query that matches the "real" document that is being matched. This test ensures you can disable the solr document list from coming in the response.
testDefaultParser: validate that solr-monitor routes to default parser when none is selected.
testDisjunctionQuery: validate that subqueries of a disjunction get indexed seperately.
testNoDanglingDecomposition: validate that deleting a top-level query also removes all the child disjuncts.
testNotQuery
testWildCardQuery
testDefaultQueryMatchTypeIsNone: If no match type is selected with the monitorMatchType field then only a solr document list is returned (same behavior as "forward" search).
testMultiDocHighlightMatchType: Test highlight matcher on a multi-document batch and ensure it returns the character offsets and positions of all individual matches. It is worth noting that percolator returns the actual matching text snippet. This is something we could consider supporting within solr or adding to lucene-monitor.
testHighlightMatchType: Single doc highlight test. Slightly different than the one above in that the highlighted field does not need to be storeOffsetsWithPositions="true" which is pretty convenient. I am not sure if I am relying on a MemoryIndex implementation detail but it is a bit tedious for users to update their schemas to have storeOffsetsWithPositions="true" just to get character offsets back from the highlight matcher. I also don't know if there is a better way to handle the multi-doc case .. maybe break each doc into its own MemeoryIndex reader so that we got offsets by default without specifying storeOffsetsWithPositions="true"?
manySegmentsQuery: The cache warmer has reader-leaf-dependent logic so this was included to verify everything works on a multi-segment index.

All of the above are also tested with below custom configurations:

Parallel matcher - lucene monitor allows for running the final, most-expensive matching step in a multi-threaded environment. The current solr-monitor implementation allows for this with some restrictions. For instance, it is difficult to populate a document response list from a fully asynchronous matching component because it would require awkwardly opening and closing leaf collectors on-demand. The more idiomatic solr approach would be to just run this on many shards and gain parallelism as recommended here. Still, during testing I found that a fully async postfilter in a single shard had better performance than an equally-parallel multi-sharded, synchronous postfilter so I've decided to keep it in the initial proposal. On top of that, it helps achieve greater feature parity with lucene-monitor (which obviously has no concept of sharding so can only parallelize with a special matcher).
Stored monitor query - allow storing queries with stored="true" instead of using the recommended docValues. docValues have stricter single-value size limits so this is mainly to accommodate humongous queries

I'll report here that I also have some local performance tests which are difficult to port but that helped guide some of the decisions so far. I've also "manually" tested the custom tlog deserialization of the derived query field but this verification should probably go somewhere in a TlogReplay test. I haven't gone down that rabbit hole yet as I wanted to poll for some feedback first. The reason I skip TLog for the derived query fields is because these fields wrap a tokenstream which in itself is difficult to serialize without a custom analyzer. The goal was to let users leverage their existing document schema as often as possible instead of having to create something custom for the query-monitoring use-case.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check. TODO some apparently unrelated test failures
I have added tests for my changes.
I have added documentation for the Reference Guide

cpoerschke · 2024-04-15T11:48:18Z

solr/modules/monitor/src/test-files/solr/collection1/solrconfig-no-cache.xml

+  -->
+
+<config>
+  <luceneMatchVersion>9.4</luceneMatchVersion>


minor/maybe

Suggested change

<luceneMatchVersion>9.4</luceneMatchVersion>

<luceneMatchVersion>${tests.luceneMatchVersion:LATEST}</luceneMatchVersion>

thanks for catching this! .. I had to change it in a few other places so I made a separate commit

dsmiley · 2024-04-18T14:00:33Z

solr/modules/monitor/build.gradle

+
+apply plugin: 'java-library'
+
+description = 'Apache Solr Monitor'


That is so puzzling to anyone who isn't intimately familiar with Lucene Monitor. I don't even think we should be calling this "Solr Monitor"; looks like infrastructure monitoring thing. Possibly "Solr-Lucene-Monitor" but still... a puzzling name.

This is a great point .. The library used to be called luwak which I find to be a much better name... I'll try to think of a better name (maybe solr-reverse-search or solr-query-alerting). I'll reply in more detail to your mailing list message also touching on solr.cool and the sandbox.

Saved Searches is a common name, I assume it is possible to list a users's saved searches too. Or Alerting, but then most people will expect there to be some functionality to ship alerts somewhere...

You're right, if anything this might be a part of some larger alerting system, but "saved search" is more accurate.

Saved searches is a pretty indicative name. Percolator is also a known name for this kind of functionally.

Interesting, I thought ES invented "percolator" as more of a metaphor... I wasn't aware that this is a more generic name. I was worried that "percolator" might clash too much with ES.

cpoerschke

Hi @kotman12 - thanks for working on this!

I started "just browsing" on this PR this morning and so the inline comments may seem a bit random or general but sharing them anyhow in case they're useful. Not considered any naming or solr-versus-solr-sandbox-versus-elsewhere aspects at this point i.e. was just browsing.

cpoerschke · 2024-04-19T08:46:37Z

solr/modules/monitor/src/java/org/apache/solr/monitor/update/MonitorUpdateProcessorFactory.java

+  private String queryFieldNameOverride;
+  private String payloadFieldNameOverride;


subjective: could maybe initialise to the defaults here, overriding in init if applicable, and then avoid the null-or-not checks in getInstance

Suggested change

private String queryFieldNameOverride;

private String payloadFieldNameOverride;

private String queryFieldName = MonitorFields.MONITOR_QUERY ;

private String payloadFieldName = MonitorFields.PAYLOAD;

yea .. this makes sense to me. Is there really any value to these overrides though? I don't have a good reason why I chose to make these two fields overridable but not the other reserved fields. Is it safe to assume that a field prefixed by _ won't be in the user space anyway? If that is the case then this override business is overkill. Otherwise, we probably should make everything overridable.

cpoerschke · 2024-04-19T08:48:13Z

solr/modules/monitor/src/java/org/apache/solr/monitor/update/MonitorUpdateProcessorFactory.java

+
+public class MonitorUpdateProcessorFactory extends UpdateRequestProcessorFactory {
+
+  private Presearcher presearcher = PresearcherFactory.build();


noting that init has PresearcherFactory.build(presearcherType) also.

Suggested change

private Presearcher presearcher = PresearcherFactory.build();

private Presearcher presearcher;

With the latest change the Presearcher only gets initialized in the ReverseQueryParserPlugin and I share that core-level-singleton by making MonitorUpdateProcessorFactory a SolrCoreAware type. Not sure if there is a better pattern for this? This would be admittedly nicer with some kind of DI mechanism.

cpoerschke · 2024-04-19T08:54:01Z

solr/modules/monitor/src/java/org/apache/solr/monitor/package-info.java

+ */
+
+/** This package contains Solr's lucene monitor integration. */
+package org.apache.solr.monitor;


minor: surprised that package-info.java seems to be not needed for the lucene/monitor sub-directory, or maybe the checking logic just isn't checking for it.

I was hoping to actually make some changes to lucene in order to avoid the need for that package. I wanted to gauge the viability of "solr-monitor" before suggesting changes to the lucene upstream. The way I see it, lucene-monitor has very nice optimizations for making saved search fast but the interface is tightly sealed and makes very opinionated choices about stuff like caching which makes it hard to integrate into something like solr. Not to mention, lucene-monitor's index isn't "pluggable" or exposed in any way. It just seemed easier to expose the relevant algorithms within lucene-monitor rather than trying to hack in the whole kitchen sink into solr. Sorry about the tangent 😃

cpoerschke · 2024-04-19T09:00:03Z

solr/modules/monitor/src/java/org/apache/solr/monitor/update/MonitorUpdateRequestProcessor.java

+    this.queryFieldName = queryFieldName;
+    this.payloadFieldName = payloadFieldName;
+    this.core = core;
+    this.indexSchema = core.getLatestSchema();


Wondering if there's an assumption somehow w.r.t. queryFieldName and payloadFieldName being or not being within indexSchema -- and if there's an assumption to check it somewhere, maybe at initialisation time rather than when the first document(s) arrive to make use of the fields etc.

Likewise w.r.t. the MonitorFields.RESERVED_MONITOR_FIELDS referenced later on.

I've added a narrower MonitorFields.REQUIRED_MONITOR_FIELDS set which gets cross-validated against the schema in MonitorUpdateProcessorFactory::inform. In the same place I've also added some more specific schema-validations which get invoked for more specific configurations, i.e. which type of presearcher you are using.

cpoerschke · 2024-04-19T09:16:03Z

solr/modules/monitor/src/java/org/apache/solr/monitor/search/ReverseQueryParserPlugin.java

+  @Override
+  public void close() throws IOException {
+    super.close();
+  }
+
+  @Override
+  public void init(NamedList<?> args) {
+    super.init(args);
+  }


Suggested change

@Override

public void close() throws IOException {

super.close();

}

@Override

public void init(NamedList<?> args) {

super.init(args);

}

solr/modules/monitor/src/java/org/apache/solr/monitor/search/ReverseQueryParser.java

solr/modules/monitor/src/java/org/apache/solr/monitor/search/ReverseSearchComponent.java

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>

kotman12 · 2024-04-20T01:26:31Z

solr/modules/monitor/src/java/org/apache/solr/monitor/search/ParallelSolrMatcherSink.java

+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+
+class ParallelSolrMatcherSink<T extends QueryMatch> implements SolrMatcherSink {


@cpoerschke I wonder if this whole class would be made obviated by #2248 .. I found that because there can be significant overhead in pre-processing documents for reverse search (mainly analysis), parallelizing by throwing more solr cores at the problem wasn't quite as fast as simply parallelizing the expensive post filter. But it seems that if that PR (or something similar) was merged we might already run post filters in parallel for each segment?

cpoerschke · 2024-04-25T17:19:40Z

So I was trying to learn how the main configuration bits fit together here and high-level the reverse search idea and my solr-monitor-naive-dinner-demo branch (or #2421 diff) off this pull request's branch is a side effect of that and my understanding so far based on it is that:

the in-memory state is in the Presearcher object in the ReverseQueryParserPlugin class object (and in the solr-monitor-naive-dinner-demo i just used a simple Monitor object instead of the Presearcher object)
the state is updated via the MonitorUpdateRequestProcessor i.e. saved searches are added as MonitorQuery objects to the Monitor object (and updating of the Presearcher object is a bit different)
the state is accessed via the ReverseSearchComponent component (currently non-distributed but conceptually distributed would work too?)

Is that basic understanding correct? As a next step I might go learn more about the Presearcher itself.

kotman12 · 2024-04-25T20:44:53Z

So I was trying to learn how the main configuration bits fit together here and high-level the reverse search idea and my solr-monitor-naive-dinner-demo branch (or #2421 diff) off this pull request's branch is a side effect of that and my understanding so far based on it is that:

the in-memory state is in the Presearcher object in the ReverseQueryParserPlugin class object (and in the solr-monitor-naive-dinner-demo i just used a simple Monitor object instead of the Presearcher object)

the state is updated via the MonitorUpdateRequestProcessor i.e. saved searches are added as MonitorQuery objects to the Monitor object (and updating of the Presearcher object is a bit different)

the state is accessed via the ReverseSearchComponent component (currently non-distributed but conceptually distributed would work too?)

Is that basic understanding correct? As a next step I might go learn more about the Presearcher itself.

I'll give the PR a look but on an architectural level it is similar from what you describe. The custom update processor adds the saved search to some stateful component. The reverse search component takes a solr doc and converts it to a lucene query. It then runs that document-in-query-form against the stateful component to find the matching saved searches. And when I first looked at this I wanted to use the monitor as the stateful component. But quickly some problems emerged with that idea. My main concerns wiring a Monitor straight into solr were:

Handling commit/rollback and what to update the tlog with if you also writing to a "sidecar" monitor object?
Handling persistence. Currently the Monitor has its own tightly sealed index. It can be configured for persistence but if you want to peek at the segments a monitor is writing to disk it might not be easy, especially to handle configurations like tlog+pull. The alternative is to use only the in-memory Monitor configurations but that has limitations and takes away precious resources from the {cacheId -> deserialized query} cache.
Bringing me to my final point that the cache a Monitor object wraps is a simple ConcurrentHashMap and the Monitor itself is updated with a very coarse-grained lock that can block reads for a long time (because it synchronizes the map with the index). It just doesn't feel like it "jives" with the solr approach to concurrency that is much more sophisticated (it is a fully fledged db after all). We could make the Monitor cache more configurable in the upstream lucene monitor repo but in my opinion lucene monitor tries to do too much state-management that its not that good at but the most valuable thing to take advantage of is the sophisticated reverse search methods (query decomposition for faster matching, query tokenization for pre-search, term weighting, optimized document-to-query conversion with term-acceptor, and probably something else I am forgetting).

kotman12 · 2024-04-29T13:20:54Z

@cpoerschke I was reading your last comment more carefully and I want to stress that the presearcher, once constructed, should be completely stateless in the current proposal. The whole point of extracting the internal bits of lucene-monitor was to avoid its cumbersome internal state management that doesn't really fit nicely into solr (at least with anything I've been able to come up with). The presearcher merely exposes the methods to efficiently convert queries into documents (and vice versa) which make reverse search faster. That is why it is used by both MonitorUpdateRequestProcessor and ReverseQueryParser.

kotman12 added 4 commits March 29, 2024 23:18

integrate lucene-monitor into solr

bba4191

move MonitorDataValues and check in license

fe3b413

update versions.lock

0989f1c

add package-info to monitor packages

62ddcbc

github-actions bot added tool:build tests dependencies Dependency upgrades labels Apr 1, 2024

kotman12 added 15 commits April 1, 2024 19:29

extract helper method

33eccf7

apply errorprone suggestions

fa7fb40

implement highlight matches

11a1138

AggregatingMatcher -> MatchesAggregator

7526b2f

make monitor query cache optional

6ac7a5e

move manySegmentsTest to ParallelMonitorSolrQueryTest

2362ce7

call CandidateMatcher directly

e5a1382

remove doc forwarding callback

b7b58b3

instantiate decoder in outer loop

ce40d60

ignore score for relevant match types

a201676

read MAX_SIZE_PARAM for maxSize

60b0eb5

don't drop cause

fd04c4e

remove superstitious delete calls

32cb6f6

add testDeleteByQueryId

b6d369b

enable setting maxRamMB for monitor cache

0ab5a6b

cpoerschke reviewed Apr 15, 2024

View reviewed changes

hardcoding luceneMatchVersion is bad

0b0120e

dsmiley reviewed Apr 18, 2024

View reviewed changes

cpoerschke reviewed Apr 19, 2024

View reviewed changes

kotman12 and others added 3 commits April 19, 2024 18:32

add multi-pass presearcher and optional field aliasing

24c1bf5

more accurate error

ee2992d

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>

redundant override

6815ea1

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>

kotman12 commented Apr 20, 2024

View reviewed changes

kotman12 added 2 commits April 24, 2024 16:38

wrap reserved field with _ and remove override behavior

995cfa2

validate MonitorFields.RESERVED_MONITOR_FIELDS in schema

a2419ff

cpoerschke mentioned this pull request Apr 25, 2024

SOLR-4587: solr-monitor-naive-dinner-demo #2421

Closed

kotman12 added 2 commits April 26, 2024 20:23

stricter validations of required fields

0fde7ef

narrow scope of __anytokenfield validation

5968754

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-4587: integrate lucene-monitor into solr #2382

SOLR-4587: integrate lucene-monitor into solr #2382

kotman12 commented Apr 1, 2024 •

edited

cpoerschke Apr 15, 2024

kotman12 Apr 15, 2024

dsmiley Apr 18, 2024

kotman12 Apr 18, 2024

janhoy Apr 18, 2024

kotman12 Apr 18, 2024

almogtavor Apr 18, 2024

kotman12 Apr 18, 2024

cpoerschke left a comment

cpoerschke Apr 19, 2024

kotman12 Apr 24, 2024

cpoerschke Apr 19, 2024

kotman12 Apr 24, 2024

cpoerschke Apr 19, 2024

kotman12 Apr 20, 2024

cpoerschke Apr 19, 2024

kotman12 Apr 27, 2024

cpoerschke Apr 19, 2024

kotman12 Apr 20, 2024 •

edited

cpoerschke commented Apr 25, 2024 •

edited

kotman12 commented Apr 25, 2024 •

edited

kotman12 commented Apr 29, 2024

	<luceneMatchVersion>9.4</luceneMatchVersion>
	<luceneMatchVersion>${tests.luceneMatchVersion:LATEST}</luceneMatchVersion>


		apply plugin: 'java-library'

		description = 'Apache Solr Monitor'

		private String queryFieldNameOverride;
		private String payloadFieldNameOverride;


		public class MonitorUpdateProcessorFactory extends UpdateRequestProcessorFactory {

		private Presearcher presearcher = PresearcherFactory.build();

	private Presearcher presearcher = PresearcherFactory.build();
	private Presearcher presearcher;

SOLR-4587: integrate lucene-monitor into solr #2382

Are you sure you want to change the base?

SOLR-4587: integrate lucene-monitor into solr #2382

Conversation

kotman12 commented Apr 1, 2024 • edited

Description

Solution

Tests

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpoerschke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kotman12 Apr 20, 2024 • edited

Choose a reason for hiding this comment

cpoerschke commented Apr 25, 2024 • edited

kotman12 commented Apr 25, 2024 • edited

kotman12 commented Apr 29, 2024

kotman12 commented Apr 1, 2024 •

edited

kotman12 Apr 20, 2024 •

edited

cpoerschke commented Apr 25, 2024 •

edited

kotman12 commented Apr 25, 2024 •

edited