New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-4587: integrate lucene-monitor into solr #2382
base: main
Are you sure you want to change the base?
Conversation
--> | ||
|
||
<config> | ||
<luceneMatchVersion>9.4</luceneMatchVersion> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor/maybe
<luceneMatchVersion>9.4</luceneMatchVersion> | |
<luceneMatchVersion>${tests.luceneMatchVersion:LATEST}</luceneMatchVersion> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for catching this! .. I had to change it in a few other places so I made a separate commit
|
||
apply plugin: 'java-library' | ||
|
||
description = 'Apache Solr Monitor' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is so puzzling to anyone who isn't intimately familiar with Lucene Monitor. I don't even think we should be calling this "Solr Monitor"; looks like infrastructure monitoring thing. Possibly "Solr-Lucene-Monitor" but still... a puzzling name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great point .. The library used to be called luwak which I find to be a much better name... I'll try to think of a better name (maybe solr-reverse-search or solr-query-alerting). I'll reply in more detail to your mailing list message also touching on solr.cool and the sandbox.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saved Searches is a common name, I assume it is possible to list a users's saved searches too. Or Alerting, but then most people will expect there to be some functionality to ship alerts somewhere...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, if anything this might be a part of some larger alerting system, but "saved search" is more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saved searches is a pretty indicative name. Percolator is also a known name for this kind of functionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I thought ES invented "percolator" as more of a metaphor... I wasn't aware that this is a more generic name. I was worried that "percolator" might clash too much with ES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kotman12 - thanks for working on this!
I started "just browsing" on this PR this morning and so the inline comments may seem a bit random or general but sharing them anyhow in case they're useful. Not considered any naming or solr-versus-solr-sandbox-versus-elsewhere aspects at this point i.e. was just browsing.
private String queryFieldNameOverride; | ||
private String payloadFieldNameOverride; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subjective: could maybe initialise to the defaults here, overriding in init
if applicable, and then avoid the null-or-not
checks in getInstance
private String queryFieldNameOverride; | |
private String payloadFieldNameOverride; | |
private String queryFieldName = MonitorFields.MONITOR_QUERY ; | |
private String payloadFieldName = MonitorFields.PAYLOAD; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea .. this makes sense to me. Is there really any value to these overrides though? I don't have a good reason why I chose to make these two fields overridable but not the other reserved fields. Is it safe to assume that a field prefixed by _
won't be in the user space anyway? If that is the case then this override business is overkill. Otherwise, we probably should make everything overridable.
|
||
public class MonitorUpdateProcessorFactory extends UpdateRequestProcessorFactory { | ||
|
||
private Presearcher presearcher = PresearcherFactory.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noting that init
has PresearcherFactory.build(presearcherType)
also.
private Presearcher presearcher = PresearcherFactory.build(); | |
private Presearcher presearcher; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest change the Presearcher only gets initialized in the ReverseQueryParserPlugin
and I share that core-level-singleton by making MonitorUpdateProcessorFactory
a SolrCoreAware
type. Not sure if there is a better pattern for this? This would be admittedly nicer with some kind of DI mechanism.
*/ | ||
|
||
/** This package contains Solr's lucene monitor integration. */ | ||
package org.apache.solr.monitor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: surprised that package-info.java
seems to be not needed for the lucene/monitor
sub-directory, or maybe the checking logic just isn't checking for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping to actually make some changes to lucene in order to avoid the need for that package. I wanted to gauge the viability of "solr-monitor" before suggesting changes to the lucene upstream. The way I see it, lucene-monitor has very nice optimizations for making saved search fast but the interface is tightly sealed and makes very opinionated choices about stuff like caching which makes it hard to integrate into something like solr. Not to mention, lucene-monitor's index isn't "pluggable" or exposed in any way. It just seemed easier to expose the relevant algorithms within lucene-monitor rather than trying to hack in the whole kitchen sink into solr. Sorry about the tangent 😃
this.queryFieldName = queryFieldName; | ||
this.payloadFieldName = payloadFieldName; | ||
this.core = core; | ||
this.indexSchema = core.getLatestSchema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if there's an assumption somehow w.r.t. queryFieldName
and payloadFieldName
being or not being within indexSchema
-- and if there's an assumption to check it somewhere, maybe at initialisation time rather than when the first document(s) arrive to make use of the fields etc.
Likewise w.r.t. the MonitorFields.RESERVED_MONITOR_FIELDS
referenced later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a narrower MonitorFields.REQUIRED_MONITOR_FIELDS
set which gets cross-validated against the schema in MonitorUpdateProcessorFactory::inform
. In the same place I've also added some more specific schema-validations which get invoked for more specific configurations, i.e. which type of presearcher you are using.
@Override | ||
public void close() throws IOException { | ||
super.close(); | ||
} | ||
|
||
@Override | ||
public void init(NamedList<?> args) { | ||
super.init(args); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Override | |
public void close() throws IOException { | |
super.close(); | |
} | |
@Override | |
public void init(NamedList<?> args) { | |
super.init(args); | |
} |
solr/modules/monitor/src/java/org/apache/solr/monitor/search/ReverseQueryParser.java
Outdated
Show resolved
Hide resolved
solr/modules/monitor/src/java/org/apache/solr/monitor/search/ReverseSearchComponent.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
import org.apache.lucene.search.IndexSearcher; | ||
import org.apache.lucene.search.Query; | ||
|
||
class ParallelSolrMatcherSink<T extends QueryMatch> implements SolrMatcherSink { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpoerschke I wonder if this whole class would be made obviated by #2248 .. I found that because there can be significant overhead in pre-processing documents for reverse search (mainly analysis), parallelizing by throwing more solr cores at the problem wasn't quite as fast as simply parallelizing the expensive post filter. But it seems that if that PR (or something similar) was merged we might already run post filters in parallel for each segment?
So I was trying to learn how the main configuration bits fit together here and high-level the reverse search idea and my solr-monitor-naive-dinner-demo branch (or #2421 diff) off this pull request's branch is a side effect of that and my understanding so far based on it is that:
Is that basic understanding correct? As a next step I might go learn more about the |
I'll give the PR a look but on an architectural level it is similar from what you describe. The custom update processor adds the saved search to some stateful component. The reverse search component takes a solr doc and converts it to a lucene query. It then runs that document-in-query-form against the stateful component to find the matching saved searches. And when I first looked at this I wanted to use the monitor as the stateful component. But quickly some problems emerged with that idea. My main concerns wiring a Monitor straight into solr were:
|
@cpoerschke I was reading your last comment more carefully and I want to stress that the presearcher, once constructed, should be completely stateless in the current proposal. The whole point of extracting the internal bits of lucene-monitor was to avoid its cumbersome internal state management that doesn't really fit nicely into solr (at least with anything I've been able to come up with). The presearcher merely exposes the methods to efficiently convert queries into documents (and vice versa) which make reverse search faster. That is why it is used by both |
https://issues.apache.org/jira/browse/SOLR-4587
Description
The module hopes to simplify distribution and scaling query-indexes for monitoring and alerting workflows (also known as reverse search) by providing a bridge between solr-managed search index and lucene-monitor's efficient reverse search algorithms.
Here is some evidence that the community might find this useful.
Solution
This is still a WiP but I am opening up as a PR to get community feedback. The current approach is to ingest queries as solr documents, decompose them for perfromance, and then use child-document feature to index the decomposed subqueries under one atomic parent document block. On the search side the latest approach is to use a dedicated component that creates hooks into lucene-monitor's Presearcher, QueryTermFilter and CandidateMatcher.
The current optional cache implementation uses caffeine instead of lucene-monitor's simpler ConcurrentHashMap. It's worth noting that this cache should likely be quite a bit larger than your average query or document cache since query parsing involves a non-trivial amount of compute and disk I/O (especially for large results and/or queries). It's also worth noting that lucene-monitor will keep all the indexed queries cached in memory with in its default configuration. A unique solr-monitor feature was the addition of a bespoke cache warmer that tries to populate the cache with approximately all the latest updated queries since the last commit. This approach was added to have a baseline when comparing with lucene-monitor performance. The goal was to make it possible to effectively cache all queries in memory (since that is what lucene-monitor enables by default) but not necessarily require it.
Currently the PR has some visitor classes in the
org.apache.lucene.monitor
package that exposes certain lucene-monitor internals. If this approach gets accepted then the lucene project will likely need to be updated to expose what is necessary.Tests
monitorDocuments
response that can relay lucene-monitor's response structure and unique features such as "reverse" highlights. The other response structure is a regular solr document list with each "response" document really referring to a query that matches the "real" document that is being matched. This test ensures you can disable the solr document list from coming in the response.monitorMatchType
field then only a solr document list is returned (same behavior as "forward" search).storeOffsetsWithPositions="true"
which is pretty convenient. I am not sure if I am relying on aMemoryIndex
implementation detail but it is a bit tedious for users to update their schemas to havestoreOffsetsWithPositions="true"
just to get character offsets back from the highlight matcher. I also don't know if there is a better way to handle the multi-doc case .. maybe break each doc into its ownMemeoryIndex
reader so that we got offsets by default without specifyingstoreOffsetsWithPositions="true"
?All of the above are also tested with below custom configurations:
stored="true"
instead of using the recommendeddocValues
. docValues have stricter single-value size limits so this is mainly to accommodate humongous queriesI'll report here that I also have some local performance tests which are difficult to port but that helped guide some of the decisions so far. I've also "manually" tested the custom tlog deserialization of the derived query field but this verification should probably go somewhere in a
TlogReplay
test. I haven't gone down that rabbit hole yet as I wanted to poll for some feedback first. The reason I skip TLog for the derived query fields is because these fields wrap a tokenstream which in itself is difficult to serialize without a custom analyzer. The goal was to let users leverage their existing document schema as often as possible instead of having to create something custom for the query-monitoring use-case.Checklist
Please review the following and check all that apply:
main
branch../gradlew check
. TODO some apparently unrelated test failures