-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for phrase search with wildcard and prefix matching for Lucene indexed tables #12680
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12680 +/- ##
============================================
- Coverage 61.75% 61.58% -0.18%
+ Complexity 207 198 -9
============================================
Files 2436 2463 +27
Lines 133233 134634 +1401
Branches 20636 20848 +212
============================================
+ Hits 82274 82909 +635
- Misses 44911 45517 +606
- Partials 6048 6208 +160
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
...java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java
Outdated
Show resolved
Hide resolved
@@ -150,10 +155,14 @@ public MutableRoaringBitmap getDocIds(String searchQuery) { | |||
// be instantiated per query. Analyzer on the other hand is stateless | |||
// and can be created upfront. | |||
QueryParser parser = new QueryParser(_column, _analyzer); | |||
parser.setAllowLeadingWildcard(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case you missed it: should we set this flag to true only when _enablePrefixSuffixMatchingInPhraseQueries
is set to true?
There are tradeoffs: without setting this to true the queries will fail directly. On the other hand, allowing it can risk expensive queries (per Lucene documentation). I am okay with either choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. better to protect behind the flag for now too.
...t-segment-local/src/main/java/org/apache/pinot/segment/local/utils/LuceneTextIndexUtils.java
Outdated
Show resolved
Hide resolved
...t-segment-local/src/main/java/org/apache/pinot/segment/local/utils/LuceneTextIndexUtils.java
Show resolved
Hide resolved
...t-segment-local/src/main/java/org/apache/pinot/segment/local/utils/LuceneTextIndexUtils.java
Outdated
Show resolved
Hide resolved
...t-segment-local/src/main/java/org/apache/pinot/segment/local/utils/LuceneTextIndexUtils.java
Outdated
Show resolved
Hide resolved
...t-segment-local/src/main/java/org/apache/pinot/segment/local/utils/LuceneTextIndexUtils.java
Outdated
Show resolved
Hide resolved
@@ -150,10 +155,14 @@ public MutableRoaringBitmap getDocIds(String searchQuery) { | |||
// be instantiated per query. Analyzer on the other hand is stateless | |||
// and can be created upfront. | |||
QueryParser parser = new QueryParser(_column, _analyzer); | |||
parser.setAllowLeadingWildcard(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case you missed it: should we set this flag to true only when _enablePrefixSuffixMatchingInPhraseQueries
is set to true?
There are tradeoffs: without setting this to true the queries will fail directly. On the other hand, allowing it can risk expensive queries (per Lucene documentation). I am okay with either choice.
@@ -72,6 +74,8 @@ public TextIndexConfig(@JsonProperty("disabled") Boolean disabled, @JsonProperty | |||
luceneMaxBufferSizeMB == null ? LUCENE_INDEX_DEFAULT_MAX_BUFFER_SIZE_MB : luceneMaxBufferSizeMB; | |||
_luceneAnalyzerClass = (luceneAnalyzerClass == null || luceneAnalyzerClass.isEmpty()) | |||
? FieldConfig.TEXT_INDEX_DEFAULT_LUCENE_ANALYZER_CLASS : luceneAnalyzerClass; | |||
_enablePrefixSuffixMatchingInPhraseQueries = | |||
enablePrefixSuffixMatchingInPhraseQueries == null ? false : enablePrefixSuffixMatchingInPhraseQueries; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: hoist default false value to a static constant (similar to LUCENE_INDEX_DEFAULT_USE_COMPOUND_FILE
).
|
||
public class LuceneTextIndexUtilsTest { | ||
@Test | ||
public void testBooleanQueryRewrittenToSpanQuery() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi: I used this code for testing and understanding the PR. The approach in this PR is quite interesting!
https://gist.github.com/ankitsultana/88181cc3bda5113fec83175cfa867445
String query = | ||
"SELECT INT_COL, SKILLS_TEXT_COL FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL, '*ealtime streaming system*') " | ||
+ "LIMIT 50000"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we handle/add a test case for boolean span queries? i.e.
TEXT_MATCH(SKILLS_TEXT_COL, '*ealtime streaming system* AND *chine learn*')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test cases added and they work! Though it is not the main intent of this PR.
What do you think about enabling this by default/not hiding this feature behind a config? It seems that we should be able to infer whether a query is valid single term lucene syntax/phrase search or requires a span query e.g. for StandardAnalyzer parsed text |
For now I think it is better to hide this feature behind a config. A few reasons:
|
Thanks for adding the feature! Can you please add a |
@chenboat following up here.. did you get a chance to add docs? |
Release note section added. |
Issue 10863 Pinot's TEXT_MATCH filter today does not support phrase search with wildcard and prefix matching (e.g., "*pache pino*" to match "Apache Pinot") directly. The kind of queries is very common in use case like log search where user needs to search matching results in long text. Today one has to use external means to walk around this issue (e.g., concatenating all words in a paragraph into a longer string and performing regex query on it) and usually they will incur much higher query latency.
This PR adds support to allow phrase search with wildcard and prefix matching for Lucene indexed tables. This feature is enabled through a config in the Lucene text indexed column. The default value is false (or not enabled). User can write a text match function to perform the filter (text_match(col, '*pache pino*')).
We have tested this feature in our internal env and it can process 150+G of text data in 5 sec on 1 server. It can be 3x faster in phrase matching tests.
release-notes
To enable the phrase wildcard search feature, one needs to add a new config to their column Lucene text index config property list
"enablePrefixSuffixMatchingInPhraseQueries": "true"
With this config enabled, one can now perform the pharse wildcard search using the following syntax like
text_match(col, '*pache pino*') which can match the string "Apache pinot".