Add support for phrase search with wildcard and prefix matching for Lucene indexed tables #12680

chenboat · 2024-03-20T16:42:51Z

Issue 10863 Pinot's TEXT_MATCH filter today does not support phrase search with wildcard and prefix matching (e.g., "*pache pino*" to match "Apache Pinot") directly. The kind of queries is very common in use case like log search where user needs to search matching results in long text. Today one has to use external means to walk around this issue (e.g., concatenating all words in a paragraph into a longer string and performing regex query on it) and usually they will incur much higher query latency.

This PR adds support to allow phrase search with wildcard and prefix matching for Lucene indexed tables. This feature is enabled through a config in the Lucene text indexed column. The default value is false (or not enabled). User can write a text match function to perform the filter (text_match(col, '*pache pino*')).

We have tested this feature in our internal env and it can process 150+G of text data in 5 sec on 1 server. It can be 3x faster in phrase matching tests.

release-notes

To enable the phrase wildcard search feature, one needs to add a new config to their column Lucene text index config property list
"enablePrefixSuffixMatchingInPhraseQueries": "true"

With this config enabled, one can now perform the pharse wildcard search using the following syntax like
text_match(col, '*pache pino*') which can match the string "Apache pinot".

…rms in the phrase

codecov-commenter · 2024-03-20T17:34:56Z

Codecov Report

Attention: Patch coverage is 69.38776% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 61.58%. Comparing base (59551e4) to head (e34187a).
Report is 185 commits behind head on master.

Files	Patch %	Lines
...ment/index/readers/text/LuceneTextIndexReader.java	14.28%	3 Missing and 3 partials ⚠️
...me/impl/invertedindex/RealtimeLuceneTextIndex.java	50.00%	2 Missing and 2 partials ⚠️
...inot/segment/local/utils/LuceneTextIndexUtils.java	85.71%	2 Missing and 1 partial ⚠️
...pache/pinot/segment/spi/index/TextIndexConfig.java	81.81%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #12680      +/-   ##
============================================
- Coverage     61.75%   61.58%   -0.18%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2463      +27     
  Lines        133233   134634    +1401     
  Branches      20636    20848     +212     
============================================
+ Hits          82274    82909     +635     
- Misses        44911    45517     +606     
- Partials       6048     6208     +160

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`61.54% <69.38%> (-0.17%)`	⬇️
java-21	`61.46% <69.38%> (-0.17%)`	⬇️
skip-bytebuffers-false	`61.56% <69.38%> (-0.18%)`	⬇️
skip-bytebuffers-true	`61.44% <69.38%> (+33.71%)`	⬆️
temurin	`61.58% <69.38%> (-0.18%)`	⬇️
unittests	`61.57% <69.38%> (-0.18%)`	⬇️
unittests1	`46.12% <18.36%> (-0.77%)`	⬇️
unittests2	`27.98% <51.02%> (+0.25%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

...java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java

ankitsultana · 2024-03-21T16:34:12Z

...in/java/org/apache/pinot/segment/local/segment/index/readers/text/LuceneTextIndexReader.java

@@ -150,10 +155,14 @@ public MutableRoaringBitmap getDocIds(String searchQuery) {
      // be instantiated per query. Analyzer on the other hand is stateless
      // and can be created upfront.
      QueryParser parser = new QueryParser(_column, _analyzer);
+      parser.setAllowLeadingWildcard(true);


same as above

in case you missed it: should we set this flag to true only when _enablePrefixSuffixMatchingInPhraseQueries is set to true?

There are tradeoffs: without setting this to true the queries will fail directly. On the other hand, allowing it can risk expensive queries (per Lucene documentation). I am okay with either choice.

done. better to protect behind the flag for now too.

...t-segment-local/src/main/java/org/apache/pinot/segment/local/utils/LuceneTextIndexUtils.java

ankitsultana · 2024-03-27T06:06:29Z

...in/java/org/apache/pinot/segment/local/segment/index/readers/text/LuceneTextIndexReader.java

@@ -150,10 +155,14 @@ public MutableRoaringBitmap getDocIds(String searchQuery) {
      // be instantiated per query. Analyzer on the other hand is stateless
      // and can be created upfront.
      QueryParser parser = new QueryParser(_column, _analyzer);
+      parser.setAllowLeadingWildcard(true);


in case you missed it: should we set this flag to true only when _enablePrefixSuffixMatchingInPhraseQueries is set to true?

There are tradeoffs: without setting this to true the queries will fail directly. On the other hand, allowing it can risk expensive queries (per Lucene documentation). I am okay with either choice.

ankitsultana · 2024-03-27T22:47:14Z

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/TextIndexConfig.java

@@ -72,6 +74,8 @@ public TextIndexConfig(@JsonProperty("disabled") Boolean disabled, @JsonProperty
        luceneMaxBufferSizeMB == null ? LUCENE_INDEX_DEFAULT_MAX_BUFFER_SIZE_MB : luceneMaxBufferSizeMB;
    _luceneAnalyzerClass = (luceneAnalyzerClass == null || luceneAnalyzerClass.isEmpty())
        ? FieldConfig.TEXT_INDEX_DEFAULT_LUCENE_ANALYZER_CLASS : luceneAnalyzerClass;
+    _enablePrefixSuffixMatchingInPhraseQueries =
+        enablePrefixSuffixMatchingInPhraseQueries == null ? false : enablePrefixSuffixMatchingInPhraseQueries;


nit: hoist default false value to a static constant (similar to LUCENE_INDEX_DEFAULT_USE_COMPOUND_FILE).

ankitsultana · 2024-03-28T00:27:05Z

...gment-local/src/test/java/org/apache/pinot/segment/local/utils/LuceneTextIndexUtilsTest.java

+
+public class LuceneTextIndexUtilsTest {
+  @Test
+  public void testBooleanQueryRewrittenToSpanQuery() {


fyi: I used this code for testing and understanding the PR. The approach in this PR is quite interesting!

https://gist.github.com/ankitsultana/88181cc3bda5113fec83175cfa867445

itschrispeck · 2024-03-28T01:12:13Z

pinot-core/src/test/java/org/apache/pinot/queries/TextSearchQueriesTest.java

+    String query =
+        "SELECT INT_COL, SKILLS_TEXT_COL FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL, '*ealtime streaming system*') "
+            + "LIMIT 50000";


Can we handle/add a test case for boolean span queries? i.e.

TEXT_MATCH(SKILLS_TEXT_COL, '*ealtime streaming system* AND *chine learn*')

Test cases added and they work! Though it is not the main intent of this PR.

itschrispeck · 2024-03-28T01:13:18Z

What do you think about enabling this by default/not hiding this feature behind a config? It seems that we should be able to infer whether a query is valid single term lucene syntax/phrase search or requires a span query

e.g. for StandardAnalyzer parsed text
'*istributed' - valid lucene, unchanged behavior
'distribute*'- valid lucene, unchanged behavior
'"Distributed systems"' - valid lucene, unchanged behavior
'*istributed systems*' - not valid lucene, modify it to be a span query
'/.*istributed systems.*/' - valid lucene, incompatible with StandardAnalyzer, unchanged behavior

chenboat · 2024-03-30T00:05:21Z

What do you think about enabling this by default/not hiding this feature behind a config? It seems that we should be able to infer whether a query is valid single term lucene syntax/phrase search or requires a span query

e.g. for StandardAnalyzer parsed text '*istributed' - valid lucene, unchanged behavior 'distribute*'- valid lucene, unchanged behavior '"Distributed systems"' - valid lucene, unchanged behavior '*istributed systems*' - not valid lucene, modify it to be a span query '/.*istributed systems.*/' - valid lucene, incompatible with StandardAnalyzer, unchanged behavior

For now I think it is better to hide this feature behind a config. A few reasons:

'istributed systems' is today still a valid parsable boolean query without this PR. It is just not clear what the users' real intent is. We probably will leave it this way to maintain the current status. Only when a table owner explicitly set the config flag, this query pattern will be treated as a phrase query with wild card matching.
The query feature enable is still costly. It goes beyond most patterns suggested by Lucene (e.g., no leading *). So it is better to be an option in only feature.

Jackie-Jiang · 2024-04-05T00:20:02Z

Thanks for adding the feature! Can you please add a release-notes section to the PR description, and also update the Pinot documentation for this new feature?

npawar · 2024-05-22T20:22:09Z

@chenboat following up here.. did you get a chance to add docs?

chenboat · 2024-05-24T23:26:37Z

Release note section added.
@npawar can you review this doc PR? https://github.com/pinot-contrib/pinot-docs/pull/341/files

chenboat added 2 commits March 13, 2024 23:20

Intial commit to support phrase search with regex matching for the te…

600d5a5

…rms in the phrase

Increase max clause limit for SpanOr queries.

2dc66a1

chenboat added 2 commits March 20, 2024 11:00

Fix the lint errors.

0fecc09

Fix lint

0ac1024

chenboat requested review from Jackie-Jiang, ankitsultana and siddharthteotia March 20, 2024 23:19

ankitsultana reviewed Mar 21, 2024

View reviewed changes

chenboat added 4 commits March 22, 2024 17:36

Fix based on comments.

b5be593

Fix lint.

4b87b0b

Fix lint

f1053be

Remove unused imports.

694c166

chenboat requested a review from ankitsultana March 25, 2024 20:53

ankitsultana reviewed Mar 28, 2024

View reviewed changes

itschrispeck reviewed Mar 28, 2024

View reviewed changes

Revise based on comments.

e34187a

ankitsultana approved these changes Mar 29, 2024

View reviewed changes

chenboat requested a review from itschrispeck March 30, 2024 00:05

itschrispeck approved these changes Apr 1, 2024

View reviewed changes

chenboat merged commit 62b97ef into apache:master Apr 1, 2024
19 checks passed

Jackie-Jiang added feature documentation release-notes Referenced by PRs that need attention when compiling the next release notes labels Apr 5, 2024

jackluo923 mentioned this pull request Jun 16, 2024

Text-index does not support multi-token substring search where first and last tokens are partial #10863

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for phrase search with wildcard and prefix matching for Lucene indexed tables #12680

Add support for phrase search with wildcard and prefix matching for Lucene indexed tables #12680

chenboat commented Mar 20, 2024 •

edited

Loading

codecov-commenter commented Mar 20, 2024 •

edited

Loading

ankitsultana Mar 21, 2024

ankitsultana Mar 27, 2024

chenboat Mar 29, 2024

ankitsultana Mar 27, 2024

ankitsultana Mar 27, 2024

ankitsultana Mar 28, 2024

itschrispeck Mar 28, 2024

chenboat Mar 29, 2024

itschrispeck commented Mar 28, 2024

chenboat commented Mar 30, 2024

Jackie-Jiang commented Apr 5, 2024

npawar commented May 22, 2024

chenboat commented May 24, 2024 •

edited

Loading

Add support for phrase search with wildcard and prefix matching for Lucene indexed tables #12680

Add support for phrase search with wildcard and prefix matching for Lucene indexed tables #12680

Conversation

chenboat commented Mar 20, 2024 • edited Loading

release-notes

codecov-commenter commented Mar 20, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itschrispeck commented Mar 28, 2024

chenboat commented Mar 30, 2024

Jackie-Jiang commented Apr 5, 2024

npawar commented May 22, 2024

chenboat commented May 24, 2024 • edited Loading

chenboat commented Mar 20, 2024 •

edited

Loading

codecov-commenter commented Mar 20, 2024 •

edited

Loading

chenboat commented May 24, 2024 •

edited

Loading