Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for phrase search with wildcard and prefix matching for Lucene indexed tables #12680

Merged
merged 9 commits into from
Apr 1, 2024

Conversation

chenboat
Copy link
Contributor

@chenboat chenboat commented Mar 20, 2024

Issue 10863 Pinot's TEXT_MATCH filter today does not support phrase search with wildcard and prefix matching (e.g., "*pache pino*" to match "Apache Pinot") directly. The kind of queries is very common in use case like log search where user needs to search matching results in long text. Today one has to use external means to walk around this issue (e.g., concatenating all words in a paragraph into a longer string and performing regex query on it) and usually they will incur much higher query latency.

This PR adds support to allow phrase search with wildcard and prefix matching for Lucene indexed tables. This feature is enabled through a config in the Lucene text indexed column. The default value is false (or not enabled). User can write a text match function to perform the filter (text_match(col, '*pache pino*')).

We have tested this feature in our internal env and it can process 150+G of text data in 5 sec on 1 server. It can be 3x faster in phrase matching tests.

release-notes

To enable the phrase wildcard search feature, one needs to add a new config to their column Lucene text index config property list
"enablePrefixSuffixMatchingInPhraseQueries": "true"

With this config enabled, one can now perform the pharse wildcard search using the following syntax like
text_match(col, '*pache pino*') which can match the string "Apache pinot".

@codecov-commenter
Copy link

codecov-commenter commented Mar 20, 2024

Codecov Report

Attention: Patch coverage is 69.38776% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 61.58%. Comparing base (59551e4) to head (e34187a).
Report is 185 commits behind head on master.

Files Patch % Lines
...ment/index/readers/text/LuceneTextIndexReader.java 14.28% 3 Missing and 3 partials ⚠️
...me/impl/invertedindex/RealtimeLuceneTextIndex.java 50.00% 2 Missing and 2 partials ⚠️
...inot/segment/local/utils/LuceneTextIndexUtils.java 85.71% 2 Missing and 1 partial ⚠️
...pache/pinot/segment/spi/index/TextIndexConfig.java 81.81% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12680      +/-   ##
============================================
- Coverage     61.75%   61.58%   -0.18%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2463      +27     
  Lines        133233   134634    +1401     
  Branches      20636    20848     +212     
============================================
+ Hits          82274    82909     +635     
- Misses        44911    45517     +606     
- Partials       6048     6208     +160     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 61.54% <69.38%> (-0.17%) ⬇️
java-21 61.46% <69.38%> (-0.17%) ⬇️
skip-bytebuffers-false 61.56% <69.38%> (-0.18%) ⬇️
skip-bytebuffers-true 61.44% <69.38%> (+33.71%) ⬆️
temurin 61.58% <69.38%> (-0.18%) ⬇️
unittests 61.57% <69.38%> (-0.18%) ⬇️
unittests1 46.12% <18.36%> (-0.77%) ⬇️
unittests2 27.98% <51.02%> (+0.25%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -150,10 +155,14 @@ public MutableRoaringBitmap getDocIds(String searchQuery) {
// be instantiated per query. Analyzer on the other hand is stateless
// and can be created upfront.
QueryParser parser = new QueryParser(_column, _analyzer);
parser.setAllowLeadingWildcard(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case you missed it: should we set this flag to true only when _enablePrefixSuffixMatchingInPhraseQueries is set to true?

There are tradeoffs: without setting this to true the queries will fail directly. On the other hand, allowing it can risk expensive queries (per Lucene documentation). I am okay with either choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. better to protect behind the flag for now too.

@@ -150,10 +155,14 @@ public MutableRoaringBitmap getDocIds(String searchQuery) {
// be instantiated per query. Analyzer on the other hand is stateless
// and can be created upfront.
QueryParser parser = new QueryParser(_column, _analyzer);
parser.setAllowLeadingWildcard(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case you missed it: should we set this flag to true only when _enablePrefixSuffixMatchingInPhraseQueries is set to true?

There are tradeoffs: without setting this to true the queries will fail directly. On the other hand, allowing it can risk expensive queries (per Lucene documentation). I am okay with either choice.

@@ -72,6 +74,8 @@ public TextIndexConfig(@JsonProperty("disabled") Boolean disabled, @JsonProperty
luceneMaxBufferSizeMB == null ? LUCENE_INDEX_DEFAULT_MAX_BUFFER_SIZE_MB : luceneMaxBufferSizeMB;
_luceneAnalyzerClass = (luceneAnalyzerClass == null || luceneAnalyzerClass.isEmpty())
? FieldConfig.TEXT_INDEX_DEFAULT_LUCENE_ANALYZER_CLASS : luceneAnalyzerClass;
_enablePrefixSuffixMatchingInPhraseQueries =
enablePrefixSuffixMatchingInPhraseQueries == null ? false : enablePrefixSuffixMatchingInPhraseQueries;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: hoist default false value to a static constant (similar to LUCENE_INDEX_DEFAULT_USE_COMPOUND_FILE).


public class LuceneTextIndexUtilsTest {
@Test
public void testBooleanQueryRewrittenToSpanQuery() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi: I used this code for testing and understanding the PR. The approach in this PR is quite interesting!

https://gist.github.com/ankitsultana/88181cc3bda5113fec83175cfa867445

Comment on lines +297 to +299
String query =
"SELECT INT_COL, SKILLS_TEXT_COL FROM MyTable WHERE TEXT_MATCH(SKILLS_TEXT_COL, '*ealtime streaming system*') "
+ "LIMIT 50000";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we handle/add a test case for boolean span queries? i.e.

TEXT_MATCH(SKILLS_TEXT_COL, '*ealtime streaming system* AND *chine learn*') 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test cases added and they work! Though it is not the main intent of this PR.

@itschrispeck
Copy link
Collaborator

What do you think about enabling this by default/not hiding this feature behind a config? It seems that we should be able to infer whether a query is valid single term lucene syntax/phrase search or requires a span query

e.g. for StandardAnalyzer parsed text
'*istributed' - valid lucene, unchanged behavior
'distribute*'- valid lucene, unchanged behavior
'"Distributed systems"' - valid lucene, unchanged behavior
'*istributed systems*' - not valid lucene, modify it to be a span query
'/.*istributed systems.*/' - valid lucene, incompatible with StandardAnalyzer, unchanged behavior

@chenboat
Copy link
Contributor Author

What do you think about enabling this by default/not hiding this feature behind a config? It seems that we should be able to infer whether a query is valid single term lucene syntax/phrase search or requires a span query

e.g. for StandardAnalyzer parsed text '*istributed' - valid lucene, unchanged behavior 'distribute*'- valid lucene, unchanged behavior '"Distributed systems"' - valid lucene, unchanged behavior '*istributed systems*' - not valid lucene, modify it to be a span query '/.*istributed systems.*/' - valid lucene, incompatible with StandardAnalyzer, unchanged behavior

For now I think it is better to hide this feature behind a config. A few reasons:

  1. 'istributed systems' is today still a valid parsable boolean query without this PR. It is just not clear what the users' real intent is. We probably will leave it this way to maintain the current status. Only when a table owner explicitly set the config flag, this query pattern will be treated as a phrase query with wild card matching.
  2. The query feature enable is still costly. It goes beyond most patterns suggested by Lucene (e.g., no leading *). So it is better to be an option in only feature.

@chenboat chenboat merged commit 62b97ef into apache:master Apr 1, 2024
19 checks passed
@Jackie-Jiang Jackie-Jiang added feature documentation release-notes Referenced by PRs that need attention when compiling the next release notes labels Apr 5, 2024
@Jackie-Jiang
Copy link
Contributor

Thanks for adding the feature! Can you please add a release-notes section to the PR description, and also update the Pinot documentation for this new feature?

@npawar
Copy link
Contributor

npawar commented May 22, 2024

@chenboat following up here.. did you get a chance to add docs?

@chenboat
Copy link
Contributor Author

chenboat commented May 24, 2024

Release note section added.
@npawar can you review this doc PR? https://github.com/pinot-contrib/pinot-docs/pull/341/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation feature release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants