-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Native Text Indexing in Pinot #7395
Comments
As part of the design, it would be great to see details on how different analysis chains can be specified (e.g. based on target language, for a column). It would also be great to plan for supporting dynamic analysis chains, typically based on language classification (either manual or dynamic), per row, as that solves a major problem for many text search use cases. |
Thanks @atris . I would like to review. Please give me couple of days to go through the doc. |
High level question After we added lucene based text index, there was FST text index added which uses Lucene FST libraries for purely regex searches as the former one took more storage if the user only wanted regex searches and not term, phrase etc. For the current proposal of native text indexes, are we planning to re-implement all of Lucene libraries for search within Pinot ? |
Looking at the doc and PR, it looks like only the FST is being implemented. Pinot already has the posting list and the ability to evaluate boolean expressions over posting lists. Is there anything else? |
No, FST and regexp automaton are the only two things - - rest of the
operations will be performed by Pinot indices
…On Fri, 10 Sep 2021, 01:10 Kishore Gopalakrishna, ***@***.***> wrote:
Looking at the doc and PR, it looks like only the FST is being
implemented. Pinot already has the posting list and the ability to evaluate
boolean expressions over posting lists. Is there anything else?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7395 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANE5Y42X34HBC3VHHKYSCLUBEETDANCNFSM5DMMPHBQ>
.
|
Wanted to understand a few things better. IIUC, this is our current state
In both the above cases, we get FST and regexp automaton as part of using Lucene. We also advise users to not use Lucene text index if they want to do exact matches since Pinot's native inverted index is way faster for exact matches. When we say we are implementing native FST index, what exact functionality are we adding and/or improving ? This is not clear in the design doc. The doc talks about control/flexibility and potential future improvements but they are a bit vague IMHO and few more details can be added in those sections. My guess is that this is about improving phrase, regex and fuzzy search by building a native FST index which can work on top of existing Pinot's native structures -- inverted index and dictionary. So it seems like a bridge is missing between Pinot's native inv index and dictionary structure and Lucene FST. Is this correct ? If so, can this not be achieved by continuing to use Lucene FST library as opposed to putting it into Pinot. Something we already do as part of Lucene FST index. Also, how will this new work be different from what is currently offered by Lucene FST index in terms of functionality and performance. There are some performance charts but if I am reading them right, the improvement seems marginal. Also, thanks for clarifying in the doc that this work won't regress the TEXT_MATCH functionality (query syntax etc) and performance. In case, we go ahead with this new work, I think from the end state, we should not have the mandatory step of removing current Lucene text index and TEXT_MATCH. If someone wants to migrate, there should be a migration path. Rest of the users can continue to use what we have today |
Thanks for reviewing the document, @siddharthteotia ! Here are my thoughts: Current text search infrastructure: Status quo, we simply build side car Lucene indices and expose a UDF which allows users to specify Lucene queries. IMO, this is a component that should ideally be outside of Pinot since it has no correlation with Pinot itself. So, an eventual goal is to move text search to native Pinot indices and dictionary, and follow the SQL Standard (LIKE operator) syntax as much as possible. Now, coming to the FST itself. There are three reasons as to why a native FST makes sense:
Regarding TEXT_MATCH, while it is my dearest wish to deprecate the module, I understand that some users may wish to use it. As highlighted, both indices can co exist, with no mandate to migrate to one over the other. |
I had followed up for clarifying few additional things with @atris in slack channel. Copying here for reference and visibility Can we all confirm the following ? I am sorry to have asked this couple of times as part of different threads in the doc but since doc still indicates some sort of migration Note that till completion of phase 4, we will be maintaining the existing text indices within Pinot. I just want to make sure
@atris 's response
Based on above clarifications, I am ok with proceeding @amrishlal , @jackjlli please feel free to add any additional discussion notes |
Did a brief sync up with @atris yesterday. We would like to try this out for functional and perf testing on our prod use case as soon as the phrase and term search part is complete. Will have to discuss/ handle index conversion and query migration (if possible). We will collaborate during testing / rollout for any feature, perf gaps etc. |
Build a fully functional text search engine on top of native Pinot indices, allowing exact matches, prefix and suffix matches, substring matches and regular expressions.
Performance of the text search component (automaton, matcher and FST) should be comparable or better than Lucene's FST, matcher and automaton.
Build the engine using core Pinot capabilities and have it deeply integrated with Pinot's core components
Allow the library to be reusable across Pinot.
Allow the library to be extensible without requiring application changes
Please see:
https://docs.google.com/document/d/1PMhoRy6WF46C4d4mw0LVe9b8Vjqes6vsXZkmxXzMYzw/edit#heading=h.krgi6ulfrbxj
The text was updated successfully, but these errors were encountered: