-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for building materialized views using Lucene formats #13188
Comments
Figuring out the right API for this idea sounds challenging, but I like the idea. |
I wonder if we could think of this more broadly as a caching problem. Basically, you could evaluate some "question" (aggregations, statistics, etc.) for all segments and save the index-level question and the per- |
There are several advantages to keeping the new index as part of the same Lucene segment. It reduces maintenance overhead and enables Near Real-Time (NRT) use cases. Specifically, for the star tree index, incrementally building the star tree as segments get flushed and merged takes significantly less time, as the sort and aggregation operations can be optimized. Considering these advantages, I'm further exploring the idea of a new format to support multi-field indices, which can also be extended to create other types of composite indices. DataCubesFormat vs CompositeValuesFormatSince Lucene is also used for OLAP use cases, we can create a 'DataCubesFormat' specifically designed to create multi-field indices on a set of dimensions and metrics. [Preferred] Alternatively, if we want a more generic format for creating indices based on any set of fields, we could go with 'CompositeValuesFormat'. While the underlying implementation for both formats would be similar (creating indices on a set of Lucene fields), 'DataCubesFormat' is more descriptive and tailored to the OLAP use case. ImplementationFor clarity, we will focus on 'DataCubesFormat' in the rest of this section. Broadly, we have two ways to implement the format. DataCubesConfig via IndexWriterConfig / SegmentInfo [ Preferred ]
Pros
Cons
Add/update doc flow with a new DataCubeFieldUsers can pass the set of dimensions and metrics as part of a new 'DataCubeField' during the 'ProcessDocument' flow. Pros
Cons
Overall, the preferred approach of using 'IndexWriterConfig' and 'SegmentInfo' seems more suitable for implementing the 'DataCubesFormat'. |
DataCubesFormat
DataCubesConfig
DataCubesDocValuesConsumerThe DataCubesDocValuesConsumer consumes the DocValues writer to read the DocValues data and create new indices based on the DataCubesConfig. DataCubesProducerThe DataCubesProducer/Reader is used to read the 'DataCubes' index from the segment. ExampleIn this example, I've implemented 'StarTree' index by extending 'DataCubeFormat'
|
Wow! Adding data cube (OLAP) capabilities to Lucene could be really powerful. Adding it as a new format does sound like the right idea to me. I would like to better understand how I'm thinking, e.g. of adding an Could something like that be achieved by adding a "datacube dimensionality" attribute to a field, similar to how |
Thanks for the comments @msfroh . Good idea, if we want to supply But there are some challenges:
The same
So in order to avoid the duplication of values , how about we derive the values of Flush
MergeDuring merge, we will most likely not need POC code |
It's not clear to me how we'd take advantage of this information at search time. What changes would we make to e.g. |
Hi @jpountz , We will traverse the StarTreeQuery
And coming to For example SumCollector
Example:
Code reference - This contains star tree implementation but this is old code where I've not integrated yet with DataCubes format etc. , I've followed |
This reminded me of an older issue: #11463 that seems to have foundered. Maybe there is something to be learned from that, not sure. |
Thanks for the inputs @msokolov . I do see the similarities but the linked issue seems to be tied to rollups done as part of merge aided by index sorting on the dimensions. Index sorting is quite expensive. The difference here is that, all the computation is deferred to the format and its custom logic. And query time gains could be higher as we are using efficient cubing structures. For star tree implementation, the algorithm sorts the dims and then aggregates during flush , the successive merges just need to sort and aggregate the compacted, sorted data cube structures. There are some cons here as well ,
Let me know your thoughts. |
My main concern was to ensure this exciting effort didn't get blocked by the need to do major changes to existing indexing workloads. It sounds like the plan here is less intrusive and confined to the new format, for which +1 |
At first sight I don't like the fact that this seems to plug in a whole new way of doing things. Either you don't use a star tree index and you do things the usual way with filters and collectors, or you want to use a star tree index and then you need to craft queries in a very specific way if you want to be able to take advantage of the optimization for aggregations. Since this optimization is about aggregating data, I'd like this to mostly require changes on the collector side from the end user's perspective. It would be somewhat less efficient, but an alternative I'm contemplating would consist of the following:
|
Thanks for the inputs @jpountz . Let me spend some more time on this. But this is a topic which was thought of as well and one idea was to do query abstraction / planning. Let me know your thoughts : Can the concern with query can be solved by abstracting it by introducing In fact the input can remain the same as of the original query , we can do a check if that can be solved via Or we can also think of query rewriting if a particular query can be solved using But there is an issue in |
Description
We are exploring the use case of building materialized views for certain fields and dimensions using Star Tree index while indexing the data. This will be based on the configured fields (dimensions and metrics) during index creation. This is inspired from http://hanj.cs.illinois.edu/pdf/vldb03_starcube.pdf and Apache Pinot’s Star Tree index. Star Tree helps to enforce upper bound on the aggregation queries ensuring predictable latency and resource usage, it is also storage space efficient and configurable.
OpenSearch RFC : opensearch-project/OpenSearch#12498
Creating this issue to discuss approaches to support Star Tree in Lucene and also to get feedback on any other approaches/recommendations from the community.
Quick overview on Star Tree index creation flow
The Star Tree DocValues fields and Star Tree index are created during the flush / merge flows of indexing
Flush / merge flow
Challenges
Main challenge is that ‘StarTree’ index is a multi-field index compared to other formats in Lucene / OpenSearch. This makes it infeasible to use the PerField extension defined in Lucene today. We explored ‘BinaryDocValues’ to encode dimensions and metrics, but the ‘type’ of dimensions and metrics are different. So we couldn’t find a way to extend it. [Dimensions could be numeric or text or combination].
Create Star Tree index
Approach 1 - Create a new format to build materialized views
We can create a new dedicated file format (similar to points format, postings format) for materialized views which accepts list of dimensions and metrics and the default implementation for it could be the Star Tree index.
Pros
Cons
Approach 2 - Extend DocValues format
Indexing - Extend DocValues to support materialized views
We can extend DocValues format to support a new type of field ‘AGGREGATED’ which will hold the configured list of dimensions and metrics by the user during index creation.
During flush / merge , the values of the dimensions and metrics will be read from the associated ‘DocValues’ fields using DocValuesProducer and we will create the Star Tree indices as per the steps mentioned above.
Search flow
We can extend ‘LeafReader’ and ‘DocValuesProducer’ with a new method ‘getAggregatedDocValues’ to get the Star Tree index during query time. This retrieves the root of the Star Tree and the dimensions and metrics DocValues fields.
Pros
Cons
Open questions
Any suggestions on a way to pack values of ‘dimensions’ and ‘metrics' as part of ‘AggregatedField’ during indexing as part of ‘addDocument’ flow? Also, should we explore this or we can simply create the derived ‘AggregatedField’ during flush/merge ?
Create Star Tree DocValues fields
Star Tree index is backed by Star Tree DocValues fields.
So to read/write, we can reuse the existing ‘DocValuesFormat’. Each field is stored as ‘Numeric’ DocValues field or ‘SortedSet’ DocValues field in case of text fields.
To accommodate this, we propose to make DocValuesFormat extend ‘Codec’ and ‘Extension’ , so that we can create the StarTree DocValues fields with custom extensions.
The text was updated successfully, but these errors were encountered: