-
Notifications
You must be signed in to change notification settings - Fork 25.6k
ESQL: Fuse MV_MIN and MV_MAX and document process #138029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adds REST tests for the `percentiles_bucket` pipeline bucket aggregation. This gives us forwards and backwards compatibility tests for these aggs as well as mixed version cluster tests for these aggs. Relates to elastic#26220
|
Hi @nik9000, I've created a changelog YAML for you. |
|
I've flipped this to draft because I haven't done a full survey of the csv-spec functions for coverage around this issue. |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
I've removed this from draft because I've added a bunch more MV_MIN and MV_MAX tests. We could really use more around |
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how this works and it gives a very good speedup. I left just one comment, and someone else should review too. But other than that LGTM
| - class: org.elasticsearch.xpack.esql.heap_attack.HeapAttackLookupJoinIT | ||
| method: testLookupExplosionBigString | ||
| issue: https://github.com/elastic/elasticsearch/issues/138510 | ||
| - class: org.elasticsearch.xpack.esql.qa.single_node.GenerativeForkIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended to be muted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Sorry. Had a pending comment explaining it that I hadn't posted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those queries are now "optimized incorrectly". I've got one fix for the rule in #138544 so I'm going to get that and this in and work on that query. I'm calling this mute another thing that block release of the field fusion work.
| issue: https://github.com/elastic/elasticsearch/issues/138510 | ||
| - class: org.elasticsearch.xpack.esql.qa.single_node.GenerativeForkIT | ||
| method: test {csv-spec:inlinestats.MvMinMvExpand} | ||
| issue: https://github.com/elastic/elasticsearch/issues/137679 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tracked in the meta-issue blocking deployment of pushing functions to field loads: #137679
carlosdelest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very very nice docs. Thank you.
| * Interface for loading data in a block shape. Instances of this class | ||
| * must be immutable and thread safe. | ||
| * Loads values from a chunk of lucene documents into a "Block" for the compute engine. | ||
| * <p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My past self would have loved this. Thanks for the comments, will help my future self when my present self forgets about this.
| * <li>{@code STATS} or another function - these <strong>won't</strong> use your new code</li> | ||
| * </ul> | ||
| * <p> | ||
| * It's <strong>fairly</strong> likely we already have tests for all these cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about FORK, LOOKUP JOIN and subqueries? Are we expecting them to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that, once we fix FORK, LOOKUP JOIN, and subqueries, the same solution will apply to pushing all functions. So we won't need a test for every single function. We'll need some with a couple of representative functions. But I wrote this from the perspective of someone writing a new fused expression a few weeks from now. That's..... aspirational!
| * "fused" into field loading using {@link BlockLoaderExpression}. Functions like {@code V_COSINE} | ||
| * can use the vector search index to compute the result. Functions like {@code MV_MIN} can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not exactly true - we avoid loading the vector into blocks by calculating just the single float coming out of a similarity function. The function is not calculated as part of the data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought that was the plan and hadn't looked super close. I remember Ben saying something about being able to push down to lucene somehow if we knew the doc ids were in ascending order.
didn't mean to have this in this PR
Adds REST tests for the `percentiles_bucket` pipeline bucket aggregation. This gives us forwards and backwards compatibility tests for these aggs as well as mixed version cluster tests for these aggs. Relates to elastic#26220
Fuses the
MV_MINandMV_MAXfunctions to block loading for a few types. And documents the fusion process so folks doing similar work have a recipe to follow. LikeLENGTH, this is not often going to be a super hot function but it is an "obvious" one to fuse because it reduces the data we load and it's fairly easy to implement on top of the loaders we already have. Queries like:over ten million docs go from ~84ms to ~56ms. So, like, 33% improvement.