[nocommit] Introduce ExpressionFacets along with a demo. #12184

gsmiller · 2023-03-05T19:29:13Z

This is meant only as a demo/example. It is incomplete and untested. It's just to start discussion around a possible feature.

gsmiller · 2023-03-05T19:29:39Z

Immediately closing. Only want to use this draft PR as an example.

gsmiller · 2023-03-05T19:30:45Z

lucene/facet/src/java/org/apache/lucene/facet/taxonomy/ExpressionFacets.java

+      FacetsCollector facetsCollector)
+      throws IOException, ParseException {
+    // Compute component aggregations:
+    for (FieldAggregation fa : aggregations) {


@stefanvodita here's where I think you're proposing we do multiple aggregations in one pass right? (which I'm not doing in this example, but it could be done and would probably be a bit more efficient)

That’s right. Seeing your demo makes me thing of a few other features that would be nice to have:

Have a global cache of previously computed aggregations. This supports cases where multiple expressions use the same aggregations. Your bindings is already a global cache. New aggregation bindings could be added to it instead of aggregationBindings. Maybe the user can pass a flag to request for aggregations to be saved to the aggregation cache.

Compute aggregations on different match-sets. For example, let’s say in the ranking expression I only want to aggregate populations for cities whose name starts with “S” and only aggregate distances for cities that are overseas with respect to their country’s capital. Some ways I can imagine this being supported are: a) By using a filter query, like other faceting implementations do. b) By introducing a ternary operator in the expressions or in the values bound to the expressions. c) Each aggregation object references a match-set to be computed against.

Make ExpressionFacets recursive. I’m not sure if this is a good idea, but it makes sense to me conceptually. Right now, variables that show up in the expression have to be pre-computed and bound. What if the variables were treated as sub-expressions, computed recursively as needed, and added to a cache (as per point 1)?

I’ve worked a bit with the example code to see how these ideas could play out. I’ll go through them in the same order.

As anticipated, it’s straight-forward to use the Bindings object as a global cache of aggregations. A strong objection to this is that the Bindings object would now contain values keyed by docid and keyed by ordinal. I don’t think this is a problem. When a variable is referenced in an expression it’s not ambiguous whether it is keyed by docid or ordinal. Another reason to keep field bindings separate from aggregation bindings is that aggregations make sense within the context of a particular query.

The user can define DVSs that have a built-in condition. With this approach, it’s the user’s responsibility to provide conditional DVSs that make sense. Another approach is to use conditions in the expressions - the ternary operator is supported.

I’m dropping this idea. Having considered it some more, I don’t see any benefit.

Thanks @stefanvodita. On your points above:

It sounds like you're talking about re-using aggregations across fields, not just dims? I think it's fairly common for users to "pack" all their faceting dimensions in a single index field. In this case, the aggregations will be re-used across all the dims the user wants to facet on. If the user is spreading their facet dims across multiple index fields though, you're right that any common aggregations would need to be re-computed. This feels like a little bit of an unusual case though, and I'd rather not design for it up-front to be honest. I'm nervous about putting these ordinal-level bindings into a "general" bindings instance since we have no control over the names that have already been bound, so there could be collisions (although maybe unlikely?). By keeping them isolated to an internal binding instance, we have full control over the namespace. Maybe we could simplify a bit initially and look at this optimization in a follow-up issue if we think it's important for users?

I'm a little confused by this use-case. If the user wants to restrict the match set used for faceting, wouldn't they want to restrict it in the same way across all aggregations and the final expression? That's easy enough and is modeled by FacetCountsWithFilterQuery, which we could use here. I'm not really clear on why a user would want to restrict the match set differently for different aggregations?

Not sure I understand this either, but it sounds like you're unconvinced as well? :)

I didn't fully get this part. What I had in mind was preserving aggregations across different faceting calls. For example, val_distance_sum could be used in 2 different expressions we want to compute. We don’t want to recompute val_distance_sum each time. In any case, you’re right that we shouldn’t complicate things too much at this stage.

I’ve mixed up 2 different things, which made things more confusing. The first one, which you address, is restricting the match-set for the expression computation. This can be done with a filter query. The second one is about having conditionals in the expression. This is already supported.

Thanks @stefanvodita! On your first point:

If the user has "packed" all their faceting dimensions into a single index field (which is sort of the default behavior here), then the expression (along with all aggregation values) would be computed a single time when the Facets instance is instantiated. If the user is then making separate calls to get faceting information for different dimensions on this same facets instance (e.g., multiple calls to getTopChildren for different dimensions or dims + paths), then we're not re-computing anything. That's all I was getting at. Where we would be recomputing is if there are different index fields storing the faceting information. This is because we need a separate Facets instance for each index field, which is where some duplicate computation could come into play I think. So my point is just that I suspect most common use-cases would leverage a single index field, with a single Facets instance, so there wouldn't re duplicate computation.

gsmiller · 2023-03-06T17:44:17Z

@stefanvodita I created #12190 to track this further. I propose we carry the discussion forward over there to see if we can come up with a design we like, and solicit community feedback, etc.

[nocommit] Introduce ExpressionFacets along with a demo.

1aa4350

This is meant only as a demo/example. It is incomplete and untested. It's just to start discussion around a possible feature.

gsmiller closed this Mar 5, 2023

gsmiller commented Mar 5, 2023

View reviewed changes

gsmiller mentioned this pull request Mar 6, 2023

Add "Expression" Facets Implementation #12190

Open

Shradha26 mentioned this pull request Sep 13, 2023

[DISCUSS] Identifying Gaps in Lucene’s Faceting #12553

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nocommit] Introduce ExpressionFacets along with a demo. #12184

[nocommit] Introduce ExpressionFacets along with a demo. #12184

gsmiller commented Mar 5, 2023

gsmiller commented Mar 5, 2023

gsmiller Mar 5, 2023

stefanvodita Mar 6, 2023

stefanvodita Mar 9, 2023

gsmiller Mar 10, 2023

stefanvodita Mar 11, 2023

gsmiller Mar 13, 2023

gsmiller commented Mar 6, 2023

[nocommit] Introduce ExpressionFacets along with a demo. #12184

[nocommit] Introduce ExpressionFacets along with a demo. #12184

Conversation

gsmiller commented Mar 5, 2023

gsmiller commented Mar 5, 2023

gsmiller Mar 5, 2023

Choose a reason for hiding this comment

stefanvodita Mar 6, 2023

Choose a reason for hiding this comment

stefanvodita Mar 9, 2023

Choose a reason for hiding this comment

gsmiller Mar 10, 2023

Choose a reason for hiding this comment

stefanvodita Mar 11, 2023

Choose a reason for hiding this comment

gsmiller Mar 13, 2023

Choose a reason for hiding this comment

gsmiller commented Mar 6, 2023