Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10274: Add hyperrectangle faceting capabilities #841

Merged
merged 25 commits into from Jun 25, 2022

Conversation

mdmarshmallow
Copy link
Contributor

@mdmarshmallow mdmarshmallow commented Apr 26, 2022

Description

Added basic hyperrectangle faceting capabilities. This is mostly just a draft PR to sketch out what the API will look like. Added new fields to store points as a BinaryDocValues field and then just linearly scan through those points to see if they "fit" inside the hyperrectangle. There are several important things that are still missing in this commit:

  • Current implementation only supports single-values point fields
  • Can optimize this implementation (Current idea is allowing user to choose whether to put input hyperrectangle in R tree for faster checks than linear scanning, however this wouldn't really be useful below a certain number of input hyperrectangled depending on R-tree node size)
  • Need to add more comprehensive tests.

Solution

Currently just linear scans stored points through provided hyper rectangles and checks if a doc is accepted or not.

Tests

Created two basic tests, will need to add more once the API is more set in stone.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.


/** Created DoubleHyperRectangle */
public DoubleHyperRectangle(String label, DoubleRangePair... pairs) {
super(label, pairs.length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: can pairs be null? If so perhaps we should add a null check? Also, is it a valid case to receive an empty pairs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pairs should not be null or empty, added a check for both of these cases.

minIn = Math.nextUp(minIn);
}

if (Double.isNaN(maxIn)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this check before is (!minInclusive) and I suggest unifying both checks like this:

if (Double.isNaN(minIn) || Double.isNaN(maxIn)) {
  throw new IllegalArgumentException("min and max cannot be NaN: min=" + minIn + ", max=" + maxIn);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, will change this.

/** Sole constructor. */
protected HyperRectangle(String label, int dims) {
if (label == null) {
throw new NullPointerException("label must not be null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it consistent that you throw NPE here and IAE in DoubleRangePair for illegal arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense to be IAE. Changed it.

if (minIn != Long.MAX_VALUE) {
minIn++;
} else {
throw new IllegalArgumentException("Invalid min input");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the actual input value to the exception for clarity

if (maxIn != Long.MIN_VALUE) {
maxIn--;
} else {
throw new IllegalArgumentException("Invalid max input");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the actual input value to the exception for clarity

FacetResult result = facets.getTopChildren(10, "field");
assertEquals(
"""
dim=field path=[] value=22 childCount=5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: personally I prefer less this type of assertions as they are very fragile. If we change the toString() tomorrow we'll need to fix all the tests. Can we change to test to make explicit assertions on the label + count? Eventually we want to test the returned facets, not their toString() representation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to test the actual FacetResult values rather than the toString()

/** Stores pair as LongRangePair */
private final LongHyperRectangle.LongRangePair[] pairs;

/** Created DoubleHyperRectangle */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, thanks for catching!

for (int dim = 0; dim < pairs.length; dim++) {
long longMin = NumericUtils.doubleToSortableLong(pairs[dim].min);
long longMax = NumericUtils.doubleToSortableLong(pairs[dim].max);
this.pairs[dim] = new LongHyperRectangle.LongRangePair(longMin, true, longMax, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to always pass true (inclusive)? I'm thinking perhaps we should introduce a toLongRangePair on DoubleRangePair which will (1) simplify this code and (2) use the actual values of inclusive for min/max? WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think passing true here always is fine since min and max are always inclusive themselves, the boolean values just determine whether the params provided are inclusive or not, but if exclusive they are changed to become inclusive. That being said, I think introducing a toLongRangePair function makes sense and will make the code easier to read.

}

private static long[] convertToSortableLongPoint(double[] point) {
long[] ret = new long[point.length];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this can be written w/ Stream, since it's called in the ctor of the Field I don't think we should worry about perf. Something like: Arrays.stream(point).mapToObject(NumericUtils::doubleToSortableLong).toArray();. Up to you though :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree it makes the code look a lot cleaner :)

* @param dim dimension of the request range
* @return The comparable long version of the requested range
*/
public abstract LongHyperRectangle.LongRangePair getComparableDimRange(int dim);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After your refactoring, it seems that the two variants of this class don't "convert" anything here, but rather just do return pairs[dim], so I wonder if we should rename this method / update the javadocs to remove "conversion" from it, and/or store a LongRangePair[] in this abstract class and have it take them in the constructor. Then we can implement this API in one place only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you accept that, you can also move LongRangePair here, which will make the code less odd -- DoubleRangePair and the Collector referencing a type from LongHyperRectangle even though they interact w/ HyperRectangle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the LongRangePair to HyperRectangle and made concrete implementation of getComparableDimRange. I ended up making a few code changes as a result of this but I think it is cleaner now.

}

private HyperRectangleFacetCounts(
boolean discarded, String field, FacetsCollector hits, HyperRectangle... hyperRectangles)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents you from having one public ctor which just takes HyperRectangle... parameter?

this.field = field;
this.hyperRectangles = hyperRectangles;
this.dims = hyperRectangles[0].dims;
assert isHyperRectangleDimsConsistent()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer that we do the assertions first, and only then assign to class fields. You can just pass the rectangles + dims to this method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in next revision.

}

private boolean isHyperRectangleDimsConsistent() {
for (HyperRectangle hyperRectangle : hyperRectangles) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how do you feel about using Stream but if you're OK with it this one can just be written as return Arrays.stream(hyperRectangles).allMatch(hyperRectangle -> hyperRectangle.dims == dims)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize allMatch was a thing, thanks for the suggestion!

@Override
public FacetResult getTopChildren(int topN, String dim, String... path) throws IOException {
validateTopN(topN);
if (dim.equals(field) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if you revert the equals check to field.equals(dim) then you don't risk NPE in case someone passes a null dim (and field is asserted in the ctor).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in next revision

super(label, convertToLongRangePairArray(pairs));
}

private static LongRangePair[] convertToLongRangePairArray(DoubleRangePair... pairs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I find Array redundant, maybe convertToLongRangePairs? Or toLongRangePairs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to convertToLongRangePairs

}

if (!maxInclusive) {
// Why no Math.nextDown?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to try answer this question? :) According to https://appdividend.com/2022/01/05/java-math-nextdown-function-example/:

The nextDown() method is equivalent to nextAfter(d, Double.NEGATIVE_INFINITY) method.

So I think you can just call nextDown()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for checking that out :). Changed to Math.nextDown()

}

if (minIn > maxIn) {
throw new IllegalArgumentException("Minimum cannot be greater than maximum");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here including the value of min/max in the message will be useful to debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added max and min values, also did the same thing in DoubleHyperRectangle

boolean discarded, String field, FacetsCollector hits, HyperRectangle... hyperRectangles)
throws IOException {
assert hyperRectangles.length > 0 : "Hyper rectangle ranges cannot be empty";
assert isHyperRectangleDimsConsistent(hyperRectangles)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is or are? I think we're referring to the plural rectangles and dims? So areHyperRectangleDimsConsistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to are, thanks for pointing that out it did sound a bit awkward :)

totCount++;
}
}
doc = it.nextDoc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: can this be moved to the for() line itself? I think so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are correct here. Changed it.

throw new IllegalArgumentException(
"invalid dim \"" + dim + "\"; should be \"" + field + "\"");
}
if (path.length != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add null check? Do we do this elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to if (path != null && path.length != 0). Not sure what exactly you mean by doing this elsewhere, but this function is a copy of RangeFacetCounts#getTopChildren.

*
* @param name field name
* @param point double[] value
* @throws IllegalArgumentException if the field name or value is null.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't check if point is null? Not sure if you intended to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah true I forgot, added a null and empty check here and in LongPointFacetField as well. Thanks for catching this!

*/
package org.apache.lucene.facet.hyperrectangle;

/** Holds the name and the number of dims for a HyperRectangle */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/name/label/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this comment more accurate

import org.apache.lucene.index.DocValues;
import org.apache.lucene.search.DocIdSetIterator;

/** Get counts given a list of HyperRectangles (which must be of the same type) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't actually enforce the "same type" part. Do we really want/care to enforce that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's correct, I forgot to remove this when I removed the enforcement.

/** Hypper rectangles passed to constructor. */
protected final HyperRectangle[] hyperRectangles;

/** Counts, initialized in subclass. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counts is actually initialized in this class. I also think that as javadocs, it's not very helpful. Maybe something like "Holds the number of matching documents (contain at least one intersecting point) for each HyperRectangle"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed comment to something similar to what you suggested.

*/
public HyperRectangleFacetCounts(
String field, FacetsCollector hits, HyperRectangle... hyperRectangles) throws IOException {
assert hyperRectangles.length > 0 : "Hyper rectangle ranges cannot be empty";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you might trip NPE here I think, if someone doesn't pass any rectangle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that these should just throw IllegalArgumentExceptions, I changed this to a conditional and included a null check.

Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mdmarshmallow for taking this on! I did a quick pass over this and left you some feedback. Nothing major, but wanted to get this to you. I'll look at it a bit more thoroughly soon and might leave you some additional comments. Thanks again!


for (int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = it.nextDoc()) {
if (binaryDocValues.advanceExact(doc)) {
long[] point = LongPoint.unpack(binaryDocValues.binaryValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can avoid unpacking every stored point and instead check directly against the packed format? PointRangeQuery actually does this nicely. Instead of unpacking each point, we could pack our hypterrectangle ranges and then compare arrays with a ByteArrayComparator. Maybe have a look at the Weight created in PointRangeQuery#createWeight and see if something similar would make sense here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not realize you could compare packed values. I think comparing packed values makes more sense here as it should be more performance than unpacking every time. Not only that but when I made the change it allowed me to simplify the code quite a bit I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

private void count(String field, List<FacetsCollector.MatchingDocs> matchingDocs)
throws IOException {

for (int i = 0; i < matchingDocs.size(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I'd suggest for (FacetsCollector.MatchingDocs hits : matchingDocs) as a slightly more idiomatic loop style since you don't actually care about the index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to this for loop.

Comment on lines 80 to 87
FacetsCollector.MatchingDocs hits = matchingDocs.get(i);

BinaryDocValues binaryDocValues = DocValues.getBinary(hits.context.reader(), field);

final DocIdSetIterator it = hits.bits.iterator();
if (it == null) {
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a little simpler to read if you create your iterator like this:

BinaryDocValues binaryDocValues = DocValues.getBinary(hits.context.reader(), field);
      final DocIdSetIterator it = 
        ConjunctionUtils.intersectIterators(Arrays.asList(hits.bits.iterator(), binaryDocValues));

... then you don't have to separately advance your doc values iterator (and check that it advanced to the doc) as the loop will take care of all that for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't even occur to me to intersect the iterators, thanks for the suggestion!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this convenience is nice. It also might optimize a little internally by figuring out what to lead with, etc. for doing the conjunction. So definitely nice to use.

boolean validPoint = true;
for (int dim = 0; dim < dims; dim++) {
HyperRectangle.LongRangePair range = hyperRectangles[j].getComparableDimRange(dim);
if (!range.accept(point[dim])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: we tend to favor == false instead of ! in the codebase for readability and less likely hood of introducing a future bug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I put that there by mistake. This part of the code got deleted anyways.


for (int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = it.nextDoc()) {
if (binaryDocValues.advanceExact(doc)) {
long[] point = LongPoint.unpack(binaryDocValues.binaryValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does this imply that each document can only index a single point in this field? Can we support docs with multiple points?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For right now yes. I was planning on adding multi value support after the basic API got fleshed out (maybe in a separate issue?)

+ dims
+ ")";
// linear scan, change this to use R trees
boolean docIsValid = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docIsValid was a bit of a confusing name to me. This really captures the idea that the doc contributed to at least one HR right? Maybe something like shouldCountDoc or something? I dunno... naming is hard! :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to shouldCountDoc.

Comment on lines 100 to 108
for (int j = 0; j < hyperRectangles.length; j++) {
boolean validPoint = true;
for (int dim = 0; dim < dims; dim++) {
HyperRectangle.LongRangePair range = hyperRectangles[j].getComparableDimRange(dim);
if (!range.accept(point[dim])) {
validPoint = false;
break;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think you can make this just a little more readable and avoid a boolean flag here if you use a labeled loop like this:

          ranges:
          for (int j = 0; j < hyperRectangles.length; j++) {
            for (int dim = 0; dim < dims; dim++) {
              HyperRectangle.LongRangePair range = hyperRectangles[j].getComparableDimRange(dim);
              if (!range.accept(point[dim])) {
                continue ranges;
              }
            }
            counts[j]++;
            docIsValid = true;
          }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the code got removed in the next revision.

Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more feedback as I thought about this a bit more. Just to step back a bit, I think you were mostly interested in API-level feedback. In my opinion, I think the API is solid. The way users would interact with HyperRectangleFacetCounts makes sense to me. I think the one "itchy" bit for me—that I'm a little unsettled on—is how to define these multidim point fields. I think you've exposed the issue that we don't have a good way for users to index point data as a doc value field today. So I think we ought to discuss whether-or-not these new field definitions you've added should be specific to faceting, or if they're more general. I left a detailed comment about this. Thanks again!

import org.apache.lucene.document.LongPoint;

/** Packs an array of longs into a {@link BinaryDocValuesField} */
public class LongPointFacetField extends BinaryDocValuesField {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about these field classes you created a little more, and I'm wondering if we should create something more generic than just for faceting? I think what you've may run into here is the fact that we don't actually have a field type for indexing point values as doc values (that I know of anyway). We have all the xxxPoint fields for adding inverted fields in the points index (e.g., LongPoint), but I don't think we have an actual representation for adding them as DVs.

What do you think of moving these into the document package and actually defining them as general DV field types? We might not have to go so far as to actually formalize this new concept in the DocValuesType enum with their own format and such. Under the hood, they could just be a binary format like you have here (at least to start). You might look at LongRangeDocValuesField as a good example of what I mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like your suggestion a lot, I think it makes more sense cause there is nothing really faceting specific about these fields. I will include them in the document package instead as and rename them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full transparency: Marc and I had a discussion about this offline so I wanted to circle back here with a suggestion I made to him so it's fully out in the open and we can carry a conversation forward with the community.

While I initially suggested adding this as a sub-class of BinaryRangeDocValuesField (similar to what LongRangeDocValuesField does), I wonder if the right thing would be to actually formalize a new doc values format type. If we're building faceting, and potentially "slow range query" support on top of these, it seems like formalizing the format encoding might be the right thing to do. I'd be really curious what the community thinks of this though, and recommended that Marc start that discussion. I'm personally leaning towards formalizing the format, and maybe even having single-valued and multi-valued versions (analogous to (Sorted)NumericDocValues).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering what your thoughts were on just using separate numeric fields rather than packing them. I think this would make the API "nicer" to be honest, but the big drawback would that we would need some hacky multivalued implementation. I can think of some ways to build some sort of UnsortedNumericDV on top of SortedNumericDV, but they would all be super hacky and have limitations and probably not worth implementing.

Edit: Upon thinking about this further, my suggestion doesn't make sense when we have multi-valued fields

package org.apache.lucene.facet.hyperrectangle;

/** Holds the name and the number of dims for a HyperRectangle */
public abstract class HyperRectangle {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be made pkg-private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want this public right? Since it's a public part of the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does HyperRectangle itself actually need to be part of the public API though? Users certainly need the definitions for Long/DoubleHyperRectangle but do they need the HyperRectangle definition itself? Like would they need a generic reference to HyperRectangle? I'm not sure?

*
* @return A LongRangePair equivalent of this object
*/
public LongRangePair toLongRangePair() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be public? I think it's only used internally in DoubleHyperRectangle right? Should we reduce visibility (unless we expect users need this functionality directly?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to private


/** Get counts given a list of HyperRectangles (which must be of the same type) */
public class HyperRectangleFacetCounts extends Facets {
/** Hypper rectangles passed to constructor. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that would be pronounces "Hipper rectangles" 😂. Fixed it :).

Comment on lines 35 to 47
protected final HyperRectangle[] hyperRectangles;

/** Counts, initialized in subclass. */
protected final int[] counts;

/** Our field name. */
protected final String field;

/** Number of dimensions for field */
protected final int dims;

/** Total number of hits. */
protected int totCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed you made all these fields protected. Were you thinking this might be a useful extension point for users? I might recommend against that, at least for now. If users start extending this, it might limit the changes we can make going forward (needing to stay back-compat with some of our internal implementation details).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking this would be extended later on, for example we might have a subclass that does linear scanning, another subclass that uses R trees, etc. I think I changed my mind in making things protected half way through writing this class though since all the functions are private. For now, since we aren't doing any subclassing yet, I will make it private.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I think leaving it private until there's a need is good.

* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.document;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to this change, but I find it a bit strange that we add a "general" Point DV support without any tests that exercise it, and the only usage of it is in the Facet module. Do we see a use case in the future for other DV usage? Like Sorting?

Anyway I'm fine either way, just wanted to comment here that since it's @lucene.experimental we could have also left it in the facet package and then move here if a more general use case came up.


/**
* Takes an array of doubles and converts them to sortable longs, then stores as a {@link
* BinaryDocValuesField}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe A {@link BinaryDocValuesField} which indexes double point values as sortable-longs?

public class DoublePointDocValuesField extends BinaryDocValuesField {

/**
* Creates a new DoublePointFacetField, indexing the provided N-dimensional long point.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/DoublePointFacetField/DoublePointDocValuesField/
s/long point/double point/

package org.apache.lucene.document;

/**
* Packs an array of longs into a {@link BinaryDocValuesField}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this jdoc consistent with the Double variant, mentioning that we're indexing Point values?

public class LongPointDocValuesField extends BinaryDocValuesField {

/**
* Creates a new LongPointFacetField, indexing the provided N-dimensional long point.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about FacetField

/**
* Checked a long packed value against this HyperRectangle. If you indexed a field with {@link
* org.apache.lucene.document.LongPointDocValuesField} or {@link
* org.apache.lucene.document.DoublePointDocValuesField}, those field values will be able to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/will be able to/can/?

+ ") is incompatible with hyper rectangle dimension (dim="
+ dims
+ ")";
for (int dim = 0; dim < dims; dim++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of iterating on dim you can iterate on offset starting from 0 to packedValue.length and increment by Long.BYTES?

new HyperRectangle.LongRangePair(0L, true, 11L, true),
new HyperRectangle.LongRangePair(0L, true, 12L, true)),
new LongHyperRectangle(
"over (90, 91, 92)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: "Between (90,91,92) and (100,101,102)"? Cause two tests below we have "Over (1000...) which is really just Over, without a real upper limit. But feel free to ignore my pickiness :)

new HyperRectangle.LongRangePair(91L, false, 101L, false),
new HyperRectangle.LongRangePair(92L, false, 102L, false)),
new LongHyperRectangle(
"(90, 91, 92) or above",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you accept what I wrote above, then please change this too (and the double tests).

import org.apache.lucene.store.Directory;
import org.apache.lucene.tests.index.RandomIndexWriter;

public class TestHyperRectangleFacetCounts extends FacetTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions about the tests:

  1. Would you like to add a test which verifies we assert that all given rectangles have the same dim?
  2. Would you like to add a test which showcases mixed Long/Double rectangles?

@shaie
Copy link
Contributor

shaie commented May 30, 2022

Hey @mdmarshmallow I think this is a great and very useful feature. I also believe that in general it will be good to accompany these changes with a demo main() in the demo package, but it can wait a bit until we have a solid API. I've added to this PR an .adoc with few example use cases. IMO it will be useful to keep it around, but modify it of course per the feedback we receive, as a documentation of this feature. If for some reason we'd think that this document is redundant / will be hard to maintain and we'll want to stick with javadocs, I don't mind if in the end we'll delete it. For now I think it's a convenient place to document our thoughts, examples and APIs.

I used the term FacetSets to denote "a set of values that go together". Other names may include Tuple, Group etc. I know naming is the hardest part :). In my mind I'm also thinking about an API like:

// ord("X") refers to the ordinal (`long`) one can retrieve by using the taxonomy index, or some other source
// but underline it's all just `long...`
doc.add(new FacetSetsField(
    "actorAwards",

    // A Thriller for which this actor received a Best Actor Oscar award in 2022
    new FacetSet(ord("Oscar"), ord("Best Actor"), ord("Thriller"), 2022),

    // A Drama for which this actor received a Best Supporting Actor Emmy award in 2005
    new FacetSet(ord("Emmy"), ord("Best Supporting Actor"), ord("Drama"), 2005),
    ));

Yes, it could be just sugar API on top of HyperRectangle but perhaps from a faceting perspective might make more sense and consistent with the other faceting API (RangeFacets, SSDVFacetField etc.). I'd love to receive feedback on the use cases. I can also add to the document a more-than-pseudocode-like example which will include the indexing and aggregation API, so we have something more concrete to discuss?

@gsmiller
Copy link
Contributor

@shaie thanks for providing the use case doc. Very helpful!

As far as an API proposal, I really like the "facet set" concept for the actual Facets implementation. Longer-term, I'd be more in favor of keeping the new field more generic (e.g., a generic "points" doc values field that can be used for faceting but also "slow" queries). But making it more general also makes it harder to understand for users that are just trying to use this faceting functionality. If we end up proposing this as a "sandbox" module feature for now, then I'd +1 this idea of a "facet set" API. If we propose adding this to the core functionality though, I'd like to further discuss the pros/cons of how specific we make this new doc values field.

Thanks again!

@shaie
Copy link
Contributor

shaie commented May 31, 2022

Longer-term, I'd be more in favor of keeping the new field more generic (e.g., a generic "points" doc values field that can be used for faceting but also "slow" queries)

There are two sides to FacetSets: indexing and aggregation. Well actually three: drill-down too. For indexing I think a generic long... values field make sense for FacetSets and perhaps other use cases unrelated to faceting. As long as one can use its encoding as LongPoint.pack() or any other future scheme that will be generic, then I don't see why not. However if that's all what this field will currently do, then I prefer that we start without it in the facet package. We can always extend this field in the future, rather than BDV, right? (as long as it keeps the same order of the values)

For aggregation though, as I thought about the API more, I realized that a generic "aggregator" makes no sense. On the file-system it's just a long[] but how you aggregate this series is totally use-case dependent:

  • For the "Automotive Parts Store" example, you'd want a "Matching Facet Counts" which are given a list of FacetSet objects and count documents which have an exact matching set.
  • For the HyperRectangle faceting (what's currently proposed in this PR) you'd want a different Facet Counts object which ensures that a given point falls within a pair of long values. So here the underlying long[] is actually pair of longs, really both belong to the same "dimension" maybe, and however a "FacetSet" is matched is a different function than for the one above.
  • For N-dimensional FacetSets, a'la the "Movie Awards" example, one might index 5 dimensions for a Movie award, but choose to aggregate only on 3 of them. In that case you'll want a Facet matcher which knows in advance which dimensions it picks from the set to determine whether there's a match.
  • And lastly for the "Movie Actor DB" example, if you'd want to actually compute a matrix, then our Facets API is completely not suitable, since it's a flat single-dimension result type. I don't propose to extend it, only saying that if we wanted to compute N-dimensional results, we'd need another API.
    • Actually, this feature by itself is very cool, since it's an interesting challenge to compute a matrix of "interesting" values. You can't just pick the top "cells" since that will produce fragmented rows, so you have to actually pick "interesting rows/columns". But that's for a different PR.

Therefore, while it looks like a generic long[] DV (unsorted!) field might be enough for indexing facet sets, for aggregations I think we'll have few APIs.

If we end up proposing this as a "sandbox" module feature for now, then I'd +1 this idea of a "facet set" API.

I am personally for developing it directly in the facet module. If we add @lucene.experimental to the classes, along with a "NOTE: this feature is still under heavy development. If you use it you should expect changes to the API as well the on-disk structures until it finalized and optimized". WDYT?

If we propose adding this to the core functionality though, I'd like to further discuss the pros/cons of how specific we make this new doc values field.

At this point I really think we can continue with BDV. Only if someone has prior knowledge about the data and can encode each column separately, would it (perhaps) make sense to consider a different encoding?

We can also consider including a VERSION identifier on-disk which we'll use to tell how to read the data during aggregation. It can just be another "long" we add to the beginning of the list. Yes, it's less optimal than a dedicated DV type, since it's repeating the same number for every field in every document, but it's something we can consider if we worry about it. Frankly though, I feel like we can relax the requirement of this feature at the start and not worry about it until after we're ready to remove that NOTE: from the javadocs.

@mdmarshmallow
Copy link
Contributor Author

There was an email thread where some other commiters suggested also developing this in sandbox. It does seem like this API could go through some heavy changes (I think we all agree on that here), so it seems like the sandbox module would make more sense for this? Is there a benefit to having it in facets vs sandbox? I think putting this in a field that extends BDV makes the most sense right now as well. I think the VERSION identifier might be a bit overkill though, especially if we decide to put this in sandbox. I don't think we should worry about making this backwards compatible.

As for creating a new multidimensional version of the Facets API, I think what you're saying makes sense, but could we extend the existing Facets API as well, I think those methods would still be relevant right?

@shaie
Copy link
Contributor

shaie commented May 31, 2022

Is there a benefit to having it in facets vs sandbox

I personally am not sure that we should worry about the big changes that the module will go under. API-wise we tag the classes as @lucene.experimental which should free us from worrying about the API for a while (but obviously at some point we'll want to declare the API stable!). As for the on-disk structure I also feel like we have some room with it. It's the facet module, not core. There are much less users of it and it's a completely new feature, so I think it's reasonable to declare the whole feature experimental.

I worry that if we put it in sandbox, users might not even attempt to try it, even if technically they don't mind re-indexing on version upgrades. Because "sandbox" feels (to me) like half-baked stuff, while it's not true here - we do deliver value, it's just that the representation of things may change.

@shaie
Copy link
Contributor

shaie commented May 31, 2022

I pushed a new package with the FacetSet API I had in mind. As I wrote before, while thinking about it I realized that there are few issues to handle, so I've decided to implement one of the examples in the document: the Automotive Parts Store, just to get a feel how it would work. Few points:

  1. The MatchingFacetSetsCounts takes a FacetSetMatcher and already support multi-valued fields. For now I've implemented an ExactFacetSetMathcer which requires all dim values to match a given set.
  • We can implement additional ones, such as RangeFacetSetMatcher which evaluates each value against a range.
  • We cal implement the Movie Awards matcher which looks at some of the dimension values to determine a match.
  1. The MatchingFacetSetsCounts is not a "one-Counts-to-rule-them-all" class. It's just one use case which counts how many documents each matcher matched as well as how many documents matched overall.
  • We can, and should!, also implement a TopMatchingFacetSetCounts which fixes some dimensions of a set and computes the counts for the "free" dimensions. E.g. "Top 3 years for actors performing in Thriller movies": the matcher will evaluate as a match the first dimensions (the "Genre") and compute a counts[] for the "Year" dimension, then its getTopChildren will return the 3 years with the most "Thriller" actors.

Eventually I see more "XYZFacetCounts" implementation, and not necessarily many more FacetSetMatcher impls.

NOTE: all names in the commit are just proposals, feel free to propose better ones.

@shaie
Copy link
Contributor

shaie commented May 31, 2022

As for creating a new multidimensional version of the Facets API, I think what you're saying makes sense, but could we extend the existing Facets API as well, I think those methods would still be relevant right?

I don't know, we'll need to review it when we get there. But returning a matrix is different than returning top-children IMO, so not sure it's worth to try too hard to make the Facets API return N-domensional results?

@gsmiller
Copy link
Contributor

gsmiller commented Jun 3, 2022

Trying to catch up on this now. I've been traveling and it's been difficult to find time. Thanks for all your thoughts @shaie!

I think I'm only half-following your thoughts on the different APIs necessary, and will probably need to look at what you've documented in more detail. But... as a half-baked response, I'm not convinced (yet?) that we need this level of complexity in the API. In my mind, what we're trying to build is a generalization of what is already supported in long/double-range faceting (e.g., LongRangeFacetCounts), where the user specifies all the ranges they want counts for, we count hits against those ranges, and support returning those counts through a couple APIs. Those faceting implementations allow ranges to be specified in a single dimension, and determine which ranges the document points (in one-dimensional space) fall in.

So "hyperrectangle faceting"—in my original thinking at least—is just a generalization of this to multiple dimensions. The points associated with the documents are in n-dimensional space, and the user specifies the different "hyperrectangles" they want counts for by providing a [min, max] range in each dimension. For cases like the "automotive parts finder" example, it's perfectly valid for the "hyperrectangles" provided by the user to also be single points (where the min/max are equivalent values in each dimension). But it's also valid to mix-and-match, where some dimensions are single points and some are ranges (e.g., "all auto parts that fit 'Chevy' (single point) for the years 2000 - 2010 (range)).

In the situation where a user wants to "fix some dimension" and count over others, it can still be described as a set of "hyperrectangles," but where the specified ranges on some of the dimensions happen to be the same across all of them.

So I'm not quite sure if what you're suggesting in the API is just syntactic sugar on top of this idea, or if we're possibly talking about different things here? I'll try to dive into your suggestion more though and understand. I feel like I'm just missing something important and need to catch up on your thinking. Thanks again for sharing! I'll circle back in a few days when I've (hopefully) had some more time to spend on this :)

@shaie
Copy link
Contributor

shaie commented Jun 3, 2022 via email

@mdmarshmallow
Copy link
Contributor Author

Ok so I took a look at the additions you made to get more of an understanding of what is going on here. I'll try to explain my understanding, and let me know if there is anything wrong with what I'm saying. I think the biggest difference between our implementations is the naming scheme. So it seems like the FacetSetMatcher class is equivalent to a HyperRectangle. So we have an ExactFacetSetMatcher, which would be equivalent to making a HyperRectangle where all the min's and the max's are the same for every range (in other words, a point), and I think you proposed above a RangeFacetSetMatcher, which would just be a regular hyper rectangle (albeit under a different name).

I wanted to address this though:

I hope that with this API we'll also pave the way for users to realize
they can implement their own FacetSetMatcher, for instance treating the
first 2 dimensions as range, and 3rd and 4th as exact (again, to create
specialized matchers).

I think this is something that we should provide out of the box right? It seems like it could be common enough if someone is using this functionality. Maybe something like RangedAndExactFacetSetMatcher that lets user specify which which dimensions they want as exact matches, the details don't matter too much right now.

Also for this point:

I also think that the proposed API under facetset is easier to
extend, even though I'm sure we can re-structure the hyperrectangle
package to allow for such extension. Essentially you want a Reader which
knows how to read the long[] and a Filter/Matcher/Whatever which
interprets them, returning a boolean or something else. That part is the
extension point we'd hope users to implement, which will make the
underlying storage of the points an abstraction that users don't have to
deal with.

I get where your coming from, but I still feel like overriding the matches function is still a bit of an expert use case and no super clear maybe?

I have a proposal here that may be even more convoluted and a bit crazy, but I'll just put it out there in case. So starting with FacetSet. What if we made that abstract FacetSet<T> and provide an abstract long[] writeToLong(T... values). Then the users will be able to store any value by extending this class and overwriting writeToLong. For example LongFacetSet extends FacetSet<Long>.

Now with the matchers, I think if we provide an ExactFacetSetMatcher, a RangeFacetSetMatcher, and a RangeAndFacetSetMatcher, I think that would provide for all users' facet matching needs. So instead of letting the user override match(), we can create an abstract readToLong() that would read the field that the users created and stored. So for example LongExactFacetSetMatcher extends ExactFacetSetMatcher. Or maybe we can figure out a way to combine all the types of matchers into one to make this more simple, but we would try to have a more friendly API than what HyperRectangle has currently. Let me know what you think of this.

@shaie
Copy link
Contributor

shaie commented Jun 7, 2022

I think the biggest difference between our implementations is the naming scheme.

That's right. We could get both APIs to the same state of extensibility, but the main difference is the naming. Under the hood it's about encoding a series of long[] and implementing some aggregation function over them.

Maybe something like RangedAndExactFacetSetMatcher that lets user specify which which dimensions they want as exact matches, the details don't matter too much right now.

We could totally offer that out-of-the-box, but I prefer that we do so in a different PR. Not sure about the API details yet, and whether it can truly be a generic impl or not, but it's totally something we should think about.

but I still feel like overriding the matches function is still a bit of an expert use case and no super clear maybe?

Indeed, implementing your own FacetSetMatcher is expert? I mean, OOTB we offer few generic implementations which you can use to pretty much implement your facet set aggregation. If however you require a different matching impl, or more specialized one, you can implement your own FacetSetMatcher. Also, given the OOTB examples, one should easily be able to understand how to implement their own.

So instead of letting the user override match(),

Just so I understand and clarify that we're talking about the same thing: there are two classes here - MatchingFacetSetCounts (MFSC) and FacetSetMatcher (FSM). I think that MFSC can be final since I don't expect users to extend it. All they need is to provide a list of FSMs. Extending FSM is what you refer to here? This is indeed the more expert API, which I hope users won't have to extend, given the OOTB classes.

we can create an abstract readToLong() that would read the field that the users created and stored

So I wanted to allow FSM impls to be as optimal as they need, hence they are given a byte[]. We could consider passing long[] in the API, but that would mean that MFSC needs to deserialize the bytes to longs, which is what we wanted to avoid in the HyperRectangle impl. For this reason I don't think FSM should make you read the bytes into longs, let the impl do that if it's more convenient for it. Maybe we'll even find out that deserializing them to long[] (or reading them one long at a time) performs better, especially if the same dimension is evaluated over and over in a single match() call?

I would also like to add that this sort of API is always something we can add in the future, in the form of LongFSM which overrides match() and adds its own abstract long readLong(...) version. That is, I'm not sure we need to finalize the details of this API just yet?

What if we made that abstract FacetSet<T> and provide an abstract long[] writeToLong(T... values).

Do you see more than two impls here? I.e. I put FacetSet but I wrote that we should also have DoubleFacetSet which encodes doubles. Any other impls here? Also, I'm not sure we should offer this extension to users, they can always implement that themselves? If you think about it, we eventually utilize BinaryDV to encode some data and aggregate it. The indexing + reading part of the code is not that complex so users can just write their own, and reuse existing classes where they fit?

If users can decide how to write their long[], they will also need to implement the reader, which in this PR is MFSC? If we'll change the aggregation API to be long-based then maybe? I think though that we may not need the generic here anyway. FacetSet can store long[] with a default impl for toBytes[] and you can override it if you wish. But this is super-expert API IMO, that I think we should expose only later if the need arises.

@mdmarshmallow
Copy link
Contributor Author

We could totally offer that out-of-the-box, but I prefer that we do so in a different PR. Not sure about the API details yet, and whether it can truly be a generic impl or not, but it's totally something we should think about.

This sounds like a good idea, I agree

Indeed, implementing your own FacetSetMatcher is expert?

Ok, if this is the case, then should we provide more out of the box FacetSetMatcher classes (like a double and a long implementation for starters).

Extending FSM is what you refer to here? This is indeed the more expert API, which I hope users won't have to extend, given the OOTB classes.

Yes I am referring to extending FSM, but as I said we should probably supply more OOTB classes in this case?

For this reason I don't think FSM should make you read the bytes into longs, let the impl do that if it's more convenient for it.

Ah I see your complaints here. Yeah my goal was trying to make this less of an "expert" API, but if we are going to treat overriding matches() as an expert API and provide more OOTB classes, that makes sense.

But one thing I want to point out that I was also thinking is that if we are able to read all bytes[] into longs[], we can do some stuff with R-trees in MatchingFacetSetCounts like put all the FSM into an R-tree for faster matching. If we make it so that the data in an FSM isn't guaranteed to be able to be turned into a long, we can't do this optimization in the future.

Do you see more than two impls here?

I could, for example if they wanted to encode String ordinals here where the ordinals are long, they can write that conversion logic in writeToLong(String... values). With that being said, as you mentioned earlier this enforces encoding in longs which we do want to avoid right?

@shaie
Copy link
Contributor

shaie commented Jun 7, 2022

Ok, if this is the case, then should we provide more out of the box FacetSetMatcher classes (like a double and a long implementation for starters)

Yes absolutely! I wrote somewhere in the comments that I only provided Long variants to talk about the API itself, but totally we should include a Double variant too.

I was also thinking is that if we are able to read all bytes[] into longs[], we can do some stuff with R-trees

That's a good point! So let me clarify the impl and my thoughts: originally you implemented it by reading the byte[] into a long[] and @gsmiller commented that you can compare the bytes directly, so I only flowed with this proposal (I assume it's for efficiency). But if we don't think that comparing bytes is necessary more efficient than converting to longs, but it does limit us in terms of the API in the future and where we'd want to take it to, then by all means let's make the matching API long-based.

I could, for example if they wanted to encode String ordinals here where the ordinals are long, they can write that conversion logic in writeToLong(String... values)

So they can already do that, by passing the ordinals to the FacetSet constructor. I even demonstrated that in the test. Do you see an issue w/ that?

@gsmiller
Copy link
Contributor

gsmiller commented Jun 7, 2022

OK, I've (somewhat) caught up on the conversation here and will follow up on my original questions/comments (but am not going to jump in right now on the latest API discussion).

  1. I like this "facet set" naming approach along with providing specific implementations for "exact match" cases and "range" cases. I think we should stick to these two for now. If a user wants to "mix and match" (some dims are exact matches and some are ranges), they can use the more general "range" implementation (with some dim ranges containing common values for min/max). Or they could of course implement their own. I don't think we need the complexity of an OOTB "mix and match" solution (for now at leat).
  2. As far as solving for use-cases where users want to "fix" the n-1 dims and then get top values for the nth dim, I don't think we need to solve for that (yet). The existing "range" facet counting doesn't solve for this, and requires users to fully describe the ranges they care about. So for the sake of "progress not perfection", I see no issue with following a similar pattern here.
  3. If users do need to implement the above use-case (no. 2 above), there's actually a different way to go about it. Because LongValueFacetCounts allows users to provide a LongValuesSource, users can implement their own LongValueSource that provides values for the dimension they want to count, but pre-filters to only the points that match the n-1 filtering dims. So in the above example, if users wanted the top year values for movies that received the "Oscar+Drama" award, they can implement a LongValuesSource on top of the binary doc value field (the packed points) that "emits" the year value for each point, but only if it the other dims meet the "Oscar+Drama" criteria. I've actually done this in practice. We could certainly make this easier for users to do, but they have all the primitives to do this on their own (especially with the addition of the proposed LongPointDocValuesField).
  4. I think there's actually a nice future optimization that's a bit easier with modeling the "exact match" and "range" cases separately. If the user has many points or "hyperrectangles" specified, we might want to use some sort of space-partitioning data structure to make determining the matching points/hyperrectangles more efficient as we iterate the doc points (instead of doing an exhaustive search every time). These data structures will be different for these two cases (one is probably some sort of KD-tree for the "exact match" case and the other might be some sort of R-tree for the "hyperrectangle" case). So having these separate implementations might actually set us up for a nice performance improvement too, where if we modeled everything as "hyperrectangles", we could end up just stuffing a bunch of points into an R-tree which is a little weird.
  5. I look forward to seeing the "range" implementation sketched out :)

@mdmarshmallow
Copy link
Contributor Author

Hey Greg, I had a question about point 4. Are you saying we should have a separate hyper rectangle implementation in addition to facet sets in order to implement the R-tree and KD-tree optimizations? I actually addressed this above but I think we can just implement those in facet sets (specifically MatchingFacetSetCounts) right? Couldn't we have the class do put RangeFSM into R-trees and ExactFSM into KD-trees and that would fix the issue you are talking about here? I could be misunderstanding something so let me know.

@shaie
Copy link
Contributor

shaie commented Jun 7, 2022

look forward to seeing the "range" implementation sketched out

I pushed a commit w/ RangeFacetSetMatcher which is basically very similar to HyperRectangle, but w/ some different names (i.e. LongRange instead of LongPair). I've also refactored the tests into 3 classes: a general test case for MatchingFacetSetCounts and two classes for Exact/RangeFacetSetMatcher. The tests are not exhaustive yet, and the impl still lacks a Double variant.

Do we want to consider moving to a long[] based matching API now? To allow for future optimizations like R/KD-Trees? Or proceed w/ the byte[] version for now (to finalize this PR, and until the need arises)?

@mdmarshmallow
Copy link
Contributor Author

I want to summarize the open questions we have right now to help figure out what we should do next:

  1. Should we split the ExactFSM and range multidim implementations into separate packages (facetset for the exact implementation and hyperrectangle for the range implementation) or the same?
  2. As @shaie mentioned, should we have a long[] based API or not?

For the first one, I talked with Greg a bit more about his suggestion to have them in separate packages and I think I agree with this. We can then make more specialized subclasses (like DoubleExactFSM and DoubleHyperRectangle) without cluttering up the package, and optimizations won't have to account for the fact we could be doing exact matching vs range matching etc. Maybe we combine them in the future? But I think for now we should keep them separate.

For the second question, I think we should keep this as a long[] based API as we know we want to make the KD tree and R tree optimizations in the future, so adding extra work for us to have to revert doesn't make sense to me? Though if you guys have contrary opinions please let me know, I could see other viewpoints for this.

I think once we come to agreement on these questions it will be a lot easier to move forward, at least for me cause I think it will help me have a greater understanding of exactly what our final product (for this PR at least) should look like.

@shaie
Copy link
Contributor

shaie commented Jun 8, 2022

facetset for the exact implementation and hyperrectangle for the range implementation

If I understand correctly, @gsmiller's proposal was more about efficiency, right? So if you have a large number of points that you'd like to index, an R-Tree might be a better data structure for these range matches, than the current linear scan we do in RangeFSM. But it doesn't mean we shouldn't have RangeFSM for a small dimensionality / small number of FacetSet per doc cases right? They might even perform better?

IMO, and to echo Greg's "progress not perfection", I feel it's better if we introduce the HyperRectangle impls (1) when we have a good use case to demonstrate the need for it (it just helps to reason about it) and (2) when we're actually ready to implement it "more efficiently". That is, in this PR I feel like we can do w/ RangeFSM and later (if and when we'll have a better alternative for large dims) document and reference the alternative?

For the second question, I think we should keep this as a long[] based API as we know we want to make the KD tree and R tree optimizations in the future

If we think that KD-Trees can be more efficient for FacetSetMatcher impls and we're sure that we'll get there soon, then yeah, we can move to long[] based API. Another alternative I thought of is keep the byte[] API, but introduce a helper method on FSM which converts the byte[] to long[] for whoever needs it.

To re-iterate what I previously wrote about API back-compat, I feel that putting @lucene.experimental tag on the classes is enough to have us not worry about changing it. Especially if the API hasn't been released yet, and definitely if we intend to follow with a KD-Tree impl very soon. A year from now (just an example) it's a different story, but for now I don't think we need to finalize the API for good.

@gsmiller
Copy link
Contributor

gsmiller commented Jun 8, 2022

Let me try to summarize my understanding of the "future optimization" debate as it pertains to this proposed API/implementation and see if we're on the same page or if I'm overlooking something.

The current proposal encapsulates/delegates matching logic behind FacetSetMatcher, which is responsible for determining whether-or-not that FSM instance matches a provided point. There's RangeFSM which knows how to match based on whether-or-not the point is contained in the n-dim range, and there's ExactFSM which is just doing exact point equivalence. The "protocol" for doing facet aggregation/counting is implemented inside MatchingFacetSetsCounts, which delegates to the FSM#matches method. The inefficiency is that—because MatchingFacetSetsCounts can make no assumptions about the FSM instances provided to it, it must iterate all provided FSM instances for every indexed point for every provided hit.

For the single point case (ExactFSM), this is crazy right? Even if there are a small number of provided ExactFSM instances we're matching against, doing a linear scan of all of them for every point is pretty dang inefficient. Especially so for a case where there are many provided hits with many indexed points for each. I think the same logic generally holds true for the range case as well, but maybe that's more debatable.

But, the problem is, MatchingFacetSetsCounts doesn't know anything about these FSM instances it gets and can't do anything other than just ask them all, "hey, do you match this point?" And so the debate seems to be how to setup the API to allow for future optimizations, or if we should even worry about it at all.

I personally think we should design with this optimization in mind, but I think we're close and I don't actually think the current proposal needs to really change to allow for future optimizations.

This is where I get a little fuzzy on the conversation that's taken place as I haven't totally caught up on the various proposals, and conversations taking place. But, if we kept the implementation as it currently exists, in the future, if we want to put these optimizations in place, could we not just add a method to FSM that exposes the min/max values for each dimension? Then, MatchingFacetSetsCounts could inspect the underlying "hyperrectangles" represented by each FSM by looking at these min/max values before it counts and decide how it wants to proceed. The matches method is potentially still useful/needed depending on how flexible we want this thing to be; if a user creates an FSM implementation that's more complicated than just a "hyperrectangle" (e.g., some complicated geometry), the contract for providing min/max could be that the provided min/max is a bounding box, but matches still needs to be called to actually confirm the match.

I point this out not to propose implementing it now, but to say that I think we have options to extend what is currently proposed here if/when we want to optimize. Does this make sense or am I talking about a completely different problem or missing key parts of the conversation that's happened? Apologies if I have.

@mdmarshmallow
Copy link
Contributor Author

So based on everyone's comments:

  1. It seems like we should ditch the hyperrectangle implementation and that facetset does everything we need for right now
  2. When we decide to optimize this (right after this PR is merged ideally), we would let MatchingFacetSetsCount be able to take a look at the FSM's passed to it and then determine if it should put the FSM into an R tree, KD tree, or just linearly scan based on the min and max of eachFSM. I think this makes sense, but we also shouldn't discuss it too much here as I think this is for another PR. I think the point is we can optimize the facetsets package in it's current state. With that being said, I do plan on writing the KD and R tree optimizations as soon as this is merged so I am still for this remaining a long[] API.

@shaie
Copy link
Contributor

shaie commented Jun 9, 2022

For the single point case (ExactFSM), this is crazy right? Even if there are a small number of provided ExactFSM instances we're matching against, doing a linear scan of all of them for every point is pretty dang inefficient. Especially so for a case where there are many provided hits with many indexed points for each.

That's true, and hence why we discussed having a fastMatchQuery optimization too (which will skip over hits that don't eg. have any "Ford" or "2010" dimensions). That's an optimization we should (IMO) add in the (near) future, after this PR.

I personally think we should design with this optimization in mind, but I think we're close and I don't actually think the current proposal needs to really change to allow for future optimizations.

I agree! I think it's fine if we'll leave these optimizations for later, and even if that will change the API between MFSC and FSM, it's not a big deal yet.

if we want to put these optimizations in place, could we not just add a method to FSM that exposes the min/max values for each dimension?

We certainly can add such API. For "exact" matches it will return min=max right? Only for range ones they will be different. Are you proposing to return a min[] and max[] arrays, one entry per dimension? Just to make sure I understood your proposal (it doesn't have to be two arrays, but you understand the question).

we would let MatchingFacetSetsCount be able to take a look at the FSM's passed to it and then determine if it should put the FSM into an R tree, KD tree, or just linearly scan based on the min and max of eachFSM

Not intending to start a discussion on how to implement that, but just wanted to point out that fastMatchQuery is something we'll need anyway (I guess) for drilldown, therefore it might be worth to start with it first? And, I'd rather we have some baseline benchmark before we implement any optimization, preferably for several use cases, so that in the end can propose which impl to use. E.g. maybe if you pass 1-2 FSMs, we won't need to create R/KD trees (they might even perform worse)? Anyway let's leave that for later.

With that being said, I do plan on writing the KD and R tree optimizations as soon as this is merged so I am still for this remaining a long[] API.

I don't mind if we do that, but since it seems like a trivial change to make after (it doesn't affect the end-user API, only the internal protocol between MFSC and FSM), and since maybe fastMatchQuery will be good enough to speed up FacetSet aggregations, we may not need to have a long[] API eventually? Just saying I feel like it's not a decision we have to take now. We can also add a helper toLong[] method to begin with, so that MFSC can convert the bytes to longs, do whatever R/KD-Tree-iness it needs and then call FSM? I don't know if what I wrote even makes sense, but I feel like changing the API to fit the optimizations when we'll come to implement them will be much clearer since we'll know what we need.

@mdmarshmallow
Copy link
Contributor Author

I think the rebase was somehow messed up, I cleaned up the history and force pushed. Everything should be included in this push.

@shaie
Copy link
Contributor

shaie commented Jun 23, 2022

Thanks @mdmarshmallow. You added the CHANGES entry under Lucene 9.30 so am just verifying -- we're going to merge it both to main and 9.x branches?

@mdmarshmallow
Copy link
Contributor Author

Yeah, I think this change should be completely compatible with 9.30. Most of our changes are isolated to the new facetset package and all other changes are just adding some functions to different places, which should not affect any existing functionality.

@gsmiller
Copy link
Contributor

+1 to backporting to 9.x. I think we're ready to merge as far as I'm concerned. @shaie I'll leave it to you to merge and backport, assuming you also feel we're good-to-go here? If you'd prefer I merge/backport, I'm happy to help out as well. Thanks!

@shaie shaie merged commit f6bb9d2 into apache:main Jun 25, 2022
shaie added a commit to shaie/lucene that referenced this pull request Jun 25, 2022
Co-authored-by: Marc D'Mello <dmellomd@amazon.com>
Co-authored-by: Shai Erera <serera@gmail.com>
Co-authored-by: Greg Miller <gsmiller@gmail.com>
shaie added a commit that referenced this pull request Jun 25, 2022
* LUCENE-10274: Add FacetSets faceting capabilities (#841)

Co-authored-by: Marc D'Mello <dmellomd@amazon.com>
Co-authored-by: Shai Erera <serera@gmail.com>
Co-authored-by: Greg Miller <gsmiller@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants