New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10274: Add hyperrectangle faceting capabilities #841
Conversation
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/DoubleHyperRectangle.java
Outdated
Show resolved
Hide resolved
|
||
/** Created DoubleHyperRectangle */ | ||
public DoubleHyperRectangle(String label, DoubleRangePair... pairs) { | ||
super(label, pairs.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: can pairs
be null? If so perhaps we should add a null check? Also, is it a valid case to receive an empty pairs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pairs
should not be null or empty, added a check for both of these cases.
minIn = Math.nextUp(minIn); | ||
} | ||
|
||
if (Double.isNaN(maxIn)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this check before is (!minInclusive)
and I suggest unifying both checks like this:
if (Double.isNaN(minIn) || Double.isNaN(maxIn)) {
throw new IllegalArgumentException("min and max cannot be NaN: min=" + minIn + ", max=" + maxIn);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, will change this.
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangle.java
Outdated
Show resolved
Hide resolved
/** Sole constructor. */ | ||
protected HyperRectangle(String label, int dims) { | ||
if (label == null) { | ||
throw new NullPointerException("label must not be null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it consistent that you throw NPE here and IAE in DoubleRangePair
for illegal arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense to be IAE. Changed it.
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/DoubleHyperRectangle.java
Outdated
Show resolved
Hide resolved
if (minIn != Long.MAX_VALUE) { | ||
minIn++; | ||
} else { | ||
throw new IllegalArgumentException("Invalid min input"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the actual input value to the exception for clarity
if (maxIn != Long.MIN_VALUE) { | ||
maxIn--; | ||
} else { | ||
throw new IllegalArgumentException("Invalid max input"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the actual input value to the exception for clarity
lucene/facet/src/test/org/apache/lucene/facet/hyperrectangle/TestHyperRectangleFacetCounts.java
Outdated
Show resolved
Hide resolved
FacetResult result = facets.getTopChildren(10, "field"); | ||
assertEquals( | ||
""" | ||
dim=field path=[] value=22 childCount=5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: personally I prefer less this type of assertions as they are very fragile. If we change the toString()
tomorrow we'll need to fix all the tests. Can we change to test to make explicit assertions on the label + count? Eventually we want to test the returned facets, not their toString()
representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to test the actual FacetResult
values rather than the toString()
4aa39fb
to
6545c93
Compare
/** Stores pair as LongRangePair */ | ||
private final LongHyperRectangle.LongRangePair[] pairs; | ||
|
||
/** Created DoubleHyperRectangle */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, thanks for catching!
for (int dim = 0; dim < pairs.length; dim++) { | ||
long longMin = NumericUtils.doubleToSortableLong(pairs[dim].min); | ||
long longMax = NumericUtils.doubleToSortableLong(pairs[dim].max); | ||
this.pairs[dim] = new LongHyperRectangle.LongRangePair(longMin, true, longMax, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct to always pass true
(inclusive)? I'm thinking perhaps we should introduce a toLongRangePair
on DoubleRangePair
which will (1) simplify this code and (2) use the actual values of inclusive for min/max? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think passing true
here always is fine since min
and max
are always inclusive themselves, the boolean values just determine whether the params provided are inclusive or not, but if exclusive they are changed to become inclusive. That being said, I think introducing a toLongRangePair
function makes sense and will make the code easier to read.
} | ||
|
||
private static long[] convertToSortableLongPoint(double[] point) { | ||
long[] ret = new long[point.length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this can be written w/ Stream
, since it's called in the ctor of the Field I don't think we should worry about perf. Something like: Arrays.stream(point).mapToObject(NumericUtils::doubleToSortableLong).toArray();
. Up to you though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree it makes the code look a lot cleaner :)
* @param dim dimension of the request range | ||
* @return The comparable long version of the requested range | ||
*/ | ||
public abstract LongHyperRectangle.LongRangePair getComparableDimRange(int dim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After your refactoring, it seems that the two variants of this class don't "convert" anything here, but rather just do return pairs[dim]
, so I wonder if we should rename this method / update the javadocs to remove "conversion" from it, and/or store a LongRangePair[]
in this abstract class and have it take them in the constructor. Then we can implement this API in one place only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you accept that, you can also move LongRangePair
here, which will make the code less odd -- DoubleRangePair and the Collector referencing a type from LongHyperRectangle
even though they interact w/ HyperRectangle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the LongRangePair
to HyperRectangle
and made concrete implementation of getComparableDimRange
. I ended up making a few code changes as a result of this but I think it is cleaner now.
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
private HyperRectangleFacetCounts( | ||
boolean discarded, String field, FacetsCollector hits, HyperRectangle... hyperRectangles) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What prevents you from having one public ctor which just takes HyperRectangle...
parameter?
this.field = field; | ||
this.hyperRectangles = hyperRectangles; | ||
this.dims = hyperRectangles[0].dims; | ||
assert isHyperRectangleDimsConsistent() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer that we do the assertions first, and only then assign to class fields. You can just pass the rectangles + dims to this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in next revision.
} | ||
|
||
private boolean isHyperRectangleDimsConsistent() { | ||
for (HyperRectangle hyperRectangle : hyperRectangles) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how do you feel about using Stream
but if you're OK with it this one can just be written as return Arrays.stream(hyperRectangles).allMatch(hyperRectangle -> hyperRectangle.dims == dims)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't realize allMatch
was a thing, thanks for the suggestion!
@Override | ||
public FacetResult getTopChildren(int topN, String dim, String... path) throws IOException { | ||
validateTopN(topN); | ||
if (dim.equals(field) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if you revert the equals
check to field.equals(dim)
then you don't risk NPE in case someone passes a null dim
(and field
is asserted in the ctor).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in next revision
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java
Outdated
Show resolved
Hide resolved
super(label, convertToLongRangePairArray(pairs)); | ||
} | ||
|
||
private static LongRangePair[] convertToLongRangePairArray(DoubleRangePair... pairs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I find Array
redundant, maybe convertToLongRangePairs
? Or toLongRangePairs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to convertToLongRangePairs
} | ||
|
||
if (!maxInclusive) { | ||
// Why no Math.nextDown? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to try answer this question? :) According to https://appdividend.com/2022/01/05/java-math-nextdown-function-example/:
The nextDown() method is equivalent to nextAfter(d, Double.NEGATIVE_INFINITY) method.
So I think you can just call nextDown()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks for checking that out :). Changed to Math.nextDown()
} | ||
|
||
if (minIn > maxIn) { | ||
throw new IllegalArgumentException("Minimum cannot be greater than maximum"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here including the value of min/max in the message will be useful to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added max and min values, also did the same thing in DoubleHyperRectangle
boolean discarded, String field, FacetsCollector hits, HyperRectangle... hyperRectangles) | ||
throws IOException { | ||
assert hyperRectangles.length > 0 : "Hyper rectangle ranges cannot be empty"; | ||
assert isHyperRectangleDimsConsistent(hyperRectangles) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is
or are
? I think we're referring to the plural rectangles and dims? So areHyperRectangleDimsConsistent
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to are
, thanks for pointing that out it did sound a bit awkward :)
totCount++; | ||
} | ||
} | ||
doc = it.nextDoc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: can this be moved to the for()
line itself? I think so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are correct here. Changed it.
throw new IllegalArgumentException( | ||
"invalid dim \"" + dim + "\"; should be \"" + field + "\""); | ||
} | ||
if (path.length != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add null
check? Do we do this elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to if (path != null && path.length != 0)
. Not sure what exactly you mean by doing this elsewhere, but this function is a copy of RangeFacetCounts#getTopChildren
.
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java
Outdated
Show resolved
Hide resolved
* | ||
* @param name field name | ||
* @param point double[] value | ||
* @throws IllegalArgumentException if the field name or value is null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't check if point
is null? Not sure if you intended to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah true I forgot, added a null
and empty check here and in LongPointFacetField
as well. Thanks for catching this!
*/ | ||
package org.apache.lucene.facet.hyperrectangle; | ||
|
||
/** Holds the name and the number of dims for a HyperRectangle */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/name/label/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this comment more accurate
import org.apache.lucene.index.DocValues; | ||
import org.apache.lucene.search.DocIdSetIterator; | ||
|
||
/** Get counts given a list of HyperRectangles (which must be of the same type) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't actually enforce the "same type" part. Do we really want/care to enforce that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's correct, I forgot to remove this when I removed the enforcement.
/** Hypper rectangles passed to constructor. */ | ||
protected final HyperRectangle[] hyperRectangles; | ||
|
||
/** Counts, initialized in subclass. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
counts
is actually initialized in this class. I also think that as javadocs, it's not very helpful. Maybe something like "Holds the number of matching documents (contain at least one intersecting point) for each HyperRectangle"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed comment to something similar to what you suggested.
*/ | ||
public HyperRectangleFacetCounts( | ||
String field, FacetsCollector hits, HyperRectangle... hyperRectangles) throws IOException { | ||
assert hyperRectangles.length > 0 : "Hyper rectangle ranges cannot be empty"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you might trip NPE here I think, if someone doesn't pass any rectangle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that these should just throw IllegalArgumentExceptions
, I changed this to a conditional and included a null
check.
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mdmarshmallow for taking this on! I did a quick pass over this and left you some feedback. Nothing major, but wanted to get this to you. I'll look at it a bit more thoroughly soon and might leave you some additional comments. Thanks again!
|
||
for (int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = it.nextDoc()) { | ||
if (binaryDocValues.advanceExact(doc)) { | ||
long[] point = LongPoint.unpack(binaryDocValues.binaryValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can avoid unpacking every stored point and instead check directly against the packed format? PointRangeQuery
actually does this nicely. Instead of unpacking each point, we could pack our hypterrectangle ranges and then compare arrays with a ByteArrayComparator
. Maybe have a look at the Weight
created in PointRangeQuery#createWeight
and see if something similar would make sense here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not realize you could compare packed values. I think comparing packed values makes more sense here as it should be more performance than unpacking every time. Not only that but when I made the change it allowed me to simplify the code quite a bit I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
private void count(String field, List<FacetsCollector.MatchingDocs> matchingDocs) | ||
throws IOException { | ||
|
||
for (int i = 0; i < matchingDocs.size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I'd suggest for (FacetsCollector.MatchingDocs hits : matchingDocs)
as a slightly more idiomatic loop style since you don't actually care about the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to this for
loop.
FacetsCollector.MatchingDocs hits = matchingDocs.get(i); | ||
|
||
BinaryDocValues binaryDocValues = DocValues.getBinary(hits.context.reader(), field); | ||
|
||
final DocIdSetIterator it = hits.bits.iterator(); | ||
if (it == null) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a little simpler to read if you create your iterator like this:
BinaryDocValues binaryDocValues = DocValues.getBinary(hits.context.reader(), field);
final DocIdSetIterator it =
ConjunctionUtils.intersectIterators(Arrays.asList(hits.bits.iterator(), binaryDocValues));
... then you don't have to separately advance your doc values iterator (and check that it advanced to the doc) as the loop will take care of all that for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't even occur to me to intersect the iterators, thanks for the suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this convenience is nice. It also might optimize a little internally by figuring out what to lead with, etc. for doing the conjunction. So definitely nice to use.
boolean validPoint = true; | ||
for (int dim = 0; dim < dims; dim++) { | ||
HyperRectangle.LongRangePair range = hyperRectangles[j].getComparableDimRange(dim); | ||
if (!range.accept(point[dim])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: we tend to favor == false
instead of !
in the codebase for readability and less likely hood of introducing a future bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I put that there by mistake. This part of the code got deleted anyways.
|
||
for (int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = it.nextDoc()) { | ||
if (binaryDocValues.advanceExact(doc)) { | ||
long[] point = LongPoint.unpack(binaryDocValues.binaryValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does this imply that each document can only index a single point in this field? Can we support docs with multiple points?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For right now yes. I was planning on adding multi value support after the basic API got fleshed out (maybe in a separate issue?)
+ dims | ||
+ ")"; | ||
// linear scan, change this to use R trees | ||
boolean docIsValid = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docIsValid
was a bit of a confusing name to me. This really captures the idea that the doc contributed to at least one HR right? Maybe something like shouldCountDoc
or something? I dunno... naming is hard! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to shouldCountDoc
.
for (int j = 0; j < hyperRectangles.length; j++) { | ||
boolean validPoint = true; | ||
for (int dim = 0; dim < dims; dim++) { | ||
HyperRectangle.LongRangePair range = hyperRectangles[j].getComparableDimRange(dim); | ||
if (!range.accept(point[dim])) { | ||
validPoint = false; | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think you can make this just a little more readable and avoid a boolean flag here if you use a labeled loop like this:
ranges:
for (int j = 0; j < hyperRectangles.length; j++) {
for (int dim = 0; dim < dims; dim++) {
HyperRectangle.LongRangePair range = hyperRectangles[j].getComparableDimRange(dim);
if (!range.accept(point[dim])) {
continue ranges;
}
}
counts[j]++;
docIsValid = true;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the code got removed in the next revision.
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some more feedback as I thought about this a bit more. Just to step back a bit, I think you were mostly interested in API-level feedback. In my opinion, I think the API is solid. The way users would interact with HyperRectangleFacetCounts
makes sense to me. I think the one "itchy" bit for me—that I'm a little unsettled on—is how to define these multidim point fields. I think you've exposed the issue that we don't have a good way for users to index point data as a doc value field today. So I think we ought to discuss whether-or-not these new field definitions you've added should be specific to faceting, or if they're more general. I left a detailed comment about this. Thanks again!
import org.apache.lucene.document.LongPoint; | ||
|
||
/** Packs an array of longs into a {@link BinaryDocValuesField} */ | ||
public class LongPointFacetField extends BinaryDocValuesField { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about these field classes you created a little more, and I'm wondering if we should create something more generic than just for faceting? I think what you've may run into here is the fact that we don't actually have a field type for indexing point values as doc values (that I know of anyway). We have all the xxxPoint
fields for adding inverted fields in the points index (e.g., LongPoint
), but I don't think we have an actual representation for adding them as DVs.
What do you think of moving these into the document
package and actually defining them as general DV field types? We might not have to go so far as to actually formalize this new concept in the DocValuesType
enum with their own format and such. Under the hood, they could just be a binary format like you have here (at least to start). You might look at LongRangeDocValuesField
as a good example of what I mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually like your suggestion a lot, I think it makes more sense cause there is nothing really faceting specific about these fields. I will include them in the document package instead as and rename them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full transparency: Marc and I had a discussion about this offline so I wanted to circle back here with a suggestion I made to him so it's fully out in the open and we can carry a conversation forward with the community.
While I initially suggested adding this as a sub-class of BinaryRangeDocValuesField
(similar to what LongRangeDocValuesField
does), I wonder if the right thing would be to actually formalize a new doc values format type. If we're building faceting, and potentially "slow range query" support on top of these, it seems like formalizing the format encoding might be the right thing to do. I'd be really curious what the community thinks of this though, and recommended that Marc start that discussion. I'm personally leaning towards formalizing the format, and maybe even having single-valued and multi-valued versions (analogous to (Sorted)NumericDocValues
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering what your thoughts were on just using separate numeric fields rather than packing them. I think this would make the API "nicer" to be honest, but the big drawback would that we would need some hacky multivalued implementation. I can think of some ways to build some sort of UnsortedNumericDV on top of SortedNumericDV, but they would all be super hacky and have limitations and probably not worth implementing.
Edit: Upon thinking about this further, my suggestion doesn't make sense when we have multi-valued fields
package org.apache.lucene.facet.hyperrectangle; | ||
|
||
/** Holds the name and the number of dims for a HyperRectangle */ | ||
public abstract class HyperRectangle { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be made pkg-private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want this public
right? Since it's a public part of the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does HyperRectangle
itself actually need to be part of the public API though? Users certainly need the definitions for Long/DoubleHyperRectangle
but do they need the HyperRectangle
definition itself? Like would they need a generic reference to HyperRectangle
? I'm not sure?
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/DoubleHyperRectangle.java
Outdated
Show resolved
Hide resolved
* | ||
* @return A LongRangePair equivalent of this object | ||
*/ | ||
public LongRangePair toLongRangePair() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be public? I think it's only used internally in DoubleHyperRectangle
right? Should we reduce visibility (unless we expect users need this functionality directly?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to private
|
||
/** Get counts given a list of HyperRectangles (which must be of the same type) */ | ||
public class HyperRectangleFacetCounts extends Facets { | ||
/** Hypper rectangles passed to constructor. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that would be pronounces "Hipper rectangles" 😂. Fixed it :).
protected final HyperRectangle[] hyperRectangles; | ||
|
||
/** Counts, initialized in subclass. */ | ||
protected final int[] counts; | ||
|
||
/** Our field name. */ | ||
protected final String field; | ||
|
||
/** Number of dimensions for field */ | ||
protected final int dims; | ||
|
||
/** Total number of hits. */ | ||
protected int totCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed you made all these fields protected
. Were you thinking this might be a useful extension point for users? I might recommend against that, at least for now. If users start extending this, it might limit the changes we can make going forward (needing to stay back-compat with some of our internal implementation details).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking this would be extended later on, for example we might have a subclass that does linear scanning, another subclass that uses R trees, etc. I think I changed my mind in making things protected
half way through writing this class though since all the functions are private
. For now, since we aren't doing any subclassing yet, I will make it private
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I think leaving it private
until there's a need is good.
dc4b1aa
to
66753ab
Compare
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package org.apache.lucene.document; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not opposed to this change, but I find it a bit strange that we add a "general" Point DV support without any tests that exercise it, and the only usage of it is in the Facet module. Do we see a use case in the future for other DV usage? Like Sorting?
Anyway I'm fine either way, just wanted to comment here that since it's @lucene.experimental
we could have also left it in the facet package and then move here if a more general use case came up.
|
||
/** | ||
* Takes an array of doubles and converts them to sortable longs, then stores as a {@link | ||
* BinaryDocValuesField} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe A {@link BinaryDocValuesField} which indexes double point values as sortable-longs
?
public class DoublePointDocValuesField extends BinaryDocValuesField { | ||
|
||
/** | ||
* Creates a new DoublePointFacetField, indexing the provided N-dimensional long point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/DoublePointFacetField/DoublePointDocValuesField/
s/long point/double point/
package org.apache.lucene.document; | ||
|
||
/** | ||
* Packs an array of longs into a {@link BinaryDocValuesField} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this jdoc consistent with the Double variant, mentioning that we're indexing Point values?
public class LongPointDocValuesField extends BinaryDocValuesField { | ||
|
||
/** | ||
* Creates a new LongPointFacetField, indexing the provided N-dimensional long point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about FacetField
/** | ||
* Checked a long packed value against this HyperRectangle. If you indexed a field with {@link | ||
* org.apache.lucene.document.LongPointDocValuesField} or {@link | ||
* org.apache.lucene.document.DoublePointDocValuesField}, those field values will be able to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/will be able to/can/?
+ ") is incompatible with hyper rectangle dimension (dim=" | ||
+ dims | ||
+ ")"; | ||
for (int dim = 0; dim < dims; dim++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of iterating on dim
you can iterate on offset
starting from 0 to packedValue.length
and increment by Long.BYTES
?
new HyperRectangle.LongRangePair(0L, true, 11L, true), | ||
new HyperRectangle.LongRangePair(0L, true, 12L, true)), | ||
new LongHyperRectangle( | ||
"over (90, 91, 92)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: "Between (90,91,92) and (100,101,102)"? Cause two tests below we have "Over (1000...) which is really just Over, without a real upper limit. But feel free to ignore my pickiness :)
new HyperRectangle.LongRangePair(91L, false, 101L, false), | ||
new HyperRectangle.LongRangePair(92L, false, 102L, false)), | ||
new LongHyperRectangle( | ||
"(90, 91, 92) or above", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you accept what I wrote above, then please change this too (and the double tests).
import org.apache.lucene.store.Directory; | ||
import org.apache.lucene.tests.index.RandomIndexWriter; | ||
|
||
public class TestHyperRectangleFacetCounts extends FacetTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions about the tests:
- Would you like to add a test which verifies we assert that all given rectangles have the same dim?
- Would you like to add a test which showcases mixed Long/Double rectangles?
Hey @mdmarshmallow I think this is a great and very useful feature. I also believe that in general it will be good to accompany these changes with a demo I used the term
Yes, it could be just sugar API on top of |
@shaie thanks for providing the use case doc. Very helpful! As far as an API proposal, I really like the "facet set" concept for the actual Thanks again! |
There are two sides to FacetSets: indexing and aggregation. Well actually three: drill-down too. For indexing I think a generic For aggregation though, as I thought about the API more, I realized that a generic "aggregator" makes no sense. On the file-system it's just a
Therefore, while it looks like a generic
I am personally for developing it directly in the facet module. If we add
At this point I really think we can continue with BDV. Only if someone has prior knowledge about the data and can encode each column separately, would it (perhaps) make sense to consider a different encoding? We can also consider including a |
There was an email thread where some other commiters suggested also developing this in sandbox. It does seem like this API could go through some heavy changes (I think we all agree on that here), so it seems like the As for creating a new multidimensional version of the |
I personally am not sure that we should worry about the big changes that the module will go under. API-wise we tag the classes as I worry that if we put it in sandbox, users might not even attempt to try it, even if technically they don't mind re-indexing on version upgrades. Because "sandbox" feels (to me) like half-baked stuff, while it's not true here - we do deliver value, it's just that the representation of things may change. |
I pushed a new package with the
Eventually I see more "XYZFacetCounts" implementation, and not necessarily many more NOTE: all names in the commit are just proposals, feel free to propose better ones. |
I don't know, we'll need to review it when we get there. But returning a matrix is different than returning top-children IMO, so not sure it's worth to try too hard to make the |
Trying to catch up on this now. I've been traveling and it's been difficult to find time. Thanks for all your thoughts @shaie! I think I'm only half-following your thoughts on the different APIs necessary, and will probably need to look at what you've documented in more detail. But... as a half-baked response, I'm not convinced (yet?) that we need this level of complexity in the API. In my mind, what we're trying to build is a generalization of what is already supported in long/double-range faceting (e.g., So "hyperrectangle faceting"—in my original thinking at least—is just a generalization of this to multiple dimensions. The points associated with the documents are in n-dimensional space, and the user specifies the different "hyperrectangles" they want counts for by providing a [min, max] range in each dimension. For cases like the "automotive parts finder" example, it's perfectly valid for the "hyperrectangles" provided by the user to also be single points (where the min/max are equivalent values in each dimension). But it's also valid to mix-and-match, where some dimensions are single points and some are ranges (e.g., "all auto parts that fit 'Chevy' (single point) for the years 2000 - 2010 (range)). In the situation where a user wants to "fix some dimension" and count over others, it can still be described as a set of "hyperrectangles," but where the specified ranges on some of the dimensions happen to be the same across all of them. So I'm not quite sure if what you're suggesting in the API is just syntactic sugar on top of this idea, or if we're possibly talking about different things here? I'll try to dive into your suggestion more though and understand. I feel like I'm just missing something important and need to catch up on your thinking. Thanks again for sharing! I'll circle back in a few days when I've (hopefully) had some more time to spend on this :) |
Hi Greg, thanks for your comments. Earlier today I tried to play with the
new API to implement some other use cases just to get a feel for how they
will work, and I realized why `HyperRectangle` was proposed and implemented
the way it was (sorry for being too slow!). Let me try to clarify my
thoughts a bit, from multiple perspectives:
* HyperRectangles are indeed a generic way of matching an N-dimensional
point. If one wants ranges, one passes a pair where min/max are different.
If one wants an exact match, one would pass a range where min/max are equal.
* What I proposed with the `ExactFacetSetMatcher` implementation is
merely a specialization of the above. So instead of passing ranges where
min/max are the same, and having the aggregation algo do redundant range
checks, it just specializes on how the aggregation is done. Additionally,
from an API perspective, it might be clearer to the user that they only
need to pass the expected values, and not construct ranges "because that's
what the API allows".
* We could have let `LongPair` implement a `match()` API itself, and a
sugar API for `LongPair.create(min, max)` which will return either a
`RangeLongPair` or `ExactLongPair` (don't mind the names too much) to
specialize the impl, but I'm not sure what will perform better -- calling
the `Pair.match()` or just doing range checks always.
* From an API perspective and the user, I wonder if HyperRectangle is a
clear enough name to denote what we're building here. I.e., is it perhaps
too expert? For instance I initially thought the proposal is for
geo-something faceting before I realized it has nothing specifically to do
with geo (again, sorry for being slow :)). Naming is hard, but I _think_
that `FacetSet` with a bunch of helper classes might make the API clearer.
* I totally think we should have a `HyperRectangle` impl, maybe call it
`HyperRectangleFacetSetMatcher` or `RangeFacetSetMatcher`. This is the
generic catch-all / fallback impl if one cannot find a specialized impl, or
doesn't know how to write one.
* I hope that with this API we'll also pave the way for users to realize
they can implement their own `FacetSetMatcher`, for instance treating the
first 2 dimensions as range, and 3rd and 4th as exact (again, to create
specialized matchers).
* I also think that the proposed API under `facetset` is easier to
extend, even though I'm sure we can re-structure the `hyperrectangle`
package to allow for such extension. Essentially you want a _Reader_ which
knows how to read the `long[]` and a `Filter/Matcher/Whatever` which
interprets them, returning a boolean or something else. That part is the
extension point we'd hope users to implement, which will make the
underlying storage of the points an abstraction that users don't have to
deal with.
* Regarding the other use cases I've mentioned, both `HyperRectangle` and
`MatchingFacetSetCounts` do the same job -- they match an entire set of
points against a given set of points. The `Matcher` even implements an API
which returns a `boolean`. True, you can pass for some of the dimensions
`(NEG_INF, POS_INF)` to denote that you "don't care" about some of the
dimensions, but still at its core this implementation tells you how many
docs matched each set.
* What this impl doesn't let you do is use, say dims 1-3 for matching,
and 4 for counting so that you can ask "What are the top-3 Years for
Oscar+Drama awards" (I hope what I wrote makes sense!!). In this example
you'll want to "match" docs if they have "Oscar" and "Drama" dimensions,
but then count the "Year" dimension and compute the top-K. This use case
cannot be implemented with neither of the current proposed impls, since
they only match docs.
* What I tried to say is that for this kind of use case we'll need a diff
counting impl (but still use the same on-disk structure!), that's all. One
that keeps track of the "Year" counts and its `getTopChildren` returns the
top 3 Years. I hope that makes sense?
I'll add the HyperRectangle impl to the `facetset` package (I'll reuse the
existing classes from `hyperrectangle` for now and we can see how it works?
|
Ok so I took a look at the additions you made to get more of an understanding of what is going on here. I'll try to explain my understanding, and let me know if there is anything wrong with what I'm saying. I think the biggest difference between our implementations is the naming scheme. So it seems like the I wanted to address this though:
I think this is something that we should provide out of the box right? It seems like it could be common enough if someone is using this functionality. Maybe something like Also for this point:
I get where your coming from, but I still feel like overriding the I have a proposal here that may be even more convoluted and a bit crazy, but I'll just put it out there in case. So starting with Now with the matchers, I think if we provide an |
That's right. We could get both APIs to the same state of extensibility, but the main difference is the naming. Under the hood it's about encoding a series of
We could totally offer that out-of-the-box, but I prefer that we do so in a different PR. Not sure about the API details yet, and whether it can truly be a generic impl or not, but it's totally something we should think about.
Indeed, implementing your own
Just so I understand and clarify that we're talking about the same thing: there are two classes here -
So I wanted to allow FSM impls to be as optimal as they need, hence they are given a I would also like to add that this sort of API is always something we can add in the future, in the form of
Do you see more than two impls here? I.e. I put If users can decide how to write their |
This sounds like a good idea, I agree
Ok, if this is the case, then should we provide more out of the box
Yes I am referring to extending FSM, but as I said we should probably supply more OOTB classes in this case?
Ah I see your complaints here. Yeah my goal was trying to make this less of an "expert" API, but if we are going to treat overriding But one thing I want to point out that I was also thinking is that if we are able to read all
I could, for example if they wanted to encode String ordinals here where the ordinals are long, they can write that conversion logic in |
Yes absolutely! I wrote somewhere in the comments that I only provided
That's a good point! So let me clarify the impl and my thoughts: originally you implemented it by reading the
So they can already do that, by passing the ordinals to the |
OK, I've (somewhat) caught up on the conversation here and will follow up on my original questions/comments (but am not going to jump in right now on the latest API discussion).
|
Hey Greg, I had a question about point 4. Are you saying we should have a separate hyper rectangle implementation in addition to facet sets in order to implement the R-tree and KD-tree optimizations? I actually addressed this above but I think we can just implement those in facet sets (specifically |
I pushed a commit w/ Do we want to consider moving to a |
I want to summarize the open questions we have right now to help figure out what we should do next:
For the first one, I talked with Greg a bit more about his suggestion to have them in separate packages and I think I agree with this. We can then make more specialized subclasses (like For the second question, I think we should keep this as a I think once we come to agreement on these questions it will be a lot easier to move forward, at least for me cause I think it will help me have a greater understanding of exactly what our final product (for this PR at least) should look like. |
If I understand correctly, @gsmiller's proposal was more about efficiency, right? So if you have a large number of points that you'd like to index, an R-Tree might be a better data structure for these range matches, than the current linear scan we do in IMO, and to echo Greg's "progress not perfection", I feel it's better if we introduce the
If we think that KD-Trees can be more efficient for To re-iterate what I previously wrote about API back-compat, I feel that putting |
Let me try to summarize my understanding of the "future optimization" debate as it pertains to this proposed API/implementation and see if we're on the same page or if I'm overlooking something. The current proposal encapsulates/delegates matching logic behind For the single point case ( But, the problem is, I personally think we should design with this optimization in mind, but I think we're close and I don't actually think the current proposal needs to really change to allow for future optimizations. This is where I get a little fuzzy on the conversation that's taken place as I haven't totally caught up on the various proposals, and conversations taking place. But, if we kept the implementation as it currently exists, in the future, if we want to put these optimizations in place, could we not just add a method to I point this out not to propose implementing it now, but to say that I think we have options to extend what is currently proposed here if/when we want to optimize. Does this make sense or am I talking about a completely different problem or missing key parts of the conversation that's happened? Apologies if I have. |
So based on everyone's comments:
|
That's true, and hence why we discussed having a
I agree! I think it's fine if we'll leave these optimizations for later, and even if that will change the API between
We certainly can add such API. For "exact" matches it will return
Not intending to start a discussion on how to implement that, but just wanted to point out that
I don't mind if we do that, but since it seems like a trivial change to make after (it doesn't affect the end-user API, only the internal protocol between |
98aa16c
to
036201f
Compare
I think the rebase was somehow messed up, I cleaned up the history and force pushed. Everything should be included in this push. |
Thanks @mdmarshmallow. You added the CHANGES entry under |
Yeah, I think this change should be completely compatible with 9.30. Most of our changes are isolated to the new |
+1 to backporting to 9.x. I think we're ready to merge as far as I'm concerned. @shaie I'll leave it to you to merge and backport, assuming you also feel we're good-to-go here? If you'd prefer I merge/backport, I'm happy to help out as well. Thanks! |
Co-authored-by: Marc D'Mello <dmellomd@amazon.com> Co-authored-by: Shai Erera <serera@gmail.com> Co-authored-by: Greg Miller <gsmiller@gmail.com>
* LUCENE-10274: Add FacetSets faceting capabilities (#841) Co-authored-by: Marc D'Mello <dmellomd@amazon.com> Co-authored-by: Shai Erera <serera@gmail.com> Co-authored-by: Greg Miller <gsmiller@gmail.com>
Description
Added basic hyperrectangle faceting capabilities. This is mostly just a draft PR to sketch out what the API will look like. Added new fields to store points as a BinaryDocValues field and then just linearly scan through those points to see if they "fit" inside the hyperrectangle. There are several important things that are still missing in this commit:
Solution
Currently just linear scans stored points through provided hyper rectangles and checks if a doc is accepted or not.
Tests
Created two basic tests, will need to add more once the API is more set in stone.
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.