-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New KTypePgmIndex that learns a compact index on sorted keys and supports range search. #39
Conversation
…orts range search.
hppc-benchmarks/build.gradle
Outdated
@@ -18,6 +18,10 @@ jmh { | |||
duplicateClassesStrategy = DuplicatesStrategy.WARN | |||
} | |||
|
|||
jmhJar { | |||
duplicatesStrategy = DuplicatesStrategy.INCLUDE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this strategy to be able to run benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is actually duplicated? Duplicates in a ZIP file are highly suspicious - this means the unpacker typically picks one of the alternatives (with the same path). This can easily turn into a debugging nightmare. I haven't had the time to check but I can look into this - perhaps as part of this/ another issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just reran without this strategy to remember. I get this error:
Task :hppc-benchmarks:jmhJar FAILED
Entry LICENSE is a duplicate but no duplicate handling strategy has been set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed another issue to figure this out and upgrade the plugin. Something is odd there.
* Returns a numerical value for the argument for primitive template types. This intrinsic is used | ||
* to apply arithmetic operations on keys. It is invalid for generic types. | ||
*/ | ||
public static <T> int numeric(T e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add this intrinsic "numeric" to be able to compare numeric values (greater/less than) and also to compute the numeric slopes and intercepts at the core of the index learning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had the time to read through the paper yet so I may be saying something stupid here but why does it return the int value for everything, including long/ floating point values? Also, looking at the code, I see it's always used in value-to-value comparisons - perhaps a method like compareKeys(A, B) would be more appropriate/ clear here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value returned by Intrinsics.numeric() can be anything numeric. It is just there to make the template code compile. It could be float or double, a constant. Then at code generation time, this intrinsics is simply removed, letting only the actual numerical parameter. E.g Intrinsics.<KType>numeric(key) > Intrinsics.<KType>numeric(k)
becomes key > k
in the generated code.
I should add more javadoc to be clearer.
This intrinsics is used for comparisons but also for other purposes:
Computation - e.g. int index = (int) (slope * ((double) Intrinsics.<KType>numeric(key) - Intrinsics.<KType>numeric(sKey)) + intercept);
becomes int index = (int) (slope * ((double) key - sKey) + intercept);
Method parameter - e.g. plam.addKey(Intrinsics.<KType>numeric(key), 0, this);
becomes plam.addKey(key, 0, this);
I thought that a single Intrinsics.numeric() would be simpler and less invasive than multiple new Intrinsincs. But let me know if we could do differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I wonder if the intrinsic version should cast to an int - perhaps it should use the broadest possible value so that no truncations can occur? I replaced it with a double and all tests seem to pass - made a commit to your branch but feel free to revert if this is incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Dawid, the javadoc is much clearer.
/** | ||
* Basic growable int array helper for HPPC templates (so before {@code IntArrayList} is generated). | ||
*/ | ||
public class IntGrowableArray implements Accountable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used before IntArrayList is template-generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may take a look if this can be improved somehow. Could definitely be a separate module for pgm only... but perhaps we can make it work with more than one source folder and cascaded generation as well.
/* | ||
* HPPC | ||
* | ||
* Copyright (C) 2010-2022 Carrot Search s.c. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the copyright needs an update? (2023)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a lawyer - perhaps. I'll file an issue. :)
primitiveSizesMap.put(float.class, Integer.valueOf(Float.BYTES)); | ||
primitiveSizesMap.put(double.class, Integer.valueOf(Double.BYTES)); | ||
primitiveSizesMap.put(long.class, Integer.valueOf(Long.BYTES)); | ||
primitiveSizesMap.put(char.class, Character.BYTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Various cleanups in this class, no functional modifications.
Thanks Bruno! I'll review in the evening, had a busy weekend. |
* <p>The returned offset will be the maximum of whatever was measured so far and <code>f</code> | ||
* field's offset and representation size (unaligned). | ||
*/ | ||
static long adjustForField(long sizeSoFar, final Field f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused
hppc-benchmarks/build.gradle
Outdated
@@ -18,6 +18,10 @@ jmh { | |||
duplicateClassesStrategy = DuplicatesStrategy.WARN | |||
} | |||
|
|||
jmhJar { | |||
duplicatesStrategy = DuplicatesStrategy.INCLUDE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is actually duplicated? Duplicates in a ZIP file are highly suspicious - this means the unpacker typically picks one of the alternatives (with the same path). This can easily turn into a debugging nightmare. I haven't had the time to check but I can look into this - perhaps as part of this/ another issue?
/* | ||
* HPPC | ||
* | ||
* Copyright (C) 2010-2022 Carrot Search s.c. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a lawyer - perhaps. I'll file an issue. :)
* It should be set according to the desired space-time trade-off. A smaller value makes the | ||
* estimation more precise and the range smaller but at the cost of increased space usage. | ||
*/ | ||
// With EPSILON=64: the benchmark with 200MB of keys shows that this PGM index requires |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be part of the javadoc? Seems like helpful information, even if it's a final setting.
* Returns a numerical value for the argument for primitive template types. This intrinsic is used | ||
* to apply arithmetic operations on keys. It is invalid for generic types. | ||
*/ | ||
public static <T> int numeric(T e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had the time to read through the paper yet so I may be saying something stupid here but why does it return the int value for everything, including long/ floating point values? Also, looking at the code, I see it's always used in value-to-value comparisons - perhaps a method like compareKeys(A, B) would be more appropriate/ clear here?
I think this looks good! And interesting as well. There are some cleanups that could be done later (related to infrastructure, not the code) but I think it's fine as it is already. The only thing I wonder about is whether the code shouldn't be moved to a separate module and distributed as a separate artifact. Again, this can come later. |
Good question. Indeed this PGM-Index is not a collection in itself (though there is a "dynamic" version of it in the paper that becomes a collection, but it's closer to a Lucene index than a collection for additions and removal). |
Don't worry about it - I can take a look at it later. |
It's in, ready to be cleaned/moved. Thanks Dawid for the review! |
Space-efficient index that enables fast rank/range search operations on a sorted sequence of numerical keys (generic not supported).
It is based on the PGM-Index paper at https://pgm.di.unipi.it .
It provides rank() and range() search operations.
indexOf() is faster than B+Tree, and the index is much more compact.
contains() is between 4x to 7x slower than IntHashSet#contains(), but between 2.5x to 3x faster than Arrays#binarySearch.
Its compactness (40KB for 200MB of keys) makes it efficient for very large collections, the index fitting easily in the L2 cache.