New KTypePgmIndex that learns a compact index on sorted keys and supports range search. #39

bruno-roustant · 2023-07-30T14:52:47Z

Space-efficient index that enables fast rank/range search operations on a sorted sequence of numerical keys (generic not supported).
It is based on the PGM-Index paper at https://pgm.di.unipi.it .
It provides rank() and range() search operations.
indexOf() is faster than B+Tree, and the index is much more compact.
contains() is between 4x to 7x slower than IntHashSet#contains(), but between 2.5x to 3x faster than Arrays#binarySearch.
Its compactness (40KB for 200MB of keys) makes it efficient for very large collections, the index fitting easily in the L2 cache.

…orts range search.

bruno-roustant · 2023-07-30T14:54:08Z

hppc-benchmarks/build.gradle

@@ -18,6 +18,10 @@ jmh {
  duplicateClassesStrategy = DuplicatesStrategy.WARN
 }

+jmhJar {
+  duplicatesStrategy = DuplicatesStrategy.INCLUDE


I added this strategy to be able to run benchmarks.

What is actually duplicated? Duplicates in a ZIP file are highly suspicious - this means the unpacker typically picks one of the alternatives (with the same path). This can easily turn into a debugging nightmare. I haven't had the time to check but I can look into this - perhaps as part of this/ another issue?

I just reran without this strategy to remember. I get this error:

Task :hppc-benchmarks:jmhJar FAILED
Entry LICENSE is a duplicate but no duplicate handling strategy has been set.

I filed another issue to figure this out and upgrade the plugin. Something is odd there.

bruno-roustant · 2023-07-30T14:56:43Z

hppc-template-intrinsics/src/main/java/com/carrotsearch/hppc/Intrinsics.java

+   * Returns a numerical value for the argument for primitive template types. This intrinsic is used
+   * to apply arithmetic operations on keys. It is invalid for generic types.
+   */
+  public static <T> int numeric(T e) {


I add this intrinsic "numeric" to be able to compare numeric values (greater/less than) and also to compute the numeric slopes and intercepts at the core of the index learning.

I haven't had the time to read through the paper yet so I may be saying something stupid here but why does it return the int value for everything, including long/ floating point values? Also, looking at the code, I see it's always used in value-to-value comparisons - perhaps a method like compareKeys(A, B) would be more appropriate/ clear here?

The value returned by Intrinsics.numeric() can be anything numeric. It is just there to make the template code compile. It could be float or double, a constant. Then at code generation time, this intrinsics is simply removed, letting only the actual numerical parameter. E.g Intrinsics.<KType>numeric(key) > Intrinsics.<KType>numeric(k) becomes key > k in the generated code.
I should add more javadoc to be clearer.

This intrinsics is used for comparisons but also for other purposes:

Computation - e.g. int index = (int) (slope * ((double) Intrinsics.<KType>numeric(key) - Intrinsics.<KType>numeric(sKey)) + intercept); becomes int index = (int) (slope * ((double) key - sKey) + intercept);

Method parameter - e.g. plam.addKey(Intrinsics.<KType>numeric(key), 0, this); becomes plam.addKey(key, 0, this);

I thought that a single Intrinsics.numeric() would be simpler and less invasive than multiple new Intrinsincs. But let me know if we could do differently.

Thanks! I wonder if the intrinsic version should cast to an int - perhaps it should use the broadest possible value so that no truncations can occur? I replaced it with a double and all tests seem to pass - made a commit to your branch but feel free to revert if this is incorrect.

Thank you Dawid, the javadoc is much clearer.

bruno-roustant · 2023-07-30T14:57:40Z

hppc/src/main/java/com/carrotsearch/hppc/IntGrowableArray.java

+/**
+ * Basic growable int array helper for HPPC templates (so before {@code IntArrayList} is generated).
+ */
+public class IntGrowableArray implements Accountable {


Used before IntArrayList is template-generated.

I may take a look if this can be improved somehow. Could definitely be a separate module for pgm only... but perhaps we can make it work with more than one source folder and cascaded generation as well.

bruno-roustant · 2023-07-30T14:58:24Z

hppc-benchmarks/src/jmh/java/com/carrotsearch/hppc/benchmarks/implementations/PgmIntSetOps.java

+/*
+ * HPPC
+ *
+ * Copyright (C) 2010-2022 Carrot Search s.c.


Maybe the copyright needs an update? (2023)

I'm not a lawyer - perhaps. I'll file an issue. :)

bruno-roustant · 2023-07-30T15:00:30Z

hppc/src/main/java/com/carrotsearch/hppc/RamUsageEstimator.java

-    primitiveSizesMap.put(float.class, Integer.valueOf(Float.BYTES));
-    primitiveSizesMap.put(double.class, Integer.valueOf(Double.BYTES));
-    primitiveSizesMap.put(long.class, Integer.valueOf(Long.BYTES));
+    primitiveSizesMap.put(char.class, Character.BYTES);


Various cleanups in this class, no functional modifications.

dweiss · 2023-07-31T08:37:45Z

Thanks Bruno! I'll review in the evening, had a busy weekend.

bruno-roustant · 2023-07-31T10:53:03Z

hppc/src/main/java/com/carrotsearch/hppc/RamUsageEstimator.java

-   * <p>The returned offset will be the maximum of whatever was measured so far and <code>f</code>
-   * field's offset and representation size (unaligned).
-   */
-  static long adjustForField(long sizeSoFar, final Field f) {


dweiss · 2023-07-31T20:45:01Z

hppc-benchmarks/build.gradle

@@ -18,6 +18,10 @@ jmh {
  duplicateClassesStrategy = DuplicatesStrategy.WARN
 }

+jmhJar {
+  duplicatesStrategy = DuplicatesStrategy.INCLUDE


What is actually duplicated? Duplicates in a ZIP file are highly suspicious - this means the unpacker typically picks one of the alternatives (with the same path). This can easily turn into a debugging nightmare. I haven't had the time to check but I can look into this - perhaps as part of this/ another issue?

dweiss · 2023-07-31T20:45:49Z

hppc-benchmarks/src/jmh/java/com/carrotsearch/hppc/benchmarks/implementations/PgmIntSetOps.java

+/*
+ * HPPC
+ *
+ * Copyright (C) 2010-2022 Carrot Search s.c.


I'm not a lawyer - perhaps. I'll file an issue. :)

dweiss · 2023-07-31T20:51:59Z

hppc/src/main/templates/com/carrotsearch/hppc/KTypePgmIndex.java

+   * It should be set according to the desired space-time trade-off. A smaller value makes the
+   * estimation more precise and the range smaller but at the cost of increased space usage.
+   */
+  // With EPSILON=64: the benchmark with 200MB of keys shows that this PGM index requires


shouldn't this be part of the javadoc? Seems like helpful information, even if it's a final setting.

dweiss · 2023-07-31T20:58:13Z

hppc-template-intrinsics/src/main/java/com/carrotsearch/hppc/Intrinsics.java

+   * Returns a numerical value for the argument for primitive template types. This intrinsic is used
+   * to apply arithmetic operations on keys. It is invalid for generic types.
+   */
+  public static <T> int numeric(T e) {


I haven't had the time to read through the paper yet so I may be saying something stupid here but why does it return the int value for everything, including long/ floating point values? Also, looking at the code, I see it's always used in value-to-value comparisons - perhaps a method like compareKeys(A, B) would be more appropriate/ clear here?

…add some trivial exclusions.

dweiss · 2023-08-01T09:38:24Z

I think this looks good! And interesting as well. There are some cleanups that could be done later (related to infrastructure, not the code) but I think it's fine as it is already. The only thing I wonder about is whether the code shouldn't be moved to a separate module and distributed as a separate artifact. Again, this can come later.

bruno-roustant · 2023-08-01T09:52:32Z

Good question. Indeed this PGM-Index is not a collection in itself (though there is a "dynamic" version of it in the paper that becomes a collection, but it's closer to a Lucene index than a collection for additions and removal).
It clearly benefits from the cool template generation platform, but I agree that it could be moved to a separate module.

dweiss · 2023-08-01T09:57:31Z

Don't worry about it - I can take a look at it later.

bruno-roustant · 2023-08-01T10:32:57Z

It's in, ready to be cleaned/moved. Thanks Dawid for the review!

New KTypePgmIndex that learns a compact index on sorted keys and supp…

9d9bd86

…orts range search.

bruno-roustant requested a review from dweiss July 30, 2023 14:52

bruno-roustant commented Jul 30, 2023

View reviewed changes

Epsilon tuning for perf. Simplify code.

aac722f

bruno-roustant commented Jul 31, 2023

View reviewed changes

Complete KTypeEmptyPgmIndex.

07fc48b

dweiss reviewed Jul 31, 2023

View reviewed changes

dweiss added 2 commits August 1, 2023 10:34

Update jmh to 1.25, change duplicate strategy for jmhJar to WARN and …

f8ec77e

…add some trivial exclusions.

Return the broadest possible type for the numeric() intrinsic.

ae5aa5a

dweiss approved these changes Aug 1, 2023

View reviewed changes

Epsilon javadoc

e9eef01

bruno-roustant merged commit c9497df into carrotsearch:master Aug 1, 2023
2 checks passed

bruno-roustant deleted the pgm3 branch August 1, 2023 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New KTypePgmIndex that learns a compact index on sorted keys and supports range search. #39

New KTypePgmIndex that learns a compact index on sorted keys and supports range search. #39

bruno-roustant commented Jul 30, 2023 •

edited

bruno-roustant Jul 30, 2023

dweiss Jul 31, 2023

bruno-roustant Aug 1, 2023

dweiss Aug 1, 2023

bruno-roustant Jul 30, 2023

dweiss Jul 31, 2023

bruno-roustant Aug 1, 2023

dweiss Aug 1, 2023

bruno-roustant Aug 1, 2023

bruno-roustant Jul 30, 2023

dweiss Aug 1, 2023

bruno-roustant Jul 30, 2023

dweiss Jul 31, 2023

bruno-roustant Jul 30, 2023

dweiss commented Jul 31, 2023

bruno-roustant Jul 31, 2023

dweiss Jul 31, 2023

dweiss Jul 31, 2023

dweiss Jul 31, 2023

dweiss Jul 31, 2023

dweiss commented Aug 1, 2023

bruno-roustant commented Aug 1, 2023

dweiss commented Aug 1, 2023

bruno-roustant commented Aug 1, 2023

New KTypePgmIndex that learns a compact index on sorted keys and supports range search. #39

New KTypePgmIndex that learns a compact index on sorted keys and supports range search. #39

Conversation

bruno-roustant commented Jul 30, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dweiss commented Jul 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dweiss commented Aug 1, 2023

bruno-roustant commented Aug 1, 2023

dweiss commented Aug 1, 2023

bruno-roustant commented Aug 1, 2023

bruno-roustant commented Jul 30, 2023 •

edited