Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New KTypePgmIndex that learns a compact index on sorted keys and supports range search. #39

Merged
merged 6 commits into from
Aug 1, 2023

Conversation

bruno-roustant
Copy link
Collaborator

@bruno-roustant bruno-roustant commented Jul 30, 2023

Space-efficient index that enables fast rank/range search operations on a sorted sequence of numerical keys (generic not supported).
It is based on the PGM-Index paper at https://pgm.di.unipi.it .
It provides rank() and range() search operations.
indexOf() is faster than B+Tree, and the index is much more compact.
contains() is between 4x to 7x slower than IntHashSet#contains(), but between 2.5x to 3x faster than Arrays#binarySearch.
Its compactness (40KB for 200MB of keys) makes it efficient for very large collections, the index fitting easily in the L2 cache.

@@ -18,6 +18,10 @@ jmh {
duplicateClassesStrategy = DuplicatesStrategy.WARN
}

jmhJar {
duplicatesStrategy = DuplicatesStrategy.INCLUDE
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this strategy to be able to run benchmarks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is actually duplicated? Duplicates in a ZIP file are highly suspicious - this means the unpacker typically picks one of the alternatives (with the same path). This can easily turn into a debugging nightmare. I haven't had the time to check but I can look into this - perhaps as part of this/ another issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just reran without this strategy to remember. I get this error:

Task :hppc-benchmarks:jmhJar FAILED
Entry LICENSE is a duplicate but no duplicate handling strategy has been set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed another issue to figure this out and upgrade the plugin. Something is odd there.

* Returns a numerical value for the argument for primitive template types. This intrinsic is used
* to apply arithmetic operations on keys. It is invalid for generic types.
*/
public static <T> int numeric(T e) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add this intrinsic "numeric" to be able to compare numeric values (greater/less than) and also to compute the numeric slopes and intercepts at the core of the index learning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had the time to read through the paper yet so I may be saying something stupid here but why does it return the int value for everything, including long/ floating point values? Also, looking at the code, I see it's always used in value-to-value comparisons - perhaps a method like compareKeys(A, B) would be more appropriate/ clear here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value returned by Intrinsics.numeric() can be anything numeric. It is just there to make the template code compile. It could be float or double, a constant. Then at code generation time, this intrinsics is simply removed, letting only the actual numerical parameter. E.g Intrinsics.<KType>numeric(key) > Intrinsics.<KType>numeric(k) becomes key > k in the generated code.
I should add more javadoc to be clearer.

This intrinsics is used for comparisons but also for other purposes:

Computation - e.g. int index = (int) (slope * ((double) Intrinsics.<KType>numeric(key) - Intrinsics.<KType>numeric(sKey)) + intercept); becomes int index = (int) (slope * ((double) key - sKey) + intercept);

Method parameter - e.g. plam.addKey(Intrinsics.<KType>numeric(key), 0, this); becomes plam.addKey(key, 0, this);

I thought that a single Intrinsics.numeric() would be simpler and less invasive than multiple new Intrinsincs. But let me know if we could do differently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I wonder if the intrinsic version should cast to an int - perhaps it should use the broadest possible value so that no truncations can occur? I replaced it with a double and all tests seem to pass - made a commit to your branch but feel free to revert if this is incorrect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Dawid, the javadoc is much clearer.

/**
* Basic growable int array helper for HPPC templates (so before {@code IntArrayList} is generated).
*/
public class IntGrowableArray implements Accountable {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used before IntArrayList is template-generated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may take a look if this can be improved somehow. Could definitely be a separate module for pgm only... but perhaps we can make it work with more than one source folder and cascaded generation as well.

/*
* HPPC
*
* Copyright (C) 2010-2022 Carrot Search s.c.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the copyright needs an update? (2023)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a lawyer - perhaps. I'll file an issue. :)

primitiveSizesMap.put(float.class, Integer.valueOf(Float.BYTES));
primitiveSizesMap.put(double.class, Integer.valueOf(Double.BYTES));
primitiveSizesMap.put(long.class, Integer.valueOf(Long.BYTES));
primitiveSizesMap.put(char.class, Character.BYTES);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Various cleanups in this class, no functional modifications.

@dweiss
Copy link
Member

dweiss commented Jul 31, 2023

Thanks Bruno! I'll review in the evening, had a busy weekend.

* <p>The returned offset will be the maximum of whatever was measured so far and <code>f</code>
* field's offset and representation size (unaligned).
*/
static long adjustForField(long sizeSoFar, final Field f) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused

@@ -18,6 +18,10 @@ jmh {
duplicateClassesStrategy = DuplicatesStrategy.WARN
}

jmhJar {
duplicatesStrategy = DuplicatesStrategy.INCLUDE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is actually duplicated? Duplicates in a ZIP file are highly suspicious - this means the unpacker typically picks one of the alternatives (with the same path). This can easily turn into a debugging nightmare. I haven't had the time to check but I can look into this - perhaps as part of this/ another issue?

/*
* HPPC
*
* Copyright (C) 2010-2022 Carrot Search s.c.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a lawyer - perhaps. I'll file an issue. :)

* It should be set according to the desired space-time trade-off. A smaller value makes the
* estimation more precise and the range smaller but at the cost of increased space usage.
*/
// With EPSILON=64: the benchmark with 200MB of keys shows that this PGM index requires
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be part of the javadoc? Seems like helpful information, even if it's a final setting.

* Returns a numerical value for the argument for primitive template types. This intrinsic is used
* to apply arithmetic operations on keys. It is invalid for generic types.
*/
public static <T> int numeric(T e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had the time to read through the paper yet so I may be saying something stupid here but why does it return the int value for everything, including long/ floating point values? Also, looking at the code, I see it's always used in value-to-value comparisons - perhaps a method like compareKeys(A, B) would be more appropriate/ clear here?

@dweiss
Copy link
Member

dweiss commented Aug 1, 2023

I think this looks good! And interesting as well. There are some cleanups that could be done later (related to infrastructure, not the code) but I think it's fine as it is already. The only thing I wonder about is whether the code shouldn't be moved to a separate module and distributed as a separate artifact. Again, this can come later.

@bruno-roustant
Copy link
Collaborator Author

Good question. Indeed this PGM-Index is not a collection in itself (though there is a "dynamic" version of it in the paper that becomes a collection, but it's closer to a Lucene index than a collection for additions and removal).
It clearly benefits from the cool template generation platform, but I agree that it could be moved to a separate module.

@dweiss
Copy link
Member

dweiss commented Aug 1, 2023

Don't worry about it - I can take a look at it later.

@bruno-roustant bruno-roustant merged commit c9497df into carrotsearch:master Aug 1, 2023
2 checks passed
@bruno-roustant
Copy link
Collaborator Author

It's in, ready to be cleaned/moved. Thanks Dawid for the review!

@bruno-roustant bruno-roustant deleted the pgm3 branch August 1, 2023 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants