-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor vectorization support in Lucene #12410
Refactor vectorization support in Lucene #12410
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I really like how this turned out. The separation of provider and implementation is much cleaner now.
lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorUtilImpl.java
Outdated
Show resolved
Hide resolved
* | ||
* @lucene.internal | ||
*/ | ||
public abstract class VectorizationProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of replacing "vectorization" with "vectorized" ? so VectorizedProvider
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the default impl is not vectorized. I am not sure what the best name would be.
The package is vectorization, too. I am open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the default impl is not vectorized.
Ah yes, I forgot that. So it's "vectorizable" - a candidate for vectorization. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds horrible. 😴 Need to sleep.
I think the package name should also be adapted. Naming is hard. 🤨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea how the package and the provider should be better named.
To me "vectorization" sounds good, as it provides (possible) vectorization support for Lucene. In German it would be "Vektorisierung" which is exactly what the code should do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with "vectorization".
Yet another alternative ;-) Leave the default scalar implementation where it was. Then the provider would be a provider of vectorized implementations, Something like, VectorizedProvider::getInstanceOrNull
. That way the new package consists of just Panama code and the light-weight interfaces (no scalar code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about the same. I was not too happy about it because it moves a lot if/else code back to the original package. Because you need to implement the scalar code anyways as interface. It also makes testing hard because I am at moment working for a BaseVectorizationTest case that allows to write compare impls like TestVectorUtilSupport
. This only works in a generic way if the code is easy to locate and uses same interfaces.
I think keeping the code that should do the same together is preferable (at least to me). From what I figured out: Most of the code affected is in oal.util
anyways. The special case with ForUtil is different. At moment I am not fully sure if the code really differs in Lucene 5, Lucene 8 and Lucene 9 codecs and ForUtil can move to utils package?
About the name: I leave the name for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One addition: By returning null or similar for the provider we miss some flexibility: Based on Panama's options we can opt in / opt out to implement some methods. With the current approach we can fall back to the default implementation only for some interface or only for one method, if Panama does not have a specific feature enabled (or not in that version).
The current VectorUtilSupport for integer vectors already opts out if AVX2 is not available. Currently it is an if statment, but it could also call DefaultVectorUtilSupport#dotProduct()
if we have no AVX2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed reply. I agree with the current approach and naming. Let's go for it! :-)
I added a small refactoring for the tests: I added a base class which does the Panama Check in BeforeClass. All subclasses have access to Panama and default impl (if available). If Panama is not available it will prevent test from executing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
public static void beforeClass() throws Exception { | ||
assumeTrue( | ||
"Test only works when JDK's vector incubator module is enabled.", | ||
PANAMA_PROVIDER.getClass() != LUCENE_PROVIDER.getClass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe, yes much better than before as it is not negated.
P.S.: I checked: ForUtil in 9.x is different than in 8.x because the byte order and some other slight changes. So I would only vectorize the core one and leave out the backwards compatibility. |
I will merge this once CI is happy. I only changed some messages and added a test for the caller sensitive check (like |
Thanks @ChrisHegarty ! |
I figured out that the caller sensitive checking how it wa simplemented does not work with package private classes, so I plan to fix this in a followup commit: private static final Set<String> VALID_CALLERS =
Set.of("org.apache.lucene.util.VectorUtil", "org.apache.lucene.codecs.lucene90.PForUtil");
private static void ensureCaller() {
final boolean validCaller =
StackWalker.getInstance()
.walk(
s ->
s.skip(2)
.limit(1)
.map(StackFrame::getClassName)
.allMatch(VALID_CALLERS::contains));
if (!validCaller) {
throw new UnsupportedOperationException(
"VectorizationProvider is internal and can only be used by known Lucene classes.");
}
} |
This is fine as Maven Shade plugin and similar will rewrite constants with class names automatically, so theres no need to fallback to |
…allow private/pkg-private classes
…allow private/pkg-private classes
Fixed. |
This refactors the vectorization support introduced in Lucene 9.7 to allow to plugin vectorization support for more classes except 'VectorUtil':
org.apache.lucene.internal.vectorization
VectorUtilImpl
for the currentVectorUtil
backing. This allows to add moreVectorizationProvider
an abstract class (this was needed as private static methods don't work in interfaces).VectorizationProvider
is only accessble by a fixed list of caller classes (caller sensitive). The package is hidden by module system, but for classpath applications it is still visible, so prevent misuse. This was built like theTestSecrets
class in the sister package which is only accessible from test framework.The new setup allows:
ForUtil
(see Make ForUtil Vectorized #12396), we can add a new interface and add a getter to get a singleton (likeVectorUtilImpl
) or new instance ofForUtil
for use by codec (which has state). We can have separate implementations for various versions of the codec (e.g., let's start with the 9.x codec). The original scalar implementation is copied to the internal package, the vectorized implementation is added to the Java 20 source folder.We can change packages and naming a bit. I don't like
*Impl
for interface names. MaybeVectorUtilSupport
instead?