Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor vectorization support in Lucene #12410

Merged
merged 7 commits into from
Jul 3, 2023

Conversation

uschindler
Copy link
Contributor

@uschindler uschindler commented Jul 2, 2023

This refactors the vectorization support introduced in Lucene 9.7 to allow to plugin vectorization support for more classes except 'VectorUtil':

  • Move all classes to internal package org.apache.lucene.internal.vectorization
  • Add interface VectorUtilImpl for the current VectorUtil backing. This allows to add more
  • Make the main entry point VectorizationProvider an abstract class (this was needed as private static methods don't work in interfaces).
  • The instance of VectorizationProvider is only accessble by a fixed list of caller classes (caller sensitive). The package is hidden by module system, but for classpath applications it is still visible, so prevent misuse. This was built like the TestSecrets class in the sister package which is only accessible from test framework.

The new setup allows:

  • If we get new implementations like ForUtil (see Make ForUtil Vectorized #12396), we can add a new interface and add a getter to get a singleton (like VectorUtilImpl) or new instance of ForUtil for use by codec (which has state). We can have separate implementations for various versions of the codec (e.g., let's start with the 9.x codec). The original scalar implementation is copied to the internal package, the vectorized implementation is added to the Java 20 source folder.
  • We can also add a FMA-based VectorUtil implementation in the future if the JDK allows us to check which features are available at runtime. So based on some lookup "does incubator.vector support FMA for current CPU)", we could return a different impl.

We can change packages and naming a bit. I don't like *Impl for interface names. Maybe VectorUtilSupport instead?

@uschindler
Copy link
Contributor Author

Here is exposed Javadocs API:
image

image

image

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I really like how this turned out. The separation of provider and implementation is much cleaner now.

*
* @lucene.internal
*/
public abstract class VectorizationProvider {
Copy link
Contributor

@ChrisHegarty ChrisHegarty Jul 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of replacing "vectorization" with "vectorized" ? so VectorizedProvider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the default impl is not vectorized. I am not sure what the best name would be.

The package is vectorization, too. I am open to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the default impl is not vectorized.

Ah yes, I forgot that. So it's "vectorizable" - a candidate for vectorization. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds horrible. 😴 Need to sleep.

I think the package name should also be adapted. Naming is hard. 🤨

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea how the package and the provider should be better named.

To me "vectorization" sounds good, as it provides (possible) vectorization support for Lucene. In German it would be "Vektorisierung" which is exactly what the code should do.

Copy link
Contributor

@ChrisHegarty ChrisHegarty Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with "vectorization".

Yet another alternative ;-) Leave the default scalar implementation where it was. Then the provider would be a provider of vectorized implementations, Something like, VectorizedProvider::getInstanceOrNull. That way the new package consists of just Panama code and the light-weight interfaces (no scalar code).

Copy link
Contributor Author

@uschindler uschindler Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about the same. I was not too happy about it because it moves a lot if/else code back to the original package. Because you need to implement the scalar code anyways as interface. It also makes testing hard because I am at moment working for a BaseVectorizationTest case that allows to write compare impls like TestVectorUtilSupport. This only works in a generic way if the code is easy to locate and uses same interfaces.

I think keeping the code that should do the same together is preferable (at least to me). From what I figured out: Most of the code affected is in oal.util anyways. The special case with ForUtil is different. At moment I am not fully sure if the code really differs in Lucene 5, Lucene 8 and Lucene 9 codecs and ForUtil can move to utils package?

About the name: I leave the name for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One addition: By returning null or similar for the provider we miss some flexibility: Based on Panama's options we can opt in / opt out to implement some methods. With the current approach we can fall back to the default implementation only for some interface or only for one method, if Panama does not have a specific feature enabled (or not in that version).

The current VectorUtilSupport for integer vectors already opts out if AVX2 is not available. Currently it is an if statment, but it could also call DefaultVectorUtilSupport#dotProduct() if we have no AVX2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed reply. I agree with the current approach and naming. Let's go for it! :-)

@uschindler
Copy link
Contributor Author

I added a small refactoring for the tests: I added a base class which does the Panama Check in BeforeClass. All subclasses have access to Panama and default impl (if available). If Panama is not available it will prevent test from executing.

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

public static void beforeClass() throws Exception {
assumeTrue(
"Test only works when JDK's vector incubator module is enabled.",
PANAMA_PROVIDER.getClass() != LUCENE_PROVIDER.getClass());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noice!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, yes much better than before as it is not negated.

@uschindler
Copy link
Contributor Author

P.S.: I checked: ForUtil in 9.x is different than in 8.x because the byte order and some other slight changes. So I would only vectorize the core one and leave out the backwards compatibility.

@uschindler
Copy link
Contributor Author

I will merge this once CI is happy. I only changed some messages and added a test for the caller sensitive check (like TestTestSecrets).

@uschindler uschindler merged commit fde2e50 into apache:main Jul 3, 2023
4 checks passed
@uschindler uschindler deleted the dev/refactorVectorizationSetup branch July 3, 2023 09:39
asfgit pushed a commit that referenced this pull request Jul 3, 2023
@uschindler
Copy link
Contributor Author

Thanks @ChrisHegarty !

@uschindler
Copy link
Contributor Author

I figured out that the caller sensitive checking how it wa simplemented does not work with package private classes, so I plan to fix this in a followup commit:

  private static final Set<String> VALID_CALLERS =
      Set.of("org.apache.lucene.util.VectorUtil", "org.apache.lucene.codecs.lucene90.PForUtil");

  private static void ensureCaller() {
    final boolean validCaller =
        StackWalker.getInstance()
            .walk(
                s ->
                    s.skip(2)
                        .limit(1)
                        .map(StackFrame::getClassName)
                        .allMatch(VALID_CALLERS::contains));
    if (!validCaller) {
      throw new UnsupportedOperationException(
          "VectorizationProvider is internal and can only be used by known Lucene classes.");
    }
  }

@uschindler
Copy link
Contributor Author

This is fine as Maven Shade plugin and similar will rewrite constants with class names automatically, so theres no need to fallback to LDC bytecode.

asfgit pushed a commit that referenced this pull request Jul 3, 2023
asfgit pushed a commit that referenced this pull request Jul 3, 2023
@uschindler
Copy link
Contributor Author

Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants