Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CALCITE-4539] Customization of Metadata Handlers #2378

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jamesstarr
Copy link
Contributor

@jamesstarr jamesstarr commented Mar 23, 2021

Changing RelMetadataQueryBase.map generic from Table<Object, List, Object>
to Table<Object, Object, Object> to support more efficient cache keys.

@jacques-n
Copy link
Contributor

@jamesstarr, can you rebase this on master?

@jamesstarr jamesstarr changed the title [CALCITE-4551] Reusing Immutable metadata cache keys [CALCITE-4539] Customization of Metadata Handlers Oct 28, 2021
Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesstarr, this is a great patch and I'm really excited about the direction you're taking these classes. That being said, I'd like to be more rigorous about introducing new public apis that strengthen or maintain the current connection between handler concepts and relmetadataquery. This patch also seems to do two very different things: one is the refactoring of the base interfaces/classes and the other is a bunch of refactoring of the janino code generation logic. I think we should separate this out so we can get the first figured out and then the second (probably 80% of this code) can be reviewed separately. Does that make sense?

public final Table<RelNode, Object, Object> map;

public final MetadataCache cache;
protected final MetadataHandlerProvider metadataHandlerProvider;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against this change as it doesn't really reduce the required pattern/encapsulation between RelMetadataQuery and another provider concept.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make metadataHandlerProvider private. I think this means that for now you need to move it to relmetadataquery and pass a cache supplier into the RelMetadataQueryBase

Please make the cache private. I think using the Api pattern to "hide" things should be used as a last resort. This may mean it should also move to RelMetadataQuery (for now).

this.memoryHandler = initialHandler(BuiltInMetadata.Memory.Handler.class);
this.nonCumulativeCostHandler = initialHandler(BuiltInMetadata.NonCumulativeCost.Handler.class);
this.parallelismHandler = initialHandler(BuiltInMetadata.Parallelism.Handler.class);
protected RelMetadataQuery(MetadataHandlerProvider metadataHandlerProvider) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not excited about depending on another provider interface in RelMetadataQuery (and introducing another public api). Can we come up with a way to extract this out? (Separate the concept of handlers from the relmetadataquery surface area.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MetadataProvider interface is really "a handler way to deal with metadata that is no longer coupled to janino". I agree that this abstraction makes sense but it feels too specific to be added as a RelMetadataQuery dependency. I'd like to see it instead added to some separate impl specific version of relmetadataquery e.g. . My best thought currently is modifying RelMetadataQuery to be a decorator on top of a new "AbstractRelMetdataQuery". Then the handler concepts can be moved to an implementation of AbstractRelMetdataQuery and RelMetadataQuery doesn't depend on a new implementation specific interface. Concepts like null -> sentinel conversion and convenience delegations (getColumnOrigin -> getColumnOrigins) can be done at the RelMetadataQuery level but other concepts should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point the MetadataProvider a list a of tuples (Class<? extends MetadataHandler>, MetadataHandler instance) which represent binding a set methods to a particular metadata. Without this patch the Metadata class is not used in any meaningful way in code path that is not already deprecated. You have to have a symbol(global constant) to bind a the metadata implementation too.

Part of the whole point of the interface is to let people customize/change the behavior of the handler generation. Some people want to eagerly generate the code at runtime, other want to tweaks the caching of generated classes, you and others do not want to use janino altogether. This would support all of those use cases.

RelMetadataQuery needs to call the implementation of the metadata calls which is currently done through handlers. RelMetadataQueryBase already has a supported api for doing conceptually similar things with similar amount of exposed api. Their are just a set of things that you can not do with that existing and supported API that I am attempting to rectify.

Several consumer of apache calcite have custom metadata calls. Currently the suggested way to create a custom call is to subclass RelMetadataQuery. What you are purposing would require 2 different subclasses of RelMetadataQuery and also require them to correctly maintain the caching/cyclic dependency boiler plate. I understand the desire to be able to debug code and have a separation of concerns but the API you are proposing would require several subclasses to do something conceptually simple and introduce breaking changes.

Copy link
Contributor

@jacques-n jacques-n Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to separate the concern and the solution into two separate discussions.

First the concern:
The problem is the coupling between RelMetdataQuery and RelMetdataProvider. In this patch you add additional new public apis that reinforce this coupling. There is no need for the coupling. It is completely reasonable that someone should be able to implement RelMetadataQuery without having to work at all with RelMetdataProvider related items. That is the beauty of the RelMetadataQuery surface area from the planning/rule pov. For example, someone should be able hand write a RelMetadatQuery implementation that has all the handling of supported rel types.

Second, the solution:
It is important to remove the coupling of RelMetdataQuery and RelMetdataProvider. Ideally, we would convert RelMetadataQuery to an abstract implementation (or interface) that only does a minimal number of data cleansing/canonicalization options (null to sentinels, conveniences interfaces, etc). However, because of the exposure of the protected constructor for RelMetadatQuery, this has to be done in two steps to avoid breaking changes. I proposed one way to get there but I don't think it is the only way. I don't understand any of your comments around complexity, more subclasses, caching cyclic dependency and boiler plate. As part of the first step towards a clean api there has to be some shims to support old users until we can remove old apis but that's also true with all of these metadata patches. We shouldn't dismiss an ultimately cleaner solution because we have to maintain a shim for one release (and really, a shim must exist in either case).

Lastly, it would be good to clarify what you mean by breaking changes. In both the proposal I've put forward and the current patch, I see deprecation/upgrade paths required (new apis that replace old apis exist simultaneously and then old apis are removed post version deprecation). In both cases, a user should have to do no code changes when upgrading but should see deprecation warnings and then the following upgrade would see breakage if they didn't solve deprecation warnings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To put this another way: RelMetadataQuery is the what, RelMetadataProvider is one possible how. Let's not couple the two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several requirements metadata in calcite:
Metadata retrieval - the api for getting a metdata value for a relnode and give set of arguments. This is currently the RelMetadataQuery.
Metadata implementation - The code that is called to for a relnode to get a metadata value. This is currently sub-classes of RelMetdataProvider.
Defaulting and Validation - There is defaulting, canonization, validation, etc after a metadata value is retrieved in RelMetadataQuery.
Dispatch - How the implementation is called for given arguments. This is currently done in Janino compiled code.
Cycle Detection - This is currently a hard requirement of volcano planner. This is currently done in Janino compiled code.
Caching Invalidation - This is required for volcano planner. This is currently done through RelMetadataQuery.clearCache.
Caching Implementation - This is coupled with cycle detection, but is currently accomplished through RelMetadataQuery.map.
Caching Values - The caches need to be checked and updated. This is currently done in the generated code.

Metadata retrieval and Metadata implementation are hard requirements for external facing APIs. They are currently exist as external facing APIs. Many downstream projects depend on the ability to either customize existing metadata types or add their own metadata types. When a down stream project adds their own metadata type, they need to be able to expose it to the larger system aka through metadata retrieval. This is currently done by subclassing RelMetadataQuery and calling methods RelMetadataQueryBase. An example of this can be seen in RelMetadataTest.MyRelMetadataQuery.

So the metadata retrieval api is the only api need to for the rest of calcite core, however, downstream users of calcite need to be able to configure and extend the metadata implementation. So some of RelMetadataQuery implementation must be exposed with the current setup.

I support making RelMetadataQuery an interface so the metadata retrieval is clearly defined. However, this is not the only public API necessary to expose. Also, it is breaking change which does not separate the defaulting and validation from the RelMetadataQueryImpl. If RelMetadataQuery was split into RelMetadataQuery(an interface), DefaultingAndValidatingRelMetadataQuery and HandlerBackedRelMetadataQuery, then down stream projects would have to implement 2 different RelMetadataQueries to add a custom metadata. This api seems a bit convoluted and cumbersome. Alternatively, breaking it up into class hierarchy of RelMetadataQuery -> DefaultingAndValidatingRelMetadataQuery -> HandlerBackedRelMetadataQuery also feels a bit convoluted and is a breaking change.

I also support removing the leaky abstraction of the metadata implementation to the rest of calcite core. But changing this does not actually help things much other than remove a thread local that can cause odd behaviors in uncommon nesting scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have a simple solution to avoiding the introduction of a new constructor on this class:

  • Make this constructor private
  • Add a new inner static subclass that uses that constructor and exposes a protected class.

Then we tell people that want to use the handler behavior to extend that subclass. That way, we have deprecated subclassing RelMetadataQuery directly while moving towards this new pattern with minimal code changes.

This avoids the introduction of a new coupling api between RelMetadataQuery the handler concept, is minimal changes in your patch and sets us on a path to moving RelMetadataQuery to an interface or dumb abstract class.

* {@link RelMetadataQuery}. The handlers provided are responsible for
* updating the cache stored in {@link RelMetadataQuery}.
*/
public interface MetadataHandlerProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this general approach. The concepts of code generation and ReviseableHandlerProvider are two separate concepts. I would actually name this more to that point e.g. "ReviseableHandlerProvider". As part of this introduction, can we make we remove JaninoRelMetadataProvider since now we really only need a ReviseableMetadataProvider that uses the JaninoReviseableHandlerProvider? (by remove I mean remove all code, uses and deprecate. It would still exist for compatibility reasons)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do not want a Reviseable Handler, you can simply return the final handler from the initialHandler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have deprecated JaninoRelMetadataProvider and moved all the logic to RelMetadataHandlerGeneratorUtil. I was actually think that the name should be broader so that anything could be injected would be passed through this interface oppose to be more explicit. Maybe this should be a config object?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config feels like it would be weird here. As I have thought more about the name I'm pretty much okay with it as is.

Adding RelMetadataHandlerProvider to allow for customization of
caching, eager loading of metadata providers or alternative
metadata providers to be used.
Marking RelMetadataQueryBase.cache as api Internal.
RelMetadataTest

Adding RelMetadataQuery.Builder for streamline query provider creation.
Reworking RelMetadataTest to use RelMetadataQuery.Builder.
Removing Metadata generic from MetadataHandlerProvider.
Creating JaninoHandlerCompilerAndCacheUtil for split out the remaining
functionality from JaninoRelMetadataProvider.

Deprecating
* RelOptCluster.setMetadataProvider
* LegacyJaninoMetadataHandlerProvider
* RelMetadataQuery.new()
* JaninoRelMetadataProvider
@jcamachor
Copy link
Contributor

@jamesstarr , what is the status of this PR? @jacques-n left some feedback but I think the PR was not synced to the main dev branch, it's not clear which issues are legit and need addressing, and which ones had already been solved.

@jamesstarr
Copy link
Contributor Author

@jamesstarr , what is the status of this PR? @jacques-n left some feedback but I think the PR was not synced to the main dev branch, it's not clear which issues are legit and need addressing, and which ones had already been solved.

@jcamachor, @jacques-n said he would look at it on tomorrow. Overall, I think I need to add some more javadocs and possibly clean up some naming, but I am waiting for feed back from @jacques-n before I do the last bit of clean up.

*
* @return A new cache for {@link RelMetadataQuery}
*/
MetadataCache buildCache();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the meaning of this method. Why does someone call this? What happens if I call it multiple times? What is this caching? Does it have to be called? Why is it on this interface? Also, if you wanted to customize the cache for things like a double property, wouldn't you potentially want to implement a cache that avoided the constant creation of objects and boxing (which the interface of MetadataCache forces)?

public final Table<RelNode, Object, Object> map;

public final MetadataCache cache;
protected final MetadataHandlerProvider metadataHandlerProvider;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make metadataHandlerProvider private. I think this means that for now you need to move it to relmetadataquery and pass a cache supplier into the RelMetadataQueryBase

Please make the cache private. I think using the Api pattern to "hide" things should be used as a last resort. This may mean it should also move to RelMetadataQuery (for now).

public static final ThreadLocal<@Nullable JaninoRelMetadataProvider> THREAD_PROVIDERS =
new ThreadLocal<>();

//~ Constructors -----------------------------------------------------------

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't removing this constructor a breaking change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doubtful any one inherited directly from RelMetadataQueryBase.

metadataHandlerProvider.initialHandler(BuiltInMetadata.LowerBoundCost.Handler.class);
}

protected RelMetadataQuery(RelMetadataQuery prototype) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make this private.

this.memoryHandler = initialHandler(BuiltInMetadata.Memory.Handler.class);
this.nonCumulativeCostHandler = initialHandler(BuiltInMetadata.NonCumulativeCost.Handler.class);
this.parallelismHandler = initialHandler(BuiltInMetadata.Parallelism.Handler.class);
protected RelMetadataQuery(MetadataHandlerProvider metadataHandlerProvider) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have a simple solution to avoiding the introduction of a new constructor on this class:

  • Make this constructor private
  • Add a new inner static subclass that uses that constructor and exposes a protected class.

Then we tell people that want to use the handler behavior to extend that subclass. That way, we have deprecated subclassing RelMetadataQuery directly while moving towards this new pattern with minimal code changes.

This avoids the introduction of a new coupling api between RelMetadataQuery the handler concept, is minimal changes in your patch and sets us on a path to moving RelMetadataQuery to an interface or dumb abstract class.

* {@link RelMetadataQuery}. The handlers provided are responsible for
* updating the cache stored in {@link RelMetadataQuery}.
*/
public interface MetadataHandlerProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config feels like it would be weird here. As I have thought more about the name I'm pretty much okay with it as is.

}
}
}

public static Builder<RelMetadataQuery> builder() {
Copy link
Contributor

@jacques-n jacques-n Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to go with a builder, I think we should make a couple changes. (Not entirely sure a builder is yet necessary.)

  • This should probably be a method called supplierBuilder() or similar since it doesn't actually build the type of class it is contained within. I'd expect RelMetdataQuery.builder() to build a RelMetadataQuery, not Supplier<RelMetadataQuery>.
  • I think that we may want to add other ways of building in the future. For example I'm exploring the introduction of a lambda based metadata handling systems. In that situation, we may want to construct one of those using the same initial builder. As such I suggest a specialization strategy where calling withProviders returns a subtype of builder that has configuration options related to a provider based builder. Maybe in the future we have a separate one that works based on lambdas and we call withLambdas() to use it (and a call to that would specialize the builder to that type of metadataquery supplier.

Copy link
Member

@linghengqian linghengqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that it seems like the PR was forced to update externally without resolving the branch's merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants