-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change FunctionSignature returnType to std::optional to support the return type varied with config #10467
Comments
What do you think? @mbasmanova @rui-mo @liujiayi771 |
@jinchengchenghh It is a bit difficult to understand this issue. Would you illustrate the points you are making with detailed examples? |
@jinchengchenghh Your point is that in some functions, the output is determined by the input, and it is not possible to accurately represent the result type in the signature. I also encountered situations in decimal agg where the result type couldn't be accurately expressed, and using UNKNOWN would lead to not finding the corresponding agg function. |
What's your commits and what error do you meet? Looks like Agg function only consider the function name to find the Agg function. @liujiayi771 We can also consider to change it to get the Entry by name and input types like the vector function.
|
A function declaration is the prototype for a function (or it can come from the function definition if no prototype has been seen by the compiler at that point) - it includes the return type, the name of the function and the types of the parameters (optionally in C). A function signature is the parts of the function declaration that the compiler uses to perform overload resolution. Since multiple functions might have the same name (ie., they're overloaded), the compiler needs a way to determine which of several possible functions with a particular name a function call should resolve to. The signature is what the compiler considers in that overload resolution. Specifically, the standard defines 'signature' as: the information about a function that participates in overload resolution: the types of its parameters and, if the function is a class member, the cv-qualifiers (if any) on the function itself and the class in which the member function is declared. Note that the return type is not part of the function signature. As the standard says in a footnote, "Function signatures do not include return type, because that does not participate in overload resolution". |
I draft a PR to solve this issue, #10487 This PR #10383 can also benefits from this solution. Can you help to take a look? And let me known if we should do this. @mbasmanova |
@jinchengchenghh Before proceeding with the changes, would you please explain the issue with detailed examples? See #10467 (comment) |
If I remember it correctly, there were some cases among aggregate functions that same intermediate argument type produces different return types. See avg as an example. And given that intermediate types are identified as input types while resolving, it's actually a case in Velox that return type has to be known for looking up the correct function. I'm not sure but if aggregate function is the only reason that return type is needed during resolving, I would suggest to do some refactors against aggregate functions' resolution procedure to avoid using intermediate type as input type. Speaking of SQL, when a function call is being planned, query planner should always derive its return type (and intermediate types if it's a aggregate function) using the data types of arguments. So in theory Velox could follow the same principle to find out the right function implementation without requiring for extra information. |
@zhztheplayer I also believe that the agg func should be resolved based on the input type rather than the intermediate type. There is still one issue with Spark decimal avg that remains unresolved, which is that we cannot infer the result type through the intermediate type.
|
For some decimal functions like Return type is not used to look up functions in resolving for vector functions, it is only used to evaluate string such as And it is confused there are two ways to compute resultType, one is signature to parse, one is code. |
In SQL, there are functions and special forms. A function has a name, a list of argument types and a return type. There can be multiple functions with the same name and different number or types of arguments. The return type of a function can be fully resolved from the list of argument types. It is helpful to support "static" return type resolution and not require instantiating executable functions. For example, one can register a function in Velox, extract its signature from Velox registry, pass it to SQL planner / analyzer and enable queries that use this function without having that function defined within SQL planner / analyzer itself (e.g. one can have C++-only functions that are not defined in Spark Scala or Presto Java code but can be used with these engines). Type resolution also allows writing standalone tests in Velox that use complex expressions and query plans. See PlanBuilder APIs that take SQL strings. Constructing typed expression trees by hand doesn't scale. Notice that Fuzzer also uses function registry to automatically construct valid expression trees. This would not be possible if function registry didn't include the return types. To evaluate a function, the engine first evaluates all inputs, then invokes the function. Special forms are similar to functions, but they allow for more flexibility in type resolution or execution. Planner / analyzer needs to know about all special forms as these often have different syntax and do not support "static" type resolution. Return type for a special form may not be resolvable from just the list of argument types. For example, return type for CAST needs to be specified explicitly. Both CAST(integer as varchar) and CAST(integer as double) take a single argument of type integer, but have different return types: varchar vs. double. Another example, is that return type for round(expr, targetScale) Spark function depends on the value of the second argument, not just its type. In addition, special forms may not evaluate all the inputs to produce a result. For example, COALESCE is a special form because it evaluates arguments one at a time and once it found an argument that evaluates to non-NULL it returns without evaluating the rest of the arguments. IF is similar. If 'condition' evaluates to true, only 'then' argument is evaluated. Velox supports both functions and special forms. Functions are required to provide a way to statically resolve return type from input types. Special forms do not have that requirement and they resolve return types dynamically at runtime.
This case means that there are 2 different functions and config decides which one is used in a given setup. The API to register functions (velox::functions::sparksql::registerFunctions) can take a "config" and use it to decide which function to register. Alternatively, the register API may register 2 functions with different names and plan converter can map the function to one or the other based on config.
FYI, we have extended Simple Function API to support functions with decimal inputs and results and re-implemented Presto functions to use it. This makes it a lot simpler. It might be helpful to do the same for Spark functions. CC: @rui-mo |
In Gluten, we will not generate QueryConfig when register functions.
|
SimpleFunction cannot get resultPrecision and resultScale, I need it in computation process. |
@jinchengchenghh I suggest to change the existing registerFunctions to add a new non-optional parameter that specifies the config for registration. This parameter should be a new struct, not core::QueryConfig, and it should contain only values / flags that influence registration. Something like this:
SparkFunctionRegistrationConfig name isn't great. We can iterate on it during code review.
I'm not sure I understand this question. Please, take a look at Presto functions implementation and provide more details about the problem you are facing if you still have questions. CC: @rui-mo |
Thanks for your explain. Preto function is different with Spark function, I have read the Presto functions, it used simple function to implement, but for Spark function, we need result scale, for example
|
I can also change it to simple function, compute the result precision and result scale by input types, but this problem still exists.
If you think it is ok, I will change it to simple functions. Then we can only use |
I tried to change it to simple function and use config to register. but the overwrite logic has a problem.
I register all the functions with allowPrecisionLoss true in SparkFunctionBaseTest, and in a single test, I register allowPrecisionLoss false to try to overwrite the last function. When I register allowPrecisionLoss false, the FunctionSignature
The print result is
It confirms the signature is different for
We will get a random one FunctionEntry from FunctionEntry with precision loss and without precision loss.
The problem occurs because we register the function considering result type for key I would prefer the second approach. It will be easy to control which Function is really used and don't need to compute result precision and result scale by signature and code. The config allowPrecisionLoss mode is a session config in Spark, the semantic is similar to QueryConfig. What do you think? @mbasmanova CC @rui-mo |
@jinchengchenghh It seems a bug in the function registry. It should only consider name and input types and not allow registering functions that differ only in result type. It would be nice to create a separate GitHub issue about this and fix it. CC: @kagamiori @kgpai @bikramSingh91 |
We should not configure same behavior in 2 different places. From what I understand so far, allowPrecisionLoss should be defined only in SparkFunctionRegistrationConfig, not in core::QueryConfig.
To workaround the bug in the registry, you may register functions in the test with different prefixes. |
Because decimal allow precision loss mode is a session config, and we register functions before the session created.
|
And we will not need to overwrite function in the session level. |
We may have to take the second option and select the used function at runtime since we need to adapt to user-specified session configuration which determines whether to allow precision loss. Thanks! |
Description
In spark functions, there are many functions whose return datatype are computed by input type and varied with config.
For current case, we always implement as SpecialForm, such as decimal_round, make_decimal.
But it has several shortages:
add
, but it should bedecimal_add
;Why we should make returnType optional?
CallTypedExpr
directly in test code and callevaluate
to test.;CallTypedExpr
and set#10383
The text was updated successfully, but these errors were encountered: