Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

TAJO-1092: Improve the function system to allow other function implementation types.#178

Closed
hyunsik wants to merge 2 commits intoapache:masterfrom
hyunsik:TAJO-1092
Closed

TAJO-1092: Improve the function system to allow other function implementation types.#178
hyunsik wants to merge 2 commits intoapache:masterfrom
hyunsik:TAJO-1092

Conversation

@hyunsik
Copy link
Copy Markdown
Member

@hyunsik hyunsik commented Oct 5, 2014

See https://issues.apache.org/jira/browse/TAJO-1092.

In the current function system, each function implementation is a single Java class subclassed from org.apache.tajo.catalog.function.Function.

In this approach, there are many rooms for improvement. This approach always uses Datum as input and output values of functions, creating unnecessary objects. It does not likely to exploit given information included query statements; for example, some parameters are constants or variables.
In this issue, I propose the improvement to allow the function system to support other function implementation types. In addition, I propose three function implementation types:

  • legacy Java class function provided by the current Tajo
  • static method in Java class
  • code generation by ASM

Later, we could expand this feature to allow Pig or Hive functions in Tajo.

@hyunsik
Copy link
Copy Markdown
Member Author

hyunsik commented Oct 5, 2014

An example of static method in Java class is https://github.com/apache/tajo/pull/178/files#diff-f6daa76b2459470a9f3412131c0f726bR34.

I designed the function annotation system to point Function Collection, which is a class including multiple static functions. For user-defined functions and built-in functions, just add function as the example. It is very easy and it enables Tajo to reuse existing functions.

Besides, as you can see, SQL is based on three-valued logic (http://en.wikipedia.org/wiki/Three-valued_logic). So, each value can be nullable. Despite of boolean type, one boolean type value can be three values: TRUE, FALSE, and UNKNOWN (NULL in SQL). In the current function system, each function must deal with NULL value explicitly. Most of functions usually return NULL if at least of one parameter is NULL. Substr function is an example (https://github.com/apache/tajo/blob/master/tajo-core/src/main/java/org/apache/tajo/engine/function/string/Substr.java#L63). It gives users burden, and it is easy for users to forget NULL handling when users implement user-defined functions.

In order to mitigate such a problem and to make function invocation more efficiently, I designed new function binder and new function definition approach to keep hints how a function handles NULL value.

The hints are described in function parameters in a function definition. You can specify the hints by using java primitive type or class primitive type as each parameter according to null handling way.

For example:

This pow function does not allow NULL values as input parameter. In this case, if at least one parameter is null, this function binder will automatically return NULL value without invoking this function. So, this function itself does not need to handle NULL value explicitly.

@ScalarFunction(name = "pow", returnType = FLOAT8, paramTypes = {FLOAT8, FLOAT8})
   public static double pow(double x, double y) {
     return Math.pow(x, y);
}

The following function definition allow NULL value as both input parameters. In this case, this function must handle NULL value explicitly.

@ScalarFunction(name = "pow", returnType = FLOAT8, paramTypes = {FLOAT8, FLOAT8})
public static Double pow(Double x, Double y) {
  if (x == null || y == null) {
    return null;
  }
  return Math.pow(x, y);
}

In addition, the function binder allows a mixed use of primitive types and class primitive types. When mixed definition is used, the function binder only allow class primitive types to handle NULL values explicitly.

Finally, the function binder is generated on the fly by java byte code generation technique, and it does not have any overheads even though the logic is very complex. Also, I'm expecting that this idea will remove significantly the overhead of Datum uses in the existing function system.

@hyunsik
Copy link
Copy Markdown
Member Author

hyunsik commented Oct 5, 2014

After this patch is committed, I'll add a documentation about how making Tajo user-defined functions using the proposed design.

…into TAJO-1092

Conflicts:
	tajo-core/src/main/java/org/apache/tajo/master/TajoMaster.java
@hyunsik
Copy link
Copy Markdown
Member Author

hyunsik commented Oct 11, 2014

rebased.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this null example?

@jinossy
Copy link
Copy Markdown
Member

jinossy commented Oct 13, 2014

Looks great to me!
In my opinion, you should check following example:
public static bool myfunc(String x, int y); //nullable + primitive

@hyunsik
Copy link
Copy Markdown
Member Author

hyunsik commented Oct 14, 2014

The function support will be added in my next patch. Thank you for your review.

@jinossy
Copy link
Copy Markdown
Member

jinossy commented Oct 15, 2014

This patch provide backward compatibility, so there is no issue.
Here is my +1. Thank you for your contribution!

@hyunsik
Copy link
Copy Markdown
Member Author

hyunsik commented Oct 15, 2014

Thank you for your review. I'll commit it shortly.

@asfgit asfgit closed this in d56737b Oct 15, 2014
asfgit pushed a commit that referenced this pull request Oct 15, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants