Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

TAJO-1344: Python UDF support #526

Closed
wants to merge 77 commits into from
Closed

Conversation

jihoonson
Copy link
Contributor

I found some problems of version confliction when using Jython.
So, I used another approach of using pipe.
Most codes are borrowed from Pig.

hyunsik and others added 30 commits March 12, 2015 02:53
…into TAJO-1344

Conflicts:
	tajo-core/src/test/java/org/apache/tajo/TajoTestingCluster.java
@jihoonson
Copy link
Contributor Author

Ok. I think that this patch is ready for review.

To address @hyunsik's comment, I added a class called FunctionInvoke. This class describes how the functions are executed.

On executing Python scripts, I used the approach of using an external UDF controller that is responsible for executing python scripts as commented above. When a submitted query involves one or more python UDFs, several UDF controllers are executed to compute UDFs. Input/output tuples are transmitted via stdio. This approach may have an issue on performance, but I think it is inevitable without using Jython.

Currently, the controller is executed for each Python functions. That is, if a query involves 5 Python functions even some of them are same, at least 5 different controllers are executed during query processing. I chose this architecture due to its simplicity.

Here are some highlights of changes.

  • AnyDatum is used to support Python's dynamic typing.
  • PythonScriptEngine is responsible for maintaining the external controller process. To reduce overhead, the controller should be forked only when UDFs are actually evaluated. In this patch, there are three points where the controller is forked.
    • Constant folding optimization in Tajo master: During constant folding, some UDFs can be evaluated. If necessary, controllers are forked and immediately destroyed after evaluation.
    • Non-from query execution in Tajo master. If the query involves Python UDFs, controllers are forked during query processing.
    • Task execution in worker: If the plan of a stage involves Python UDFs, controllers are forked (destroyed) when a task starts up (shuts down). Due to the simplicity, I chose this architecture rather than sharing controllers among multiple tasks via ExecutionBlockSharedResource.
  • Refactoring the EvalNode::bind() function. This function now receives EvalContext in addition to Schema. EvalContext can contain some information given at runtime such as ScriptEngine started by each task.

For reviewers, I apologize for a large patch. But many changes are related to just refactoring of the bind() function and renaming some functions.
Thanks.

@jinossy
Copy link
Member

jinossy commented Apr 13, 2015

Great!! @jihoonson
I will start review. If you fix the findbug warnings in your codes, It would be better

@hyunsik
Copy link
Member

hyunsik commented Apr 13, 2015

Thank you all guys for your efforts. I also check only some design points.

@jihoonson
Copy link
Contributor Author

Thanks @jinossy and @hyunsik. I've fixed some findbugs warnings related to my patches.

…into TAJO-1344_3

Conflicts:
	tajo-core/src/test/java/org/apache/tajo/engine/query/TestSelectQuery.java
…into TAJO-1344_3

Conflicts:
	tajo-core/src/main/java/org/apache/tajo/worker/Task.java
* @return
* @throws IOException
*/
public static Map<FunctionSignature, FunctionDesc> loadOptionalFunctions(TajoConf conf,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, 'user-defined function' would be a better word than 'optional function'

@hyunsik
Copy link
Member

hyunsik commented Apr 17, 2015

I leaved some trivial comments. The patch looks good to me. It's a great job.

Here is my additional comments. You try to change the signature EvalNode::bind to take EvalContext instance in order to get an instance of a launched script engine.

Even though this refactoring is already a breaking change. EvalNode still needs more information about task. In this chance, it would be great if we refactor EvalNode to take more meta context including TajoConf, task attempt information, and shared resources of workers. Then, we can probably remove OverridableConf parameter from all constructors of EvalNode.

If you are concerned with a large patch, we can do this work in another jira. But, it will cause the second breaking change. It is also not good as much as a large patch. You can choose either two breaking changes or a large patch. It's up to you.

@jihoonson
Copy link
Contributor Author

Thanks @hyunsik. I've changed the function name and removed QueryContextUtil according to your comment.

Regarding on refactoring bind() function, I think that it would be better to work in another jira. For this issue, we first should decide what information are required to be passed to EvalNode. I booked another jira (https://issues.apache.org/jira/browse/TAJO-1566).

@hyunsik
Copy link
Member

hyunsik commented Apr 17, 2015

I got your point and your plan. Please keep going. Here is my +1.

@jihoonson
Copy link
Contributor Author

I've tested on my laptop.
@jinossy, do you have any other opinions?

@jinossy
Copy link
Member

jinossy commented Apr 17, 2015

No, Looks good to me. Here is my +1
Thank you for your work

@asfgit asfgit closed this in a745385 Apr 18, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants