[FLINK-16337][python][table-planner][table-planner-blink] Add support of vectorized Python UDF in blink planner and old planner #11252

dianfu · 2020-02-28T13:06:08Z

What is the purpose of the change

This pull request adds the relNodes and rules to support vectorized Python UDF in blink planner and old planner.

Brief change log

*Introduce relNodes and rules to support vectorized Python UDF in blink planner such as StreamExecArrowPythonCalc, BatchExecArrowPythonCalc, etc *
Introduce relNodes and rules to support vectorized Python UDF in old planner such as DataStreamArrowPythonCalc
Introduce PythonCalcSplitPandasInProjectionRule which is used to support use non-vectorized Python UDF and vectorized Python UDF in the same job

Verifying this change

This change added tests and can be verified as follows:

Added tests in PythonCalcSplitRuleTest

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not applicable)

flinkbot · 2020-02-28T13:09:26Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit ec77755 (Fri Feb 28 13:09:25 UTC 2020)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2020-02-28T13:24:57Z

CI report:

4304158 Travis: SUCCESS Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

hequn8128

@dianfu Thanks a lot for the PR. Overall it looks good to me except that I'm wondering if we can reuse the current PythonCalRule and PythonCalRelNode. The reasons are:

Most code between PythonCalcRule and ArrowPythonCalcRule, PythonCalRelNode and ArrowPythonCalRelNode are same.
The Rule even doesn't need to be changed if we reuse the Rule and RelNode.
The change in the RelNode would be also small, e.g., adding some if else to load the corresponding runtime operator.

Besides, is it possible to support pandas udf for batch mode in old planner?

What do you think?

hequn8128 · 2020-03-05T01:24:00Z

...able-planner-blink/src/main/scala/org/apache/flink/table/planner/plan/utils/PythonUtil.scala

@@ -32,51 +33,85 @@ object PythonUtil {
    * @param node the RexNode to check
    * @return true if it contains the Python function call in the specified node.
    */
-  def containsPythonCall(node: RexNode): Boolean = node.accept(new FunctionFinder(true, true))
+  def containsPythonCall(node: RexNode): Boolean =


How about merging the two containsPythonCall . I see two options:

Add a default parameter for pythonFunctionKind.

Add a GENERAL_PANDAS enum type in PythonFunctionKind.

hequn8128 · 2020-03-05T07:17:22Z

.../org/apache/flink/table/planner/plan/rules/physical/batch/BatchExecArrowPythonCalcRule.scala

+/**
+  * Rule that converts [[FlinkLogicalCalc]] to [[BatchExecArrowPythonCalc]].
+  */
+class BatchExecArrowPythonCalcRule


How about reusing the current PythonCalcRule and PythonCalRelNode? Most code between PythonCalcRule and ArrowPythonCalcRule, PythonCalRelNode and ArrowPythonCalRelNode are same. The Rules even don't need to be changed.

Besides, we may need to convert this class to Java? In the long term, it's better to avoid new scala classes. Also see discussion here: #11051 (comment)

What do you think?

hequn8128 · 2020-03-05T07:24:23Z

...cala/org/apache/flink/table/planner/plan/nodes/physical/batch/BatchExecArrowPythonCalc.scala

+      ret, getPythonWorkerMemory(planner.getTableConfig.getConfiguration))
+  }
+
+  private def getPythonWorkerMemory(config: Configuration): Long = {


This method is copied from BatchExecPythonCalc. How about putting this method into the CommonPythonCalc?

hequn8128 · 2020-03-05T07:45:46Z

...er-blink/src/test/scala/org/apache/flink/table/planner/utils/UserDefinedTableFunctions.scala

 import org.apache.flink.table.functions.{FunctionContext, ScalarFunction, TableFunction}
 import org.apache.flink.types.Row
-


Unnecessary changes.

…Python UDF in blink planner

… UDF in old planner

dianfu · 2020-03-05T12:28:36Z

@hequn8128 Thanks a lot for your great review and suggestions. That makes much sense to me and have updated the PR accordingly. Regarding to the support of pandas udf for batch mode in old planner, I'd like to add it in a separate PR as the operator for this case is still not added. What's your thoughts?

hequn8128

@dianfu Thanks a lot for the update. LGTM. Will merge this once test passed.

… UDF in old planner This closes apache#11252.

rmetzger added the review=description? label Feb 28, 2020

rmetzger added component=API/Python component=TableSQL/Planner component=TableSQL/LegacyPlanner labels Feb 28, 2020

dianfu force-pushed the FLINK-16337 branch 4 times, most recently from 10b2004 to d90b0eb Compare March 3, 2020 17:23

hequn8128 self-assigned this Mar 4, 2020

dianfu force-pushed the FLINK-16337 branch from d90b0eb to 542c6f8 Compare March 4, 2020 02:03

hequn8128 changed the title ~~[FLINK-16337][python][table-planner-blink] Add support of vectorized Python UDF in blink planner~~ [FLINK-16337][python][table-planner-blink] Add support of vectorized Python UDF in blink planner and old planner Mar 4, 2020

hequn8128 changed the title ~~[FLINK-16337][python][table-planner-blink] Add support of vectorized Python UDF in blink planner and old planner~~ [FLINK-16337][python][table-planner][table-planner-blink] Add support of vectorized Python UDF in blink planner and old planner Mar 4, 2020

hequn8128 reviewed Mar 5, 2020

View reviewed changes

dianfu added 2 commits March 5, 2020 20:03

[FLINK-16337][python][table-planner-blink] Add support of vectorized …

986feba

…Python UDF in blink planner

[FLINK-16337][python][table-planner] Add support of vectorized Python…

4304158

… UDF in old planner

dianfu force-pushed the FLINK-16337 branch from 542c6f8 to 4304158 Compare March 5, 2020 12:03

hequn8128 approved these changes Mar 6, 2020

View reviewed changes

hequn8128 pushed a commit to hequn8128/flink that referenced this pull request Mar 6, 2020

[FLINK-16337][python][table-planner] Add support of vectorized Python…

eed03ee

… UDF in old planner This closes apache#11252.

hequn8128 closed this in 3ebc162 Mar 6, 2020

dianfu deleted the FLINK-16337 branch June 10, 2020 02:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-16337][python][table-planner][table-planner-blink] Add support of vectorized Python UDF in blink planner and old planner #11252

[FLINK-16337][python][table-planner][table-planner-blink] Add support of vectorized Python UDF in blink planner and old planner #11252

dianfu commented Feb 28, 2020

flinkbot commented Feb 28, 2020

flinkbot commented Feb 28, 2020 •

edited

hequn8128 left a comment •

edited

hequn8128 Mar 5, 2020

hequn8128 Mar 5, 2020

hequn8128 Mar 5, 2020

hequn8128 Mar 5, 2020

dianfu commented Mar 5, 2020

hequn8128 left a comment

		import org.apache.flink.table.functions.{FunctionContext, ScalarFunction, TableFunction}
		import org.apache.flink.types.Row

[FLINK-16337][python][table-planner][table-planner-blink] Add support of vectorized Python UDF in blink planner and old planner #11252

[FLINK-16337][python][table-planner][table-planner-blink] Add support of vectorized Python UDF in blink planner and old planner #11252

Conversation

dianfu commented Feb 28, 2020

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Feb 28, 2020

Automated Checks

Review Progress

flinkbot commented Feb 28, 2020 • edited

CI report:

hequn8128 left a comment • edited

Choose a reason for hiding this comment

hequn8128 Mar 5, 2020

Choose a reason for hiding this comment

hequn8128 Mar 5, 2020

Choose a reason for hiding this comment

hequn8128 Mar 5, 2020

Choose a reason for hiding this comment

hequn8128 Mar 5, 2020

Choose a reason for hiding this comment

dianfu commented Mar 5, 2020

hequn8128 left a comment

Choose a reason for hiding this comment

flinkbot commented Feb 28, 2020 •

edited

hequn8128 left a comment •

edited