[SPARK-42115][SQL] Push down limit through Python UDFs by kelvinjian-db · Pull Request #39842 · apache/spark

kelvinjian-db · 2023-02-01T08:16:37Z

What changes were proposed in this pull request?

This PR adds cases in LimitPushDown to push limits through Python UDFs. In order to allow for this, we need to call LimitPushDown in SparkOptimizer after the "Extract Python UDFs" batch. We also add PushProjectionThroughLimit afterwards in order to plan CollectLimit.

Why are the changes needed?

Right now, LimitPushdown does not push limits through Python UDFs, which means that expensive Python UDFs can be run on potentially large amounts of input. This PR adds this capability, while ensuring that a GlobalLimit - LocalLimit pattern stays at the top in order to trigger the CollectLimit code path.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added a UT.

cloud-fan · 2023-02-01T09:34:55Z

This is actually a regression caused by #37941

cloud-fan · 2023-02-01T12:19:35Z

The failed test is no longer valid: test_scalar_iter_udf_close_early (pyspark.sql.tests.pandas.test_pandas_udf_scalar.ScalarPandasUDFTests)

We can remove or ignore it.

cloud-fan · 2023-02-02T00:59:30Z

thanks, merging to master/3.4!

### What changes were proposed in this pull request? This PR adds cases in LimitPushDown to push limits through Python UDFs. In order to allow for this, we need to call LimitPushDown in SparkOptimizer after the "Extract Python UDFs" batch. We also add PushProjectionThroughLimit afterwards in order to plan CollectLimit. ### Why are the changes needed? Right now, LimitPushdown does not push limits through Python UDFs, which means that expensive Python UDFs can be run on potentially large amounts of input. This PR adds this capability, while ensuring that a GlobalLimit - LocalLimit pattern stays at the top in order to trigger the CollectLimit code path. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a UT. Closes #39842 from kelvinjian-db/SPARK-42115-limit-through-python-udfs. Authored-by: Kelvin Jiang <kelvin.jiang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0fe361e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR adds cases in LimitPushDown to push limits through Python UDFs. In order to allow for this, we need to call LimitPushDown in SparkOptimizer after the "Extract Python UDFs" batch. We also add PushProjectionThroughLimit afterwards in order to plan CollectLimit. ### Why are the changes needed? Right now, LimitPushdown does not push limits through Python UDFs, which means that expensive Python UDFs can be run on potentially large amounts of input. This PR adds this capability, while ensuring that a GlobalLimit - LocalLimit pattern stays at the top in order to trigger the CollectLimit code path. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a UT. Closes apache#39842 from kelvinjian-db/SPARK-42115-limit-through-python-udfs. Authored-by: Kelvin Jiang <kelvin.jiang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0fe361e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

pushed down limit through Python UDFs

b90bbc3

github-actions bot added PYTHON SQL labels Feb 1, 2023

HyukjinKwon mentioned this pull request Feb 1, 2023

[SPARK-42115][SQL][PYTHON] Push down limit through Python UDFs #39653

Closed

cloud-fan approved these changes Feb 1, 2023

View reviewed changes

HyukjinKwon approved these changes Feb 1, 2023

View reviewed changes

ignored test that isn't valid anymore

f4a27f2

github-actions bot added the CORE label Feb 1, 2023

cloud-fan closed this in 0fe361e Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42115][SQL] Push down limit through Python UDFs#39842

[SPARK-42115][SQL] Push down limit through Python UDFs#39842
kelvinjian-db wants to merge 2 commits intoapache:masterfrom
kelvinjian-db:SPARK-42115-limit-through-python-udfs

kelvinjian-db commented Feb 1, 2023

Uh oh!

cloud-fan commented Feb 1, 2023

Uh oh!

cloud-fan commented Feb 1, 2023

Uh oh!

cloud-fan commented Feb 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kelvinjian-db commented Feb 1, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Feb 1, 2023

Uh oh!

cloud-fan commented Feb 1, 2023

Uh oh!

cloud-fan commented Feb 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants