Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36748][PYTHON] Introduce the 'compute.isin_limit' option #33982

Closed
wants to merge 1 commit into from

Conversation

xinrong-meng
Copy link
Member

What changes were proposed in this pull request?

Introduce the 'compute.isin_limit' option, with the default value of 80.

Why are the changes needed?

Column.isin(list) doesn't perform well when the given list is large, as https://issues.apache.org/jira/browse/SPARK-33383.
Thus, 'compute.isin_limit' is introduced to constrain the usage of Column.isin(list) in the code base.
If the length of the ‘list’ is above the 'compute.isin_limit', broadcast join is used instead for better performance.

Why is the default value 80?

After reproducing the benchmark mentioned in https://issues.apache.org/jira/browse/SPARK-33383,

length of filtering list isin time /ms broadcast DF time / ms
200 69411 39296
100 43074 40087
80 35592 40350
50 28134 37847

We may notice when the length of the filtering list <= 80, the isin approach performs better than broadcast DF.

Does this PR introduce any user-facing change?

Users may read/write the value of 'compute.isin_limit' as follows

>>> ps.get_option('compute.isin_limit')
80

>>> ps.set_option('compute.isin_limit', 10)
>>> ps.get_option('compute.isin_limit')
10

>>> ps.set_option('compute.isin_limit', -1)
...
ValueError: 'compute.isin_limit' should be greater than or equal to 0.

>>> ps.reset_option('compute.isin_limit')
>>> ps.get_option('compute.isin_limit')
80

How was this patch tested?

Manual test.

@SparkQA
Copy link

SparkQA commented Sep 13, 2021

Test build #143226 has finished for PR 33982 at commit 8d06b21.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 13, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47729/

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants