Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31976][ML][PYSPARK] LinearSVC use MemoryUsage to control the size of block #28974

Closed
wants to merge 1 commit into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

in LinearSVC, use maxBlockMemoryInMB instead of blockSize

Why are the changes needed?

According to the performance test in https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is mainly related to the nnz of block.

Does this PR introduce any user-facing change?

yes, blockSize is changed to maxBlockMemoryInMB

How was this patch tested?

added testsuites

init

init

init

init
@SparkQA
Copy link

SparkQA commented Jul 2, 2020

Test build #124865 has finished for PR 28974 at commit 5d7432d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait HasMaxBlockMemoryInMB extends Params
  • class HasMaxBlockMemoryInMB(Params):

@zhengruifeng
Copy link
Contributor Author

ping @WeichenXu123 @mengxr

@HyukjinKwon
Copy link
Member

cc @huaxingao @srowen too

@HyukjinKwon HyukjinKwon closed this Jul 3, 2020
@HyukjinKwon HyukjinKwon reopened this Jul 3, 2020
@HyukjinKwon
Copy link
Member

Oops, mistake.

@Since("3.1.0")
def setBlockSize(value: Int): this.type = set(blockSize, value)
setDefault(blockSize -> 1)
def setMaxBlockMemoryInMB(value: Int): this.type = set(maxBlockMemoryInMB, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a scala doc?

@huaxingao
Copy link
Contributor

@zhengruifeng Did you have a chance to do a benchmark test to verify the performance gain?

@zhengruifeng
Copy link
Contributor Author

@huaxingao since this changed is suggested by @mengxr and @WeichenXu123 , I perfer to append the performace tests after they think current design is OK.

In current commit, if the maxBlockMemoryInMB==0 (by default), the old cold path trainOnRows is chosen. If we want to discard this code path in the future, there will be a small behavior change, that the default maxBlockMemoryInMB need to be set with another default value (i.e. 4MB).

Existing performace tests were done against numRows, I think the results can be converted to maxBlockMemoryInMB. (But I agree with anohter test directly on maxBlockMemoryInMB).

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 15, 2020
@github-actions github-actions bot closed this Oct 16, 2020
@zhengruifeng zhengruifeng deleted the use_memoryUsage branch May 28, 2021 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants