Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization #27682

Closed
wants to merge 1 commit into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Feb 24, 2020

What changes were proposed in this pull request?

1, avoid Iterator.grouped(size: Int), which need to maintain an arraybuffer of size
2, keep the number of partitions in curve computation

Why are the changes needed?

1, BinaryClassificationMetrics tend to fail (OOM) when grouping=count/numBins is too large, due to Iterator.grouped(size: Int) need to maintain an arraybuffer with size entries, however, in BinaryClassificationMetrics we do not need to maintain such a big array;
2, make sizes of partitions more even;

This PR computes metrics more stable and a littler faster;

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

init
@zhengruifeng
Copy link
Contributor Author

testCode:

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import scala.util.Random

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import scala.util.Random

val scoreAndLabels = sc.range(0, 40000000L, 1, 4).mapPartitionsWithIndex{ case (pid, iter) => val rng=new Random(pid); iter.map{_ => (rng.nextDouble, rng.nextInt(2).toDouble)} }

scoreAndLabels.count

val metrics = new BinaryClassificationMetrics(scoreAndLabels, 1)
val start = System.currentTimeMillis; val auc = metrics.areaUnderROC; val end = System.currentTimeMillis; end - start

result:

Test This PR(--driver-memory=1G) This PR(--driver-memory=32G) Master(--driver-memory=1G) Master(--driver-memory=32G)
Duration 343091 173030 OOM 183258

if (grouping < 2) {
// numBins was more than half of the size; no real point in down-sampling to bins
logInfo(s"Curve is too small ($countsSize) for $numBins bins to be useful")
counts
} else {
if (grouping >= Int.MaxValue) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterator.grouped(size: Int) does not support grouping larger than Int.MaxValue
After this change, BinaryClassificationMetrics can deal with grouping larger than Int.MaxValue

@SparkQA
Copy link

SparkQA commented Feb 24, 2020

Test build #118863 has finished for PR 27682 at commit 06bce05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng zhengruifeng deleted the grouped_opt branch February 28, 2020 08:56
@zhengruifeng
Copy link
Contributor Author

Merged to master

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
1, avoid `Iterator.grouped(size: Int)`, which need to maintain an arraybuffer of `size`
2, keep the number of partitions in curve computation

### Why are the changes needed?
1, `BinaryClassificationMetrics` tend to fail (OOM) when `grouping=count/numBins` is too large, due to `Iterator.grouped(size: Int)` need to maintain an arraybuffer with `size` entries, however, in `BinaryClassificationMetrics` we do not need to maintain such a big array;
2, make sizes of partitions more even;

This PR computes metrics more stable and a littler faster;

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes apache#27682 from zhengruifeng/grouped_opt.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants