[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization #27682

zhengruifeng · 2020-02-24T10:25:55Z

What changes were proposed in this pull request?

1, avoid Iterator.grouped(size: Int), which need to maintain an arraybuffer of size
2, keep the number of partitions in curve computation

Why are the changes needed?

1, BinaryClassificationMetrics tend to fail (OOM) when grouping=count/numBins is too large, due to Iterator.grouped(size: Int) need to maintain an arraybuffer with size entries, however, in BinaryClassificationMetrics we do not need to maintain such a big array;
2, make sizes of partitions more even;

This PR computes metrics more stable and a littler faster;

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

init

zhengruifeng · 2020-02-24T10:28:29Z

testCode:

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import scala.util.Random

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import scala.util.Random

val scoreAndLabels = sc.range(0, 40000000L, 1, 4).mapPartitionsWithIndex{ case (pid, iter) => val rng=new Random(pid); iter.map{_ => (rng.nextDouble, rng.nextInt(2).toDouble)} }

scoreAndLabels.count

val metrics = new BinaryClassificationMetrics(scoreAndLabels, 1)
val start = System.currentTimeMillis; val auc = metrics.areaUnderROC; val end = System.currentTimeMillis; end - start

result:

Test	This PR(--driver-memory=1G)	This PR(--driver-memory=32G)	Master(--driver-memory=1G)	Master(--driver-memory=32G)
Duration	343091	173030	OOM	183258

zhengruifeng · 2020-02-24T10:40:58Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala

        if (grouping < 2) {
          // numBins was more than half of the size; no real point in down-sampling to bins
          logInfo(s"Curve is too small ($countsSize) for $numBins bins to be useful")
          counts
        } else {
-          if (grouping >= Int.MaxValue) {


Iterator.grouped(size: Int) does not support grouping larger than Int.MaxValue
After this change, BinaryClassificationMetrics can deal with grouping larger than Int.MaxValue

SparkQA · 2020-02-24T11:44:23Z

Test build #118863 has finished for PR 27682 at commit 06bce05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-02-28T08:56:56Z

Merged to master

### What changes were proposed in this pull request? 1, avoid `Iterator.grouped(size: Int)`, which need to maintain an arraybuffer of `size` 2, keep the number of partitions in curve computation ### Why are the changes needed? 1, `BinaryClassificationMetrics` tend to fail (OOM) when `grouping=count/numBins` is too large, due to `Iterator.grouped(size: Int)` need to maintain an arraybuffer with `size` entries, however, in `BinaryClassificationMetrics` we do not need to maintain such a big array; 2, make sizes of partitions more even; This PR computes metrics more stable and a littler faster; ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes apache#27682 from zhengruifeng/grouped_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>

init

06bce05

init

zhengruifeng added ML MLLIB labels Feb 24, 2020

zhengruifeng commented Feb 24, 2020

View reviewed changes

dongjoon-hyun removed the MLLIB label Feb 28, 2020

zhengruifeng closed this in 14bb639 Feb 28, 2020

zhengruifeng deleted the grouped_opt branch February 28, 2020 08:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization #27682

[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization #27682

zhengruifeng commented Feb 24, 2020 •

edited

zhengruifeng commented Feb 24, 2020

zhengruifeng Feb 24, 2020 •

edited

SparkQA commented Feb 24, 2020

zhengruifeng commented Feb 28, 2020

[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization #27682

[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization #27682

Conversation

zhengruifeng commented Feb 24, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zhengruifeng commented Feb 24, 2020

zhengruifeng Feb 24, 2020 • edited

Choose a reason for hiding this comment

SparkQA commented Feb 24, 2020

zhengruifeng commented Feb 28, 2020

zhengruifeng commented Feb 24, 2020 •

edited

zhengruifeng Feb 24, 2020 •

edited