[SPARK-38346][MLLIB] Add cache in MLlib BinaryClassificationMetrics #35678

waruto210 · 2022-02-28T10:46:02Z

What changes were proposed in this pull request?

Cache two RDDs in BinaryClassificationMetrics.scala.

Why are the changes needed?

Two RDDs in BinaryClassificationMetrics.scala were used by diffrenet jobs but not cached when we run some example code as follow(view full description on jira):

  val pipeline: Pipeline = new Pipeline().setStages(Array(assembler, classifier))

  val Array(tr, te) = originalData.randomSplit(Array(0.7, .03), 666)
  val model = pipeline.fit(tr)
  val modelDF = model.transform(te)
  val evaluator = new BinaryClassificationEvaluator().setLabelCol(labelCol).setRawPredictionCol("prediction")
  println(evaluator.evaluate(modelDF))

Does this PR introduce any user-facing change?

NO

How was this patch tested?

origin tests

waruto210 · 2022-02-28T11:01:33Z

cc @zhengruifeng @imatiach-msft

AmplabJenkins · 2022-03-01T03:50:31Z

Can one of the admins verify this patch?

srowen · 2022-03-01T14:31:29Z

I'm not sure this is worth it. For one, this cached RDD leaks, and is not unpersisted. It doesn't check if it's already cached. And I don't think it probably helps much as you pay the caching overhead just to avoid 1 recomputation for count(), which may be not a big deal perf-wise.

zhengruifeng · 2022-03-02T01:35:19Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala

@@ -113,7 +113,7 @@ class BinaryClassificationMetrics @Since("3.0.0") (
      } else {
        iter
      }
-    }
+    }.cache()


you can cache the returned rdd on your own purpose.
We should not cache it here

zhengruifeng · 2022-03-02T01:37:26Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala

@@ -186,7 +186,7 @@ class BinaryClassificationMetrics @Since("3.0.0") (
      mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
        c += (labelAndWeight._1, labelAndWeight._2),
      mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2
-    ).sortByKey(ascending = false)
+    ).sortByKey(ascending = false).cache()


this RDD counts seems used twice at most in this method. I think it not worthwhile to cache it.

waruto210 · 2022-03-02T05:56:46Z

The user-level cache requires developers to understand the program characteristics themselves, and the focus here is on the cache of libraries-level (MLlib), like SPARK-16697, SPARK-16880 and SPARK-18356 in MLLib.
@zhengruifeng

zhengruifeng · 2022-03-02T07:43:15Z

@waruto210 normally, ml algorithm will use the cached dataset maxIters (by default 20~100) times, so we must cache it. In this case, I think it not worthwhile.

Add cache in BinaryClassificationMetrics.scala

34d3dd6

github-actions bot added the MLLIB label Feb 28, 2022

zhengruifeng reviewed Mar 2, 2022

View reviewed changes

zhengruifeng closed this Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38346][MLLIB] Add cache in MLlib BinaryClassificationMetrics #35678

[SPARK-38346][MLLIB] Add cache in MLlib BinaryClassificationMetrics #35678

waruto210 commented Feb 28, 2022

waruto210 commented Feb 28, 2022

AmplabJenkins commented Mar 1, 2022

srowen commented Mar 1, 2022

zhengruifeng Mar 2, 2022

zhengruifeng Mar 2, 2022

waruto210 commented Mar 2, 2022

zhengruifeng commented Mar 2, 2022

[SPARK-38346][MLLIB] Add cache in MLlib BinaryClassificationMetrics #35678

[SPARK-38346][MLLIB] Add cache in MLlib BinaryClassificationMetrics #35678

Conversation

waruto210 commented Feb 28, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

waruto210 commented Feb 28, 2022

AmplabJenkins commented Mar 1, 2022

srowen commented Mar 1, 2022

zhengruifeng Mar 2, 2022

Choose a reason for hiding this comment

zhengruifeng Mar 2, 2022

Choose a reason for hiding this comment

waruto210 commented Mar 2, 2022

zhengruifeng commented Mar 2, 2022