Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38346][MLLIB] Add cache in MLlib BinaryClassificationMetrics #35678

Closed
wants to merge 1 commit into from

Conversation

waruto210
Copy link

What changes were proposed in this pull request?

Cache two RDDs in BinaryClassificationMetrics.scala.

Why are the changes needed?

Two RDDs in BinaryClassificationMetrics.scala were used by diffrenet jobs but not cached when we run some example code as follow(view full description on jira):

  val pipeline: Pipeline = new Pipeline().setStages(Array(assembler, classifier))

  val Array(tr, te) = originalData.randomSplit(Array(0.7, .03), 666)
  val model = pipeline.fit(tr)
  val modelDF = model.transform(te)
  val evaluator = new BinaryClassificationEvaluator().setLabelCol(labelCol).setRawPredictionCol("prediction")
  println(evaluator.evaluate(modelDF))

Does this PR introduce any user-facing change?

NO

How was this patch tested?

origin tests

@github-actions github-actions bot added the MLLIB label Feb 28, 2022
@waruto210
Copy link
Author

cc @zhengruifeng @imatiach-msft

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Mar 1, 2022

I'm not sure this is worth it. For one, this cached RDD leaks, and is not unpersisted. It doesn't check if it's already cached. And I don't think it probably helps much as you pay the caching overhead just to avoid 1 recomputation for count(), which may be not a big deal perf-wise.

@@ -113,7 +113,7 @@ class BinaryClassificationMetrics @Since("3.0.0") (
} else {
iter
}
}
}.cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can cache the returned rdd on your own purpose.
We should not cache it here

@@ -186,7 +186,7 @@ class BinaryClassificationMetrics @Since("3.0.0") (
mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) =>
c += (labelAndWeight._1, labelAndWeight._2),
mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2
).sortByKey(ascending = false)
).sortByKey(ascending = false).cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this RDD counts seems used twice at most in this method. I think it not worthwhile to cache it.

@waruto210
Copy link
Author

The user-level cache requires developers to understand the program characteristics themselves, and the focus here is on the cache of libraries-level (MLlib), like SPARK-16697, SPARK-16880 and SPARK-18356 in MLLib.
@zhengruifeng

@zhengruifeng
Copy link
Contributor

@waruto210 normally, ml algorithm will use the cached dataset maxIters (by default 20~100) times, so we must cache it. In this case, I think it not worthwhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants