[SPARK-30503][ML] OnlineLDAOptimizer does not handle persistance correctly by zhengruifeng · Pull Request #27261 · apache/spark

zhengruifeng · 2020-01-17T10:18:10Z

What changes were proposed in this pull request?

unpersist graph outside checkpointer, like what Pregel does

Why are the changes needed?

Shown in SPARK-30503, intermediate edges are not unpersisted

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites and manual test

zhengruifeng · 2020-01-17T10:20:00Z

testCode:

import org.apache.spark.ml.clustering.LDA

val dataset = spark.read.format("libsvm").load("data/mllib/sample_lda_libsvm_data.txt")

val lda = new LDA().setK(10).setMaxIter(100).setOptimizer("em")

sc.getPersistentRDDs

val start = System.currentTimeMillis; val model = lda.fit(dataset); val end = System.currentTimeMillis; end - start

sc.getPersistentRDDs

sc.getPersistentRDDs.size

sc.getPersistentRDDs.foreach(println)

this PR:

start: Long = 1579250257523
model: org.apache.spark.ml.clustering.LDAModel = DistributedLDAModel: uid=lda_2a48ae87b788, k=10, numFeatures=11
end: Long = 1579250268529
res1: Long = 11006


scala> sc.getPersistentRDDs.foreach(println)
(2441,EdgeRDD MapPartitionsRDD[2441] at mapPartitions at EdgeRDDImpl.scala:119)
(2438,VertexRDD, VertexRDD ZippedPartitionsRDD2[2438] at zipPartitions at VertexRDD.scala:322)
(29,VertexRDD, VertexRDD ZippedPartitionsRDD2[29] at zipPartitions at VertexRDD.scala:322)
(32,EdgeRDD MapPartitionsRDD[32] at mapPartitions at EdgeRDDImpl.scala:119)

master:

scala> val start = System.currentTimeMillis; val model = lda.fit(dataset); val end = System.currentTimeMillis; end - start
start: Long = 1579255989886
model: org.apache.spark.ml.clustering.LDAModel = DistributedLDAModel: uid=lda_f600c29d8e0a, k=10, numFeatures=11
end: Long = 1579256001181
res1: Long = 11295

scala> sc.getPersistentRDDs.size
res2: Int = 106

There seems no perfermance regression.

zhengruifeng · 2020-01-17T10:24:53Z

The previous graph is handled like Pregel:

    var prevG: Graph[VD, ED] = null
    var i = 0
    while (activeMessages > 0 && i < maxIterations) {
      // Receive the messages and update the vertices.
      prevG = g
      g = g.joinVertices(messages)(vprog)
      graphCheckpointer.update(g)

      ...

      // Unpersist the RDDs hidden by newly-materialized RDDs
      oldMessages.unpersist()
      prevG.unpersistVertices()
      prevG.edges.unpersist()
      // count the iteration
      i += 1
    }

It is somewhat stranger that prevG needs to be unpersisted both inside and outside of graphCheckpointer, which is quite different from PeriodicRDDCheckpointer.

And the persist and unpersist method in PeriodicGraphCheckpointer are not consistent.

  override protected def persist(data: Graph[VD, ED]): Unit = {
    if (data.vertices.getStorageLevel == StorageLevel.NONE) {
      /* We need to use cache because persist does not honor the default storage level requested
       * when constructing the graph. Only cache does that.
       */
      data.vertices.cache()
    }
    if (data.edges.getStorageLevel == StorageLevel.NONE) {
      data.edges.cache()
    }
  }

  override protected def unpersist(data: Graph[VD, ED]): Unit = data.unpersist()

SparkQA · 2020-01-17T11:31:15Z

Test build #116933 has finished for PR 27261 at commit 4670c2e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-01-17T12:18:31Z

ping @srowen

srowen · 2020-01-17T15:53:47Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala

    val docTopicDistributions: VertexRDD[TopicCounts] =
      graph.aggregateMessages[(Boolean, TopicCounts)](sendMsg, mergeMsg)
        .mapValues(_._2)
+    val prevGraph = graph


I buy it, but where is graph persisted again?

@zhengruifeng I'm still kind of curious, but trust your judgment so I will merge it.

Thanks! I am on the holidy, so can no reply quickly.
Current change is following what Pregel does,
I will go on looking into it, since the persistance in GraphX is a little complex.

lint

4670c2e

zhengruifeng added the ML label Jan 17, 2020

srowen reviewed Jan 17, 2020

View reviewed changes

srowen closed this in 1c46bd9 Jan 22, 2020

zhengruifeng deleted the lda_checkpointer branch January 23, 2020 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30503][ML] OnlineLDAOptimizer does not handle persistance correctly#27261

[SPARK-30503][ML] OnlineLDAOptimizer does not handle persistance correctly#27261
zhengruifeng wants to merge 1 commit intoapache:masterfrom
zhengruifeng:lda_checkpointer

zhengruifeng commented Jan 17, 2020

Uh oh!

zhengruifeng commented Jan 17, 2020

Uh oh!

zhengruifeng commented Jan 17, 2020

Uh oh!

SparkQA commented Jan 17, 2020

Uh oh!

zhengruifeng commented Jan 17, 2020

Uh oh!

srowen Jan 17, 2020

Uh oh!

srowen Jan 22, 2020

Uh oh!

zhengruifeng Jan 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhengruifeng commented Jan 17, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng commented Jan 17, 2020

Uh oh!

zhengruifeng commented Jan 17, 2020

Uh oh!

SparkQA commented Jan 17, 2020

Uh oh!

zhengruifeng commented Jan 17, 2020

Uh oh!

srowen Jan 17, 2020

Choose a reason for hiding this comment

Uh oh!

srowen Jan 22, 2020

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jan 23, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants