[SPARK-16240][ML] Model loading backward compatibility for LDA #14112

GayathriMurali · 2016-07-09T06:19:50Z

What changes were proposed in this pull request?

LDA model loading backward compatibility

How was this patch tested?

Existing UT

GayathriMurali · 2016-07-09T06:22:33Z

@hhbyyh Can you please help review? I am not sure if this is the right way to do it, as topicDistributionCol is not included in the MLWriter or load.

SparkQA · 2016-07-09T06:52:14Z

Test build #62012 has finished for PR 14112 at commit 880c3a1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2016-07-09T21:24:38Z

Thanks @GayathriMurali for the PR. I think we'll need to override the default behavior of getAndSetParams. Meanwhile, we need to invoke both convertVectorColumnsToML and convertMatrixColumnsToML.

I'll send a PR to your repository for reference.

hhbyyh · 2016-07-09T22:18:12Z

PR created. https://github.com/GayathriMurali/spark/pull/1/files

I got something else that I need to turn to. Ideally, the overriding getAndSetParams should be in LDAParams, thus it can be reused by LDA and LDA Local/Distributed Model. Please help move it there (perhaps a new Function in LDAParams)

Let me know if you have any question. I'll revisit ASAP.

hhbyyh · 2016-07-09T22:24:13Z

@jkbradley I find it not easy to add a unit test to cover the logic. Appreciate your thoughts.

yanboliang · 2016-07-11T05:34:57Z

@hhbyyh I think offline test should be OK for now, since we don't have unified save/load compatibility test framework until now. It's better we can get this feature in the next RC (coming soon). Thanks!

GayathriMurali · 2016-07-11T18:05:56Z

@hhbyyh Thanks for helping out. Updated commit includes logic to include topicDistributionCol @yanboliang

SparkQA · 2016-07-11T18:56:05Z

Test build #62108 has finished for PR 14112 at commit 2b13262.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-07-11T20:44:59Z

I agree it will be hard to do a proper unit test, but we should definitely test offline. I've made a JIRA for designing backwards compatibility for the long term: https://issues.apache.org/jira/browse/SPARK-15573

I'll take a look at this now

jkbradley · 2016-07-11T20:55:10Z

mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala

-      val gammaShape = data.getAs[Double](4)
+      val vectorConverted = MLUtils.convertVectorColumnsToML(data, "docConcentration")
+      val Row(vocabSize: Int, topicsMatrix: Matrix, docConcentration: Vector,
+      topicConcentration: Double, gammaShape: Double) = MLUtils.convertMatrixColumnsToML(


style: indent 4 spaces

You could also simplify this with Datasets by using (DataFrame).as[Data] if you make LocalLDAModelWriter.Data accessible here (by moving it to sit under object LocalLDAModel).

jkbradley · 2016-07-11T20:55:19Z

What will happen in the future if the model changes, such as storing another val? We will need to update the model loading logic and should follow different code paths for models saved in 1.6 vs. 2.x. I'd recommend we isolate the 1.6 code path by checking metadata.sparkVersion. Even though a single code paths works for 1.6 and 2.0, isolating the special logic for 1.6 might make it easier to update in the future. What do you think?

jkbradley · 2016-07-11T20:55:42Z

Thank for the PR!

hhbyyh · 2016-07-11T22:39:43Z

I agree that it will make thing easier to separate the loading logic here in LDA due to its extra complexity. Yet maybe we should not extend the pattern to previous changes about loading compatibility (LR, NaiveBayes, feature). I'd appreciate your suggestions.

@GayathriMurali I think we should cover the DistributedLDAModel and ideally, LDA for loading compatibility and avoid code duplication.

GayathriMurali · 2016-07-11T23:01:58Z

+1 for separate loading logic. The recent commit includes separate code paths depending on sparkVersion

SparkQA · 2016-07-11T23:39:04Z

Test build #62125 has finished for PR 14112 at commit 08e5b55.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-07-12T19:20:21Z

mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala

@@ -80,7 +84,8 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM
   *     - Values should be >= 0
   *     - default = uniformly (1.0 / k), following the implementation from
   *       [[https://github.com/Blei-Lab/onlineldavb]].
-   * @group param
+    *


fix indentation

jkbradley · 2016-07-12T19:26:39Z

@hhbyyh I agree about not changing the other models' loading code unless it becomes necessary. I hope we can design a better long-term solution during 2.1.

We should definitely cover the other LDA classes (and try to avoid duplicate code).

jkbradley · 2016-07-13T20:19:38Z

@GayathriMurali I'd like to accelerate getting this merged. Please let me know if you'd like help with it, especially with adding the fix for other LDA classes. (We can collaborate on the same PR.)

GayathriMurali · 2016-07-13T20:47:25Z

@jkbradley I am sorry, I have been held up with something else. I am looking on ways to add this to DistribtedLDA model. I will have something by EOD today.

SparkQA · 2016-07-14T03:26:49Z

Test build #62293 has finished for PR 14112 at commit 0c2e51c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-14T04:52:00Z

Test build #62295 has finished for PR 14112 at commit 216777f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-14T06:25:53Z

Test build #62298 has finished for PR 14112 at commit 58384d4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

… separate function

SparkQA · 2016-07-20T00:39:25Z

Test build #62567 has finished for PR 14112 at commit a8bdd7a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-20T02:04:46Z

Test build #62568 has finished for PR 14112 at commit b16b368.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

GayathriMurali · 2016-07-20T05:22:05Z

@jkbradley Can you please help review this?

GayathriMurali · 2016-07-26T15:58:11Z

@jkbradley Please let me know if I can do anything to help get this merged

GayathriMurali · 2016-08-10T21:52:10Z

@jkbradley Can you please help review this?

jkbradley · 2016-08-29T06:52:48Z

@GayathriMurali Apologies for the long delay! It slipped past the release, and I'm trying to catch up on PRs now. I'll make a final review pass ASAP

SparkQA · 2016-08-29T07:44:45Z

Test build #3234 has finished for PR 14112 at commit b16b368.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-08-29T16:42:50Z

mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala

 import org.apache.spark.mllib.impl.PeriodicCheckpointer
-import org.apache.spark.mllib.linalg.{Matrices => OldMatrices, Vector => OldVector,
-  Vectors => OldVectors}
+import org.apache.spark.mllib.linalg.{Matrices => OldMatrices, Vector => OldVector, Vectors => OldVectors}


OldMatrices is not used

jkbradley · 2016-08-29T16:43:34Z

Done with review for now; I'll check back for updates. Thanks!

jkbradley · 2016-09-06T20:51:35Z

@GayathriMurali Btw, I know it's been a long time. If you no longer have time to work on this, please say so, and I'd be happy to take it over. Thanks!

GayathriMurali · 2016-09-06T20:58:15Z

@jkbradley I am so sorry I couldn't respond to this on time! I am in a transition process and might not be able to drive this JIRA to completion at this point in time. I am really sorry about that. Thanks!

jkbradley · 2016-09-08T00:42:35Z

@GayathriMurali No problem; you've had to wait quite a while. I'll be happy to take it over.

Could you please close this PR for now?

I'll send a new PR based on your commits (so you'll still be the primary author). Thanks!

GayathriMurali · 2016-09-08T03:17:10Z

@jkbradley Sure! Thanks.

GayathriMurali force-pushed the SPARK-16240 branch from 880c3a1 to 2b13262 Compare July 11, 2016 18:04

jkbradley reviewed Jul 11, 2016
View reviewed changes

jkbradley reviewed Jul 12, 2016
View reviewed changes

GayathriMurali added 6 commits July 19, 2016 17:22

Review comments and Seperate coding logic

45fa3b0

Added model loading logic to Distributed LDA and fixed review comments

ac874c5

Bug fix

3b69b88

Bug fix

34f8f3b

Selecting columns directly instead of as[Data]- Unit test issue

22fe753

Added loading logic to LDA and wrapped topicdistribution loading in a…

a8bdd7a

… separate function

GayathriMurali force-pushed the SPARK-16240 branch from 8848f0e to a8bdd7a Compare July 20, 2016 00:33

Style issues

b16b368

jkbradley reviewed Aug 29, 2016
View reviewed changes

GayathriMurali closed this Sep 8, 2016

jkbradley mentioned this pull request Sep 9, 2016

[SPARK-16240][ML] ML persistence backward compatibility for LDA #15034

Closed

jkbradley mentioned this pull request Sep 22, 2016

[SPARK-16240][ML] ML persistence backward compatibility for LDA - 2.0 backport #15205

Closed

[SPARK-16240][ML] Model loading backward compatibility for LDA #14112

[SPARK-16240][ML] Model loading backward compatibility for LDA #14112

Conversation

GayathriMurali commented Jul 9, 2016

What changes were proposed in this pull request?

How was this patch tested?

GayathriMurali commented Jul 9, 2016

SparkQA commented Jul 9, 2016

hhbyyh commented Jul 9, 2016

hhbyyh commented Jul 9, 2016 • edited Loading

hhbyyh commented Jul 9, 2016

yanboliang commented Jul 11, 2016 • edited Loading

GayathriMurali commented Jul 11, 2016

SparkQA commented Jul 11, 2016

jkbradley commented Jul 11, 2016

jkbradley Jul 11, 2016

Choose a reason for hiding this comment

jkbradley commented Jul 11, 2016

jkbradley commented Jul 11, 2016

hhbyyh commented Jul 11, 2016

GayathriMurali commented Jul 11, 2016

SparkQA commented Jul 11, 2016

jkbradley Jul 12, 2016

Choose a reason for hiding this comment

jkbradley commented Jul 12, 2016 • edited Loading

jkbradley commented Jul 13, 2016

GayathriMurali commented Jul 13, 2016

SparkQA commented Jul 14, 2016

SparkQA commented Jul 14, 2016

SparkQA commented Jul 14, 2016

SparkQA commented Jul 20, 2016

SparkQA commented Jul 20, 2016

GayathriMurali commented Jul 20, 2016

GayathriMurali commented Jul 26, 2016

GayathriMurali commented Aug 10, 2016

jkbradley commented Aug 29, 2016

SparkQA commented Aug 29, 2016

jkbradley Aug 29, 2016

Choose a reason for hiding this comment

jkbradley commented Aug 29, 2016

jkbradley commented Sep 6, 2016

GayathriMurali commented Sep 6, 2016 • edited Loading

jkbradley commented Sep 8, 2016

GayathriMurali commented Sep 8, 2016

hhbyyh commented Jul 9, 2016 •

edited

Loading

yanboliang commented Jul 11, 2016 •

edited

Loading

jkbradley commented Jul 12, 2016 •

edited

Loading

GayathriMurali commented Sep 6, 2016 •

edited

Loading