Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16240][ML] Model loading backward compatibility for LDA #14112

Closed
wants to merge 9 commits into from

Conversation

GayathriMurali
Copy link
Contributor

What changes were proposed in this pull request?

LDA model loading backward compatibility

How was this patch tested?

Existing UT

@GayathriMurali
Copy link
Contributor Author

@hhbyyh Can you please help review? I am not sure if this is the right way to do it, as topicDistributionCol is not included in the MLWriter or load.

@SparkQA
Copy link

SparkQA commented Jul 9, 2016

Test build #62012 has finished for PR 14112 at commit 880c3a1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor

hhbyyh commented Jul 9, 2016

Thanks @GayathriMurali for the PR. I think we'll need to override the default behavior of getAndSetParams. Meanwhile, we need to invoke both convertVectorColumnsToML and convertMatrixColumnsToML.

I'll send a PR to your repository for reference.

@hhbyyh
Copy link
Contributor

hhbyyh commented Jul 9, 2016

PR created. https://github.com/GayathriMurali/spark/pull/1/files

I got something else that I need to turn to. Ideally, the overriding getAndSetParams should be in LDAParams, thus it can be reused by LDA and LDA Local/Distributed Model. Please help move it there (perhaps a new Function in LDAParams)

Let me know if you have any question. I'll revisit ASAP.

@hhbyyh
Copy link
Contributor

hhbyyh commented Jul 9, 2016

@jkbradley I find it not easy to add a unit test to cover the logic. Appreciate your thoughts.

@yanboliang
Copy link
Contributor

yanboliang commented Jul 11, 2016

@hhbyyh I think offline test should be OK for now, since we don't have unified save/load compatibility test framework until now. It's better we can get this feature in the next RC (coming soon). Thanks!

@GayathriMurali
Copy link
Contributor Author

@hhbyyh Thanks for helping out. Updated commit includes logic to include topicDistributionCol @yanboliang

@SparkQA
Copy link

SparkQA commented Jul 11, 2016

Test build #62108 has finished for PR 14112 at commit 2b13262.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

I agree it will be hard to do a proper unit test, but we should definitely test offline. I've made a JIRA for designing backwards compatibility for the long term: https://issues.apache.org/jira/browse/SPARK-15573

I'll take a look at this now

val gammaShape = data.getAs[Double](4)
val vectorConverted = MLUtils.convertVectorColumnsToML(data, "docConcentration")
val Row(vocabSize: Int, topicsMatrix: Matrix, docConcentration: Vector,
topicConcentration: Double, gammaShape: Double) = MLUtils.convertMatrixColumnsToML(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: indent 4 spaces

You could also simplify this with Datasets by using (DataFrame).as[Data] if you make LocalLDAModelWriter.Data accessible here (by moving it to sit under object LocalLDAModel).

@jkbradley
Copy link
Member

What will happen in the future if the model changes, such as storing another val? We will need to update the model loading logic and should follow different code paths for models saved in 1.6 vs. 2.x. I'd recommend we isolate the 1.6 code path by checking metadata.sparkVersion. Even though a single code paths works for 1.6 and 2.0, isolating the special logic for 1.6 might make it easier to update in the future. What do you think?

@jkbradley
Copy link
Member

Thank for the PR!

@hhbyyh
Copy link
Contributor

hhbyyh commented Jul 11, 2016

I agree that it will make thing easier to separate the loading logic here in LDA due to its extra complexity. Yet maybe we should not extend the pattern to previous changes about loading compatibility (LR, NaiveBayes, feature). I'd appreciate your suggestions.

@GayathriMurali I think we should cover the DistributedLDAModel and ideally, LDA for loading compatibility and avoid code duplication.

@GayathriMurali
Copy link
Contributor Author

+1 for separate loading logic. The recent commit includes separate code paths depending on sparkVersion

@SparkQA
Copy link

SparkQA commented Jul 11, 2016

Test build #62125 has finished for PR 14112 at commit 08e5b55.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -80,7 +84,8 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM
* - Values should be >= 0
* - default = uniformly (1.0 / k), following the implementation from
* [[https://github.com/Blei-Lab/onlineldavb]].
* @group param
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix indentation

@jkbradley
Copy link
Member

jkbradley commented Jul 12, 2016

@hhbyyh I agree about not changing the other models' loading code unless it becomes necessary. I hope we can design a better long-term solution during 2.1.

We should definitely cover the other LDA classes (and try to avoid duplicate code).

@jkbradley
Copy link
Member

@GayathriMurali I'd like to accelerate getting this merged. Please let me know if you'd like help with it, especially with adding the fix for other LDA classes. (We can collaborate on the same PR.)

@GayathriMurali
Copy link
Contributor Author

@jkbradley I am sorry, I have been held up with something else. I am looking on ways to add this to DistribtedLDA model. I will have something by EOD today.

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62293 has finished for PR 14112 at commit 0c2e51c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62295 has finished for PR 14112 at commit 216777f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62298 has finished for PR 14112 at commit 58384d4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62567 has finished for PR 14112 at commit a8bdd7a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62568 has finished for PR 14112 at commit b16b368.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@GayathriMurali
Copy link
Contributor Author

@jkbradley Can you please help review this?

@GayathriMurali
Copy link
Contributor Author

@jkbradley Please let me know if I can do anything to help get this merged

@GayathriMurali
Copy link
Contributor Author

@jkbradley Can you please help review this?

@jkbradley
Copy link
Member

@GayathriMurali Apologies for the long delay! It slipped past the release, and I'm trying to catch up on PRs now. I'll make a final review pass ASAP

@SparkQA
Copy link

SparkQA commented Aug 29, 2016

Test build #3234 has finished for PR 14112 at commit b16b368.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import org.apache.spark.mllib.impl.PeriodicCheckpointer
import org.apache.spark.mllib.linalg.{Matrices => OldMatrices, Vector => OldVector,
Vectors => OldVectors}
import org.apache.spark.mllib.linalg.{Matrices => OldMatrices, Vector => OldVector, Vectors => OldVectors}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OldMatrices is not used

@jkbradley
Copy link
Member

Done with review for now; I'll check back for updates. Thanks!

@jkbradley
Copy link
Member

@GayathriMurali Btw, I know it's been a long time. If you no longer have time to work on this, please say so, and I'd be happy to take it over. Thanks!

@GayathriMurali
Copy link
Contributor Author

GayathriMurali commented Sep 6, 2016

@jkbradley I am so sorry I couldn't respond to this on time! I am in a transition process and might not be able to drive this JIRA to completion at this point in time. I am really sorry about that. Thanks!

@jkbradley
Copy link
Member

@GayathriMurali No problem; you've had to wait quite a while. I'll be happy to take it over.

Could you please close this PR for now?

I'll send a new PR based on your commits (so you'll still be the primary author). Thanks!

@GayathriMurali
Copy link
Contributor Author

@jkbradley Sure! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants