-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16240][ML] Model loading backward compatibility for LDA #14112
Conversation
@hhbyyh Can you please help review? I am not sure if this is the right way to do it, as topicDistributionCol is not included in the MLWriter or load. |
Test build #62012 has finished for PR 14112 at commit
|
Thanks @GayathriMurali for the PR. I think we'll need to override the default behavior of getAndSetParams. Meanwhile, we need to invoke both convertVectorColumnsToML and convertMatrixColumnsToML. I'll send a PR to your repository for reference. |
PR created. https://github.com/GayathriMurali/spark/pull/1/files I got something else that I need to turn to. Ideally, the overriding Let me know if you have any question. I'll revisit ASAP. |
@jkbradley I find it not easy to add a unit test to cover the logic. Appreciate your thoughts. |
@hhbyyh I think offline test should be OK for now, since we don't have unified save/load compatibility test framework until now. It's better we can get this feature in the next RC (coming soon). Thanks! |
880c3a1
to
2b13262
Compare
@hhbyyh Thanks for helping out. Updated commit includes logic to include topicDistributionCol @yanboliang |
Test build #62108 has finished for PR 14112 at commit
|
I agree it will be hard to do a proper unit test, but we should definitely test offline. I've made a JIRA for designing backwards compatibility for the long term: https://issues.apache.org/jira/browse/SPARK-15573 I'll take a look at this now |
val gammaShape = data.getAs[Double](4) | ||
val vectorConverted = MLUtils.convertVectorColumnsToML(data, "docConcentration") | ||
val Row(vocabSize: Int, topicsMatrix: Matrix, docConcentration: Vector, | ||
topicConcentration: Double, gammaShape: Double) = MLUtils.convertMatrixColumnsToML( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: indent 4 spaces
You could also simplify this with Datasets by using (DataFrame).as[Data]
if you make LocalLDAModelWriter.Data
accessible here (by moving it to sit under object LocalLDAModel
).
What will happen in the future if the model changes, such as storing another val? We will need to update the model loading logic and should follow different code paths for models saved in 1.6 vs. 2.x. I'd recommend we isolate the 1.6 code path by checking |
Thank for the PR! |
I agree that it will make thing easier to separate the loading logic here in LDA due to its extra complexity. Yet maybe we should not extend the pattern to previous changes about loading compatibility (LR, NaiveBayes, feature). I'd appreciate your suggestions. @GayathriMurali I think we should cover the DistributedLDAModel and ideally, LDA for loading compatibility and avoid code duplication. |
+1 for separate loading logic. The recent commit includes separate code paths depending on sparkVersion |
Test build #62125 has finished for PR 14112 at commit
|
@@ -80,7 +84,8 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM | |||
* - Values should be >= 0 | |||
* - default = uniformly (1.0 / k), following the implementation from | |||
* [[https://github.com/Blei-Lab/onlineldavb]]. | |||
* @group param | |||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix indentation
@hhbyyh I agree about not changing the other models' loading code unless it becomes necessary. I hope we can design a better long-term solution during 2.1. We should definitely cover the other LDA classes (and try to avoid duplicate code). |
@GayathriMurali I'd like to accelerate getting this merged. Please let me know if you'd like help with it, especially with adding the fix for other LDA classes. (We can collaborate on the same PR.) |
@jkbradley I am sorry, I have been held up with something else. I am looking on ways to add this to DistribtedLDA model. I will have something by EOD today. |
Test build #62293 has finished for PR 14112 at commit
|
Test build #62295 has finished for PR 14112 at commit
|
Test build #62298 has finished for PR 14112 at commit
|
… separate function
8848f0e
to
a8bdd7a
Compare
Test build #62567 has finished for PR 14112 at commit
|
Test build #62568 has finished for PR 14112 at commit
|
@jkbradley Can you please help review this? |
@jkbradley Please let me know if I can do anything to help get this merged |
@jkbradley Can you please help review this? |
@GayathriMurali Apologies for the long delay! It slipped past the release, and I'm trying to catch up on PRs now. I'll make a final review pass ASAP |
Test build #3234 has finished for PR 14112 at commit
|
import org.apache.spark.mllib.impl.PeriodicCheckpointer | ||
import org.apache.spark.mllib.linalg.{Matrices => OldMatrices, Vector => OldVector, | ||
Vectors => OldVectors} | ||
import org.apache.spark.mllib.linalg.{Matrices => OldMatrices, Vector => OldVector, Vectors => OldVectors} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OldMatrices is not used
Done with review for now; I'll check back for updates. Thanks! |
@GayathriMurali Btw, I know it's been a long time. If you no longer have time to work on this, please say so, and I'd be happy to take it over. Thanks! |
@jkbradley I am so sorry I couldn't respond to this on time! I am in a transition process and might not be able to drive this JIRA to completion at this point in time. I am really sorry about that. Thanks! |
@GayathriMurali No problem; you've had to wait quite a while. I'll be happy to take it over. Could you please close this PR for now? I'll send a new PR based on your commits (so you'll still be the primary author). Thanks! |
@jkbradley Sure! Thanks. |
What changes were proposed in this pull request?
LDA model loading backward compatibility
How was this patch tested?
Existing UT