New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5986][MLLib] Add save/load for k-means #4951
Conversation
Test build #28393 has started for PR 4951 at commit
|
Test build #28394 has started for PR 4951 at commit
|
Test build #28393 has finished for PR 4951 at commit
|
Test PASSed. |
Test build #28394 has finished for PR 4951 at commit
|
Test PASSed. |
val metadata = compact(render( | ||
("class" -> thisClassName) ~ ("version" -> thisFormatVersion) ~ ("k" -> model.k))) | ||
sc.parallelize(Seq(metadata), 1).saveAsTextFile(Loader.metadataPath(path)) | ||
val dataRDD = sc.parallelize(model.clusterCenters).map(wrapper.serialize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We don't need
wrapper.serialize
for vectors. - We need to store cluster indices. If the centers are saved to more than one partitions, we cannot easily load them back in the original order.
Test build #28408 has started for PR 4951 at commit
|
KMeansModel.SaveLoadV1_0.load(sc, path) | ||
} | ||
|
||
case class IndexedPoint(id: Int, point: Vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using Int
here to represent indexes? I think there is no need to use Long
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Int
is sufficient. This class should be private. Please also check the scope of other classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this private. The class name could be changed to case class Cluster(id: Int, center: Vector)
. IndexedPoint
is too general.
Test build #28408 has finished for PR 4951 at commit
|
Test PASSed. |
import org.apache.spark.api.java.JavaRDD | ||
import org.apache.spark.mllib.linalg._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be specific about the imports. You only need Vector
.
Test build #28453 has started for PR 4951 at commit
|
Test build #28453 has finished for PR 4951 at commit
|
Test PASSed. |
LGTM. Merged into master. Thanks! |
This PR adds save/load for K-means as described in SPARK-5986. Python version will be added in another PR.