Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6053][MLLIB] support save/load in PySpark's ALS #4811

Closed
wants to merge 4 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Feb 27, 2015

A simple wrapper to save/load MatrixFactorizationModel in Python. @jkbradley

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28056 has started for PR 4811 at commit 282ec8d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28056 has finished for PR 4811 at commit 282ec8d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MatrixFactorizationModel(JavaModelWrapper, Saveable, JavaLoader):
    • class Saveable(object):
    • class Loader(object):
    • class JavaLoader(Loader):
    • java_class = ".".join([java_package, cls.__name__])

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28056/
Test PASSed.

@jkbradley
Copy link
Member

I messed up not passing sc to save/load. Is this patch going into 1.3? If not, then I'll submit a separate patch fixing the documentation (which will conflict a little).

@mengxr
Copy link
Contributor Author

mengxr commented Feb 27, 2015

If we have couple days before RC2, this would be nice to have. We use the same API as in Scala/Java and there is no real implementation in this PR. Having save/load would benefit many users.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28088 has started for PR 4811 at commit 06140a4.

  • This patch merges cleanly.

@@ -220,6 +218,10 @@ predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).reduce(lambda x, y: x + y) / ratesAndPreds.count()
print("Mean Squared Error = " + str(MSE))

# Save and load model
model.save("myModelPath")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add sc to save call.
Also import MatrixFactorizationModel

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28088 has finished for PR 4811 at commit 06140a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MatrixFactorizationModel(JavaModelWrapper, Saveable, JavaLoader):
    • class Saveable(object):
    • class Loader(object):
    • class JavaLoader(Loader):
    • java_class = ".".join([java_package, cls.__name__])

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28088/
Test PASSed.

@jkbradley
Copy link
Member

LGTM. I ran into a bug running the example, but it seems to be coming from elsewhere. It happens when calling train---and not all the time, only sometimes:

java.lang.ClassCastException: scala.None$ cannot be cast to java.util.List
    at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:745)
    at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:82)
    at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:340)
    at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:335)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
    at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
    at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
    at org.apache.spark.Accumulators$.add(Accumulators.scala:335)
    at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:892)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:974)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1398)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1362)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/02/27 14:41:29 ERROR DAGScheduler: Failed to update accumulators for ResultTask(279, 4)
java.lang.ClassCastException: scala.None$ cannot be cast to java.util.List
    at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:745)
    at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:82)
    at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:340)
    at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:335)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
    at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
    at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
    at org.apache.spark.Accumulators$.add(Accumulators.scala:335)
    at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:892)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:974)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1398)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1362)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

I'll make a separate JIRA for it.

@jkbradley
Copy link
Member

Made JIRA: [https://issues.apache.org/jira/browse/SPARK-6071]

@SparkQA
Copy link

SparkQA commented Mar 1, 2015

Test build #28151 has started for PR 4811 at commit f135dac.

  • This patch merges cleanly.

@mengxr mengxr changed the title [SPARK-5991][MLLIB] support save/load in PySpark's ALS [SPARK-6053][MLLIB] support save/load in PySpark's ALS Mar 1, 2015
@SparkQA
Copy link

SparkQA commented Mar 1, 2015

Test build #28151 has finished for PR 4811 at commit f135dac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MatrixFactorizationModel(JavaModelWrapper, Saveable, JavaLoader):
    • class Saveable(object):
    • class Loader(object):
    • class JavaLoader(Loader):
    • java_class = ".".join([java_package, cls.__name__])

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28151/
Test PASSed.

@jkbradley
Copy link
Member

LGTM

asfgit pushed a commit that referenced this pull request Mar 2, 2015
A simple wrapper to save/load `MatrixFactorizationModel` in Python. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4811 from mengxr/SPARK-5991 and squashes the following commits:

f135dac [Xiangrui Meng] update save doc
57e5200 [Xiangrui Meng] address comments
06140a4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5991
282ec8d [Xiangrui Meng] support save/load in PySpark's ALS

(cherry picked from commit aedbbaa)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@mengxr
Copy link
Contributor Author

mengxr commented Mar 2, 2015

Merged into master and branch-1.3. Thanks!

@asfgit asfgit closed this in aedbbaa Mar 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants