[SPARK-5596] [mllib] ML model import/export for GLMs, NaiveBayes #4233

jkbradley · 2015-01-27T22:25:19Z

This is a PR for Parquet-based model import/export. Please see the design doc on the JIRA.

Note: This includes only a subset of regression and classification models:

NaiveBayes, SVM, LogisticRegression
LinearRegression, RidgeRegression, Lasso

Follow-up PRs will cover other models.

Sketch of current contents:

New traits: Saveable, Loader
Implementations for some algorithms
Also: Added LogisticRegressionModel.getThreshold method (so that unit test could check the threshold)

CC: @mengxr @selvinsource

SparkQA · 2015-01-27T22:27:46Z

Test build #26190 has started for PR 4233 at commit 14711b7.

This patch merges cleanly.

SparkQA · 2015-01-27T23:37:01Z

Test build #26190 has finished for PR 4233 at commit 14711b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait Exportable
- trait Importable[Model <: Exportable]

AmplabJenkins · 2015-01-27T23:37:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26190/
Test PASSed.

SparkQA · 2015-01-30T20:57:41Z

Test build #26427 has started for PR 4233 at commit 365314f.

This patch merges cleanly.

SparkQA · 2015-01-30T22:06:09Z

Test build #26427 has finished for PR 4233 at commit 365314f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait Exportable
- trait Importable[Model <: Exportable]

AmplabJenkins · 2015-01-30T22:06:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26427/
Test PASSed.

mengxr · 2015-01-31T09:16:58Z

mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala

+
+  override def save(sc: SparkContext, path: String): Unit = {
+    val sqlContext = new SQLContext(sc)
+    import sqlContext._


import sqlContext._ is no longer needed due to recent API change. implicit val sqlContext = new SQLContext(sc) should work.

I tried that, and it fails to compile. (I tried removed sqlContext from save(), as well as having it be an implicit val without the import. Neither worked.) Is there another import I need?

[error] /Users/josephkb/spark/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala:89: type mismatch; [error] found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.classification.LogisticRegressionModel.Metadata] [error] required: org.apache.spark.sql.DataFrame [error] Error occurred in an application involving default arguments. [error] val metadataRDD: DataFrame = sc.parallelize(Seq(metadata)) [error] ^ [error] /Users/josephkb/spark/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala:94: type mismatch; [error] found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.classification.LogisticRegressionModel.Data] [error] required: org.apache.spark.sql.DataFrame [error] Error occurred in an application involving default arguments. [error] val dataRDD: DataFrame = sc.parallelize(Seq(data)) [error] ^ [error] two errors found [error] (mllib/compile:compile) Compilation failed

It may not be an issue though, with the other change you suggested below. I'll see.

SparkQA · 2015-01-31T10:02:51Z

Test build #26459 has started for PR 4233 at commit 638fa81.

This patch merges cleanly.

SparkQA · 2015-01-31T10:54:54Z

Test build #26459 has finished for PR 4233 at commit 638fa81.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait Exportable
- trait Importable[Model <: Exportable]

AmplabJenkins · 2015-01-31T10:54:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26459/
Test FAILed.

SparkQA · 2015-02-03T02:17:37Z

Test build #26579 has started for PR 4233 at commit a71e555.

This patch does not merge cleanly.

SparkQA · 2015-02-03T02:19:25Z

Test build #26579 has finished for PR 4233 at commit a71e555.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- case class Data(weights: Vector, intercept: Double, threshold: Option[Double])
- case class Data(weights: Vector, intercept: Double)
- trait Exportable
- trait Importable[M <: Exportable]
- protected abstract class Importer
- * @return (class name, version, metadata)

AmplabJenkins · 2015-02-03T02:19:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26579/
Test FAILed.

SparkQA · 2015-02-03T02:22:31Z

Test build #26580 has started for PR 4233 at commit 444341a.

This patch does not merge cleanly.

SparkQA · 2015-02-03T02:23:27Z

Test build #26580 has finished for PR 4233 at commit 444341a.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- case class Data(weights: Vector, intercept: Double, threshold: Option[Double])
- case class Data(weights: Vector, intercept: Double)
- trait Exportable
- trait Importable[M <: Exportable]
- protected abstract class Importer
- * @return (class name, version, metadata)

AmplabJenkins · 2015-02-03T02:23:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26580/
Test FAILed.

…s test suite

mengxr · 2015-02-04T18:23:34Z

mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala

+
+  override def load(sc: SparkContext, path: String): LogisticRegressionModel = {
+    val (loadedClassName, version, metadata) = Loader.loadMetadata(sc, path)
+    val classNameV1_0 = "org.apache.spark.mllib.classification.LogisticRegressionModel"


Maybe we should put a comment here about why using literal string name.

jkbradley · 2015-02-04T21:44:31Z

Btw, I'm also changing paths to use org.apache.hadoop.fs.Path to create URIs (instead of my hard-coded path separators).

…numClasses in model metadata. Improvements to unit tests

SparkQA · 2015-02-04T23:27:32Z

Test build #26784 has started for PR 4233 at commit 12d9059.

This patch merges cleanly.

jkbradley · 2015-02-04T23:32:54Z

Hopefully that took care of everything!

SparkQA · 2015-02-04T23:37:57Z

Test build #26786 has started for PR 4233 at commit 87c4eb8.

This patch merges cleanly.

SparkQA · 2015-02-05T00:30:48Z

Test build #26786 has finished for PR 4233 at commit 87c4eb8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- * @param modelClass String name for model class (used for error messages)
- case class Data(labels: Array[Double], pi: Array[Double], theta: Array[Array[Double]])
- s" but class priors vector pi had $
- s" but class conditionals array theta had $
- case class Data(weights: Vector, intercept: Double, threshold: Option[Double])
- * @param modelClass String name for model class (used for error messages)
- * @param modelClass String name for model class (used for error messages)
- case class Data(weights: Vector, intercept: Double)
- * @param modelClass String name for model class (used for error messages)
- trait Saveable
- trait Loader[M <: Saveable]
- * @return (class name, version, metadata)

AmplabJenkins · 2015-02-05T00:30:52Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26786/
Test FAILed.

SparkQA · 2015-02-05T00:39:48Z

Test build #26784 has finished for PR 4233 at commit 12d9059.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- * @param modelClass String name for model class (used for error messages)
- case class Data(labels: Array[Double], pi: Array[Double], theta: Array[Array[Double]])
- s" but class priors vector pi had $
- s" but class conditionals array theta had $
- case class Data(weights: Vector, intercept: Double, threshold: Option[Double])
- * @param modelClass String name for model class (used for error messages)
- * @param modelClass String name for model class (used for error messages)
- case class Data(weights: Vector, intercept: Double)
- * @param modelClass String name for model class (used for error messages)
- trait Saveable
- trait Loader[M <: Saveable]
- * @return (class name, version, metadata)

AmplabJenkins · 2015-02-05T00:39:52Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26784/
Test PASSed.

jkbradley · 2015-02-05T00:50:42Z

The first failure was from Kafka tests

This is a PR for Parquet-based model import/export. Please see the design doc on [the JIRA](https://issues.apache.org/jira/browse/SPARK-4587). Note: This includes only a subset of regression and classification models: * NaiveBayes, SVM, LogisticRegression * LinearRegression, RidgeRegression, Lasso Follow-up PRs will cover other models. Sketch of current contents: * New traits: Saveable, Loader * Implementations for some algorithms * Also: Added LogisticRegressionModel.getThreshold method (so that unit test could check the threshold) CC: mengxr selvinsource Author: Joseph K. Bradley <joseph@databricks.com> Closes #4233 from jkbradley/ml-import-export and squashes the following commits: 87c4eb8 [Joseph K. Bradley] small cleanups 12d9059 [Joseph K. Bradley] Many cleanups after code review. Major changes: Storing numFeatures, numClasses in model metadata. Improvements to unit tests b4ee064 [Joseph K. Bradley] Reorganized save/load for regression and classification. Renamed concepts to Saveable, Loader a34aef5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into ml-import-export ee99228 [Joseph K. Bradley] scala style fix 79675d5 [Joseph K. Bradley] cleanups in LogisticRegression after rebasing after multinomial PR d1e5882 [Joseph K. Bradley] organized imports 2935963 [Joseph K. Bradley] Added save/load and tests for most classification and regression models c495dba [Joseph K. Bradley] made version for model import/export local to each model 1496852 [Joseph K. Bradley] Added save/load for NaiveBayes 8d46386 [Joseph K. Bradley] Added save/load to NaiveBayes 1577d70 [Joseph K. Bradley] fixed issues after rebasing on master (DataFrame patch) 64914a3 [Joseph K. Bradley] added getThreshold to SVMModel b1fc5ec [Joseph K. Bradley] small cleanups 418ba1b [Joseph K. Bradley] Added save, load to mllib.classification.LogisticRegressionModel, plus test suite (cherry picked from commit 975bcef) Signed-off-by: Xiangrui Meng <meng@databricks.com>

mengxr · 2015-02-05T06:48:20Z

LGTM. Merged into master and branch-1.3. Thanks!

following #4233. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #4422 from mengxr/SPARK-5598 and squashes the following commits: a059394 [Xiangrui Meng] SaveLoad not extending Loader 14b7ea6 [Xiangrui Meng] address comments f487cb2 [Xiangrui Meng] add unit tests 62fc43c [Xiangrui Meng] implement save/load for MFM (cherry picked from commit 5c299c5) Signed-off-by: Xiangrui Meng <meng@databricks.com>

following #4233. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #4422 from mengxr/SPARK-5598 and squashes the following commits: a059394 [Xiangrui Meng] SaveLoad not extending Loader 14b7ea6 [Xiangrui Meng] address comments f487cb2 [Xiangrui Meng] add unit tests 62fc43c [Xiangrui Meng] implement save/load for MFM

jkbradley force-pushed the ml-import-export branch from 14711b7 to 365314f Compare January 30, 2015 20:53

mengxr reviewed Jan 31, 2015
View reviewed changes

jkbradley added 10 commits February 2, 2015 18:54

Added save, load to mllib.classification.LogisticRegressionModel, plu…

418ba1b

…s test suite

small cleanups

b1fc5ec

added getThreshold to SVMModel

64914a3

fixed issues after rebasing on master (DataFrame patch)

1577d70

Added save/load to NaiveBayes

8d46386

Added save/load for NaiveBayes

1496852

made version for model import/export local to each model

c495dba

Added save/load and tests for most classification and regression models

2935963

organized imports

d1e5882

cleanups in LogisticRegression after rebasing after multinomial PR

79675d5

mengxr reviewed Feb 4, 2015
View reviewed changes

Many cleanups after code review. Major changes: Storing numFeatures, …

12d9059

…numClasses in model metadata. Improvements to unit tests

small cleanups

87c4eb8

asfgit closed this in 975bcef Feb 5, 2015

mengxr mentioned this pull request Feb 6, 2015

[SPARK-5598][MLLIB] model save/load for ALS #4422

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5596] [mllib] ML model import/export for GLMs, NaiveBayes #4233

[SPARK-5596] [mllib] ML model import/export for GLMs, NaiveBayes #4233

jkbradley commented Jan 27, 2015

SparkQA commented Jan 27, 2015

SparkQA commented Jan 27, 2015

AmplabJenkins commented Jan 27, 2015

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

mengxr Jan 31, 2015

jkbradley Jan 31, 2015

SparkQA commented Jan 31, 2015

SparkQA commented Jan 31, 2015

AmplabJenkins commented Jan 31, 2015

SparkQA commented Feb 3, 2015

SparkQA commented Feb 3, 2015

AmplabJenkins commented Feb 3, 2015

SparkQA commented Feb 3, 2015

SparkQA commented Feb 3, 2015

AmplabJenkins commented Feb 3, 2015

mengxr Feb 4, 2015

jkbradley commented Feb 4, 2015

SparkQA commented Feb 4, 2015

jkbradley commented Feb 4, 2015

SparkQA commented Feb 4, 2015

SparkQA commented Feb 5, 2015

AmplabJenkins commented Feb 5, 2015

SparkQA commented Feb 5, 2015

AmplabJenkins commented Feb 5, 2015

jkbradley commented Feb 5, 2015

mengxr commented Feb 5, 2015

[SPARK-5596] [mllib] ML model import/export for GLMs, NaiveBayes #4233

[SPARK-5596] [mllib] ML model import/export for GLMs, NaiveBayes #4233

Conversation

jkbradley commented Jan 27, 2015

SparkQA commented Jan 27, 2015

SparkQA commented Jan 27, 2015

AmplabJenkins commented Jan 27, 2015

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

mengxr Jan 31, 2015

Choose a reason for hiding this comment

jkbradley Jan 31, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 31, 2015

SparkQA commented Jan 31, 2015

AmplabJenkins commented Jan 31, 2015

SparkQA commented Feb 3, 2015

SparkQA commented Feb 3, 2015

AmplabJenkins commented Feb 3, 2015

SparkQA commented Feb 3, 2015

SparkQA commented Feb 3, 2015

AmplabJenkins commented Feb 3, 2015

mengxr Feb 4, 2015

Choose a reason for hiding this comment

jkbradley commented Feb 4, 2015

SparkQA commented Feb 4, 2015

jkbradley commented Feb 4, 2015

SparkQA commented Feb 4, 2015

SparkQA commented Feb 5, 2015

AmplabJenkins commented Feb 5, 2015

SparkQA commented Feb 5, 2015

AmplabJenkins commented Feb 5, 2015

jkbradley commented Feb 5, 2015

mengxr commented Feb 5, 2015