Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5596] [mllib] ML model import/export for GLMs, NaiveBayes #4233

Closed
wants to merge 15 commits into from

Conversation

jkbradley
Copy link
Member

This is a PR for Parquet-based model import/export. Please see the design doc on the JIRA.

Note: This includes only a subset of regression and classification models:

  • NaiveBayes, SVM, LogisticRegression
  • LinearRegression, RidgeRegression, Lasso

Follow-up PRs will cover other models.

Sketch of current contents:

  • New traits: Saveable, Loader
  • Implementations for some algorithms
  • Also: Added LogisticRegressionModel.getThreshold method (so that unit test could check the threshold)

CC: @mengxr @selvinsource

@SparkQA
Copy link

SparkQA commented Jan 27, 2015

Test build #26190 has started for PR 4233 at commit 14711b7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 27, 2015

Test build #26190 has finished for PR 4233 at commit 14711b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait Exportable
    • trait Importable[Model <: Exportable]

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26190/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26427 has started for PR 4233 at commit 365314f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26427 has finished for PR 4233 at commit 365314f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait Exportable
    • trait Importable[Model <: Exportable]

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26427/
Test PASSed.


override def save(sc: SparkContext, path: String): Unit = {
val sqlContext = new SQLContext(sc)
import sqlContext._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import sqlContext._ is no longer needed due to recent API change. implicit val sqlContext = new SQLContext(sc) should work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that, and it fails to compile. (I tried removed sqlContext from save(), as well as having it be an implicit val without the import. Neither worked.) Is there another import I need?

[error] /Users/josephkb/spark/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala:89: type mismatch;
[error]  found   : org.apache.spark.rdd.RDD[org.apache.spark.mllib.classification.LogisticRegressionModel.Metadata]
[error]  required: org.apache.spark.sql.DataFrame
[error] Error occurred in an application involving default arguments.
[error]     val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
[error]                                                ^
[error] /Users/josephkb/spark/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala:94: type mismatch;
[error]  found   : org.apache.spark.rdd.RDD[org.apache.spark.mllib.classification.LogisticRegressionModel.Data]
[error]  required: org.apache.spark.sql.DataFrame
[error] Error occurred in an application involving default arguments.
[error]     val dataRDD: DataFrame = sc.parallelize(Seq(data))
[error]                                            ^
[error] two errors found
[error] (mllib/compile:compile) Compilation failed

It may not be an issue though, with the other change you suggested below. I'll see.

@SparkQA
Copy link

SparkQA commented Jan 31, 2015

Test build #26459 has started for PR 4233 at commit 638fa81.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 31, 2015

Test build #26459 has finished for PR 4233 at commit 638fa81.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait Exportable
    • trait Importable[Model <: Exportable]

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26459/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26579 has started for PR 4233 at commit a71e555.

  • This patch does not merge cleanly.

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26579 has finished for PR 4233 at commit a71e555.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class Data(weights: Vector, intercept: Double, threshold: Option[Double])
    • case class Data(weights: Vector, intercept: Double)
    • trait Exportable
    • trait Importable[M <: Exportable]
    • protected abstract class Importer
    • * @return (class name, version, metadata)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26579/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26580 has started for PR 4233 at commit 444341a.

  • This patch does not merge cleanly.

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26580 has finished for PR 4233 at commit 444341a.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class Data(weights: Vector, intercept: Double, threshold: Option[Double])
    • case class Data(weights: Vector, intercept: Double)
    • trait Exportable
    • trait Importable[M <: Exportable]
    • protected abstract class Importer
    • * @return (class name, version, metadata)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26580/
Test FAILed.


override def load(sc: SparkContext, path: String): LogisticRegressionModel = {
val (loadedClassName, version, metadata) = Loader.loadMetadata(sc, path)
val classNameV1_0 = "org.apache.spark.mllib.classification.LogisticRegressionModel"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should put a comment here about why using literal string name.

@jkbradley
Copy link
Member Author

Btw, I'm also changing paths to use org.apache.hadoop.fs.Path to create URIs (instead of my hard-coded path separators).

…numClasses in model metadata. Improvements to unit tests
@SparkQA
Copy link

SparkQA commented Feb 4, 2015

Test build #26784 has started for PR 4233 at commit 12d9059.

  • This patch merges cleanly.

@jkbradley
Copy link
Member Author

Hopefully that took care of everything!

@SparkQA
Copy link

SparkQA commented Feb 4, 2015

Test build #26786 has started for PR 4233 at commit 87c4eb8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26786 has finished for PR 4233 at commit 87c4eb8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • * @param modelClass String name for model class (used for error messages)
    • case class Data(labels: Array[Double], pi: Array[Double], theta: Array[Array[Double]])
    • s" but class priors vector pi had $
    • s" but class conditionals array theta had $
    • case class Data(weights: Vector, intercept: Double, threshold: Option[Double])
    • * @param modelClass String name for model class (used for error messages)
    • * @param modelClass String name for model class (used for error messages)
    • case class Data(weights: Vector, intercept: Double)
    • * @param modelClass String name for model class (used for error messages)
    • trait Saveable
    • trait Loader[M <: Saveable]
    • * @return (class name, version, metadata)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26786/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26784 has finished for PR 4233 at commit 12d9059.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • * @param modelClass String name for model class (used for error messages)
    • case class Data(labels: Array[Double], pi: Array[Double], theta: Array[Array[Double]])
    • s" but class priors vector pi had $
    • s" but class conditionals array theta had $
    • case class Data(weights: Vector, intercept: Double, threshold: Option[Double])
    • * @param modelClass String name for model class (used for error messages)
    • * @param modelClass String name for model class (used for error messages)
    • case class Data(weights: Vector, intercept: Double)
    • * @param modelClass String name for model class (used for error messages)
    • trait Saveable
    • trait Loader[M <: Saveable]
    • * @return (class name, version, metadata)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26784/
Test PASSed.

@jkbradley
Copy link
Member Author

The first failure was from Kafka tests

@asfgit asfgit closed this in 975bcef Feb 5, 2015
asfgit pushed a commit that referenced this pull request Feb 5, 2015
This is a PR for Parquet-based model import/export.  Please see the design doc on [the JIRA](https://issues.apache.org/jira/browse/SPARK-4587).

Note: This includes only a subset of regression and classification models:
* NaiveBayes, SVM, LogisticRegression
* LinearRegression, RidgeRegression, Lasso

Follow-up PRs will cover other models.

Sketch of current contents:
* New traits: Saveable, Loader
* Implementations for some algorithms
* Also: Added LogisticRegressionModel.getThreshold method (so that unit test could check the threshold)

CC: mengxr  selvinsource

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4233 from jkbradley/ml-import-export and squashes the following commits:

87c4eb8 [Joseph K. Bradley] small cleanups
12d9059 [Joseph K. Bradley] Many cleanups after code review.  Major changes: Storing numFeatures, numClasses in model metadata. Improvements to unit tests
b4ee064 [Joseph K. Bradley] Reorganized save/load for regression and classification.  Renamed concepts to Saveable, Loader
a34aef5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into ml-import-export
ee99228 [Joseph K. Bradley] scala style fix
79675d5 [Joseph K. Bradley] cleanups in LogisticRegression after rebasing after multinomial PR
d1e5882 [Joseph K. Bradley] organized imports
2935963 [Joseph K. Bradley] Added save/load and tests for most classification and regression models
c495dba [Joseph K. Bradley] made version for model import/export local to each model
1496852 [Joseph K. Bradley] Added save/load for NaiveBayes
8d46386 [Joseph K. Bradley] Added save/load to NaiveBayes
1577d70 [Joseph K. Bradley] fixed issues after rebasing on master (DataFrame patch)
64914a3 [Joseph K. Bradley] added getThreshold to SVMModel
b1fc5ec [Joseph K. Bradley] small cleanups
418ba1b [Joseph K. Bradley] Added save, load to mllib.classification.LogisticRegressionModel, plus test suite

(cherry picked from commit 975bcef)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@mengxr
Copy link
Contributor

mengxr commented Feb 5, 2015

LGTM. Merged into master and branch-1.3. Thanks!

asfgit pushed a commit that referenced this pull request Feb 9, 2015
following #4233. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4422 from mengxr/SPARK-5598 and squashes the following commits:

a059394 [Xiangrui Meng] SaveLoad not extending Loader
14b7ea6 [Xiangrui Meng] address comments
f487cb2 [Xiangrui Meng] add unit tests
62fc43c [Xiangrui Meng] implement save/load for MFM

(cherry picked from commit 5c299c5)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
asfgit pushed a commit that referenced this pull request Feb 9, 2015
following #4233. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4422 from mengxr/SPARK-5598 and squashes the following commits:

a059394 [Xiangrui Meng] SaveLoad not extending Loader
14b7ea6 [Xiangrui Meng] address comments
f487cb2 [Xiangrui Meng] add unit tests
62fc43c [Xiangrui Meng] implement save/load for MFM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants