Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7380][MLLIB] pipeline stages should be copyable in Python #6088

Closed
wants to merge 20 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented May 12, 2015

This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:

  1. Rename paramMap and defaultParamMap to _paramMap and _defaultParamMap, respectively.
  2. Accept a list of param maps in fit.
  3. Use parent uid and name to identify param.

@jkbradley

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 12, 2015

Test build #32521 has started for PR 6088 at commit 9630eae.

@SparkQA
Copy link

SparkQA commented May 12, 2015

Test build #32521 has finished for PR 6088 at commit 9630eae.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32521/
Test FAILed.

@mengxr mengxr changed the title [SPARK-7380][MLLIB] pipeline stages should be copyable in Python [WIP][SPARK-7380][MLLIB] pipeline stages should be copyable in Python May 12, 2015
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 12, 2015

Test build #32527 has started for PR 6088 at commit a163413.

@SparkQA
Copy link

SparkQA commented May 12, 2015

Test build #32527 has finished for PR 6088 at commit a163413.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • decisionTreeCode = '''class DecisionTreeParams(Params):
    • class DecisionTreeParams(Params):
    • class LinearRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter,
    • class LinearRegressionModel(JavaModel):
    • class TreeRegressorParams(object):
    • class RandomForestParams(object):
    • class GBTParams(object):
    • class DecisionTreeRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol,
    • class DecisionTreeRegressionModel(JavaModel):
    • class RandomForestRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasSeed,
    • class RandomForestRegressionModel(JavaModel):
    • class GBTRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter,
    • class GBTRegressionModel(JavaModel):
    • s"FileOutputCommitter or its subclass is expected, but got a $
    • trait FSBasedRelationProvider
    • abstract class OutputWriter

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32527/
Test PASSed.

:param extra: Extra parameters to copy to the new instance
:return: Copy of this instance
"""
newCV = Params.copy(self, extra)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that, while I believe this works, it will make a deep copy of the Estimator and Evaluator, which is a little inefficient. Not that important

@jkbradley
Copy link
Member

It looks good, although I agree it may need more testing to make sure UIDs behave as expected.

One other thought: Do you think it would be worthwhile to try to use the same uid for a Java*Wrapper instance in Python and the Scala instance it is wrapping?

@mengxr
Copy link
Contributor Author

mengxr commented May 13, 2015

Yes, that certainly helps. Let's get #6019 in first, which should be ready for review.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32698 has started for PR 6088 at commit 46cb6ed.

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32698 has finished for PR 6088 at commit 46cb6ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Pipeline(override val uid: String) extends Estimator[PipelineModel]
    • final class DecisionTreeClassifier(override val uid: String)
    • final class GBTClassifier(override val uid: String)
    • class LogisticRegression(override val uid: String)
    • final class OneVsRest(override val uid: String)
    • final class RandomForestClassifier(override val uid: String)
    • class BinaryClassificationEvaluator(override val uid: String)
    • final class Binarizer(override val uid: String)
    • final class Bucketizer(override val uid: String)
    • class ElementwiseProduct(override val uid: String)
    • class HashingTF(override val uid: String) extends UnaryTransformer[Iterable[_], Vector, HashingTF]
    • final class IDF(override val uid: String) extends Estimator[IDFModel] with IDFBase
    • class Normalizer(override val uid: String) extends UnaryTransformer[Vector, Vector, Normalizer]
    • class OneHotEncoder(override val uid: String)
    • class PolynomialExpansion(override val uid: String)
    • class StandardScaler(override val uid: String) extends Estimator[StandardScalerModel]
    • class StringIndexer(override val uid: String) extends Estimator[StringIndexerModel]
    • class Tokenizer(override val uid: String) extends UnaryTransformer[String, Seq[String], Tokenizer]
    • class RegexTokenizer(override val uid: String)
    • class VectorAssembler(override val uid: String)
    • class VectorIndexer(override val uid: String) extends Estimator[VectorIndexerModel]
    • final class Word2Vec(override val uid: String) extends Estimator[Word2VecModel] with Word2VecBase
    • class Param[T](val parent: String, val name: String, val doc: String, val isValid: T => Boolean)
    • class DoubleParam(parent: String, name: String, doc: String, isValid: Double => Boolean)
    • class IntParam(parent: String, name: String, doc: String, isValid: Int => Boolean)
    • class FloatParam(parent: String, name: String, doc: String, isValid: Float => Boolean)
    • class LongParam(parent: String, name: String, doc: String, isValid: Long => Boolean)
    • class BooleanParam(parent: String, name: String, doc: String) // No need for isValid
    • class ALS(override val uid: String) extends Estimator[ALSModel] with ALSParams
    • final class DecisionTreeRegressor(override val uid: String)
    • final class GBTRegressor(override val uid: String)
    • class LinearRegression(override val uid: String)
    • final class RandomForestRegressor(override val uid: String)
    • class CrossValidator(override val uid: String) extends Estimator[CrossValidatorModel]
    • trait Identifiable

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@mengxr mengxr changed the title [WIP][SPARK-7380][MLLIB] pipeline stages should be copyable in Python [SPARK-7380][MLLIB] pipeline stages should be copyable in Python May 17, 2015
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32942/
Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 17, 2015

Test build #32945 has started for PR 6088 at commit 611c719.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@mengxr
Copy link
Contributor Author

mengxr commented May 18, 2015

test this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 18, 2015

Test build #32966 has started for PR 6088 at commit 413c463.

@SparkQA
Copy link

SparkQA commented May 18, 2015

Test build #32966 has finished for PR 6088 at commit 413c463.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class VectorIndexerModel(JavaModel):

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32966/
Test PASSed.

@jkbradley
Copy link
Member

LGTM

asfgit pushed a commit that referenced this pull request May 18, 2015
This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:

1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
2. Accept a list of param maps in `fit`.
3. Use parent uid and name to identify param.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6088 from mengxr/SPARK-7380 and squashes the following commits:

413c463 [Xiangrui Meng] remove unnecessary doc
4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
611c719 [Xiangrui Meng] fix python style
68862b8 [Xiangrui Meng] update _java_obj initialization
927ad19 [Xiangrui Meng] fix ml/tests.py
0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
7e0d27f [Xiangrui Meng] merge master
46840fb [Xiangrui Meng] update wrappers
b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
46cb6ed [Xiangrui Meng] merge master
a163413 [Xiangrui Meng] fix style
1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
9630eae [Xiangrui Meng] fix Identifiable._randomUID
13bd70a [Xiangrui Meng] update ml/tests.py
64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
7431272 [Joseph K. Bradley] Rebased with master

(cherry picked from commit 9c7e802)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@mengxr
Copy link
Contributor Author

mengxr commented May 18, 2015

Merged into master and branch-1.4.

@asfgit asfgit closed this in 9c7e802 May 18, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:

1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
2. Accept a list of param maps in `fit`.
3. Use parent uid and name to identify param.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#6088 from mengxr/SPARK-7380 and squashes the following commits:

413c463 [Xiangrui Meng] remove unnecessary doc
4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
611c719 [Xiangrui Meng] fix python style
68862b8 [Xiangrui Meng] update _java_obj initialization
927ad19 [Xiangrui Meng] fix ml/tests.py
0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
7e0d27f [Xiangrui Meng] merge master
46840fb [Xiangrui Meng] update wrappers
b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
46cb6ed [Xiangrui Meng] merge master
a163413 [Xiangrui Meng] fix style
1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
9630eae [Xiangrui Meng] fix Identifiable._randomUID
13bd70a [Xiangrui Meng] update ml/tests.py
64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
7431272 [Joseph K. Bradley] Rebased with master
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:

1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
2. Accept a list of param maps in `fit`.
3. Use parent uid and name to identify param.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#6088 from mengxr/SPARK-7380 and squashes the following commits:

413c463 [Xiangrui Meng] remove unnecessary doc
4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
611c719 [Xiangrui Meng] fix python style
68862b8 [Xiangrui Meng] update _java_obj initialization
927ad19 [Xiangrui Meng] fix ml/tests.py
0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
7e0d27f [Xiangrui Meng] merge master
46840fb [Xiangrui Meng] update wrappers
b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
46cb6ed [Xiangrui Meng] merge master
a163413 [Xiangrui Meng] fix style
1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
9630eae [Xiangrui Meng] fix Identifiable._randomUID
13bd70a [Xiangrui Meng] update ml/tests.py
64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
7431272 [Joseph K. Bradley] Rebased with master
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:

1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
2. Accept a list of param maps in `fit`.
3. Use parent uid and name to identify param.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#6088 from mengxr/SPARK-7380 and squashes the following commits:

413c463 [Xiangrui Meng] remove unnecessary doc
4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
611c719 [Xiangrui Meng] fix python style
68862b8 [Xiangrui Meng] update _java_obj initialization
927ad19 [Xiangrui Meng] fix ml/tests.py
0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
7e0d27f [Xiangrui Meng] merge master
46840fb [Xiangrui Meng] update wrappers
b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
46cb6ed [Xiangrui Meng] merge master
a163413 [Xiangrui Meng] fix style
1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
9630eae [Xiangrui Meng] fix Identifiable._randomUID
13bd70a [Xiangrui Meng] update ml/tests.py
64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
7431272 [Joseph K. Bradley] Rebased with master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants