Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4586][MLLIB] Python API for ML pipeline and parameters #4151

Closed
wants to merge 34 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Jan 22, 2015

This PR adds Python API for ML pipeline and parameters. The design doc can be found on the JIRA page. It includes transformers and an estimator to demo the simple text classification example code.

TODO:

  • handle parameters in LRModel
  • unit tests
  • missing some docs

CC: @davies @jkbradley

@SparkQA
Copy link

SparkQA commented Jan 22, 2015

Test build #25935 has started for PR 4151 at commit 56de571.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 22, 2015

Test build #25935 has finished for PR 4151 at commit 56de571.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class PipelineStage(Params):
    • class Estimator(PipelineStage):
    • class Transformer(PipelineStage):
    • class Pipeline(Estimator):
    • class PipelineModel(Transformer):
    • class JavaWrapper(object):
    • class JavaEstimator(Estimator, JavaWrapper):
    • class JavaTransformer(Transformer, JavaWrapper):
    • class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter,
    • class LogisticRegressionModel(Transformer):
    • class Tokenizer(JavaTransformer, HasInputCol, HasOutputCol):
    • class HashingTF(JavaTransformer, HasInputCol, HasOutputCol):
    • class Param(object):
    • class Params(Identifiable):
    • return """class Has%s(Params):
    • class HasMaxIter(Params):
    • class HasRegParam(Params):
    • class HasFeaturesCol(Params):
    • class HasLabelCol(Params):
    • class HasPredictionCol(Params):
    • class HasInputCol(Params):
    • class HasOutputCol(Params):
    • class Identifiable(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25935/
Test PASSed.

optimize pipeline.fit impl
@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26113 has started for PR 4151 at commit d3e8dbe.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26114 has started for PR 4151 at commit 05e3e40.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26113 has finished for PR 4151 at commit d3e8dbe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class PipelineStage(Params):
    • class Estimator(PipelineStage):
    • class Transformer(PipelineStage):
    • class Pipeline(Estimator):
    • class PipelineModel(Transformer):
    • class JavaWrapper(Params):
    • class JavaEstimator(Estimator, JavaWrapper):
    • class JavaTransformer(Transformer, JavaWrapper):
    • class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter,
    • class LogisticRegressionModel(Transformer):
    • class Tokenizer(JavaTransformer, HasInputCol, HasOutputCol):
    • class HashingTF(JavaTransformer, HasInputCol, HasOutputCol):
    • class Param(object):
    • class Params(Identifiable):
    • return """class Has%s(Params):
    • class HasMaxIter(Params):
    • class HasRegParam(Params):
    • class HasFeaturesCol(Params):
    • class HasLabelCol(Params):
    • class HasPredictionCol(Params):
    • class HasInputCol(Params):
    • class HasOutputCol(Params):
    • class Identifiable(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26113/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26114 has finished for PR 4151 at commit 05e3e40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class PipelineStage(Params):
    • class Estimator(PipelineStage):
    • class Transformer(PipelineStage):
    • class Pipeline(Estimator):
    • class PipelineModel(Transformer):
    • class JavaWrapper(Params):
    • class JavaEstimator(Estimator, JavaWrapper):
    • class JavaTransformer(Transformer, JavaWrapper):
    • class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter,
    • class LogisticRegressionModel(Transformer):
    • class Tokenizer(JavaTransformer, HasInputCol, HasOutputCol):
    • class HashingTF(JavaTransformer, HasInputCol, HasOutputCol):
    • class Param(object):
    • class Params(Identifiable):
    • return """class Has%s(Params):
    • class HasMaxIter(Params):
    • class HasRegParam(Params):
    • class HasFeaturesCol(Params):
    • class HasLabelCol(Params):
    • class HasPredictionCol(Params):
    • class HasInputCol(Params):
    • class HasOutputCol(Params):
    • class Identifiable(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26114/
Test PASSed.

def __init__(self):
#: A unique id for the object. The default implementation
#: concatenates the class name, "-", and 8 random hex chars.
self.uid = type(self).__name__ + "-" + uuid.uuid4().hex[:8]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The id(obj) will be the memory address of obj, it should be used as part of uid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory address could be reused, which may not be unique.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all live objects, the id (memory address) will be unique, but the random one (uuid) may not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if one gets dereferenced and a new object gets created? I try to make the random part of the id short while maintaining a tiny collision rate. With 8 hex chars, one gets selected with equal chances from more than 4 billion values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks!

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26202 has finished for PR 4151 at commit fc59a02.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26202/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26238 has started for PR 4151 at commit dd1256b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26238 has finished for PR 4151 at commit dd1256b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class BlockMatrix(
    • class PipelineStage(Params):
    • class Estimator(PipelineStage):
    • class Transformer(PipelineStage):
    • class Model(Transformer):
    • class Pipeline(Estimator):
    • class PipelineModel(Model):
    • class JavaWrapper(Params):
    • class JavaEstimator(Estimator, JavaWrapper):
    • class JavaTransformer(Transformer, JavaWrapper):
    • class JavaModel(JavaTransformer):
    • class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter,
    • class LogisticRegressionModel(JavaModel):
    • class Tokenizer(JavaTransformer, HasInputCol, HasOutputCol):
    • class HashingTF(JavaTransformer, HasInputCol, HasOutputCol, HasNumFeatures):
    • class Param(object):
    • class Params(Identifiable):
    • template = '''class Has$Name(Params):
    • class HasMaxIter(Params):
    • class HasRegParam(Params):
    • class HasFeaturesCol(Params):
    • class HasLabelCol(Params):
    • class HasPredictionCol(Params):
    • class HasInputCol(Params):
    • class HasOutputCol(Params):
    • class HasNumFeatures(Params):
    • class Identifiable(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26238/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26244 has started for PR 4151 at commit 44c2405.

  • This patch merges cleanly.

@mengxr
Copy link
Contributor Author

mengxr commented Jan 28, 2015

@davies I merged your changes and moved Identifiable to util.py. Could you make a final pass?

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26246 has started for PR 4151 at commit edbd6fe.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26244 has finished for PR 4151 at commit 44c2405.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class BlockMatrix(
    • class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter,
    • class LogisticRegressionModel(JavaModel):
    • class Tokenizer(JavaTransformer, HasInputCol, HasOutputCol):
    • class HashingTF(JavaTransformer, HasInputCol, HasOutputCol, HasNumFeatures):
    • class Identifiable(object):
    • class Param(object):
    • class Params(Identifiable):
    • template = '''class Has$Name(Params):
    • class HasMaxIter(Params):
    • class HasRegParam(Params):
    • class HasFeaturesCol(Params):
    • class HasLabelCol(Params):
    • class HasPredictionCol(Params):
    • class HasInputCol(Params):
    • class HasOutputCol(Params):
    • class HasNumFeatures(Params):
    • class Estimator(Params):
    • class Transformer(Params):
    • class Pipeline(Estimator):
    • class PipelineModel(Transformer):
    • class JavaWrapper(Params):
    • class JavaEstimator(Estimator, JavaWrapper):
    • class JavaTransformer(Transformer, JavaWrapper):
    • class JavaModel(JavaTransformer):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26244/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26246 has finished for PR 4151 at commit edbd6fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter,
    • class LogisticRegressionModel(JavaModel):
    • class Tokenizer(JavaTransformer, HasInputCol, HasOutputCol):
    • class HashingTF(JavaTransformer, HasInputCol, HasOutputCol, HasNumFeatures):
    • class Param(object):
    • class Params(Identifiable):
    • template = '''class Has$Name(Params):
    • class HasMaxIter(Params):
    • class HasRegParam(Params):
    • class HasFeaturesCol(Params):
    • class HasLabelCol(Params):
    • class HasPredictionCol(Params):
    • class HasInputCol(Params):
    • class HasOutputCol(Params):
    • class HasNumFeatures(Params):
    • class Estimator(Params):
    • class Transformer(Params):
    • class Pipeline(Estimator):
    • class PipelineModel(Transformer):
    • class Identifiable(object):
    • class JavaWrapper(Params):
    • class JavaEstimator(Estimator, JavaWrapper):
    • class JavaTransformer(Transformer, JavaWrapper):
    • class JavaModel(JavaTransformer):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26246/
Test PASSed.


def __init__(self):
super(HasMaxIter, self).__init__()
#: param for max number of iterations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Does this appear in the generated doc? I did see that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but this is the official Sphinx way to document instance attributes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But pyspark.ml.rst, we disable the doc for members:

:undoc-members:

@davies
Copy link
Contributor

davies commented Jan 28, 2015

@mengxr After remove inherit_doc from pipeline.py , I think it's OK to merge.

@mengxr
Copy link
Contributor Author

mengxr commented Jan 28, 2015

Great! Waiting for Jenkins ...

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26266 has started for PR 4151 at commit 415268e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 29, 2015

Test build #26266 has finished for PR 4151 at commit 415268e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26266/
Test PASSed.

@mengxr
Copy link
Contributor Author

mengxr commented Jan 29, 2015

Merged into master.

@asfgit asfgit closed this in e80dc1c Jan 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants