Skip to content

Conversation

@jkbradley
Copy link
Member

What changes were proposed in this pull request?

Adds support for saving and loading nested ML Pipelines from Python. Pipeline and PipelineModel do not extend JavaWrapper, but they are able to utilize the JavaMLWriter, JavaMLReader implementations.

Also:

  • Separates out interfaces from Java wrapper implementations for MLWritable, MLReadable, MLWriter, MLReader.
  • Moves methods _stages_java2py, _stages_py2java into Pipeline, PipelineModel as _transfer_stage_from_java, _transfer_stage_to_java

How was this patch tested?

Added new unit test for nested Pipelines. Abstracted validity check into a helper method for the 2 unit tests.

@SparkQA
Copy link

SparkQA commented Mar 21, 2016

Test build #53711 has finished for PR 11866 at commit 1c070e7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ElementwiseProduct(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable,
    • class HashingTF(JavaTransformer, HasInputCol, HasOutputCol, HasNumFeatures, JavaMLReadable,
    • class PolynomialExpansion(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable,
    • class JavaMLWritable(MLWritable):
    • class JavaMLReader(MLReader):

@SparkQA
Copy link

SparkQA commented Mar 21, 2016

Test build #53712 has finished for PR 11866 at commit 6d74b97.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def load(cls, path):
"""Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
return cls.read().load(path)
def _transfer_stage_from_java(cls, java_stage):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little confusing to call it java_stage while the expected input is a pipeline. Shall we rename it to _from_java_pipeline and _transfer_stage_to_java to _to_java_pipeline? If we have to use the same method name, I would recommend _from_java and _to_java instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to use the same method name, so I'll switch to _from/to_java.

@mengxr
Copy link
Contributor

mengxr commented Mar 21, 2016

Not part of this PR, in PySpark DataFrame.write is a property instead of a method. Shall we follow the same convention?

@jkbradley
Copy link
Member Author

Making MLWritable.write a property sounds good. I'll create a JIRA for it.

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53785 has finished for PR 11866 at commit be879e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Mar 22, 2016

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 7e3423b Mar 22, 2016
@jkbradley jkbradley deleted the nested-pipeline-io branch March 22, 2016 22:20
@inherit_doc
class JavaMLWriter(MLWriter):
"""
(Private) Specialization of :py:class:`MLWriter` for :py:class:`JavaWrapper` types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbradley @mengxr Excuse me, why we need to add a Private annotation here and in other places?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants