Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9654][ML][PYSPARK] Add IndexToString to PySpark #7976

Closed
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
1dc4579
SPARK-9654 Add string indexer inverse in PySpark
holdenk Aug 5, 2015
0445fcc
doc fix
holdenk Aug 5, 2015
af2f869
Don't changge the base class init, fill out the doctest for the invert.
holdenk Aug 6, 2015
510bce5
remove extra blank line
holdenk Aug 6, 2015
c6da160
get rid of unicude specificers in doctest
holdenk Aug 6, 2015
9f5af3a
Deal with the difference between 2.X and 3.X with the output by just …
holdenk Aug 6, 2015
7b3b5ca
Use the standard constructor method for the StringIndexInverse
holdenk Aug 12, 2015
244e083
Update for index to string changeover
holdenk Aug 14, 2015
e95b61b
Move the property on to the model, remove references to old class name
holdenk Aug 14, 2015
b1795aa
CR feedback
holdenk Aug 18, 2015
ab90dcd
switch link to pydoc style
holdenk Aug 18, 2015
43ae197
Merge in master
holdenk Aug 18, 2015
c400e16
remove getLabels function (CR feedback) now that labels is public.
holdenk Aug 18, 2015
64de5c9
Some CR feedback
holdenk Aug 28, 2015
2316a90
Use None instead of empty array
holdenk Aug 28, 2015
15390bb
merge in master
holdenk Sep 1, 2015
28afcfd
Some CR feedback (note: still sorting our one of the params)
holdenk Sep 1, 2015
f19445d
Change description text
holdenk Sep 1, 2015
51ae7ee
merge in master
holdenk Sep 1, 2015
ed0ca91
moar merge
holdenk Sep 1, 2015
8fca8b3
punctuation
holdenk Sep 1, 2015
3ef852f
remove unrelated change
holdenk Sep 1, 2015
41d0d27
long line fix
holdenk Sep 1, 2015
cd5d418
Add missing period
holdenk Sep 9, 2015
4f56b17
Fix link to transformer class, copy scala doc for labels
holdenk Sep 9, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,11 @@ class StringIndexerModel (
map
}

/**
* The labels used for applying this transformation
*/
private[spark] def getLabels() = labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer needed since "label" is a public val


/** @group setParam */
def setHandleInvalid(value: String): this.type = set(handleInvalid, value)
setDefault(handleInvalid, "error")
Expand Down
63 changes: 61 additions & 2 deletions python/pyspark/ml/feature.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@
from pyspark.mllib.common import inherit_doc
from pyspark.mllib.linalg import _convert_to_vector

__all__ = ['Binarizer', 'Bucketizer', 'HashingTF', 'IDF', 'IDFModel', 'NGram', 'Normalizer',
'OneHotEncoder', 'PolynomialExpansion', 'RegexTokenizer', 'StandardScaler',
__all__ = ['Binarizer', 'Bucketizer', 'HashingTF', 'IDF', 'IDFModel', 'IndexToString', 'NGram',
'Normalizer', 'OneHotEncoder', 'PolynomialExpansion', 'RegexTokenizer', 'StandardScaler',
'StandardScalerModel', 'StringIndexer', 'StringIndexerModel', 'Tokenizer',
'VectorAssembler', 'VectorIndexer', 'Word2Vec', 'Word2VecModel', 'PCA',
'PCAModel', 'RFormula', 'RFormulaModel']
Expand Down Expand Up @@ -731,6 +731,11 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol):
>>> sorted(set([(i[0], i[1]) for i in td.select(td.id, td.indexed).collect()]),
... key=lambda x: x[0])
[(0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), (5, 1.0)]
>>> inverter = IndexToString(inputCol="indexed", outputCol="label2", labels=model.labels())
>>> itd = inverter.transform(td)
>>> sorted(set([(i[0], str(i[1])) for i in itd.select(itd.id, itd.label2).collect()]),
... key=lambda x: x[0])
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'a'), (4, 'a'), (5, 'c')]
"""

@keyword_only
Expand Down Expand Up @@ -760,6 +765,60 @@ class StringIndexerModel(JavaModel):
"""
Model fitted by StringIndexer.
"""
@property
def labels(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy Scala doc: "Ordered list of labels, corresponding to indices to be assigned"

return self._java_obj.labels


class IndexToString(JavaTransformer, HasInputCol, HasOutputCol):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use inherit_doc tag

"""
Convert provided indexes back to strings using either the metadata on the input column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please copy updated Scala doc here.

Also, please mark as Experimental (as in, e.g., RFormula)

or user provided labels.
Note: By default we keep the original columns during StringIndexerModel's transformation,
so the inverse should only be used on new columns such as predicted labels.
"""
# a placeholder to make the labels show up in generated doc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert newline above

labels = Param(Params._dummy(), "lables",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "lables"

"Optional labels to be provided by the user, if not supplied column " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "if not supplied" -> "if equal to the empty array then"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes less sense, if it isn't supplied then it uses the column metadata.

"metadata is read for labels. The default value is an empty array, " +
"but the empty array is ignored and column metadata used instead.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the above nit, this becomes redundant IMO. Since this is a matter of taste, feel free to keep or cut


@keyword_only
def __init__(self, inputCol=None, outputCol=None, labels=[]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid using mutable values [] as defaults in Python, let's use None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is the underlying Scala code uses an empty array as the default.

"""
Initialize this instace of the IndexToString using the provided java_obj.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first line should be:

__init__(self, inputCol=None, outputCol=None, labels=[])

as in other transformers (See VectorAssembler)

typo: instace

"""
super(IndexToString, self).__init__()
self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.IndexToString",
self.uid)
self.labels = Param(self, "labels",
"Optional labels to be provided by the user, if not supplied column " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as L957

"metadata is read for labels. The default value is an empty array, " +
"but the empty array is ignored and column metadata used instead.")
kwargs = self.__init__._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol=None, outputCol=None, labels=[]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, using None rather than [].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is the underlying Scala code uses an empty array as the default.

"""
setParams(self, inputCol="input", outputCol="output", labels=[])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct col defaults: None

Sets params for this IndexToString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit "." at end of line

"""
kwargs = self.setParams._input_kwargs
return self._set(**kwargs)

def setLabels(self, value):
"""
Specify the labels to be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sets the value of :py:attr:labels.
Sphinx will produce link for this param.

"""
self._paramMap[self.labels] = value
return self

def getLabels(self):
"""
Get the labels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gets the value of labels or its default value.

"""
return self.getOrDefault(self.labels)


@inherit_doc
Expand Down
3 changes: 2 additions & 1 deletion python/pyspark/ml/wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,8 @@ def _fit(self, dataset):
class JavaTransformer(Transformer, JavaWrapper):
"""
Base class for :py:class:`Transformer`s that wrap Java/Scala
implementations.
implementations. Subclasses should ensure they have the transformer Java object
available as _java_obj.
"""

__metaclass__ = ABCMeta
Expand Down