Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21633][ML][Python] UnaryTransformer in Python #18746

Closed

Conversation

ajaysaini725
Copy link
Contributor

@ajaysaini725 ajaysaini725 commented Jul 27, 2017

What changes were proposed in this pull request?

Implemented UnaryTransformer in Python.

How was this patch tested?

This patch was tested by creating a MockUnaryTransformer class in the unit tests that extends UnaryTransformer and testing that the transform function produced correct output.

@ajaysaini725
Copy link
Contributor Author

@jkbradley @thunterdb @MrBago Could you please review this?

@ajaysaini725 ajaysaini725 changed the title Implemented UnaryTransformer in Python [ML][Python]Implemented UnaryTransformer in Python Jul 27, 2017
@ajaysaini725 ajaysaini725 changed the title [ML][Python]Implemented UnaryTransformer in Python [ML][Python] Implemented UnaryTransformer in Python Jul 27, 2017
@SparkQA
Copy link

SparkQA commented Jul 27, 2017

Test build #79988 has finished for PR 18746 at commit 960de95.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2017

Test build #79991 has finished for PR 18746 at commit 11f8f29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest add a python example code for UnaryTramsformer in python. Like the scala example MyTransformer.

return StructType(outputFields)

def transform(self, dataset, paramMap=None):
transformSchema(dataset.schema())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here seems exist some problem.
The transform provide paramMap, but createTransformFunc has no way to get the passed in paramMap, here lost something I think.
Because custom UnaryTransformer will only need to override the createTransformFunc, the base class need to handle the passed in paramMap properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I accidentally overrode transform instead of _transform. Fixed!

def transform(self, dataset, paramMap=None):
transformSchema(dataset.schema())
transformUDF = udf(self.createTransformFunc(), self.outputDataType())
dataset.withColumn(self.getOutputCol(), transformUDF(self.getInputCol()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The udf need first parameter to be a function, but here why you pass in the return value of self.createTransformFunc ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.createTransformFunc returns a function which is passed to the udf so in this case I think it is okay

@SparkQA
Copy link

SparkQA commented Aug 3, 2017

Test build #80183 has finished for PR 18746 at commit 692aa5d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

@ajaysaini725 Is there a JIRA for this PR? Please tag this PR in the title.

@jkbradley
Copy link
Member

jkbradley commented Aug 3, 2017

Also, you can remove "implemented" from the title. & update the description now that you have tests, please

@ajaysaini725 ajaysaini725 changed the title [ML][Python] Implemented UnaryTransformer in Python [ML][Python] UnaryTransformer in Python Aug 3, 2017
Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK done with review pass. Thanks for the PR!

@inherit_doc
class UnaryTransformer(HasInputCol, HasOutputCol, Transformer):
"""
Abstract class for transformers that tae one input column, apply a transoformation to it,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: tae

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually multiple typos. Why not just copy the text from Scala?

@abstractmethod
def createTransformFunc(self):
"""
Creates the transoform function using the given param map.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the IntelliJ spellcheck feature


def _transform(self, dataset):
self.transformSchema(dataset.schema)
transformFunc = self.createTransformFunc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

df = df.withColumn("input", df.input.cast(dataType="double"))

transformed_df = transformer.transform(df)
output = transformed_df.select("output").collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better practice to select both input & output and collect both for comparison, rather than relying on DataFrame rows maintaining their order.

@@ -1957,6 +1987,24 @@ def test_chisquaretest(self):
self.assertTrue(all(field in fieldNames for field in expectedFields))


class UnaryTransformerTests(SparkSessionTestCase):

def test_unary_transformer_transform(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also please test validateInputType?

@SparkQA
Copy link

SparkQA commented Aug 3, 2017

Test build #80225 has finished for PR 18746 at commit 527bc88.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

@ajaysaini725 Is there a JIRA for this PR? Please tag this PR in the title.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just 1 comment left!

df = df.withColumn("input", df.input.cast(dataType="double"))

transformed_df = transformer.transform(df)
inputCol = transformed_df.select("input").collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this instead:

results = transformed_df.select("input", "output").collect()
for res in results:
   self.assertEqual(res.input + shiftVal, res.output)

@ajaysaini725 ajaysaini725 changed the title [ML][Python] UnaryTransformer in Python [SPARK-21633][ML][Python] UnaryTransformer in Python Aug 4, 2017
@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80228 has finished for PR 18746 at commit a30ae39.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Merging with master
Thanks @ajaysaini725 !

@asfgit asfgit closed this in 1347b2a Aug 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants