Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26449][PYTHON] add a transform method to the Dataframe class #23414

Closed
wants to merge 10 commits into from

Conversation

chanansh
Copy link

What changes were proposed in this pull request?

added a transform method to the Dataframe class, see https://issues.apache.org/jira/browse/SPARK-26449

How was this patch tested?

Tested manually by injecting the proposed method to the current spark version dataframe class.
I've tried to compile spark from scratch and test using ./build/mvn test. However, unrelated tests fails before my change.

Please review http://spark.apache.org/contributing.html before opening a pull request.

adding transform method
change version of transform to 3
@chanansh chanansh changed the title Spark 26449 [Spark 26449][PYSPARK] added a transform method to the Dataframe class Dec 30, 2018
@chanansh chanansh changed the title [Spark 26449][PYSPARK] added a transform method to the Dataframe class [Spark 26449][PYSPARK] add a transform method to the Dataframe class Dec 30, 2018
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Dec 30, 2018

Test build #100560 has finished for PR 23414 at commit def5b2c.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 30, 2018

Test build #100562 has finished for PR 23414 at commit def5b2c.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

removed space from empty lines
@chanansh
Copy link
Author

@HyukjinKwon I removed spaces from empty lines. please re-test.

@SparkQA
Copy link

SparkQA commented Dec 31, 2018

Test build #100592 has finished for PR 23414 at commit b370363.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chanansh
Copy link
Author

@HyukjinKwon I get the following errors:

[error] running /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/lint-python ; received return code 1
Attempting to post to Github...
 > Post successful.
Build step 'Execute shell' marked build as failure
Archiving artifacts
Recording test results
ERROR: Step ?Publish JUnit test result report? failed: No test report files were found. Configuration error?
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/100592/
Test FAILed.
Finished: FAILURE

Can you please help?

@HyukjinKwon
Copy link
Member

Looks it's failed for below reasons.

pycodestyle checks failed:
./python/pyspark/sql/dataframe.py:2048:1: W293 blank line contains whitespace
./python/pyspark/sql/dataframe.py:2064:1: W293 blank line contains whitespace

added doctest
removed space from blank line
@chanansh
Copy link
Author

added doctest and removed more empty line with spaces. please re-test

removed *args **kwargs (albeit I think they are useful)
@SparkQA
Copy link

SparkQA commented Dec 31, 2018

Test build #100594 has finished for PR 23414 at commit f5aaa1a.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chanansh
Copy link
Author

removed *args **kwargs (albeit I think they're useful). Please re-test

removed *args, **kwargs
@SparkQA
Copy link

SparkQA commented Dec 31, 2018

Test build #100595 has finished for PR 23414 at commit 0b1f562.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chanansh
Copy link
Author

@HyukjinKwon I am sorry for being newbie but I don't understand the fail reason:

Caused by: hudson.plugins.git.GitException: Command "git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/23414/*:refs/remotes/origin/pr/23414/*" returned status code 128:
stdout: 
stderr: error: RPC failed; curl 18 transfer closed with outstanding read data remaining
fatal: The remote end hung up unexpectedly

@srowen
Copy link
Member

srowen commented Dec 31, 2018

@HyukjinKwon what do you mean when you say the Scala impl has this? I'm missing it.

I don't see the value in this. From the blog post at https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55 why is ...

actual_df = (source_df
    .transform(lambda df: with_greeting(df))
    .transform(lambda df: with_something(df, "crazy")))

better than just

actual_df = with_greeting(source_df)
actual_df = with_something(actual_df, "crazy")

@chanansh
Copy link
Author

The idea is to be able to chain function easily when you have 10 stages. no need for keeping temporary variables.

@srowen
Copy link
Member

srowen commented Dec 31, 2018

You can also...

actual_df = source_df
for f in [...]:
    actual_df = f(actual_df)

Unless I'm really missing something this doesn't exist for Scala (?) and I can't see adding an API method for this. The small additional maintenance and user cognitive load just doesn't seem to buy much at all.

@chanansh
Copy link
Author

@HyukjinKwon
Copy link
Member

I was referring:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2497

If it were new API, I won't encourage to add but it's existing. I think we should rather deprecate Scala side one if we don't see some values on that. Otherwise, I thought matching it is fine.

@srowen
Copy link
Member

srowen commented Dec 31, 2018

Oh hm I had never seen that! Yah seems fine for consistency then.

@HyukjinKwon
Copy link
Member

Yea .. I'm not super happy with adding it as well to be honest but I guess it's fine foe now.

@since(3.0)
def transform(self, func):
"""Returns a new class:`DataFrame` according to a user-defined custom transform method.
This allows chaining transformations rather than using nested or temporary variables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just match the doc to Scala API side as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean?
I don't see a scala documentation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon
Copy link
Member

the build was failed due to the reason below:

pycodestyle checks failed:
./python/pyspark/sql/dataframe.py:2070:101: E501 line too long (106 > 100 characters)

@HyukjinKwon
Copy link
Member

@chanansh, also please fix the PR title to [SPARK-26449][PYTHON] ... so that it automatically links your PR to the JIRA.

@chanansh chanansh changed the title [Spark 26449][PYSPARK] add a transform method to the Dataframe class [SPARK-26449][PYTHON] add a transform method to the Dataframe class Jan 1, 2019
@SparkQA
Copy link

SparkQA commented Jan 1, 2019

Test build #100610 has finished for PR 23414 at commit 9919e28.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 1, 2019

Test build #100611 has finished for PR 23414 at commit e54d2f7.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chanansh
Copy link
Author

chanansh commented Jan 1, 2019

@HyukjinKwon, please review latest.

@SparkQA
Copy link

SparkQA commented Jan 1, 2019

Test build #100612 has finished for PR 23414 at commit 3d9a751.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

+----+---+--------+---------+
"""
res = func(self)
assert isinstance(res, DataFrame)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add a message for it. For instance,

ret = func(self)
assert instance(ret, DataFrame), "Returned instance from the " \
    "given function should be a DataFrame; however, got [%s]." % type(ret)

:param func: a custom transform function which returns a DataFrame

>>> from pyspark.sql.functions import lit
>>> def with_greeting(df):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the example more concise and meaningful? I think we should focus only on a simple example about the API itself rather then using lambda. For instance,

>>> df = spark.range(10)
>>> def cast_to_str(input_df):
...     return input_df.select([col(c).cast("string") for c in input_df.columns])
>>> df.transform(cast_to_str).show()

"""Returns a new class:`DataFrame` according to a custom transform function.
This allows chaining transformations rather than using nested or temporary variables.

:param func: a custom transform function which returns a DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: DataFrame -> class:`DataFrame`

@srowen
Copy link
Member

srowen commented Jan 9, 2019

@chanansh I think this can proceed if you'll have a look at the comments above.

@chanansh
Copy link
Author

chanansh commented Jan 9, 2019 via email

@HyukjinKwon
Copy link
Member

Closing this due to author's inactivity.

@chanansh
Copy link
Author

chanansh commented Feb 11, 2019 via email

@srowen
Copy link
Member

srowen commented Feb 11, 2019

Just push more commits; I think that reopens it.

@Hellsen83
Copy link
Contributor

is this one still open? I would want to PR basically the same thing. Should I commit here or create a new PR?

@HyukjinKwon
Copy link
Member

You can pick up commits and create new PR. Looks the author is inactive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants