[SPARK-26449][PYTHON] add a transform method to the Dataframe class #23414

chanansh · 2018-12-30T14:37:37Z

What changes were proposed in this pull request?

added a transform method to the Dataframe class, see https://issues.apache.org/jira/browse/SPARK-26449

How was this patch tested?

Tested manually by injecting the proposed method to the current spark version dataframe class.
I've tried to compile spark from scratch and test using ./build/mvn test. However, unrelated tests fails before my change.

Please review http://spark.apache.org/contributing.html before opening a pull request.

adding transform method

change version of transform to 3

python/pyspark/sql/dataframe.py

HyukjinKwon · 2018-12-30T14:51:43Z

ok to test

SparkQA · 2018-12-30T14:53:19Z

Test build #100560 has finished for PR 23414 at commit def5b2c.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-30T14:59:01Z

Test build #100562 has finished for PR 23414 at commit def5b2c.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

removed space from empty lines

chanansh · 2018-12-31T12:27:48Z

@HyukjinKwon I removed spaces from empty lines. please re-test.

SparkQA · 2018-12-31T12:30:43Z

Test build #100592 has finished for PR 23414 at commit b370363.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

chanansh · 2018-12-31T12:41:27Z

@HyukjinKwon I get the following errors:

[error] running /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/lint-python ; received return code 1
Attempting to post to Github...
 > Post successful.
Build step 'Execute shell' marked build as failure
Archiving artifacts
Recording test results
ERROR: Step ?Publish JUnit test result report? failed: No test report files were found. Configuration error?
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/100592/
Test FAILed.
Finished: FAILURE

Can you please help?

HyukjinKwon · 2018-12-31T12:49:32Z

Looks it's failed for below reasons.

pycodestyle checks failed:
./python/pyspark/sql/dataframe.py:2048:1: W293 blank line contains whitespace
./python/pyspark/sql/dataframe.py:2064:1: W293 blank line contains whitespace

added doctest

removed space from blank line

chanansh · 2018-12-31T14:49:39Z

added doctest and removed more empty line with spaces. please re-test

removed *args **kwargs (albeit I think they are useful)

SparkQA · 2018-12-31T14:53:54Z

Test build #100594 has finished for PR 23414 at commit f5aaa1a.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

chanansh · 2018-12-31T14:54:01Z

removed *args **kwargs (albeit I think they're useful). Please re-test

removed *args, **kwargs

SparkQA · 2018-12-31T14:57:47Z

Test build #100595 has finished for PR 23414 at commit 0b1f562.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

chanansh · 2018-12-31T15:00:07Z

@HyukjinKwon I am sorry for being newbie but I don't understand the fail reason:

Caused by: hudson.plugins.git.GitException: Command "git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/23414/*:refs/remotes/origin/pr/23414/*" returned status code 128:
stdout: 
stderr: error: RPC failed; curl 18 transfer closed with outstanding read data remaining
fatal: The remote end hung up unexpectedly

srowen · 2018-12-31T15:32:56Z

@HyukjinKwon what do you mean when you say the Scala impl has this? I'm missing it.

I don't see the value in this. From the blog post at https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55 why is ...

actual_df = (source_df
    .transform(lambda df: with_greeting(df))
    .transform(lambda df: with_something(df, "crazy")))

better than just

actual_df = with_greeting(source_df)
actual_df = with_something(actual_df, "crazy")

chanansh · 2018-12-31T15:45:50Z

The idea is to be able to chain function easily when you have 10 stages. no need for keeping temporary variables.

srowen · 2018-12-31T15:51:22Z

You can also...

actual_df = source_df
for f in [...]:
    actual_df = f(actual_df)

Unless I'm really missing something this doesn't exist for Scala (?) and I can't see adding an API method for this. The small additional maintenance and user cognitive load just doesn't seem to buy much at all.

chanansh · 2018-12-31T16:38:14Z

@srowen the motivation is from this blogpost https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55

HyukjinKwon · 2018-12-31T16:47:50Z

I was referring:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2497

If it were new API, I won't encourage to add but it's existing. I think we should rather deprecate Scala side one if we don't see some values on that. Otherwise, I thought matching it is fine.

srowen · 2018-12-31T16:59:48Z

Oh hm I had never seen that! Yah seems fine for consistency then.

HyukjinKwon · 2018-12-31T17:23:34Z

Yea .. I'm not super happy with adding it as well to be honest but I guess it's fine foe now.

python/pyspark/sql/dataframe.py

HyukjinKwon · 2018-12-31T19:14:37Z

python/pyspark/sql/dataframe.py

+    @since(3.0)
+    def transform(self, func):
+        """Returns a new class:`DataFrame` according to a user-defined custom transform method.
+        This allows chaining transformations rather than using nested or temporary variables.


I would just match the doc to Scala API side as well.

what do you mean?
I don't see a scala documentation

I meant copying and pasting Scala side documentation into here

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2485-L2492

python/pyspark/sql/dataframe.py

HyukjinKwon · 2018-12-31T19:25:42Z

the build was failed due to the reason below:

pycodestyle checks failed:
./python/pyspark/sql/dataframe.py:2070:101: E501 line too long (106 > 100 characters)

HyukjinKwon · 2019-01-01T03:04:04Z

@chanansh, also please fix the PR title to [SPARK-26449][PYTHON] ... so that it automatically links your PR to the JIRA.

SparkQA · 2019-01-01T09:04:05Z

Test build #100610 has finished for PR 23414 at commit 9919e28.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-01T09:09:13Z

Test build #100611 has finished for PR 23414 at commit e54d2f7.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

chanansh · 2019-01-01T09:26:28Z

@HyukjinKwon, please review latest.

SparkQA · 2019-01-01T09:45:41Z

Test build #100612 has finished for PR 23414 at commit 3d9a751.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-02T01:40:15Z

Looks fine except https://github.com/apache/spark/pull/23414/files#r244654162

HyukjinKwon · 2019-01-02T01:54:53Z

python/pyspark/sql/dataframe.py

+        +----+---+--------+---------+
+        """
+        res = func(self)
+        assert isinstance(res, DataFrame)


I would also add a message for it. For instance,

ret = func(self) assert instance(ret, DataFrame), "Returned instance from the " \ "given function should be a DataFrame; however, got [%s]." % type(ret)

HyukjinKwon · 2019-01-02T02:22:19Z

python/pyspark/sql/dataframe.py

+        :param func: a custom transform function which returns a DataFrame
+
+        >>> from pyspark.sql.functions import lit
+        >>> def with_greeting(df):


Can we make the example more concise and meaningful? I think we should focus only on a simple example about the API itself rather then using lambda. For instance,

>>> df = spark.range(10) >>> def cast_to_str(input_df): ... return input_df.select([col(c).cast("string") for c in input_df.columns]) >>> df.transform(cast_to_str).show()

HyukjinKwon · 2019-01-02T02:23:49Z

python/pyspark/sql/dataframe.py

+        """Returns a new class:`DataFrame` according to a custom transform function.
+        This allows chaining transformations rather than using nested or temporary variables.
+
+        :param func: a custom transform function which returns a DataFrame


nit: DataFrame -> class:`DataFrame`

srowen · 2019-01-09T14:26:44Z

@chanansh I think this can proceed if you'll have a look at the comments above.

chanansh · 2019-01-09T16:34:20Z

Aye aye

…

On Wed, Jan 2, 2019, 04:27 Hyukjin Kwon ***@***.*** wrote: ***@***.**** commented on this pull request. ------------------------------ In python/pyspark/sql/dataframe.py <#23414 (comment)>: > + This is equiavalent to a nested call: + actual_df = with_something(with_greeting(source_df), "crazy")) + + credit to: ***@***.***/chaining-custom-pyspark-transformations-4f38a8c7ae55 + + A more concrete example:: + >>> sc = pyspark.SparkContext(master='local') + >>> spark = pyspark.sql.SparkSession(sparkContext=sc) + >>> from pyspark.sql.functions import lit + >>> def with_greeting(df): + ... return df.withColumn("greeting", lit("hi")) + >>> def with_something(df, something): + ... return df.withColumn("something", lit(something)) + >>> data = [("jose", 1), ("li", 2), ("liz", 3)] + >>> source_df = spark.createDataFrame(data, ["name", "age"]) + >>> actual_df = source_df.transform(with_greeting).transform(lambda x: with_something(x, "crazy")) I think we don't necessarily have to demonstrate the chaining of multiple transform. We can chain other APIs as well, for instance, df.transform(...).select(...).transform(...) in that sense. show() is already DataFrame API. I think df.transform(...).show() is simple and good enough. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23414 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFtFzGeWp6q-F6yL_B2skw6HZsKi83hJks5u_BkIgaJpZM4Zk9No> .

HyukjinKwon · 2019-02-11T10:09:28Z

Closing this due to author's inactivity.

chanansh · 2019-02-11T11:20:41Z

sorry, please reopen I will do it. *HS*

…

On Mon, Feb 11, 2019 at 12:10 PM Hyukjin Kwon ***@***.***> wrote: Closed #23414 <#23414>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23414 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFtFzGK1zXfQq-EhAgPgm1uYJcDhZCMjks5vMUGsgaJpZM4Zk9No> .

srowen · 2019-02-11T13:45:28Z

Just push more commits; I think that reopens it.

Hellsen83 · 2019-02-22T13:56:44Z

is this one still open? I would want to PR basically the same thing. Should I commit here or create a new PR?

HyukjinKwon · 2019-02-22T14:21:11Z

You can pick up commits and create new PR. Looks the author is inactive.

chanansh added 2 commits December 30, 2018 10:48

Update dataframe.py

e3e0e1c

adding transform method

Update dataframe.py

def5b2c

change version of transform to 3

chanansh changed the title ~~Spark 26449~~ [Spark 26449][PYSPARK] added a transform method to the Dataframe class Dec 30, 2018

chanansh changed the title ~~[Spark 26449][PYSPARK] added a transform method to the Dataframe class~~ [Spark 26449][PYSPARK] add a transform method to the Dataframe class Dec 30, 2018

HyukjinKwon reviewed Dec 30, 2018

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

Update dataframe.py

b370363

removed space from empty lines

chanansh added 2 commits December 31, 2018 16:47

Update dataframe.py

bf376de

added doctest

Update dataframe.py

f5aaa1a

removed space from blank line

Update dataframe.py

54105d9

removed *args **kwargs (albeit I think they are useful)

Update dataframe.py

0b1f562

removed *args, **kwargs

HyukjinKwon reviewed Dec 31, 2018

View reviewed changes