New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26449][PYTHON] add a transform method to the Dataframe class #23414
Conversation
adding transform method
change version of transform to 3
ok to test |
Test build #100560 has finished for PR 23414 at commit
|
Test build #100562 has finished for PR 23414 at commit
|
removed space from empty lines
@HyukjinKwon I removed spaces from empty lines. please re-test. |
Test build #100592 has finished for PR 23414 at commit
|
@HyukjinKwon I get the following errors:
Can you please help? |
Looks it's failed for below reasons.
|
added doctest
removed space from blank line
added doctest and removed more empty line with spaces. please re-test |
removed *args **kwargs (albeit I think they are useful)
Test build #100594 has finished for PR 23414 at commit
|
removed *args **kwargs (albeit I think they're useful). Please re-test |
removed *args, **kwargs
Test build #100595 has finished for PR 23414 at commit
|
@HyukjinKwon I am sorry for being newbie but I don't understand the fail reason:
|
@HyukjinKwon what do you mean when you say the Scala impl has this? I'm missing it. I don't see the value in this. From the blog post at https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55 why is ...
better than just
|
The idea is to be able to chain function easily when you have 10 stages. no need for keeping temporary variables. |
You can also...
Unless I'm really missing something this doesn't exist for Scala (?) and I can't see adding an API method for this. The small additional maintenance and user cognitive load just doesn't seem to buy much at all. |
@srowen the motivation is from this blogpost https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55 |
I was referring: If it were new API, I won't encourage to add but it's existing. I think we should rather deprecate Scala side one if we don't see some values on that. Otherwise, I thought matching it is fine. |
Oh hm I had never seen that! Yah seems fine for consistency then. |
Yea .. I'm not super happy with adding it as well to be honest but I guess it's fine foe now. |
@since(3.0) | ||
def transform(self, func): | ||
"""Returns a new class:`DataFrame` according to a user-defined custom transform method. | ||
This allows chaining transformations rather than using nested or temporary variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just match the doc to Scala API side as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean?
I don't see a scala documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant copying and pasting Scala side documentation into here
the build was failed due to the reason below:
|
@chanansh, also please fix the PR title to |
Test build #100610 has finished for PR 23414 at commit
|
Test build #100611 has finished for PR 23414 at commit
|
@HyukjinKwon, please review latest. |
Test build #100612 has finished for PR 23414 at commit
|
Looks fine except https://github.com/apache/spark/pull/23414/files#r244654162 |
+----+---+--------+---------+ | ||
""" | ||
res = func(self) | ||
assert isinstance(res, DataFrame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add a message for it. For instance,
ret = func(self)
assert instance(ret, DataFrame), "Returned instance from the " \
"given function should be a DataFrame; however, got [%s]." % type(ret)
:param func: a custom transform function which returns a DataFrame | ||
|
||
>>> from pyspark.sql.functions import lit | ||
>>> def with_greeting(df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the example more concise and meaningful? I think we should focus only on a simple example about the API itself rather then using lambda
. For instance,
>>> df = spark.range(10)
>>> def cast_to_str(input_df):
... return input_df.select([col(c).cast("string") for c in input_df.columns])
>>> df.transform(cast_to_str).show()
"""Returns a new class:`DataFrame` according to a custom transform function. | ||
This allows chaining transformations rather than using nested or temporary variables. | ||
|
||
:param func: a custom transform function which returns a DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: DataFrame
-> class:`DataFrame`
@chanansh I think this can proceed if you'll have a look at the comments above. |
Aye aye
…On Wed, Jan 2, 2019, 04:27 Hyukjin Kwon ***@***.*** wrote:
***@***.**** commented on this pull request.
------------------------------
In python/pyspark/sql/dataframe.py
<#23414 (comment)>:
> + This is equiavalent to a nested call:
+ actual_df = with_something(with_greeting(source_df), "crazy"))
+
+ credit to: ***@***.***/chaining-custom-pyspark-transformations-4f38a8c7ae55
+
+ A more concrete example::
+ >>> sc = pyspark.SparkContext(master='local')
+ >>> spark = pyspark.sql.SparkSession(sparkContext=sc)
+ >>> from pyspark.sql.functions import lit
+ >>> def with_greeting(df):
+ ... return df.withColumn("greeting", lit("hi"))
+ >>> def with_something(df, something):
+ ... return df.withColumn("something", lit(something))
+ >>> data = [("jose", 1), ("li", 2), ("liz", 3)]
+ >>> source_df = spark.createDataFrame(data, ["name", "age"])
+ >>> actual_df = source_df.transform(with_greeting).transform(lambda x: with_something(x, "crazy"))
I think we don't necessarily have to demonstrate the chaining of multiple
transform. We can chain other APIs as well, for instance,
df.transform(...).select(...).transform(...) in that sense.
show() is already DataFrame API. I think df.transform(...).show() is
simple and good enough.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23414 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFtFzGeWp6q-F6yL_B2skw6HZsKi83hJks5u_BkIgaJpZM4Zk9No>
.
|
Closing this due to author's inactivity. |
sorry, please reopen I will do it.
*HS*
…On Mon, Feb 11, 2019 at 12:10 PM Hyukjin Kwon ***@***.***> wrote:
Closed #23414 <#23414>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23414 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFtFzGK1zXfQq-EhAgPgm1uYJcDhZCMjks5vMUGsgaJpZM4Zk9No>
.
|
Just push more commits; I think that reopens it. |
is this one still open? I would want to PR basically the same thing. Should I commit here or create a new PR? |
You can pick up commits and create new PR. Looks the author is inactive. |
What changes were proposed in this pull request?
added a transform method to the Dataframe class, see https://issues.apache.org/jira/browse/SPARK-26449
How was this patch tested?
Tested manually by injecting the proposed method to the current spark version dataframe class.
I've tried to compile spark from scratch and test using ./build/mvn test. However, unrelated tests fails before my change.
Please review http://spark.apache.org/contributing.html before opening a pull request.