New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25333][SQL] Ability add new columns in Dataset in a user-defined position #22332
Conversation
Can one of the admins verify this patch? |
I think that if we want to introduce a new method for this, it'd be better to have a |
Why not |
@mgaido91 Thank you for your suggestion, I updated the PR name, description and sources with a new version using a parameter @jaceklaskowski Thank your for your review. But here I'm discussing about the |
I also can't find a strong reason to append a new API in |
Can't we simply |
*/ | ||
def withColumn(colName: String, col: Column, atTheEnd: Boolean): DataFrame = | ||
withColumns(Seq(colName), Seq(col), atTheEnd) | ||
def withColumn(colName: String, col: Column, atPosition: Int): DataFrame = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@since 2.4.0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
@@ -2226,16 +2226,18 @@ class Dataset[T] private[sql]( | |||
* `column`'s expression must only refer to attributes supplied by this Dataset. It is an | |||
* error to add a column that refers to some other Dataset. | |||
* | |||
* You can choose to add new columns either at the end (default behavior) or at the beginning. | |||
* The position of the new column start from 0, and a negative position means at the end (default behavior). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"starts at 0
. Any negative position means to add the column at the end"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified as you suggested
|
||
/** | ||
* Returns a new Dataset by adding columns or replacing the existing columns that has | ||
* the same names. | ||
* | ||
* The position of new columns start from 0, and a negative position means at the end (default behavior). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
def withColumn(colName: String, col: Column, atTheEnd: Boolean): DataFrame = | ||
withColumns(Seq(colName), Seq(col), atTheEnd) | ||
def withColumn(colName: String, col: Column, atPosition: Int): DataFrame = | ||
withColumns(Seq(colName), Seq(col), atPosition) | ||
|
||
/** | ||
* Returns a new Dataset by adding columns or replacing the existing columns that has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/has/have
@@ -831,13 +831,21 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { | |||
}.toSeq) | |||
assert(df.schema.map(_.name) === Seq("key", "value", "newCol")) | |||
|
|||
val df2 = testData.toDF().withColumn("newCol", col("key") + 1, false) | |||
val df2 = testData.toDF().withColumn("newCol", col("key") + 1, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about tests with negative positions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test with negative position was covered for the public method withColumn
and the private method withColumns
:
- https://github.com/apache/spark/pull/22332/files#diff-5d2ebf4e9ca5a990136b276859769289R852
- https://github.com/apache/spark/pull/22332/files#diff-5d2ebf4e9ca5a990136b276859769289R907
I'm testing 3 cases (in the same time) that add new column at the end:
- negative position
- last position
- position greater than columns size
@jaceklaskowski I refactored with what you suggested in your review. Let me know what you think. |
@HyukjinKwon Thank you for your review. To answer to your question about using Someone can run tests please ? |
I think we can still just add a column and select. It will probably need a few lines (or one line) change to reorder the columns ... no? |
@HyukjinKwon even instead of using the actual method |
If that's easily worked around, let's not add this one. There are too many APIs open now and we should rather try to reduce them. |
PR closed: we can use select to add new columns in a user-defined position. |
Thanks, @wmellouli. |
Thanks guys.
…On Thu, Sep 6, 2018 at 2:12 AM Hyukjin Kwon ***@***.***> wrote:
Thanks, @wmellouli <https://github.com/wmellouli>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22332 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AATvPJcYji4KcDEN2c9ruguA1X9MH5_Gks5uYOb5gaJpZM4WZYhq>
.
|
What changes were proposed in this pull request?
When we add new columns in a Dataset, they are added automatically at the end of the Dataset.
Generally users want to add new columns either at the end, in the beginning or in a defined position, depends on use cases.
In my case for example, we add technical columns in the beginning of a Dataset and we add business columns at the end.
This pull request, add the ability to add new columns in a user-defined position of a Dataset, using an optional parameter atPosition that should start from 0:
This change is backward compatible with old versions.
Consider this data frame with two columns:
So we can:
1- add a new column without using the parameter atPosition (default behavior):
2- add a new column using the parameter atPosition with different values:
=> with negative position
=> with a position greater than the columns number
How was this patch tested?
This patch is tested with unit tests.