[SPARK-25333][SQL] Ability add new columns in Dataset in a user-defined position #22332

wmellouli · 2018-09-04T17:04:19Z

What changes were proposed in this pull request?

When we add new columns in a Dataset, they are added automatically at the end of the Dataset.
Generally users want to add new columns either at the end, in the beginning or in a defined position, depends on use cases.
In my case for example, we add technical columns in the beginning of a Dataset and we add business columns at the end.

This pull request, add the ability to add new columns in a user-defined position of a Dataset, using an optional parameter atPosition that should start from 0:

negative value (default behavior) means add the column at the end
a position greater than the columns number means add the column at the end

This change is backward compatible with old versions.

Consider this data frame with two columns:

val df = sc.parallelize(Seq(1, 2, 3)).toDF.withColumn("newCol1", col("value") + 1)
df.printSchema

root
 |-- value: integer (nullable = true)
 |-- newCol1: integer (nullable = true)

So we can:

1- add a new column without using the parameter atPosition (default behavior):

val newDf = df.withColumn("newCol2", col("value") + 2)
newDf.printSchema

root
 |-- value: integer (nullable = true)
 |-- newCol1: integer (nullable = true)
 |-- newCol2: integer (nullable = true)

2- add a new column using the parameter atPosition with different values:

val newDf = df.withColumn("newColumn", col("value") + 2, 2)
newDf.printSchema

root
 |-- value: integer (nullable = true)
 |-- newCol1: integer (nullable = true)
 |-- newCol2: integer (nullable = true)

val newDf = df.withColumn("newColumn", col("value") + 2, 0)
newDf.printSchema

root
 |-- newCol2: integer (nullable = true)
 |-- value: integer (nullable = true)
 |-- newCol1: integer (nullable = true)

val newDf = df.withColumn("newColumn", col("value") + 2, 1)
newDf.printSchema

root
 |-- value: integer (nullable = true)
 |-- newCol2: integer (nullable = true)
 |-- newCol1: integer (nullable = true)

=> with negative position

val newDf = df.withColumn("newColumn", col("value") + 2, -2)
newDf.printSchema

root
 |-- value: integer (nullable = true)
 |-- newCol1: integer (nullable = true)
 |-- newCol2: integer (nullable = true)

=> with a position greater than the columns number

val newDf = df.withColumn("newColumn", col("value") + 2, 15)
newDf.printSchema

root
 |-- value: integer (nullable = true)
 |-- newCol1: integer (nullable = true)
 |-- newCol2: integer (nullable = true)

How was this patch tested?

This patch is tested with unit tests.

…aset

AmplabJenkins · 2018-09-04T17:07:59Z

Can one of the admins verify this patch?

mgaido91 · 2018-09-04T17:08:36Z

I think that if we want to introduce a new method for this, it'd be better to have a atPosition parameter, rather than a boolean to chose the location. It'd be more general.

jaceklaskowski · 2018-09-04T19:15:14Z

Why not select($"*", newColumnHere) or select(newColumnHere, $"*")? Somehow I don't think the use case merits overloading withColumn.

wmellouli · 2018-09-04T21:37:58Z

@mgaido91 Thank you for your suggestion, I updated the PR name, description and sources with a new version using a parameter atPosition instead of a flag atTheEnd. Let me know what you think about this new implementation.

@jaceklaskowski Thank your for your review. But here I'm discussing about the withColumn method that allows adding and/or replacing existing columns with new column content. What you suggested does not manage replacing existing column content. The idea is to make the withColumn method more flexible with keeping the backward compatibility. Actually the withColumn method is useful only for one use case: add (maybe replace) column at the end. I changed the implementation with @mgaido91 suggestion to make withColumn useful for more use cases. Let me know what you think about this new implementation.

maropu · 2018-09-04T23:42:48Z

I also can't find a strong reason to append a new API in Dataset... btw, to add a new API there, you'd be better to discuss in jira before making a pr, I think. cc: @rxin @cloud-fan @HyukjinKwon

HyukjinKwon · 2018-09-05T03:17:37Z

Can't we simply select after the the column is added? I wouldn't add this as well - it can look confusing to be honest IMO.

jaceklaskowski · 2018-09-05T06:05:47Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

    */
-  def withColumn(colName: String, col: Column, atTheEnd: Boolean): DataFrame =
-    withColumns(Seq(colName), Seq(col), atTheEnd)
+  def withColumn(colName: String, col: Column, atPosition: Int): DataFrame =


@since 2.4.0?

jaceklaskowski · 2018-09-05T06:07:21Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -2226,16 +2226,18 @@ class Dataset[T] private[sql](
    * `column`'s expression must only refer to attributes supplied by this Dataset. It is an
    * error to add a column that refers to some other Dataset.
    *
-    * You can choose to add new columns either at the end (default behavior) or at the beginning.
+    * The position of the new column start from 0, and a negative position means at the end (default behavior).


"starts at 0. Any negative position means to add the column at the end"?

I modified as you suggested

jaceklaskowski · 2018-09-05T06:07:42Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala


  /**
   * Returns a new Dataset by adding columns or replacing the existing columns that has
   * the same names.
+   *
+   * The position of new columns start from 0, and a negative position means at the end (default behavior).


Same as above

jaceklaskowski · 2018-09-05T06:08:13Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

-  def withColumn(colName: String, col: Column, atTheEnd: Boolean): DataFrame =
-    withColumns(Seq(colName), Seq(col), atTheEnd)
+  def withColumn(colName: String, col: Column, atPosition: Int): DataFrame =
+    withColumns(Seq(colName), Seq(col), atPosition)

  /**
   * Returns a new Dataset by adding columns or replacing the existing columns that has


jaceklaskowski · 2018-09-05T06:10:14Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

@@ -831,13 +831,21 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
      }.toSeq)
    assert(df.schema.map(_.name) === Seq("key", "value", "newCol"))

-    val df2 = testData.toDF().withColumn("newCol", col("key") + 1, false)
+    val df2 = testData.toDF().withColumn("newCol", col("key") + 1, 0)


What about tests with negative positions?

Test with negative position was covered for the public method withColumn and the private method withColumns:

https://github.com/apache/spark/pull/22332/files#diff-5d2ebf4e9ca5a990136b276859769289R852

https://github.com/apache/spark/pull/22332/files#diff-5d2ebf4e9ca5a990136b276859769289R907

I'm testing 3 cases (in the same time) that add new column at the end:

negative position

last position

position greater than columns size

wmellouli · 2018-09-05T08:51:06Z

@jaceklaskowski I refactored with what you suggested in your review. Let me know what you think.

wmellouli · 2018-09-05T12:17:48Z

@HyukjinKwon Thank you for your review. To answer to your question about using select, take a look at my explaination here to @jaceklaskowski (he asked about the same question here).
In addition I took into consideration @mgaido91 suggestion here. So what do you think about this new version ?

Someone can run tests please ?

HyukjinKwon · 2018-09-06T07:50:00Z

What you suggested does not manage replacing existing column content.

I think we can still just add a column and select. It will probably need a few lines (or one line) change to reorder the columns ... no?

wmellouli · 2018-09-06T08:54:03Z

@HyukjinKwon even instead of using the actual method withColumn(colName: String, col: Column) we can just add a column and select. The idea from this PR is to add more power/flexibility to withColumn method to cover more use cases, without affecting performance or backward compatibility.
IMO using withColumn is more natural and hides adding/replacing + select logic, in one operation.

HyukjinKwon · 2018-09-06T09:02:26Z

If that's easily worked around, let's not add this one. There are too many APIs open now and we should rather try to reduce them.

wmellouli · 2018-09-06T09:07:21Z

PR closed: we can use select to add new columns in a user-defined position.

HyukjinKwon · 2018-09-06T09:10:30Z

Thanks, @wmellouli.

rxin · 2018-09-06T15:28:40Z

Thanks guys.

…

On Thu, Sep 6, 2018 at 2:12 AM Hyukjin Kwon ***@***.***> wrote: Thanks, @wmellouli <https://github.com/wmellouli>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22332 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AATvPJcYji4KcDEN2c9ruguA1X9MH5_Gks5uYOb5gaJpZM4WZYhq> .

[SPARK-25333][SQL] Ability to add new columns in the beginning of Dat…

f83afe5

…aset

Use atPosition integer instead of the boolean flag atTheEnd

e3ad093

wmellouli changed the title ~~[SPARK-25333][SQL] Ability add new columns in the beginning of Dataset~~ [SPARK-25333][SQL] Ability add new columns in Dataset in a user-defined position Sep 4, 2018

Refactor & fix tests

d6d73fd

jaceklaskowski suggested changes Sep 5, 2018

View reviewed changes

Refactor comments

935ecbf

jaceklaskowski approved these changes Sep 5, 2018

View reviewed changes

wmellouli closed this Sep 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25333][SQL] Ability add new columns in Dataset in a user-defined position #22332

[SPARK-25333][SQL] Ability add new columns in Dataset in a user-defined position #22332

wmellouli commented Sep 4, 2018 •

edited

AmplabJenkins commented Sep 4, 2018

mgaido91 commented Sep 4, 2018

jaceklaskowski commented Sep 4, 2018

wmellouli commented Sep 4, 2018

maropu commented Sep 4, 2018

HyukjinKwon commented Sep 5, 2018

jaceklaskowski Sep 5, 2018

wmellouli Sep 5, 2018

jaceklaskowski Sep 5, 2018

wmellouli Sep 5, 2018

jaceklaskowski Sep 5, 2018

jaceklaskowski Sep 5, 2018

jaceklaskowski Sep 5, 2018

wmellouli Sep 5, 2018

wmellouli commented Sep 5, 2018

wmellouli commented Sep 5, 2018

HyukjinKwon commented Sep 6, 2018 •

edited

wmellouli commented Sep 6, 2018

HyukjinKwon commented Sep 6, 2018

wmellouli commented Sep 6, 2018

HyukjinKwon commented Sep 6, 2018

rxin commented Sep 6, 2018 via email

[SPARK-25333][SQL] Ability add new columns in Dataset in a user-defined position #22332

[SPARK-25333][SQL] Ability add new columns in Dataset in a user-defined position #22332

Conversation

wmellouli commented Sep 4, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented Sep 4, 2018

mgaido91 commented Sep 4, 2018

jaceklaskowski commented Sep 4, 2018

wmellouli commented Sep 4, 2018

maropu commented Sep 4, 2018

HyukjinKwon commented Sep 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wmellouli commented Sep 5, 2018

wmellouli commented Sep 5, 2018

HyukjinKwon commented Sep 6, 2018 • edited

wmellouli commented Sep 6, 2018

HyukjinKwon commented Sep 6, 2018

wmellouli commented Sep 6, 2018

HyukjinKwon commented Sep 6, 2018

rxin commented Sep 6, 2018 via email

wmellouli commented Sep 4, 2018 •

edited

HyukjinKwon commented Sep 6, 2018 •

edited