Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25333][SQL] Ability add new columns in Dataset in a user-defined position #22332

Closed
wants to merge 4 commits into from

Conversation

wmellouli
Copy link

@wmellouli wmellouli commented Sep 4, 2018

What changes were proposed in this pull request?

When we add new columns in a Dataset, they are added automatically at the end of the Dataset.
Generally users want to add new columns either at the end, in the beginning or in a defined position, depends on use cases.
In my case for example, we add technical columns in the beginning of a Dataset and we add business columns at the end.

This pull request, add the ability to add new columns in a user-defined position of a Dataset, using an optional parameter atPosition that should start from 0:

  • negative value (default behavior) means add the column at the end
  • a position greater than the columns number means add the column at the end

This change is backward compatible with old versions.

Consider this data frame with two columns:

val df = sc.parallelize(Seq(1, 2, 3)).toDF.withColumn("newCol1", col("value") + 1)
df.printSchema

root
 |-- value: integer (nullable = true)
 |-- newCol1: integer (nullable = true)

So we can:

1- add a new column without using the parameter atPosition (default behavior):

val newDf = df.withColumn("newCol2", col("value") + 2)
newDf.printSchema

root
 |-- value: integer (nullable = true)
 |-- newCol1: integer (nullable = true)
 |-- newCol2: integer (nullable = true)

2- add a new column using the parameter atPosition with different values:

val newDf = df.withColumn("newColumn", col("value") + 2, 2)
newDf.printSchema

root
 |-- value: integer (nullable = true)
 |-- newCol1: integer (nullable = true)
 |-- newCol2: integer (nullable = true)
val newDf = df.withColumn("newColumn", col("value") + 2, 0)
newDf.printSchema

root
 |-- newCol2: integer (nullable = true)
 |-- value: integer (nullable = true)
 |-- newCol1: integer (nullable = true)
val newDf = df.withColumn("newColumn", col("value") + 2, 1)
newDf.printSchema

root
 |-- value: integer (nullable = true)
 |-- newCol2: integer (nullable = true)
 |-- newCol1: integer (nullable = true)

=> with negative position

val newDf = df.withColumn("newColumn", col("value") + 2, -2)
newDf.printSchema

root
 |-- value: integer (nullable = true)
 |-- newCol1: integer (nullable = true)
 |-- newCol2: integer (nullable = true)

=> with a position greater than the columns number

val newDf = df.withColumn("newColumn", col("value") + 2, 15)
newDf.printSchema

root
 |-- value: integer (nullable = true)
 |-- newCol1: integer (nullable = true)
 |-- newCol2: integer (nullable = true)

How was this patch tested?

This patch is tested with unit tests.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mgaido91
Copy link
Contributor

mgaido91 commented Sep 4, 2018

I think that if we want to introduce a new method for this, it'd be better to have a atPosition parameter, rather than a boolean to chose the location. It'd be more general.

@jaceklaskowski
Copy link
Contributor

Why not select($"*", newColumnHere) or select(newColumnHere, $"*")? Somehow I don't think the use case merits overloading withColumn.

@wmellouli wmellouli changed the title [SPARK-25333][SQL] Ability add new columns in the beginning of Dataset [SPARK-25333][SQL] Ability add new columns in Dataset in a user-defined position Sep 4, 2018
@wmellouli
Copy link
Author

@mgaido91 Thank you for your suggestion, I updated the PR name, description and sources with a new version using a parameter atPosition instead of a flag atTheEnd. Let me know what you think about this new implementation.

@jaceklaskowski Thank your for your review. But here I'm discussing about the withColumn method that allows adding and/or replacing existing columns with new column content. What you suggested does not manage replacing existing column content. The idea is to make the withColumn method more flexible with keeping the backward compatibility. Actually the withColumn method is useful only for one use case: add (maybe replace) column at the end. I changed the implementation with @mgaido91 suggestion to make withColumn useful for more use cases. Let me know what you think about this new implementation.

@maropu
Copy link
Member

maropu commented Sep 4, 2018

I also can't find a strong reason to append a new API in Dataset... btw, to add a new API there, you'd be better to discuss in jira before making a pr, I think. cc: @rxin @cloud-fan @HyukjinKwon

@HyukjinKwon
Copy link
Member

Can't we simply select after the the column is added? I wouldn't add this as well - it can look confusing to be honest IMO.

*/
def withColumn(colName: String, col: Column, atTheEnd: Boolean): DataFrame =
withColumns(Seq(colName), Seq(col), atTheEnd)
def withColumn(colName: String, col: Column, atPosition: Int): DataFrame =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@since 2.4.0?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@@ -2226,16 +2226,18 @@ class Dataset[T] private[sql](
* `column`'s expression must only refer to attributes supplied by this Dataset. It is an
* error to add a column that refers to some other Dataset.
*
* You can choose to add new columns either at the end (default behavior) or at the beginning.
* The position of the new column start from 0, and a negative position means at the end (default behavior).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"starts at 0. Any negative position means to add the column at the end"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified as you suggested


/**
* Returns a new Dataset by adding columns or replacing the existing columns that has
* the same names.
*
* The position of new columns start from 0, and a negative position means at the end (default behavior).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

def withColumn(colName: String, col: Column, atTheEnd: Boolean): DataFrame =
withColumns(Seq(colName), Seq(col), atTheEnd)
def withColumn(colName: String, col: Column, atPosition: Int): DataFrame =
withColumns(Seq(colName), Seq(col), atPosition)

/**
* Returns a new Dataset by adding columns or replacing the existing columns that has
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/has/have

@@ -831,13 +831,21 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
}.toSeq)
assert(df.schema.map(_.name) === Seq("key", "value", "newCol"))

val df2 = testData.toDF().withColumn("newCol", col("key") + 1, false)
val df2 = testData.toDF().withColumn("newCol", col("key") + 1, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about tests with negative positions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test with negative position was covered for the public method withColumn and the private method withColumns:

I'm testing 3 cases (in the same time) that add new column at the end:

  • negative position
  • last position
  • position greater than columns size

@wmellouli
Copy link
Author

@jaceklaskowski I refactored with what you suggested in your review. Let me know what you think.

@wmellouli
Copy link
Author

@HyukjinKwon Thank you for your review. To answer to your question about using select, take a look at my explaination here to @jaceklaskowski (he asked about the same question here).
In addition I took into consideration @mgaido91 suggestion here. So what do you think about this new version ?

Someone can run tests please ?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 6, 2018

What you suggested does not manage replacing existing column content.

I think we can still just add a column and select. It will probably need a few lines (or one line) change to reorder the columns ... no?

@wmellouli
Copy link
Author

@HyukjinKwon even instead of using the actual method withColumn(colName: String, col: Column) we can just add a column and select. The idea from this PR is to add more power/flexibility to withColumn method to cover more use cases, without affecting performance or backward compatibility.
IMO using withColumn is more natural and hides adding/replacing + select logic, in one operation.

@HyukjinKwon
Copy link
Member

If that's easily worked around, let's not add this one. There are too many APIs open now and we should rather try to reduce them.

@wmellouli
Copy link
Author

PR closed: we can use select to add new columns in a user-defined position.

@wmellouli wmellouli closed this Sep 6, 2018
@HyukjinKwon
Copy link
Member

Thanks, @wmellouli.

@rxin
Copy link
Contributor

rxin commented Sep 6, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants