-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39895][SQL][PYTHON] Support multiple column drop #37335
Conversation
* SPARK-39895 pyspark support multiple column drop * SPARK-39895 pyspark support multiple column drop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR seems to over-claim. Apache Spark already supports multi-column drop like the following. Please be more specific about your contribution.
>>> spark.version
'3.2.2'
>>> df = spark.createDataFrame([("A", 50, "Y"), ("B", 60, "Y")], ["name", "age", "active"])
>>> df.drop("name", "age").show()
+------+
|active|
+------+
| Y|
| Y|
+------+
@dongjoon-hyun JIRA ticket contains reproducible example. I will update the description on this PR for convenience! The patch is related to |
Can one of the admins verify this patch? |
Looks pretty good. Let me take a closer look tmr. |
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Outdated
Show resolved
Hide resolved
@zhengruifeng Thanks for your review, I have addressed your comments. |
@HyukjinKwon would you like to take another look? |
R/pkg/R/DataFrame.R
Outdated
@@ -3593,17 +3593,27 @@ setMethod("str", | |||
#' drop(df, "col1") | |||
#' drop(df, c("col1", "col2")) | |||
#' drop(df, df$col1) | |||
#' drop(df, list(df$col1, df$col2)) | |||
#' } | |||
#' @note drop since 2.0.0 | |||
setMethod("drop", | |||
signature(x = "SparkDataFrame"), | |||
function(x, col) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably use a different siganture with ...
. Feel free to remove this in this PR, and do it in another PR if you're not used to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise.
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Merged to master. Thanks. |
Thank you, @santosh-d3vpl3x , @HyukjinKwon , @zhengruifeng . |
What changes were proposed in this pull request?
Pyspark dataframe drop has following signature:
def drop(self, *cols: "ColumnOrName") -> "DataFrame":
However when we try to pass multiple Column types to drop function it raises TypeError
each col in the param list should be a string
Minimal reproducible example:
It spits out following:
Why are the changes needed?
We expect that multiple columns can be handled by drop call on df because of its typing but that is not the case.
Does this PR introduce any user-facing change?
Yes, fixes issues related type confirmation in pyspark api
How was this patch tested?
Added missing tests for regression testing. CI Pipeline on fork and CI here will test them.