Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39895][SQL][PYTHON] Support multiple column drop #37335

Closed
wants to merge 17 commits into from

Conversation

santosh-d3vpl3x
Copy link
Contributor

@santosh-d3vpl3x santosh-d3vpl3x commented Jul 28, 2022

What changes were proposed in this pull request?

Pyspark dataframe drop has following signature:
def drop(self, *cols: "ColumnOrName") -> "DataFrame":

However when we try to pass multiple Column types to drop function it raises TypeError

each col in the param list should be a string

Minimal reproducible example:

values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), ("id_1", 3, 3), ("id_2", 4, 3)]
df = spark.createDataFrame(values, "id string, point int, count int")
df.drop(df.point, df.count)

It spits out following:

/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py in drop(self, *cols)
2537 for col in cols:
2538 if not isinstance(col, str):
-> 2539 raise TypeError("each col in the param list should be a string")
2540 jdf = self._jdf.drop(self._jseq(cols))
2541

TypeError: each col in the param list should be a string

Why are the changes needed?

We expect that multiple columns can be handled by drop call on df because of its typing but that is not the case.

Does this PR introduce any user-facing change?

Yes, fixes issues related type confirmation in pyspark api

How was this patch tested?

Added missing tests for regression testing. CI Pipeline on fork and CI here will test them.

* SPARK-39895 pyspark support multiple column drop

* SPARK-39895 pyspark support multiple column drop
@santosh-d3vpl3x santosh-d3vpl3x changed the title SPARK-39895 pyspark support multiple column drop [SPARK-39895][PySpark] support multiple column drop Jul 28, 2022
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-39895][PySpark] support multiple column drop [SPARK-39895][PYTHON] Support multiple column drop Jul 28, 2022
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems to over-claim. Apache Spark already supports multi-column drop like the following. Please be more specific about your contribution.

>>> spark.version
'3.2.2'
>>> df = spark.createDataFrame([("A", 50, "Y"), ("B", 60, "Y")], ["name", "age", "active"])
>>> df.drop("name", "age").show()
+------+
|active|
+------+
|     Y|
|     Y|
+------+

@santosh-d3vpl3x
Copy link
Contributor Author

santosh-d3vpl3x commented Jul 29, 2022

This PR seems to over-claim. Apache Spark already supports multi-column drop like the following. Please be more specific about your contribution.

>>> spark.version
'3.2.2'
>>> df = spark.createDataFrame([("A", 50, "Y"), ("B", 60, "Y")], ["name", "age", "active"])
>>> df.drop("name", "age").show()
+------+
|active|
+------+
|     Y|
|     Y|
+------+

@dongjoon-hyun JIRA ticket contains reproducible example. I will update the description on this PR for convenience! The patch is related to df.drop(Column*) and not df.drop(str*).

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions github-actions bot added the R label Jul 29, 2022
@HyukjinKwon
Copy link
Member

Looks pretty good. Let me take a closer look tmr.

@zhengruifeng zhengruifeng changed the title [SPARK-39895][PYTHON] Support multiple column drop [SPARK-39895][SQL][PYTHON][R] Support multiple column drop Aug 5, 2022
@santosh-d3vpl3x
Copy link
Contributor Author

@zhengruifeng Thanks for your review, I have addressed your comments.

@zhengruifeng
Copy link
Contributor

Looks pretty good. Let me take a closer look tmr.

@HyukjinKwon would you like to take another look?

@@ -3593,17 +3593,27 @@ setMethod("str",
#' drop(df, "col1")
#' drop(df, c("col1", "col2"))
#' drop(df, df$col1)
#' drop(df, list(df$col1, df$col2))
#' }
#' @note drop since 2.0.0
setMethod("drop",
signature(x = "SparkDataFrame"),
function(x, col) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably use a different siganture with .... Feel free to remove this in this PR, and do it in another PR if you're not used to this.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise.

@santosh-d3vpl3x santosh-d3vpl3x changed the title [SPARK-39895][SQL][PYTHON][R] Support multiple column drop [SPARK-39895][SQL][PYTHON] Support multiple column drop Aug 10, 2022
@HyukjinKwon
Copy link
Member

Merged to master.

Thanks.

@dongjoon-hyun
Copy link
Member

Thank you, @santosh-d3vpl3x , @HyukjinKwon , @zhengruifeng .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants