[SPARK-39895][SQL][PYTHON] Support multiple column drop #37335

santosh-d3vpl3x · 2022-07-28T19:41:35Z

What changes were proposed in this pull request?

Pyspark dataframe drop has following signature:
def drop(self, *cols: "ColumnOrName") -> "DataFrame":

However when we try to pass multiple Column types to drop function it raises TypeError

each col in the param list should be a string

Minimal reproducible example:

values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), ("id_1", 3, 3), ("id_2", 4, 3)]
df = spark.createDataFrame(values, "id string, point int, count int")
df.drop(df.point, df.count)

It spits out following:

/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py in drop(self, *cols)
2537 for col in cols:
2538 if not isinstance(col, str):
-> 2539 raise TypeError("each col in the param list should be a string")
2540 jdf = self._jdf.drop(self._jseq(cols))
2541

TypeError: each col in the param list should be a string

Why are the changes needed?

We expect that multiple columns can be handled by drop call on df because of its typing but that is not the case.

Does this PR introduce any user-facing change?

Yes, fixes issues related type confirmation in pyspark api

How was this patch tested?

Added missing tests for regression testing. CI Pipeline on fork and CI here will test them.

* SPARK-39895 pyspark support multiple column drop * SPARK-39895 pyspark support multiple column drop

python/pyspark/sql/dataframe.py

python/pyspark/sql/tests/test_dataframe.py

dongjoon-hyun

This PR seems to over-claim. Apache Spark already supports multi-column drop like the following. Please be more specific about your contribution.

>>> spark.version
'3.2.2'
>>> df = spark.createDataFrame([("A", 50, "Y"), ("B", 60, "Y")], ["name", "age", "active"])
>>> df.drop("name", "age").show()
+------+
|active|
+------+
|     Y|
|     Y|
+------+

python/pyspark/sql/dataframe.py

santosh-d3vpl3x · 2022-07-29T08:48:54Z

This PR seems to over-claim. Apache Spark already supports multi-column drop like the following. Please be more specific about your contribution.
>>> spark.version
'3.2.2'
>>> df = spark.createDataFrame([("A", 50, "Y"), ("B", 60, "Y")], ["name", "age", "active"])
>>> df.drop("name", "age").show()
+------+
|active|
+------+
|     Y|
|     Y|
+------+

@dongjoon-hyun JIRA ticket contains reproducible example. I will update the description on this PR for convenience! The patch is related to df.drop(Column*) and not df.drop(str*).

AmplabJenkins · 2022-07-29T09:39:17Z

Can one of the admins verify this patch?

HyukjinKwon · 2022-07-31T05:00:20Z

Looks pretty good. Let me take a closer look tmr.

stale.

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

python/pyspark/sql/tests/test_dataframe.py

santosh-d3vpl3x · 2022-08-08T14:26:50Z

@zhengruifeng Thanks for your review, I have addressed your comments.

zhengruifeng · 2022-08-09T03:17:04Z

Looks pretty good. Let me take a closer look tmr.

@HyukjinKwon would you like to take another look?

HyukjinKwon · 2022-08-09T03:27:48Z

R/pkg/R/DataFrame.R

@@ -3593,17 +3593,27 @@ setMethod("str",
 #' drop(df, "col1")
 #' drop(df, c("col1", "col2"))
 #' drop(df, df$col1)
+#' drop(df, list(df$col1, df$col2))
 #' }
 #' @note drop since 2.0.0
 setMethod("drop",
          signature(x = "SparkDataFrame"),
          function(x, col) {


should probably use a different siganture with .... Feel free to remove this in this PR, and do it in another PR if you're not used to this.

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

HyukjinKwon

LGTM otherwise.

Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>

HyukjinKwon · 2022-08-11T03:13:47Z

Merged to master.

Thanks.

dongjoon-hyun · 2022-08-11T04:17:04Z

Thank you, @santosh-d3vpl3x , @HyukjinKwon , @zhengruifeng .

SPARK-39895 pyspark support multiple column drop

f05b4b8

* SPARK-39895 pyspark support multiple column drop * SPARK-39895 pyspark support multiple column drop

github-actions bot added CORE PYTHON SQL labels Jul 28, 2022

santosh-d3vpl3x closed this Jul 28, 2022

santosh-d3vpl3x reopened this Jul 28, 2022

santosh-d3vpl3x changed the title ~~SPARK-39895 pyspark support multiple column drop~~ [SPARK-39895][PySpark] support multiple column drop Jul 28, 2022

santosh-d3vpl3x added 4 commits July 28, 2022 21:49

SPARK-39895 pyspark support multiple column drop

38d1830

SPARK-39895 Reduce length of line

70f04c3

SPARK-39895 Remove spaces from empty line

1189f23

Add mypy fix for union-attr

bf7b020

dongjoon-hyun changed the title ~~[SPARK-39895][PySpark] support multiple column drop~~ [SPARK-39895][PYTHON] Support multiple column drop Jul 28, 2022

dongjoon-hyun reviewed Jul 28, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

santosh-d3vpl3x added 2 commits July 29, 2022 00:51

SPARK-39895 Restore cleaned up code

26d908d

SPARK-39895 Cleanup

b571574

santosh-d3vpl3x requested a review from dongjoon-hyun July 28, 2022 22:54

dongjoon-hyun reviewed Jul 29, 2022

View reviewed changes

python/pyspark/sql/tests/test_dataframe.py Show resolved Hide resolved

dongjoon-hyun previously requested changes Jul 29, 2022

View reviewed changes

HyukjinKwon reviewed Jul 29, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

santosh-d3vpl3x requested review from HyukjinKwon and dongjoon-hyun July 29, 2022 09:09

SPARK-39895 Apply suggestions from PR

950010b

github-actions bot added the R label Jul 29, 2022

santosh-d3vpl3x added 3 commits July 29, 2022 21:54

SPARK-39895 Fix test name duplications

285d166

SPARK-39895 Introduce drop(Column) back to ensure binary compat

933d4a9

SPARK-39895 Avoid unnecessary Seq(Column)

ccad1ce

zhengruifeng approved these changes Aug 5, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala Outdated Show resolved Hide resolved

python/pyspark/sql/tests/test_dataframe.py Outdated Show resolved Hide resolved

zhengruifeng changed the title ~~[SPARK-39895][PYTHON] Support multiple column drop~~ [SPARK-39895][SQL][PYTHON][R] Support multiple column drop Aug 5, 2022

santosh-d3vpl3x added 2 commits August 7, 2022 22:40

SPARK-39895 address PR comments

4527356

Merge branch 'apache:master' into master

c292198

santosh-d3vpl3x requested a review from zhengruifeng August 8, 2022 14:22

zhengruifeng approved these changes Aug 9, 2022

View reviewed changes

HyukjinKwon reviewed Aug 9, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated Show resolved Hide resolved

HyukjinKwon approved these changes Aug 9, 2022

View reviewed changes

santosh-d3vpl3x and others added 3 commits August 9, 2022 01:41

Update sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

89821e6

Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>

Merge branch 'apache:master' into master

b0883a9

SPARK-39895 Seperate out R changes for another PR

83dc275

santosh-d3vpl3x changed the title ~~[SPARK-39895][SQL][PYTHON][R] Support multiple column drop~~ [SPARK-39895][SQL][PYTHON] Support multiple column drop Aug 10, 2022

Apply suggestions from code review

3993318

HyukjinKwon closed this in 69f402a Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39895][SQL][PYTHON] Support multiple column drop #37335

[SPARK-39895][SQL][PYTHON] Support multiple column drop #37335

santosh-d3vpl3x commented Jul 28, 2022 •

edited

dongjoon-hyun left a comment

santosh-d3vpl3x commented Jul 29, 2022 •

edited

AmplabJenkins commented Jul 29, 2022

HyukjinKwon commented Jul 31, 2022

santosh-d3vpl3x commented Aug 8, 2022

zhengruifeng commented Aug 9, 2022

HyukjinKwon Aug 9, 2022

HyukjinKwon left a comment

HyukjinKwon commented Aug 11, 2022

dongjoon-hyun commented Aug 11, 2022

[SPARK-39895][SQL][PYTHON] Support multiple column drop #37335

[SPARK-39895][SQL][PYTHON] Support multiple column drop #37335

Conversation

santosh-d3vpl3x commented Jul 28, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun left a comment

Choose a reason for hiding this comment

santosh-d3vpl3x commented Jul 29, 2022 • edited

AmplabJenkins commented Jul 29, 2022

HyukjinKwon commented Jul 31, 2022

santosh-d3vpl3x commented Aug 8, 2022

zhengruifeng commented Aug 9, 2022

HyukjinKwon Aug 9, 2022

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 11, 2022

dongjoon-hyun commented Aug 11, 2022

santosh-d3vpl3x commented Jul 28, 2022 •

edited

santosh-d3vpl3x commented Jul 29, 2022 •

edited