Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41169][CONNECT][PYTHON] Implement DataFrame.drop #38686

Closed
wants to merge 2 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Implement DataFrame.drop with a proto message

Why are the changes needed?

for api coverage

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

added UT

// (Required) columns to drop.
//
// Should contain at least 1 item.
repeated Expression cols = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does drop actually support arbitrary expressions? Shouldn't this be a repeated unresolved attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset.drop takes arbitrary expressions into account

val expressions = (for (col <- allColumns) yield col match {
case Column(u: UnresolvedAttribute) =>
queryExecution.analyzed.resolveQuoted(
u.name, sparkSession.sessionState.analyzer.resolver).getOrElse(u)
case Column(expr: Expression) => expr
})

val sparkPlan2 = sparkTestRelation.drop("id", "name")
comparePlans(connectPlan2, sparkPlan2)

// non-existing column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you treat the dropped columns as expressions we need to add a negative test for unsupported expressions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked the implementation of Dataset.drop, it supports all kinds of expressions, a expression will be just ignored if it doesn't semanticEquals the columns in current dataframe.

@@ -255,10 +255,21 @@ def distinct(self) -> "DataFrame":
)

def drop(self, *cols: "ColumnOrString") -> "DataFrame":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting case where one could argue for implementing the behavior on the client side instead of the server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, there will be two RPC if we implement it on the client side
1, all_cols = self.columns to fetch the schema;
2, build the plan

with a dedicated proto mesage, we only need one RPC.

Copy link
Contributor

@amaliujia amaliujia Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have built the consensus that we prefer re-using the proto than ask clients do duplicate work.

// (Required) columns to drop.
//
// Should contain at least 1 item.
repeated Expression cols = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if the name should be more explicit like "dropped"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here follows the naming in def drop(col: Column, cols: Column*): DataFrame

assert(rel.getColsCount > 0, s"cols must contains at least 1 item!")

val cols = rel.getColsList.asScala.toArray.map { expr =>
Column(transformExpression(expr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should verify supported types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean verify for the arrow-based collect?
since we will remove the json code path, it always fails if there are unsupported types.

@zhengruifeng
Copy link
Contributor Author

I will update this PR after the arrow-based collect is fixed #38706 , otherwise, some e2e tests will fail

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the connect_df_drop branch November 22, 2022 06:53
@amaliujia
Copy link
Contributor

LGTM

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?
Implement `DataFrame.drop` with a proto message

### Why are the changes needed?
for api coverage

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
added UT

Closes apache#38686 from zhengruifeng/connect_df_drop.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 15, 2022
### What changes were proposed in this pull request?
Implement `DataFrame.drop` with a proto message

### Why are the changes needed?
for api coverage

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
added UT

Closes apache#38686 from zhengruifeng/connect_df_drop.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
### What changes were proposed in this pull request?
Implement `DataFrame.drop` with a proto message

### Why are the changes needed?
for api coverage

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
added UT

Closes apache#38686 from zhengruifeng/connect_df_drop.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants