Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add replaceWhere functionality #1957

Closed
MrPowers opened this issue Dec 11, 2023 · 10 comments
Closed

Add replaceWhere functionality #1957

MrPowers opened this issue Dec 11, 2023 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@MrPowers
Copy link
Collaborator

Description

PySpark has a cool replaceWhere function that lets you override existing data in a Delta table that matches a predicate with new data. Here's an example of the replaceWhere functionality:

df2 = spark.createDataFrame(
    [
        ("x", 7),
        ("y", 8),
        ("z", 9),
    ]
).toDF("letter", "number")

(
    df2.write.format("delta")
    .option("replaceWhere", "number >= 2")
    .mode("overwrite")
    .save("tmp/my_data")
)

What do folks think about adding replaceWhere functionality to Python deltalake.

It's possible that the Rust predicate argument in write_deltalake already exposes this functionality.

@MrPowers MrPowers added the enhancement New feature or request label Dec 11, 2023
@ion-elgreco
Copy link
Collaborator

I exposed the predicate parameter for the rust engine writer but it's currently not doing anything because the functionality in Rust is not built yet

@r3stl355
Copy link
Contributor

take

@r3stl355
Copy link
Contributor

I'll give this a try

@r3stl355
Copy link
Contributor

WriteBuilder uses predicate: Option<String> but has no implementation for it yet whereas DeleteBuilder uses predicate: Option<Expression>. I suggest harmonising by changing WriteBuilder to use predicate: Option<Expression>. Though this is a breaking change, predicate handling is not implemented in WriteBuilder so changing the type should not cause issues

@roeap
Copy link
Collaborator

roeap commented Dec 23, 2023

It would be great to do this usig logical expressions rather then the physical ones - much like @Blajda recently updated for merge. The good thing there is we get some type coercion for free, which has been a hassle with expressions.

In python we will likely have to accept strings and do the parsing..

@ion-elgreco
Copy link
Collaborator

@roeap I think we can start allowing arrow expressions as input, which we can serialize as substrait and then deserialize with Datafusion-substrait

@roeap
Copy link
Collaborator

roeap commented Dec 23, 2023

This would be a great goal, but I would say lets be consistent in that and make a deliberate API choice.

I.e not have substrait supported in one method but not the other...

Good news is substrait plans are of course logical plans :)

@r3stl355
Copy link
Contributor

I'll try that @roeap. As for

It would be great to do this usig logical expressions rather then the physical ones - much like @Blajda recently updated for merge.

is this the David's PR you are referring to? #1969

@ion-elgreco
Copy link
Collaborator

@roeap we should be able to add this to merge, update, delete and write and then just add the conversion inside the pyo3 binding, so it's a Python only feature.

@roeap
Copy link
Collaborator

roeap commented Dec 23, 2023

@r3stl355 its #1720 had been up for a while before it got merged.

@ion-elgreco - sure to get started, and as you said right now this could just be internal. Substrait is a nice feature for rust as well, of course as alternative path since we are lookig to integrate into datafusions internal planning.

ion-elgreco added a commit that referenced this issue Jan 31, 2024
# Description
First/naive implementation of `replaceWhere` for `write`. Code compiles
and there is a test to verify the outcome. I would appreciate any
feedback on improving the structure/implementation. For example, I
copied the part of code from `delete` operation because there is no way
to call that code in `delete` directly from `write` - should I look into
extracting that code from `delete` to somewhere central?

Seems to also works with partitions columns.

# Related Issue(s)
#1957

# Documentation
Added a section in docs

---------

Signed-off-by: Nikolay Ulmasov <ulmasov@hotmail.com>
Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
RobinLin666 pushed a commit to RobinLin666/delta-rs that referenced this issue Feb 2, 2024
# Description
First/naive implementation of `replaceWhere` for `write`. Code compiles
and there is a test to verify the outcome. I would appreciate any
feedback on improving the structure/implementation. For example, I
copied the part of code from `delete` operation because there is no way
to call that code in `delete` directly from `write` - should I look into
extracting that code from `delete` to somewhere central?

Seems to also works with partitions columns.

# Related Issue(s)
delta-io#1957

# Documentation
Added a section in docs

---------

Signed-off-by: Nikolay Ulmasov <ulmasov@hotmail.com>
Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants