Skip to content

Conversation

@eejbyfeldt
Copy link
Contributor

@eejbyfeldt eejbyfeldt commented Jul 1, 2024

What changes were proposed in this pull request?

This patch makes ii possible to mark scala UDFs as foldable. UDFs marked as foldable will be constant folded when when it is deterministic and all the children are foldable.

Why are the changes needed?

By allowing the expression to be folded we can allow for more optimizations. Zero argument UDFs are commonly used for literals of custom types.

Does this PR introduce any user-facing change?

Yes, it allows scala UDFs to be constant folded during analysis.

How was this patch tested?

New tests in ConstantFoldingSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jul 1, 2024
@eejbyfeldt eejbyfeldt force-pushed the SPARK-48769-scala-udf-constant-folding branch from b237c81 to 4b55bf2 Compare July 1, 2024 12:54
@github-actions github-actions bot added the AVRO label Jul 1, 2024
@eejbyfeldt eejbyfeldt marked this pull request as ready for review July 1, 2024 19:46
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem we made ScalaUDF as deterministic by default so far, and we will have a lot of problems in existing workloads by executing the UDFs in driver side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we ever made them non-determinstic by default, it would have been much easier to support foldability ..

Copy link
Contributor Author

@eejbyfeldt eejbyfeldt Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon

I think the problem we made ScalaUDF as deterministic by default so far, and we will have a lot of problems in existing workloads by executing the UDFs in driver side.

To me deterministic seems like the correct default, I would expect most udfs to be deterministic. So chainging the default seems like a more invasive change.

The concerns you raise are around the belief that people are using the existing API incorrectly. Do we have some more concrete evidence that this is the case? Because this seems like an argument that could be used against a lot of potential improvements and to me it does not seems like a rule we would like to follow in the long run.

My understanding it also that having a UDF with incorrect value for deterministic would already open you up to correctness issues due to other optimizer rules? So maybe that leads people to already specifying the correct nullability for their udfs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user defines a UDF that requires something in executors, that'd be a problem now as the UDF runs in driver.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, if the UDF runs something very heavy, it will stop the Catalyst optimizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, if the UDF runs something very heavy, it will stop the Catalyst optimizer.

To me this argument sounds a bit backwards. The heavier the UDF is the more gain it is from executing it once during planning instead of executing it multiple times during execution (worst case we are going to execute it once per row).

If user defines a UDF that requires something in executors, that'd be a problem now as the UDF runs in driver.

Ok, this does sound like somthing that could cause breakage. Not sure how common that would be in practice b though.

I changed the PR to make it opt in per UDF to make it foldable. I guess that should address most/all of the concerns you brought up with regards of it being a breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon any chance we could modify this change to make it acceptable or should it just be abandoned?

I also notice that we will attempt to execute expressions and therefore UDFs as part of EliminateOuterJoins. There errors are ignored. We could extend the ConstantFolding rule to the same, ignore errors from expressions containing UDFs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found that we will also execute udfs of part of ConvertToLocalRelation

@eejbyfeldt eejbyfeldt force-pushed the SPARK-48769-scala-udf-constant-folding branch from bb0725b to 086e48b Compare August 6, 2024 06:44
@github-actions github-actions bot added CONNECT and removed AVRO labels Aug 6, 2024
@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Nov 15, 2024
@github-actions github-actions bot closed this Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants