Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPJ joins in the outer join component of MERGE queries #8387

Closed
dev-goyal opened this issue Aug 24, 2023 · 4 comments
Closed

SPJ joins in the outer join component of MERGE queries #8387

dev-goyal opened this issue Aug 24, 2023 · 4 comments
Labels

Comments

@dev-goyal
Copy link

Query engine

Spark on EMR

Question

In a MERGE query, is there a reason the inner join (WHEN MATCHED) is SPJ, but the left outer join (WHEN NOT MATCHED) is ShuffledHashJoin? Can post an example, but wondering if this is expected behaviour for some reason?

@RussellSpitzer
Copy link
Member

So if I remember correctly the root of the issue is this. To do a SPJ we first align things based on partitions

[ Source Partition ] === [Destination Partition]

But Sometimes this balancing is not good so we want to be able to break up one of the two sides here into multiple pieces. To do that we split one of the sides up into multiple parts

[Source partition] === [Destination Partition Part 1]
[Source partition] === [Destination Partition Part 2]
[Source partition] === [Destination Partition Part 3]

For a "WHEN MATCHED" this is good, because if any task succeeds we can perform the action on the result. Each task can still be treated completely independently.

For "WHEN NOT MATCHED" we have a problem because then we can only apply the change if all of the tasks do not match for a given expression. We don't have a mechanism for doing a reduction over key (? or something like that) after doing our checks. This means we can't do our SPJ optimization and have to do a full shuffle.

I think I have that all right, but I'm mostly remembering our discussion when this was originally being implemented.

@dev-goyal
Copy link
Author

Very helpful to know, thanks!

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Sep 14, 2024
Copy link

github-actions bot commented Oct 5, 2024

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants