Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Add extra safeguards around join method #8170

Merged
merged 4 commits into from
Jul 7, 2023

Conversation

Tishj
Copy link
Contributor

@Tishj Tishj commented Jul 6, 2023

This PR fixes #8134

join used the alias of the Relations, which was query_relation by default when no alias is explicitly specified.
This causes the BinderException mentioned in the issue.

With this PR we change the default alias to be a uniquely generated name, to avoid these alias collisions in the future.

If two relations have the same alias, this will still cause an issue.
To prevent users running into this confusing error message we now check if the alias is the same and throw a helpful error message if that's the case.

@Mytherin
Copy link
Collaborator

Mytherin commented Jul 6, 2023

Thanks for the PR! LGTM.

Not sure if this is possible - but it might be nicer if we default to the variable name in which a relation is stored as the alias (if any), at least if set_alias has not been called on the relation.

Another potential solution - which is what we do with unnamed subqueries - is to make the default alias unique by adding a (globally unique?) identifier. For example:

D select unnamed_subquery1.a, unnamed_subquery8.a from (select 42 AS a), (select 84 as a);
┌───────┬───────┐
│   a   │   a   │
│ int32 │ int32 │
├───────┼───────┤
│    4284 │
└───────┴───────┘

@Tishj
Copy link
Contributor Author

Tishj commented Jul 6, 2023

Thanks for the PR! LGTM.

Not sure if this is possible - but it might be nicer if we default to the variable name in which a relation is stored as the alias (if any), at least if set_alias has not been called on the relation.

Another potential solution - which is what we do with unnamed subqueries - is to make the default alias unique by adding a (globally unique?) identifier. For example:

D select unnamed_subquery1.a, unnamed_subquery8.a from (select 42 AS a), (select 84 as a);
┌───────┬───────┐
│   a   │   a   │
│ int32 │ int32 │
├───────┼───────┤
│    4284 │
└───────┴───────┘

The globally unique default alias is a good idea, that should help alleviate this issue.
I wonder if it's possible to check the local/global dict and compare the pointers to find out the variable name, but otherwise I'm not sure if that context is still available here

EDIT:
got something working 👀

import duckdb

obj = 'str'
res = duckdb.test(obj)
print(res)
# obj

@Mytherin
Copy link
Collaborator

Mytherin commented Jul 6, 2023

I wonder if it's possible to check the local/global dict and compare the pointers to find out the variable name, but otherwise I'm not sure if that context is still available here

Yes, I think that would be the only way to do it. After giving it some more thought I think that is not a good idea though. Firstly we would need to scan these dictionaries constantly which will perform poorly if they get large. Secondly, it could cause potential issues when objects are stored in multiple variables, e.g. something like:

rel1 = duckdb.sql("SELECT 1 AS col1, 2 AS col2")
rel2 = rel1

rel = rel1.join(rel2, "col1")

Thirdly it would not always solve the issue anyway, as relations don't need to be stored in objects, so this would still fail:

rel = duckdb.sql("SELECT 1 AS col1, 2 AS col2").join(duckdb.sql("SELECT 1 AS col1, 2 AS col2"), "col1")

I think the better solution would be to generate a unique name using either (1) a global atomic<uint64_t> counter that is incremented whenever a new alias is created, or (2) a UUID

@github-actions github-actions bot marked this pull request as draft July 7, 2023 08:43
@Tishj Tishj marked this pull request as ready for review July 7, 2023 08:44
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

@Mytherin Mytherin merged commit fef9667 into duckdb:master Jul 7, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants