Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43841][SQL] Handle candidate attributes with no prefix in StringUtils#orderSuggestedIdentifiersBySimilarity #41353

Closed

Conversation

bersprockets
Copy link
Contributor

What changes were proposed in this pull request?

In StringUtils#orderSuggestedIdentifiersBySimilarity, handle the case where the candidate attributes have a mix of empty and non-empty prefixes.

Why are the changes needed?

The following query throws a StringIndexOutOfBoundsException:

with v1 as (
 select * from values (1, 2) as (c1, c2)
),
v2 as (
  select * from values (2, 3) as (c1, c2)
)
select v1.c1, v1.c2, v2.c1, v2.c2, b
from v1
full outer join v2
using (c1);

The query should fail anyway, since b refers to a non-existent column. But it should fail with a helpful error message, not with a StringIndexOutOfBoundsException.

StringUtils#orderSuggestedIdentifiersBySimilarity assumes that a list of suggested attributes with a mix of prefixes will never have an attribute name with an empty prefix. But in this case it does (c1 from the coalesce has no prefix, since it is not associated with any relation or subquery):

+- 'Project [c1#5, c2#6, c1#7, c2#8, 'b]
   +- Project [coalesce(c1#5, c1#7) AS c1#9, c2#6, c2#8] <== c1#9 has no prefix, unlike c2#6 (v1.c2) or c2#8 (v2.c2)
      +- Join FullOuter, (c1#5 = c1#7)
         :- SubqueryAlias v1
         :  +- CTERelationRef 0, true, [c1#5, c2#6]
         +- SubqueryAlias v2
            +- CTERelationRef 1, true, [c1#7, c2#8]

Because of this, orderSuggestedIdentifiersBySimilarity returns a sorted list of suggestions like this:

ArrayBuffer(.c1, v1.c2, v2.c2)

UnresolvedAttribute.parseAttributeName chokes on an attribute name that starts with a '.'.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit tests.

@github-actions github-actions bot added the SQL label May 28, 2023
@HyukjinKwon
Copy link
Member

cc @rednaxelafx @cloud-fan FYI

@MaxGekk
Copy link
Member

MaxGekk commented May 29, 2023

+1, LGTM. Merging to master.
Thank you, @bersprockets and @dongjoon-hyun @HyukjinKwon for review.

@MaxGekk MaxGekk closed this in 27bb384 May 29, 2023
@bersprockets bersprockets deleted the unresolved_column_issue branch June 8, 2023 20:40
czxm pushed a commit to czxm/spark that referenced this pull request Jun 12, 2023
…ingUtils#orderSuggestedIdentifiersBySimilarity`

### What changes were proposed in this pull request?

In `StringUtils#orderSuggestedIdentifiersBySimilarity`, handle the case where the candidate attributes have a mix of empty and non-empty prefixes.

### Why are the changes needed?

The following query throws a `StringIndexOutOfBoundsException`:
```
with v1 as (
 select * from values (1, 2) as (c1, c2)
),
v2 as (
  select * from values (2, 3) as (c1, c2)
)
select v1.c1, v1.c2, v2.c1, v2.c2, b
from v1
full outer join v2
using (c1);
```
The query should fail anyway, since `b` refers to a non-existent column. But it should fail with a helpful error message, not with a `StringIndexOutOfBoundsException`.

`StringUtils#orderSuggestedIdentifiersBySimilarity` assumes that a list of suggested attributes with a mix of prefixes will never have an attribute name with an empty prefix. But in this case it does (`c1` from the `coalesce` has no prefix, since it is not associated with any relation or subquery):
```
+- 'Project [c1#5, c2#6, c1#7, c2#8, 'b]
   +- Project [coalesce(c1#5, c1#7) AS c1#9, c2#6, c2#8] <== c1#9 has no prefix, unlike c2#6 (v1.c2) or c2#8 (v2.c2)
      +- Join FullOuter, (c1#5 = c1#7)
         :- SubqueryAlias v1
         :  +- CTERelationRef 0, true, [c1#5, c2#6]
         +- SubqueryAlias v2
            +- CTERelationRef 1, true, [c1#7, c2#8]
```
Because of this, `orderSuggestedIdentifiersBySimilarity` returns a sorted list of suggestions like this:
```
ArrayBuffer(.c1, v1.c2, v2.c2)
```
`UnresolvedAttribute.parseAttributeName` chokes on an attribute name that starts with a '.'.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit tests.

Closes apache#41353 from bersprockets/unresolved_column_issue.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants