Skip to content

Conversation

@kgyrtkirk
Copy link
Member

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Comment on lines 1839 to 1840
HIVE_JOIN_SHORTCUT_EXTENSIONROWS("hive.join.shortcut.extensionrows", true,
"Enables to shortcut processing of known filtered rows in merge joins."),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding is correct if this is set to false the outer join produces incorrect results. I don't think it is a good idea to have a property which determines the correctness of the results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea was to add this toggle to make it possible to check if this feature is causing any perf regression - and make it possible to validate that quickly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point but I still think that it would be better to avoid toggles when it comes to correctness issues. I guess it is not the first time we are doing this so I am not gonna push back on this but it shouldn't be the norm. I am leaving the final judgement up to you.

If it goes in I would add some extra information that it is for "internal use only" and changing the value can lead to incorrect results.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that this config is needed for performance testing. At the same time, agree that it should be marked for internal use. I see there's precedent ..there are a few config options with 'Internal use only' in the description.

Comment on lines -376 to -378
NULL NULL NULL 50 10050 66
NULL NULL NULL 50 10050 66
NULL NULL NULL 50 10050 88
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed these rows shouldn't be there. It's a pity that people update q.out files without paying attention if the resulting changes are correct. We could have found this bug much earlier if people verified the results against another DBMS as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes; overwrite without checking tends to happen - that's why I see no value in the tests which use the src table - which content is not defined at all - one way around these problems is to use assert_true ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan of the the src and similar tables but if people validated the results against another DBMS I can leave with it.

* Decides if the actual row must be an extension row.
*
* Extension rows are those which are not part of the inner-join.
* May not correctly identify all extension rows - but will remove trivially filtered ones.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May not correctly... It is not clear what this means. Possibly an example could help clarify this better.
What happens in the case where it does not identify an extension row? Wrong results?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May not identify all extension rows

it could not identify non-trivial ones like that no rows could satisfy the l.id=r.id and l.id+2*r.id=1 condition...
guess what...running:

select * from t_xy l full outer join t_y r on (l.id=r.id and l.s='y' and l.id+2*r.id=1);

resulted in incorrect results as well...

this will leave me no choice but to remove that skip crap and replace it with something decent....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have examples that do not work, maybe it would be better to write "The current implementation, does not correctly identify all..." and put a reference to a follow up JIRA highlighting that there is an implementation limitation.

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left a few comments but most of them are minor. It would good though if someone more familiar with this code can do another pass.

Comment on lines 1839 to 1840
HIVE_JOIN_SHORTCUT_EXTENSIONROWS("hive.join.shortcut.extensionrows", true,
"Enables to shortcut processing of known filtered rows in merge joins."),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point but I still think that it would be better to avoid toggles when it comes to correctness issues. I guess it is not the first time we are doing this so I am not gonna push back on this but it shouldn't be the norm. I am leaving the final judgement up to you.

If it goes in I would add some extra information that it is for "internal use only" and changing the value can lead to incorrect results.


// A field because we cannot multi-inherit.
transient InterruptibleProcessing interruptChecker;
transient boolean shortcutExtensionRows;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: should we use unmatched everywhere we use extension?

"Unmatched rows of an outer join" sounds more natural than "Extension rows of an outer join".

Comment on lines -376 to -378
NULL NULL NULL 50 10050 66
NULL NULL NULL 50 10050 66
NULL NULL NULL 50 10050 88
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan of the the src and similar tables but if people validated the results against another DBMS I can leave with it.

* Decides if the actual row must be an extension row.
*
* Extension rows are those which are not part of the inner-join.
* May not correctly identify all extension rows - but will remove trivially filtered ones.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have examples that do not work, maybe it would be better to write "The current implementation, does not correctly identify all..." and put a reference to a follow up JIRA highlighting that there is an implementation limitation.

if (shortcutExtensionRows && isOuterJoinExtensionRow(tag, value)) {
int type = condn[0].getType();
if (tag == 0 && (type == JoinDesc.LEFT_OUTER_JOIN || type == JoinDesc.FULL_OUTER_JOIN)) {
emitExtensionRow(tag, value);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied this patch and added a log message to emitExtensionRow() and ran queries with full , left and right outer join (with the extra single table predicate in ON clause) but only see the log msg for full outer join, not for left/right. It is unclear which situations would lead to extension rows for left/right outer join.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it have stopped for me for other outer joins ; however I had to disable auto.join.converion to have that - without that the join was converted into a mapjoin
I will make some further tests for different types of joins to see if we don't have any issues

@kgyrtkirk kgyrtkirk merged commit 621bfd1 into apache:master Dec 21, 2021
dengzhhu653 pushed a commit to dengzhhu653/hive that referenced this pull request Dec 15, 2022
…ditions only affecting one side (apache#2891) (Zoltan Haindrich reviewed by Stamatis Zampetakis, Aman Sinha)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants