-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-24883 : Add support for complex types columns in Hive Joins #2071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
zabetak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this forward @maheshk114 .
First, I have some high-level questions regarding the scope of this work:
- Which join operators are we targeting ? Checking the
CommonJoinOperatorhierarchy I see a few classes that were not affected by your changes (e.g.,MapJoinOperator,JoinOperator, andVectorXXX) and I am wondering if that is normal. Do they already support complex types? Should they support complex types in the future? - Which kind of joins are we tackling? Apart from equality joins (
=) there are more operators that can appear such as (<>,<,>,<=,>=, etc), what happens with them? - What are the semantics of the comparisons? Are we following the SQL standard?
For the above it may be worth enriching/modifying the JIRA case to tighten the scope.
Next in terms of testing, I think we should have a few cases covering:
- comparisons with null values;
- comparisons of collections with different sizes;
- comparisons with different types (negative?);
- more operators & predicates (depends on the answers to the questions above)
| import java.util.ArrayList; | ||
| import java.util.LinkedHashMap; | ||
|
|
||
| class HiveListComparator extends HiveWritableComparator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having multiple top-level classes in a single source file does not provide any big advantage and on the contrary may cause problems (check Item 25: Limit source files to a single top-level class Effective Java).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| } | ||
| } | ||
|
|
||
| public class HiveWritableComparator extends WritableComparator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to introduce a new API? As far as I can see HiveWritableComparator does not add any new behavior to WritableComparator. It only contains some factory methods and these would fit much better in a ComplexWritableComparatorFactory class that is final and immutable.
The non-public top-level classes above could become private static members classes of the factory class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java
Show resolved
Hide resolved
| public int compare(Object key1, Object key2) { | ||
| ArrayList a1 = (ArrayList) key1; | ||
| ArrayList a2 = (ArrayList) key2; | ||
| if (a1.size() != a2.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to get an NPE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, added null check for all.
I am not aware of any SQL standards for complex type comparison. The join ordering used follows the normal comparison, equality is check from left to right fields. |
Thanks for pointing this out. I will create a separate Jira as currently only equal operator is supported. |
As of now hash based joins are working fine. This patch fixes the issue with SMB and Common merge join. |
…xed review commnets
…xed review commnets1
…xed review commnets3
…mmon merge joins 1. Support added only for equal operator. 2. Not supported for map type.
zabetak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a minor comment in the PR, and one in the JIRA about supporting UNION types in comparisons. Apart from that the PR is in very good shape and can be merged as soon as we agree on the support of UNION type.
Suggestion for squash commit msg:
HIVE-24883: Support ARRAY/STRUCT types in equality sort-merge joins
include also UNION if we decide to go this way.
| create table table_map_types (id int, c1 map<int,int>, c2 map<int,int>); | ||
| insert into table_map_types VALUES (1, map(1,1), map(2,1)); | ||
| insert into table_map_types VALUES (2, map(1,2), map(2,2)); | ||
| insert into table_map_types VALUES (3, map(1,3), map(2,3)); | ||
| insert into table_map_types VALUES (4, map(1,4), map(1,4)); | ||
| insert into table_map_types VALUES (1, map(1,1,2,2,3,3,4,4), map(2,1,1,4)); | ||
| select * from table_map_types; | ||
|
|
||
| create table table_map_types1 (id int, c1 map<int,int>, c2 map<int,int>); | ||
| insert into table_map_types1 VALUES (1, map(1,1), map(2,1)); | ||
| insert into table_map_types1 VALUES (2, map(1,2), map(2,2)); | ||
| insert into table_map_types1 VALUES (3, map(1,4), map(1,3)); | ||
| insert into table_map_types1 VALUES (1, map(1,1,2,2,3,3,4,4), map(2,1,1,5)); | ||
| insert into table_map_types1 VALUES (1, map(1,1,2,2,3,3,4,5), map(2,1,1,4)); | ||
| select * from table_map_types1; | ||
|
|
||
| set hive.cbo.enable=false; | ||
| set hive.auto.convert.join=false; | ||
| set hive.optimize.ppd=false; | ||
|
|
||
| explain select * from table_map_types t1 inner join table_map_types1 t2 on t1.c1 = t2.c1; | ||
| select * from table_map_types t1 inner join table_map_types1 t2 on t1.c1 = t2.c1; | ||
|
|
||
| explain select * from table_map_types t1 inner join table_map_types1 t2 on t1.c2 = t2.c2; | ||
| select * from table_map_types t1 inner join table_map_types1 t2 on t1.c2 = t2.c2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a negative test I guess you only need the following lines to make sure that the exception is raised:
create table table_map_types (id int, c1 map<int,int>, c2 map<int,int>);
set hive.cbo.enable=false;
set hive.auto.convert.join=false;
set hive.optimize.ppd=false;
select * from table_map_types t1 inner join table_map_types t2 on t1.c1 = t2.c1;
It is better to keep test cases minimal.
zabetak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this forward @maheshk114
I have no more comments, LGTM!
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?