Skip to content

Conversation

@maheshk114
Copy link
Contributor

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this forward @maheshk114 .

First, I have some high-level questions regarding the scope of this work:

  1. Which join operators are we targeting ? Checking the CommonJoinOperator hierarchy I see a few classes that were not affected by your changes (e.g., MapJoinOperator, JoinOperator, and VectorXXX) and I am wondering if that is normal. Do they already support complex types? Should they support complex types in the future?
  2. Which kind of joins are we tackling? Apart from equality joins (=) there are more operators that can appear such as (<>,<,>,<=,>=, etc), what happens with them?
  3. What are the semantics of the comparisons? Are we following the SQL standard?

For the above it may be worth enriching/modifying the JIRA case to tighten the scope.

Next in terms of testing, I think we should have a few cases covering:

  • comparisons with null values;
  • comparisons of collections with different sizes;
  • comparisons with different types (negative?);
  • more operators & predicates (depends on the answers to the questions above)

import java.util.ArrayList;
import java.util.LinkedHashMap;

class HiveListComparator extends HiveWritableComparator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having multiple top-level classes in a single source file does not provide any big advantage and on the contrary may cause problems (check Item 25: Limit source files to a single top-level class Effective Java).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}

public class HiveWritableComparator extends WritableComparator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to introduce a new API? As far as I can see HiveWritableComparator does not add any new behavior to WritableComparator. It only contains some factory methods and these would fit much better in a ComplexWritableComparatorFactory class that is final and immutable.

The non-public top-level classes above could become private static members classes of the factory class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public int compare(Object key1, Object key2) {
ArrayList a1 = (ArrayList) key1;
ArrayList a2 = (ArrayList) key2;
if (a1.size() != a2.size()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get an NPE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added null check for all.

@maheshk114
Copy link
Contributor Author

maheshk114 commented Apr 9, 2021

3. What are the semantics of the comparisons? Are we following the SQL standard?

I am not aware of any SQL standards for complex type comparison. The join ordering used follows the normal comparison, equality is check from left to right fields.

@maheshk114
Copy link
Contributor Author

2. Which kind of joins are we tackling? Apart from equality joins (=) there are more operators that can appear such as (<>,<,>,<=,>=, etc), what happens with them?

Thanks for pointing this out. I will create a separate Jira as currently only equal operator is supported.

@maheshk114
Copy link
Contributor Author

  1. Which join operators are we targeting ? Checking the CommonJoinOperator hierarchy I see a few classes that were not affected by your changes (e.g., MapJoinOperator, JoinOperator, and VectorXXX) and I am wondering if that is normal. Do they already support complex types? Should they support complex types in the future?

As of now hash based joins are working fine. This patch fixes the issue with SMB and Common merge join.

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a minor comment in the PR, and one in the JIRA about supporting UNION types in comparisons. Apart from that the PR is in very good shape and can be merged as soon as we agree on the support of UNION type.

Suggestion for squash commit msg:
HIVE-24883: Support ARRAY/STRUCT types in equality sort-merge joins
include also UNION if we decide to go this way.

Comment on lines 1 to 25
create table table_map_types (id int, c1 map<int,int>, c2 map<int,int>);
insert into table_map_types VALUES (1, map(1,1), map(2,1));
insert into table_map_types VALUES (2, map(1,2), map(2,2));
insert into table_map_types VALUES (3, map(1,3), map(2,3));
insert into table_map_types VALUES (4, map(1,4), map(1,4));
insert into table_map_types VALUES (1, map(1,1,2,2,3,3,4,4), map(2,1,1,4));
select * from table_map_types;

create table table_map_types1 (id int, c1 map<int,int>, c2 map<int,int>);
insert into table_map_types1 VALUES (1, map(1,1), map(2,1));
insert into table_map_types1 VALUES (2, map(1,2), map(2,2));
insert into table_map_types1 VALUES (3, map(1,4), map(1,3));
insert into table_map_types1 VALUES (1, map(1,1,2,2,3,3,4,4), map(2,1,1,5));
insert into table_map_types1 VALUES (1, map(1,1,2,2,3,3,4,5), map(2,1,1,4));
select * from table_map_types1;

set hive.cbo.enable=false;
set hive.auto.convert.join=false;
set hive.optimize.ppd=false;

explain select * from table_map_types t1 inner join table_map_types1 t2 on t1.c1 = t2.c1;
select * from table_map_types t1 inner join table_map_types1 t2 on t1.c1 = t2.c1;

explain select * from table_map_types t1 inner join table_map_types1 t2 on t1.c2 = t2.c2;
select * from table_map_types t1 inner join table_map_types1 t2 on t1.c2 = t2.c2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a negative test I guess you only need the following lines to make sure that the exception is raised:

create table table_map_types (id int, c1 map<int,int>, c2 map<int,int>);
set hive.cbo.enable=false;
set hive.auto.convert.join=false;
set hive.optimize.ppd=false;
select * from table_map_types t1 inner join table_map_types t2 on t1.c1 = t2.c1;

It is better to keep test cases minimal.

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this forward @maheshk114
I have no more comments, LGTM!

@maheshk114 maheshk114 merged commit 37f13b0 into apache:master May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants