Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-27112 - implement array_except UDF in Hive #4090

Merged
merged 8 commits into from Jun 23, 2023

Conversation

tarak271
Copy link
Contributor

@tarak271 tarak271 commented Mar 1, 2023

What changes were proposed in this pull request?

Implement array_except function in Hive

Why are the changes needed?

This enhancement is already implemented in Spark

Does this PR introduce any user-facing change?

No

How was this patch tested?

Created Junit tests as well as qtests as part of this change

@tarak271 tarak271 changed the title Tarak hive 27112 HIVE-27112 - implement array_except UDF in Hive Mar 1, 2023
*/
@Description(name = "array_except", value = "_FUNC_(array, value) - Returns an array of the elements in array1 but not in array2.", extended =
"Example:\n" + " > SELECT _FUNC_(array(1, 2, 3,4), array(2,3)) FROM src LIMIT 1;\n"
+ " [1,4]") @NDV(maxNdv = 2) public class GenericUDFArrayExcept extends AbstractGenericUDFArrayBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • How about changing the definition from _FUNC_(array, value) to _FUNC_(array1, array2)?
  • Line break might be wrong
  • We have to remove @NDV(maxNdv = 2) because it means the number of distinct values is up to 2. This UDF can return any arrays, meaning the maximum number of distinct values is infinity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made changes as recommended, formatting is done as per https://github.com/apache/hive/blob/master/dev-support/eclipse-styles.xml

@Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
ObjectInspector defaultOI = super.initialize(arguments);
checkArgCategory(arguments, ARRAY2_IDX, ObjectInspector.Category.LIST, FUNC_NAME,
org.apache.hadoop.hive.serde.serdeConstants.LIST_TYPE_NAME); //Array1 is already getting validated in Parent class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if the type of elements of array1 is equal to array2? It might be OK if we allow type conversions here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test cases when one array is of int type and one array is of String type at ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFArrayExcept.java#testPrimitive

Copy link
Contributor

@okumin okumin May 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think we should carefully think of the expected specification first. What should the following SQL return?

SELECT array_except(array(1, 2, 3), array(2.0, 3.3));

If it should return array(1, 3), meaning type conversion is applied, we should add a test case where elements are removed with type conversions.
If it should return array(1, 2, 3), meaning the second argument is meaningless if types are unmatched, I personally think we should raise a syntax error. That's because it happens only when a user misunderstands the types of the 1st and 2nd arguments.

For example, Spark 3.4 fails in that case. PrestoSQL returns [1.0, 3.0] with the same SQL, meaning PrestoSQL applies the type conversion from int to float. I personally think either is fine, but it should be tested.

spark-sql (default)> select array_except(array(1, 2, 3), array(2.0, 3.3));
[DATATYPE_MISMATCH.BINARY_ARRAY_DIFF_TYPES] Cannot resolve "array_except(array(1, 2, 3), array(2.0, 3.3))" due to data type mismatch: Input to function `array_except` should have been two "ARRAY" with same element type, but it's ["ARRAY<INT>", "ARRAY<DECIMAL(2,1)>"].; line 1 pos 7;
'Project [unresolvedalias(array_except(array(1, 2, 3), array(2.0, 3.3)), None)]
+- OneRowRelation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As recommended, throwing error for unmatched data types

}

List<?> retArray3 = ((ListObjectInspector) argumentOIs[ARRAY_IDX]).getList(array);
retArray3.removeAll(((ListObjectInspector) argumentOIs[ARRAY2_IDX]).getList(arguments[ARRAY2_IDX].get()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I guess we shouldn't directly mutate the object here because ListObjectInspector can return the original value
    • I think it is possible to access the original object like SELECT array1, ARRAY_EXCEPT(array1, array2)
    • We want test cases to cover such usage
  • We should have to care the case where array2 is null because List#removeAll raises NPE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added above test cases in ql/src/test/queries/clientpositive/udf_array_except.q

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. As I commented on a test file, I feel SELECT array1, ARRAY_EXCEPT(array1, array2) returns an incorrect result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that i missed your update regarding the incorrect result.. Now I have changed the code to avoid manipulating inputs

@sonarcloud
Copy link

sonarcloud bot commented Mar 2, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@github-actions
Copy link

github-actions bot commented May 2, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label May 2, 2023
@saihemanth-cloudera
Copy link
Contributor

@tarak271 - Do you still plan to work on this PR?

@github-actions github-actions bot removed the stale label May 3, 2023
@tarak271
Copy link
Contributor Author

tarak271 commented May 3, 2023

Yes @saihemanth-cloudera, need some help with review. Coding part is complete

@saihemanth-cloudera
Copy link
Contributor

Can you rerun the tests on this PR? I can't find a way to rerun the Build CI test suit. Can you push an empty commit? Thanks.

@tarak271 tarak271 marked this pull request as draft May 10, 2023 07:04
@tarak271 tarak271 marked this pull request as ready for review May 10, 2023 07:09
@sonarcloud
Copy link

sonarcloud bot commented May 10, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@tarak271
Copy link
Contributor Author

@saihemanth-cloudera tests got completed now, please help with review

Copy link
Contributor

@saihemanth-cloudera saihemanth-cloudera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

Copy link
Contributor

@okumin okumin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarak271 Looks cool! I only put a few minor comments.


public GenericUDFArrayExcept() {
super(FUNC_NAME, 2, 2, ObjectInspector.Category.LIST);
}

@Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
ObjectInspector defaultOI = super.initialize(arguments);
array2OI = (ListObjectInspector) arguments[ARRAY2_IDX];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NITs:

  • This line should be after the following checkArgCategory because it validates the type and checkArgCategory throws a kind message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

checkArgCategory(arguments, ARRAY2_IDX, ObjectInspector.Category.LIST, FUNC_NAME,
org.apache.hadoop.hive.serde.serdeConstants.LIST_TYPE_NAME); //Array1 is already getting validated in Parent class
arrayElementOI = arrayOI.getListElementObjectInspector();
array2ElementOI = array2OI.getListElementObjectInspector();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: array2OI can be a local variable, or we can simply do ObjectInspectorUtils.compareTypes(arrayOI, arguments[ARRAY2_IDX]) because it seems to recursively check types of LIST, MAP, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all those variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please note that elements in the input array can be of type array or list, that's why sending element itself to check compatibility

List inputArrayCopy = new ArrayList<>();
inputArrayCopy.addAll(retArray3);
inputArrayCopy.removeAll(((ListObjectInspector) argumentOIs[ARRAY2_IDX]).getList(arguments[ARRAY2_IDX].get()));
return inputArrayCopy.stream().distinct().map(o -> converter.convert(o)).collect(Collectors.toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We got robustness. Thanks!

runAndVerify(inputStringList,inputStringList,asList());
runAndVerify(inputStringList,asList(),inputStringList);
// Empty input arrays
runAndVerify(asList(),asList(),asList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great test cases!

@sonarcloud
Copy link

sonarcloud bot commented Jun 12, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
No Duplication information No Duplication information

Copy link
Contributor

@okumin okumin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I want to ask @saihemanth-cloudera about the final check and merge since I am not a committer.
@tarak271 Nice work. I definitely have use cases to use this UDF!

@saihemanth-cloudera
Copy link
Contributor

LGTM +1

@saihemanth-cloudera saihemanth-cloudera merged commit 14a1f70 into apache:master Jun 23, 2023
5 checks passed
yeahyung pushed a commit to yeahyung/hive that referenced this pull request Jul 20, 2023
… Rama Rao Lethavadla, reviewed by Okumin, Sai Hemanth Gantasala)
tarak271 added a commit to tarak271/hive-1 that referenced this pull request Dec 19, 2023
… Rama Rao Lethavadla, reviewed by Okumin, Sai Hemanth Gantasala)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants