New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-27112 - implement array_except UDF in Hive #4090
Conversation
*/ | ||
@Description(name = "array_except", value = "_FUNC_(array, value) - Returns an array of the elements in array1 but not in array2.", extended = | ||
"Example:\n" + " > SELECT _FUNC_(array(1, 2, 3,4), array(2,3)) FROM src LIMIT 1;\n" | ||
+ " [1,4]") @NDV(maxNdv = 2) public class GenericUDFArrayExcept extends AbstractGenericUDFArrayBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- How about changing the definition from
_FUNC_(array, value)
to_FUNC_(array1, array2)
? - Line break might be wrong
- We have to remove
@NDV(maxNdv = 2)
because it means the number of distinct values is up to 2. This UDF can return any arrays, meaning the maximum number of distinct values is infinity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made changes as recommended, formatting is done as per https://github.com/apache/hive/blob/master/dev-support/eclipse-styles.xml
@Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { | ||
ObjectInspector defaultOI = super.initialize(arguments); | ||
checkArgCategory(arguments, ARRAY2_IDX, ObjectInspector.Category.LIST, FUNC_NAME, | ||
org.apache.hadoop.hive.serde.serdeConstants.LIST_TYPE_NAME); //Array1 is already getting validated in Parent class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check if the type of elements of array1 is equal to array2? It might be OK if we allow type conversions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test cases when one array is of int type and one array is of String type at ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFArrayExcept.java#testPrimitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think we should carefully think of the expected specification first. What should the following SQL return?
SELECT array_except(array(1, 2, 3), array(2.0, 3.3));
If it should return array(1, 3)
, meaning type conversion is applied, we should add a test case where elements are removed with type conversions.
If it should return array(1, 2, 3)
, meaning the second argument is meaningless if types are unmatched, I personally think we should raise a syntax error. That's because it happens only when a user misunderstands the types of the 1st and 2nd arguments.
For example, Spark 3.4 fails in that case. PrestoSQL returns [1.0, 3.0]
with the same SQL, meaning PrestoSQL applies the type conversion from int to float. I personally think either is fine, but it should be tested.
spark-sql (default)> select array_except(array(1, 2, 3), array(2.0, 3.3));
[DATATYPE_MISMATCH.BINARY_ARRAY_DIFF_TYPES] Cannot resolve "array_except(array(1, 2, 3), array(2.0, 3.3))" due to data type mismatch: Input to function `array_except` should have been two "ARRAY" with same element type, but it's ["ARRAY<INT>", "ARRAY<DECIMAL(2,1)>"].; line 1 pos 7;
'Project [unresolvedalias(array_except(array(1, 2, 3), array(2.0, 3.3)), None)]
+- OneRowRelation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As recommended, throwing error for unmatched data types
} | ||
|
||
List<?> retArray3 = ((ListObjectInspector) argumentOIs[ARRAY_IDX]).getList(array); | ||
retArray3.removeAll(((ListObjectInspector) argumentOIs[ARRAY2_IDX]).getList(arguments[ARRAY2_IDX].get())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I guess we shouldn't directly mutate the object here because ListObjectInspector can return the original value
- I think it is possible to access the original object like
SELECT array1, ARRAY_EXCEPT(array1, array2)
- We want test cases to cover such usage
- I think it is possible to access the original object like
- We should have to care the case where array2 is null because
List#removeAll
raises NPE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added above test cases in ql/src/test/queries/clientpositive/udf_array_except.q
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. As I commented on a test file, I feel SELECT array1, ARRAY_EXCEPT(array1, array2)
returns an incorrect result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that i missed your update regarding the incorrect result.. Now I have changed the code to avoid manipulating inputs
Kudos, SonarCloud Quality Gate passed! |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
@tarak271 - Do you still plan to work on this PR? |
Yes @saihemanth-cloudera, need some help with review. Coding part is complete |
Can you rerun the tests on this PR? I can't find a way to rerun the Build CI test suit. Can you push an empty commit? Thanks. |
Kudos, SonarCloud Quality Gate passed! |
@saihemanth-cloudera tests got completed now, please help with review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tarak271 Looks cool! I only put a few minor comments.
|
||
public GenericUDFArrayExcept() { | ||
super(FUNC_NAME, 2, 2, ObjectInspector.Category.LIST); | ||
} | ||
|
||
@Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { | ||
ObjectInspector defaultOI = super.initialize(arguments); | ||
array2OI = (ListObjectInspector) arguments[ARRAY2_IDX]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NITs:
- This line should be after the following
checkArgCategory
because it validates the type andcheckArgCategory
throws a kind message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
checkArgCategory(arguments, ARRAY2_IDX, ObjectInspector.Category.LIST, FUNC_NAME, | ||
org.apache.hadoop.hive.serde.serdeConstants.LIST_TYPE_NAME); //Array1 is already getting validated in Parent class | ||
arrayElementOI = arrayOI.getListElementObjectInspector(); | ||
array2ElementOI = array2OI.getListElementObjectInspector(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: array2OI
can be a local variable, or we can simply do ObjectInspectorUtils.compareTypes(arrayOI, arguments[ARRAY2_IDX])
because it seems to recursively check types of LIST, MAP, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed all those variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please note that elements in the input array can be of type array or list, that's why sending element itself to check compatibility
List inputArrayCopy = new ArrayList<>(); | ||
inputArrayCopy.addAll(retArray3); | ||
inputArrayCopy.removeAll(((ListObjectInspector) argumentOIs[ARRAY2_IDX]).getList(arguments[ARRAY2_IDX].get())); | ||
return inputArrayCopy.stream().distinct().map(o -> converter.convert(o)).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We got robustness. Thanks!
runAndVerify(inputStringList,inputStringList,asList()); | ||
runAndVerify(inputStringList,asList(),inputStringList); | ||
// Empty input arrays | ||
runAndVerify(asList(),asList(),asList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great test cases!
Kudos, SonarCloud Quality Gate passed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I want to ask @saihemanth-cloudera about the final check and merge since I am not a committer.
@tarak271 Nice work. I definitely have use cases to use this UDF!
LGTM +1 |
… Rama Rao Lethavadla, reviewed by Okumin, Sai Hemanth Gantasala)
… Rama Rao Lethavadla, reviewed by Okumin, Sai Hemanth Gantasala)
What changes were proposed in this pull request?
Implement array_except function in Hive
Why are the changes needed?
This enhancement is already implemented in Spark
Does this PR introduce any user-facing change?
No
How was this patch tested?
Created Junit tests as well as qtests as part of this change