-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-26774 - Implement array_slice UDF to get the subset of elements from an array (subarray) #3893
Conversation
…from an array (subarray)
…from an array (subarray) - 1
…from an array (subarray) - 2
…from an array (subarray) - 3
…from an array (subarray) - 4
Kudos, SonarCloud Quality Gate passed! |
/** | ||
* GenericUDFArraySlice. | ||
*/ | ||
@Description(name = "array_slice", value = "_FUNC_(array, start, length) - Returns the subset or range of elements from" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any particular reason on why we use these inputs - i.e. array_slice(array, start, length)
?
Because as far as I know in most programming languages, we usually provide start index and end index.
Please let me know if this is inspired from some place in SQL languages, so that we have valid reason to justify these inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to make the implementation similar to that of Spark's Slice function wherever possible, https://spark.apache.org/docs/latest/api/sql/#slice
Please note that negative indexing is not implemented as we don't have it either in Hive or in Java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
int length = ((IntObjectInspector) argumentOIs[LENGTH_IDX]).get(arguments[LENGTH_IDX].get()); | ||
// return empty list if start/length are out of range of the array | ||
if (start + length > retArray.size()) { | ||
return Collections.emptyList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we returning empty list when the size of array is lesser than start + length? Shouldn't an exception be thrown specifying that the window of subarray requested is not within the array because of invalid input?
Also I believe this check must be at the beginning of evaluate()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation is made close to that of Spark's slice function which is returning empty array
scala> val arrayStructureData = Seq(
| Row(List("aa","bb","cc","dd")),
| Row(List("aa"))
| )
arrayStructureData: Seq[org.apache.spark.sql.Row] = List([List(aa, bb, cc, dd)], [List(aa)])
scala> val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructureData),new StructType().add("str", ArrayType(StringType)))
df: org.apache.spark.sql.DataFrame = [str: array<string>]
scala> val sliceDF = df.withColumn("Sliced_str",slice(col("str"),2,3))
sliceDF: org.apache.spark.sql.DataFrame = [str: array<string>, Sliced_str: array<string>]
scala> sliceDF.show(false)
+----------------+------------+
|str |Sliced_str |
+----------------+------------+
|[aa, bb, cc, dd]|[bb, cc, dd]|
|[aa] |[] |
+----------------+------------+
So that users who are familiar with Spark's slice will not see a difference in Hive. Also I see another benefit of returning values for the rest of the rows which is not the case when an exception is thrown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me. +1
…from an array (subarray) (apache#3893)(Taraka Rama Rao Lethavadla, reviewed by Sai Hemanth, Sourabh Badhya)
…from an array (subarray) (apache#3893)(Taraka Rama Rao Lethavadla, reviewed by Sai Hemanth, Sourabh Badhya)
What changes were proposed in this pull request?
Implement array_slice function in Hive
Why are the changes needed?
This enhancement is already implemented in Spark
Does this PR introduce any user-facing change?
No
How was this patch tested?
Created Junit tests as well as qtests as part of this change