Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-26774 - Implement array_slice UDF to get the subset of elements from an array (subarray) #3893

Merged
merged 5 commits into from
Jan 30, 2023

Conversation

tarak271
Copy link
Contributor

What changes were proposed in this pull request?

Implement array_slice function in Hive

Why are the changes needed?

This enhancement is already implemented in Spark

Does this PR introduce any user-facing change?

No

How was this patch tested?

Created Junit tests as well as qtests as part of this change

@sonarcloud
Copy link

sonarcloud bot commented Dec 24, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

/**
* GenericUDFArraySlice.
*/
@Description(name = "array_slice", value = "_FUNC_(array, start, length) - Returns the subset or range of elements from"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason on why we use these inputs - i.e. array_slice(array, start, length) ?

Because as far as I know in most programming languages, we usually provide start index and end index.

Please let me know if this is inspired from some place in SQL languages, so that we have valid reason to justify these inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to make the implementation similar to that of Spark's Slice function wherever possible, https://spark.apache.org/docs/latest/api/sql/#slice
Please note that negative indexing is not implemented as we don't have it either in Hive or in Java

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

int length = ((IntObjectInspector) argumentOIs[LENGTH_IDX]).get(arguments[LENGTH_IDX].get());
// return empty list if start/length are out of range of the array
if (start + length > retArray.size()) {
return Collections.emptyList();
Copy link
Contributor

@SourabhBadhya SourabhBadhya Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we returning empty list when the size of array is lesser than start + length? Shouldn't an exception be thrown specifying that the window of subarray requested is not within the array because of invalid input?

Also I believe this check must be at the beginning of evaluate().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is made close to that of Spark's slice function which is returning empty array

scala> val arrayStructureData = Seq(
     |     Row(List("aa","bb","cc","dd")),
     |     Row(List("aa"))
     |   )
arrayStructureData: Seq[org.apache.spark.sql.Row] = List([List(aa, bb, cc, dd)], [List(aa)])

scala> val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructureData),new StructType().add("str", ArrayType(StringType)))
df: org.apache.spark.sql.DataFrame = [str: array<string>]

scala> val sliceDF = df.withColumn("Sliced_str",slice(col("str"),2,3))
sliceDF: org.apache.spark.sql.DataFrame = [str: array<string>, Sliced_str: array<string>]

scala> sliceDF.show(false)
+----------------+------------+
|str             |Sliced_str  |
+----------------+------------+
|[aa, bb, cc, dd]|[bb, cc, dd]|
|[aa]            |[]          |
+----------------+------------+

So that users who are familiar with Spark's slice will not see a difference in Hive. Also I see another benefit of returning values for the rest of the rows which is not the case when an exception is thrown

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor

@SourabhBadhya SourabhBadhya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

Copy link
Contributor

@saihemanth-cloudera saihemanth-cloudera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. +1

@saihemanth-cloudera saihemanth-cloudera merged commit fcf8044 into apache:master Jan 30, 2023
yeahyung pushed a commit to yeahyung/hive that referenced this pull request Jul 20, 2023
…from an array (subarray) (apache#3893)(Taraka Rama Rao Lethavadla, reviewed by Sai Hemanth, Sourabh Badhya)
tarak271 added a commit to tarak271/hive-1 that referenced this pull request Dec 19, 2023
…from an array (subarray) (apache#3893)(Taraka Rama Rao Lethavadla, reviewed by Sai Hemanth, Sourabh Badhya)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants