HIVE-26774 - Implement array_slice UDF to get the subset of elements from an array (subarray) #3893

tarak271 · 2022-12-24T10:45:12Z

What changes were proposed in this pull request?

Implement array_slice function in Hive

Why are the changes needed?

This enhancement is already implemented in Spark

Does this PR introduce any user-facing change?

No

How was this patch tested?

Created Junit tests as well as qtests as part of this change

…from an array (subarray)

…from an array (subarray) - 1

…from an array (subarray) - 2

…from an array (subarray) - 3

…from an array (subarray) - 4

sonarcloud · 2022-12-24T16:23:09Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

SourabhBadhya · 2023-01-04T03:46:01Z

ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFArraySlice.java

+/**
+ * GenericUDFArraySlice.
+ */
+@Description(name = "array_slice", value = "_FUNC_(array, start, length) - Returns the subset or range of elements from"


Is there any particular reason on why we use these inputs - i.e. array_slice(array, start, length) ?

Because as far as I know in most programming languages, we usually provide start index and end index.

Please let me know if this is inspired from some place in SQL languages, so that we have valid reason to justify these inputs.

Tried to make the implementation similar to that of Spark's Slice function wherever possible, https://spark.apache.org/docs/latest/api/sql/#slice
Please note that negative indexing is not implemented as we don't have it either in Hive or in Java

SourabhBadhya · 2023-01-04T03:54:47Z

ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFArraySlice.java

+    int length = ((IntObjectInspector) argumentOIs[LENGTH_IDX]).get(arguments[LENGTH_IDX].get());
+    // return empty list if start/length are out of range of the array
+    if (start + length > retArray.size()) {
+      return Collections.emptyList();


Why are we returning empty list when the size of array is lesser than start + length? Shouldn't an exception be thrown specifying that the window of subarray requested is not within the array because of invalid input?

Also I believe this check must be at the beginning of evaluate().

The implementation is made close to that of Spark's slice function which is returning empty array

scala> val arrayStructureData = Seq( | Row(List("aa","bb","cc","dd")), | Row(List("aa")) | ) arrayStructureData: Seq[org.apache.spark.sql.Row] = List([List(aa, bb, cc, dd)], [List(aa)]) scala> val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructureData),new StructType().add("str", ArrayType(StringType))) df: org.apache.spark.sql.DataFrame = [str: array<string>] scala> val sliceDF = df.withColumn("Sliced_str",slice(col("str"),2,3)) sliceDF: org.apache.spark.sql.DataFrame = [str: array<string>, Sliced_str: array<string>] scala> sliceDF.show(false) +----------------+------------+ |str |Sliced_str | +----------------+------------+ |[aa, bb, cc, dd]|[bb, cc, dd]| |[aa] |[] | +----------------+------------+

So that users who are familiar with Spark's slice will not see a difference in Hive. Also I see another benefit of returning values for the rest of the rows which is not the case when an exception is thrown

SourabhBadhya

LGTM +1

saihemanth-cloudera

The changes look good to me. +1

…from an array (subarray) (apache#3893)(Taraka Rama Rao Lethavadla, reviewed by Sai Hemanth, Sourabh Badhya)

tarak271 added 4 commits December 24, 2022 14:22

HIVE-26774 - Implement array_slice UDF to get the subset of elements …

9dd1743

…from an array (subarray)

HIVE-26774 - Implement array_slice UDF to get the subset of elements …

608211a

…from an array (subarray) - 1

HIVE-26774 - Implement array_slice UDF to get the subset of elements …

4f078e5

…from an array (subarray) - 2

HIVE-26774 - Implement array_slice UDF to get the subset of elements …

edbefa9

…from an array (subarray) - 3

kgyrtkirk added the tests pending label Dec 24, 2022

HIVE-26774 - Implement array_slice UDF to get the subset of elements …

7ef59bf

…from an array (subarray) - 4

kgyrtkirk added tests unstable tests pending and removed tests pending tests unstable labels Dec 24, 2022

kgyrtkirk added tests passed and removed tests pending labels Dec 24, 2022

SourabhBadhya reviewed Jan 4, 2023

View reviewed changes

tarak271 requested a review from SourabhBadhya January 4, 2023 06:39

SourabhBadhya approved these changes Jan 10, 2023

View reviewed changes

saihemanth-cloudera approved these changes Jan 30, 2023

View reviewed changes

saihemanth-cloudera merged commit fcf8044 into apache:master Jan 30, 2023

yeahyung pushed a commit to yeahyung/hive that referenced this pull request Jul 20, 2023

HIVE-26774 - Implement array_slice UDF to get the subset of elements …

faa64b6

…from an array (subarray) (apache#3893)(Taraka Rama Rao Lethavadla, reviewed by Sai Hemanth, Sourabh Badhya)

tarak271 added a commit to tarak271/hive-1 that referenced this pull request Dec 19, 2023

HIVE-26774 - Implement array_slice UDF to get the subset of elements …

e33c966

…from an array (subarray) (apache#3893)(Taraka Rama Rao Lethavadla, reviewed by Sai Hemanth, Sourabh Badhya)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-26774 - Implement array_slice UDF to get the subset of elements from an array (subarray) #3893

HIVE-26774 - Implement array_slice UDF to get the subset of elements from an array (subarray) #3893

tarak271 commented Dec 24, 2022

sonarcloud bot commented Dec 24, 2022

SourabhBadhya Jan 4, 2023

tarak271 Jan 4, 2023

SourabhBadhya Jan 10, 2023

SourabhBadhya Jan 4, 2023 •

edited

Loading

tarak271 Jan 4, 2023

SourabhBadhya Jan 10, 2023

SourabhBadhya left a comment

saihemanth-cloudera left a comment

HIVE-26774 - Implement array_slice UDF to get the subset of elements from an array (subarray) #3893

HIVE-26774 - Implement array_slice UDF to get the subset of elements from an array (subarray) #3893

Conversation

tarak271 commented Dec 24, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

sonarcloud bot commented Dec 24, 2022

SourabhBadhya Jan 4, 2023

Choose a reason for hiding this comment

tarak271 Jan 4, 2023

Choose a reason for hiding this comment

SourabhBadhya Jan 10, 2023

Choose a reason for hiding this comment

SourabhBadhya Jan 4, 2023 • edited Loading

Choose a reason for hiding this comment

tarak271 Jan 4, 2023

Choose a reason for hiding this comment

SourabhBadhya Jan 10, 2023

Choose a reason for hiding this comment

SourabhBadhya left a comment

Choose a reason for hiding this comment

saihemanth-cloudera left a comment

Choose a reason for hiding this comment

SourabhBadhya Jan 4, 2023 •

edited

Loading