Skip to content

Commit

Permalink
[SPARK-48236][BUILD] Add commons-lang:commons-lang:2.6 back to supp…
Browse files Browse the repository at this point in the history
…ort legacy Hive UDF jars

### What changes were proposed in this pull request?

This PR aims to add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars . This is a partial revert of SPARK-47018 .

### Why are the changes needed?

Recently, we dropped `commons-lang:commons-lang` during Hive upgrade.
- #46468

However, only Apache Hive 2.3.10 or 4.0.0 dropped it. In other words, Hive 2.0.0 ~ 2.3.9 and Hive 3.0.0 ~ 3.1.3 requires it. As a result, all existing  UDF jars built against those versions requires `commons-lang:commons-lang` still.

- apache/hive#4892

For example, Apache Hive 3.1.3 code:
- https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L21
```
import org.apache.commons.lang.StringUtils;
```

- https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L42
```
return StringUtils.strip(val, " ");
```

As a result, Maven CIs are broken.
- https://github.com/apache/spark/actions/runs/9032639456/job/24825599546 (Maven / Java 17)
- https://github.com/apache/spark/actions/runs/9033374547/job/24835284769 (Maven / Java 21)

The root cause is that the existing test UDF jar `hive-test-udfs.jar` was built from old Hive (before 2.3.10) libraries which requires `commons-lang:commons-lang:2.6`.
```
HiveUDFDynamicLoadSuite:
- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF
20:21:25.129 WARN org.apache.spark.SparkContext: The JAR file:///home/runner/work/spark/spark/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:33327/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

*** RUN ABORTED ***
A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/commons/lang/StringUtils
  java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184)
  at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185)
  ...
  Cause: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils
  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:593)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526)
  at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  ...
```

### Does this PR introduce _any_ user-facing change?

To support the existing customer UDF jars.

### How was this patch tested?

Manually.

```
$ build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveUDFDynamicLoadSuite test
...
HiveUDFDynamicLoadSuite:
14:21:56.034 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0

14:21:56.035 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore dongjoon127.0.0.1

14:21:56.041 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF
14:21:57.576 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDF
14:21:58.314 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDAF
14:21:58.943 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDAF
14:21:59.333 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.

14:21:59.364 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

14:21:59.370 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src specified for non-external table:src

14:21:59.718 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException

14:21:59.770 WARN org.apache.spark.SparkContext: The JAR file:///Users/dongjoon/APACHE/spark-merge/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:55526/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version.

- Spark should be able to run Hive UDF using jar regardless of current thread context classloader (GENERIC_UDTF
14:22:00.403 WARN org.apache.hadoop.hive.common.FileUtils: File file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src does not exist; Force to delete it.

14:22:00.404 ERROR org.apache.hadoop.hive.common.FileUtils: Failed to delete file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/warehouse-49291492-9d48-4360-a354-ace73a2c76ce/src

14:22:00.441 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

14:22:00.453 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.

14:22:00.537 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist

Run completed in 8 seconds, 612 milliseconds.
Total number of tests run: 5
Suites: completed 2, aborted 0
Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?

Closes #46528 from dongjoon-hyun/SPARK-48236.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
  • Loading branch information
dongjoon-hyun committed May 10, 2024
1 parent 726ef8a commit 5b3b8a9
Show file tree
Hide file tree
Showing 5 changed files with 28 additions and 0 deletions.
5 changes: 5 additions & 0 deletions connector/kafka-0-10-assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,11 @@
<artifactId>commons-codec</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
Expand Down
5 changes: 5 additions & 0 deletions connector/kinesis-asl-assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,11 @@
<artifactId>jackson-databind</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.glassfish.jersey.core</groupId>
<artifactId>jersey-client</artifactId>
Expand Down
1 change: 1 addition & 0 deletions dev/deps/spark-deps-hadoop-3-hive-2.3
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ commons-compress/1.26.1//commons-compress-1.26.1.jar
commons-crypto/1.1.0//commons-crypto-1.1.0.jar
commons-dbcp/1.4//commons-dbcp-1.4.jar
commons-io/2.16.1//commons-io-2.16.1.jar
commons-lang/2.6//commons-lang-2.6.jar
commons-lang3/3.14.0//commons-lang3-3.14.0.jar
commons-math3/3.6.1//commons-math3-3.6.1.jar
commons-pool/1.5.4//commons-pool-1.5.4.jar
Expand Down
13 changes: 13 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,8 @@
<commons-codec.version>1.17.0</commons-codec.version>
<commons-compress.version>1.26.1</commons-compress.version>
<commons-io.version>2.16.1</commons-io.version>
<!-- To support Hive UDF jars built by Hive 2.0.0 ~ 2.3.9 and 3.0.0 ~ 3.1.3. -->
<commons-lang2.version>2.6</commons-lang2.version>
<!-- org.apache.commons/commons-lang3/-->
<commons-lang3.version>3.14.0</commons-lang3.version>
<!-- org.apache.commons/commons-pool2/-->
Expand Down Expand Up @@ -613,6 +615,11 @@
<artifactId>commons-text</artifactId>
<version>1.12.0</version>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>${commons-lang2.version}</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
Expand Down Expand Up @@ -2899,6 +2906,12 @@
<artifactId>hive-storage-api</artifactId>
<version>${hive.storage.version}</version>
<scope>${hive.storage.scope}</scope>
<exclusions>
<exclusion>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>commons-cli</groupId>
Expand Down
4 changes: 4 additions & 0 deletions sql/hive/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
Expand Down

0 comments on commit 5b3b8a9

Please sign in to comment.