Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43868][SQL][TESTS] Remove originalUDFs from TestHive to ensure ObjectHashAggregateExecBenchmark can run successfully on Github Action #41369

Closed
wants to merge 2 commits into from

Conversation

LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented May 29, 2023

What changes were proposed in this pull request?

This pr remove originalUDFs from TestHive to ensure ObjectHashAggregateExecBenchmark can run successfully on Github Action.

Why are the changes needed?

After SPARK-43225, org.codehaus.jackson:jackson-mapper-asl becomes a test scope dependency, so when using GA to run benchmark, it is not in the classpath because GA uses

bin/spark-submit \
--driver-memory 6g --class org.apache.spark.benchmark.Benchmarks \
--jars "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \
"`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
"${{ github.event.inputs.class }}"

iunstead of the sbt Test/runMain.

ObjectHashAggregateExecBenchmark used TestHive, and TestHive will always call org.apache.hadoop.hive.ql.exec.FunctionRegistry#getFunctionNames to init originalUDFs before this pr, so when we run ObjectHashAggregateExecBenchmark on GitHub Actions, there will be the following exceptions:

Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
	at org.apache.hadoop.hive.ql.udf.UDFJson.<clinit>(UDFJson.java:64)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClassInternal(GenericUDFBridge.java:142)
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:132)
	at org.apache.hadoop.hive.ql.exec.FunctionInfo.getFunctionClass(FunctionInfo.java:151)
	at org.apache.hadoop.hive.ql.exec.Registry.addFunction(Registry.java:519)
	at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:163)
	at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:154)
	at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:147)
	at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:322)
	at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:530)
	at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:185)
	at org.apache.spark.sql.hive.test.TestHiveContext.<init>(TestHive.scala:133)
	at org.apache.spark.sql.hive.test.TestHive$.<init>(TestHive.scala:54)
	at org.apache.spark.sql.hive.test.TestHive$.<clinit>(TestHive.scala:53)
	at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.getSparkSession(ObjectHashAggregateExecBenchmark.scala:47)
	at org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark.$init$(SqlBasedBenchmark.scala:35)
	at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.<clinit>(ObjectHashAggregateExecBenchmark.scala:45)
	at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark.main(ObjectHashAggregateExecBenchmark.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.benchmark.Benchmarks$.$anonfun$main$7(Benchmarks.scala:128)
	at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328)
	at org.apache.spark.benchmark.Benchmarks$.main(Benchmarks.scala:91)
	at org.apache.spark.benchmark.Benchmarks.main(Benchmarks.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1025)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1116)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1125)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.type.TypeFactory
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 40 more

Then I found that originalUDFs is a unused val in TestHive now(SPARK-1251 | #6920 introduced it and become unused after SPARK-20667 | #17908), so this pr remove it from TestHive to avoid calling FunctionRegistry#getFunctionNames.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • Pass GitHub Actions
  • Run ObjectHashAggregateExecBenchmark on Github Action:

Before

https://github.com/LuciferYang/spark/actions/runs/5128228630/jobs/9224706982

image

After

https://github.com/LuciferYang/spark/actions/runs/5128227211/jobs/9224704507

image

ObjectHashAggregateExecBenchmark run successfully.

@LuciferYang LuciferYang marked this pull request as draft May 29, 2023 11:57
@github-actions github-actions bot added the SQL label May 29, 2023
@LuciferYang LuciferYang changed the title [SQL][TESTS] Remove originalUDFs from TestHive [SPARK-43868][SQL][TESTS] Remove originalUDFs from TestHive May 30, 2023
@LuciferYang LuciferYang changed the title [SPARK-43868][SQL][TESTS] Remove originalUDFs from TestHive [SPARK-43868][SQL][TESTS] Remove originalUDFs from TestHive to ensure ObjectHashAggregateExecBenchmark can run successfully on GA May 31, 2023
@LuciferYang LuciferYang changed the title [SPARK-43868][SQL][TESTS] Remove originalUDFs from TestHive to ensure ObjectHashAggregateExecBenchmark can run successfully on GA [SPARK-43868][SQL][TESTS] Remove originalUDFs from TestHive to ensure ObjectHashAggregateExecBenchmark can run successfully on Github Action May 31, 2023
@LuciferYang
Copy link
Contributor Author

cc @wangyum @dongjoon-hyun @pan3793 FYI

@LuciferYang LuciferYang marked this pull request as ready for review May 31, 2023 04:05
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@wangyum wangyum closed this in 3472619 May 31, 2023
@wangyum
Copy link
Member

wangyum commented May 31, 2023

Merged to master.

@LuciferYang
Copy link
Contributor Author

Thanks @wangyum @dongjoon-hyun @pan3793 ~

czxm pushed a commit to czxm/spark that referenced this pull request Jun 12, 2023
…sure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action

### What changes were proposed in this pull request?
This pr remove `originalUDFs` from `TestHive` to ensure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action.

### Why are the changes needed?
After SPARK-43225, `org.codehaus.jackson:jackson-mapper-asl` becomes a test scope dependency, so when using GA to run benchmark, it is not in the classpath because GA uses

https://github.com/apache/spark/blob/d61c77cac17029ee27319e6b766b48d314a4dd31/.github/workflows/benchmark.yml#L179-L183

iunstead of the sbt `Test/runMain`.

`ObjectHashAggregateExecBenchmark` used `TestHive`, and `TestHive` will always call `org.apache.hadoop.hive.ql.exec.FunctionRegistry#getFunctionNames` to init `originalUDFs` before this pr, so when we run `ObjectHashAggregateExecBenchmark` on GitHub Actions, there will be the following exceptions:

```
Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
	at org.apache.hadoop.hive.ql.udf.UDFJson.<clinit>(UDFJson.java:64)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClassInternal(GenericUDFBridge.java:142)
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:132)
	at org.apache.hadoop.hive.ql.exec.FunctionInfo.getFunctionClass(FunctionInfo.java:151)
	at org.apache.hadoop.hive.ql.exec.Registry.addFunction(Registry.java:519)
	at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:163)
	at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:154)
	at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:147)
	at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:322)
	at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:530)
	at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:185)
	at org.apache.spark.sql.hive.test.TestHiveContext.<init>(TestHive.scala:133)
	at org.apache.spark.sql.hive.test.TestHive$.<init>(TestHive.scala:54)
	at org.apache.spark.sql.hive.test.TestHive$.<clinit>(TestHive.scala:53)
	at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.getSparkSession(ObjectHashAggregateExecBenchmark.scala:47)
	at org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark.$init$(SqlBasedBenchmark.scala:35)
	at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.<clinit>(ObjectHashAggregateExecBenchmark.scala:45)
	at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark.main(ObjectHashAggregateExecBenchmark.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.benchmark.Benchmarks$.$anonfun$main$7(Benchmarks.scala:128)
	at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328)
	at org.apache.spark.benchmark.Benchmarks$.main(Benchmarks.scala:91)
	at org.apache.spark.benchmark.Benchmarks.main(Benchmarks.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1025)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1116)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1125)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.type.TypeFactory
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 40 more
```

Then I found that `originalUDFs` is a unused val in `TestHive` now(SPARK-1251 | apache#6920 introduced it and become unused after SPARK-20667 | apache#17908), so this pr remove it from `TestHive` to avoid calling `FunctionRegistry#getFunctionNames`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GitHub Actions
- Run `ObjectHashAggregateExecBenchmark` on Github Action:

**Before**

https://github.com/LuciferYang/spark/actions/runs/5128228630/jobs/9224706982

<img width="1181" alt="image" src="https://github.com/apache/spark/assets/1475305/02a58e3c-2dad-4ad4-85e4-f8576a5aabed">

**After**

https://github.com/LuciferYang/spark/actions/runs/5128227211/jobs/9224704507

<img width="1282" alt="image" src="https://github.com/apache/spark/assets/1475305/27c70ec6-e55d-4a19-a6c3-e892789b97f7">

`ObjectHashAggregateExecBenchmark` run successfully.

Closes apache#41369 from LuciferYang/hive-udf.

Lead-authored-by: yangjie01 <yangjie01@baidu.com>
Co-authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
prabhjyotsingh pushed a commit to acceldata-io/spark3 that referenced this pull request Apr 9, 2024
…sure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action

### What changes were proposed in this pull request?
This pr remove `originalUDFs` from `TestHive` to ensure `ObjectHashAggregateExecBenchmark` can run successfully on Github Action.

### Why are the changes needed?
After SPARK-43225, `org.codehaus.jackson:jackson-mapper-asl` becomes a test scope dependency, so when using GA to run benchmark, it is not in the classpath because GA uses

https://github.com/apache/spark/blob/d61c77cac17029ee27319e6b766b48d314a4dd31/.github/workflows/benchmark.yml#L179-L183

iunstead of the sbt `Test/runMain`.

`ObjectHashAggregateExecBenchmark` used `TestHive`, and `TestHive` will always call `org.apache.hadoop.hive.ql.exec.FunctionRegistry#getFunctionNames` to init `originalUDFs` before this pr, so when we run `ObjectHashAggregateExecBenchmark` on GitHub Actions, there will be the following exceptions:

```
Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
	at org.apache.hadoop.hive.ql.udf.UDFJson.<clinit>(UDFJson.java:64)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClassInternal(GenericUDFBridge.java:142)
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:132)
	at org.apache.hadoop.hive.ql.exec.FunctionInfo.getFunctionClass(FunctionInfo.java:151)
	at org.apache.hadoop.hive.ql.exec.Registry.addFunction(Registry.java:519)
	at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:163)
	at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:154)
	at org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:147)
	at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:322)
	at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:530)
	at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:185)
	at org.apache.spark.sql.hive.test.TestHiveContext.<init>(TestHive.scala:133)
	at org.apache.spark.sql.hive.test.TestHive$.<init>(TestHive.scala:54)
	at org.apache.spark.sql.hive.test.TestHive$.<clinit>(TestHive.scala:53)
	at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.getSparkSession(ObjectHashAggregateExecBenchmark.scala:47)
	at org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark.$init$(SqlBasedBenchmark.scala:35)
	at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark$.<clinit>(ObjectHashAggregateExecBenchmark.scala:45)
	at org.apache.spark.sql.execution.benchmark.ObjectHashAggregateExecBenchmark.main(ObjectHashAggregateExecBenchmark.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.benchmark.Benchmarks$.$anonfun$main$7(Benchmarks.scala:128)
	at scala.collection.ArrayOps$.foreach$extension(ArrayOps.scala:1328)
	at org.apache.spark.benchmark.Benchmarks$.main(Benchmarks.scala:91)
	at org.apache.spark.benchmark.Benchmarks.main(Benchmarks.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1025)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1116)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1125)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.type.TypeFactory
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 40 more
```

Then I found that `originalUDFs` is a unused val in `TestHive` now(SPARK-1251 | apache#6920 introduced it and become unused after SPARK-20667 | apache#17908), so this pr remove it from `TestHive` to avoid calling `FunctionRegistry#getFunctionNames`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GitHub Actions
- Run `ObjectHashAggregateExecBenchmark` on Github Action:

**Before**

https://github.com/LuciferYang/spark/actions/runs/5128228630/jobs/9224706982

<img width="1181" alt="image" src="https://github.com/apache/spark/assets/1475305/02a58e3c-2dad-4ad4-85e4-f8576a5aabed">

**After**

https://github.com/LuciferYang/spark/actions/runs/5128227211/jobs/9224704507

<img width="1282" alt="image" src="https://github.com/apache/spark/assets/1475305/27c70ec6-e55d-4a19-a6c3-e892789b97f7">

`ObjectHashAggregateExecBenchmark` run successfully.

Closes apache#41369 from LuciferYang/hive-udf.

Lead-authored-by: yangjie01 <yangjie01@baidu.com>
Co-authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
(cherry picked from commit 3472619)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants