Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44719][SQL] Fix NoClassDefFoundError when using Hive UDF #42446

Closed
wants to merge 1 commit into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Aug 11, 2023

What changes were proposed in this pull request?

This PR changes jackson-mapper-asl's scope from test to ${hive.deps.scope}.

Why are the changes needed?

Fix NoClassDefFoundError when using Hive UDF:

spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
Time taken: 0.413 seconds
spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 'net.petrabarus.hiveudfs.LongToIP';
Time taken: 0.038 seconds
spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT long_to_ip(2130706433L) FROM range(10)]
java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
	at org.apache.hadoop.hive.ql.udf.UDFJson.<clinit>(UDFJson.java:64)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
...

Does this PR introduce any user-facing change?

No.

How was this patch tested?

manual test.

@@ -1838,7 +1838,7 @@
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
<version>${codehaus.jackson.version}</version>
<scope>test</scope>
<scope>${hive.deps.scope}</scope>
Copy link
Member Author

@wangyum wangyum Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The master branch uses Hadoop 3.3.6. So Hadoop doesn't need this.
The branch-3.5 uses Hadoop 3.3.4. so Hadoop still need this.

https://issues.apache.org/jira/browse/HADOOP-13332

@yaooqinn
Copy link
Member

thanks, merged to master

@yaooqinn yaooqinn closed this in 7baf9da Aug 12, 2023
@wangyum wangyum deleted the SPARK-44719 branch August 12, 2023 10:05
hvanhovell pushed a commit to hvanhovell/spark that referenced this pull request Aug 13, 2023
### What changes were proposed in this pull request?

This PR changes jackson-mapper-asl's scope from `test` to `${hive.deps.scope}`.

### Why are the changes needed?

Fix `NoClassDefFoundError` when using Hive UDF:
```
spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
Time taken: 0.413 seconds
spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 'net.petrabarus.hiveudfs.LongToIP';
Time taken: 0.038 seconds
spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT long_to_ip(2130706433L) FROM range(10)]
java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
	at org.apache.hadoop.hive.ql.udf.UDFJson.<clinit>(UDFJson.java:64)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
...
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

manual test.

Closes apache#42446 from wangyum/SPARK-44719.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Kent Yao <yao@apache.org>
valentinp17 pushed a commit to valentinp17/spark that referenced this pull request Aug 24, 2023
### What changes were proposed in this pull request?

This PR changes jackson-mapper-asl's scope from `test` to `${hive.deps.scope}`.

### Why are the changes needed?

Fix `NoClassDefFoundError` when using Hive UDF:
```
spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
Time taken: 0.413 seconds
spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 'net.petrabarus.hiveudfs.LongToIP';
Time taken: 0.038 seconds
spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT long_to_ip(2130706433L) FROM range(10)]
java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
	at org.apache.hadoop.hive.ql.udf.UDFJson.<clinit>(UDFJson.java:64)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
...
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

manual test.

Closes apache#42446 from wangyum/SPARK-44719.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Kent Yao <yao@apache.org>
@dongjoon-hyun
Copy link
Member

In addition to jackson-mapper-asl, jackson-core-asl is also recovered. So, this is a revert of SPARK-43225 in terms of the dependency.

This PR changes jackson-mapper-asl's scope from test to ${hive.deps.scope}.

dongjoon-hyun added a commit that referenced this pull request Feb 21, 2024
### What changes were proposed in this pull request?

This PR aims to provide a new profile, `hive-jackson-provided`, for Apache Spark 4.0.0.

### Why are the changes needed?

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.
- #40893
- #42446

To allow Apache Spark 4.0 users
- To provide their own CodeHaus Jackson libraries
- To exclude them completely if they don't use `Hive UDF`.

### Does this PR introduce _any_ user-facing change?

No, this is a new profile.

### How was this patch tested?

Pass the CIs and manual build.

**Without `hive-jackson-provided`**
```
$ dev/make-distribution.sh -Phive,hive-thriftserver
$ ls -al dist/jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Feb 21 10:53 dist.org/jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Feb 21 10:53 dist.org/jars/jackson-mapper-asl-1.9.13.jar
```

**With `hive-jackson-provided`**
```
$ dev/make-distribution.sh -Phive,hive-thriftserver,hive-jackson-provided
$ ls -al dist/jars/*asl*
zsh: no matches found: dist/jars/*asl*

$ ls -al dist/jars/*hive*
-rw-r--r--  1 dongjoon  staff    183633 Feb 21 11:00 dist/jars/hive-beeline-2.3.9.jar
-rw-r--r--  1 dongjoon  staff     44704 Feb 21 11:00 dist/jars/hive-cli-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    436169 Feb 21 11:00 dist/jars/hive-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff  10840949 Feb 21 11:00 dist/jars/hive-exec-2.3.9-core.jar
-rw-r--r--  1 dongjoon  staff    116364 Feb 21 11:00 dist/jars/hive-jdbc-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    326585 Feb 21 11:00 dist/jars/hive-llap-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff   8195966 Feb 21 11:00 dist/jars/hive-metastore-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    916630 Feb 21 11:00 dist/jars/hive-serde-2.3.9.jar
-rw-r--r--  1 dongjoon  staff   1679366 Feb 21 11:00 dist/jars/hive-service-rpc-3.1.3.jar
-rw-r--r--  1 dongjoon  staff     53902 Feb 21 11:00 dist/jars/hive-shims-0.23-2.3.9.jar
-rw-r--r--  1 dongjoon  staff      8786 Feb 21 11:00 dist/jars/hive-shims-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    120293 Feb 21 11:00 dist/jars/hive-shims-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff     12923 Feb 21 11:00 dist/jars/hive-shims-scheduler-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    258346 Feb 21 11:00 dist/jars/hive-storage-api-2.8.1.jar
-rw-r--r--  1 dongjoon  staff    581739 Feb 21 11:00 dist/jars/spark-hive-thriftserver_2.13-4.0.0-SNAPSHOT.jar
-rw-r--r--  1 dongjoon  staff    687446 Feb 21 11:00 dist/jars/spark-hive_2.13-4.0.0-SNAPSHOT.jar
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45201 from dongjoon-hyun/SPARK-47119.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun added a commit that referenced this pull request Feb 24, 2024
… a new optional directory

### What changes were proposed in this pull request?

This PR aims to provide `Apache Hive`'s `CodeHaus Jackson` dependencies via a new optional directory, `hive-jackson`, instead of the standard `jars` directory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values are `hive-jackson/*`.

  - `spark.driver.defaultExtraClassPath`
  - `spark.executor.defaultExtraClassPath`

For example, Apache Spark distributions have been providing `spark-*-yarn-shuffle.jar` file under `yarn` directory instead of `jars`.

**YARN SHUFFLE EXAMPLE**
```
$ ls -al yarn/*jar
-rw-r--r--  1 dongjoon  staff  77352048 Sep  8 19:08 yarn/spark-3.5.0-yarn-shuffle.jar
```

This PR changes `Apache Hive`'s `CodeHaus Jackson` dependencies in a similar way.

**BEFORE**
```
$ ls -al jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Sep  8 19:08 jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Sep  8 19:08 jars/jackson-mapper-asl-1.9.13.jar
```

**AFTER**
```
$ ls -al jars/*asl*
zsh: no matches found: jars/*asl*

$ ls -al hive-jackson
total 1984
drwxr-xr-x   4 dongjoon  staff     128 Feb 23 15:37 .
drwxr-xr-x  16 dongjoon  staff     512 Feb 23 16:34 ..
-rw-r--r--   1 dongjoon  staff  232248 Feb 23 15:37 jackson-core-asl-1.9.13.jar
-rw-r--r--   1 dongjoon  staff  780664 Feb 23 15:37 jackson-mapper-asl-1.9.13.jar
```

### Why are the changes needed?

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.

  - #40893
  - #42446

SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0.

  - #45201

This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0.

- Spark Shell without Apache Hive Jackson dependencies.
```
$ bin/spark-shell --driver-default-class-path ""
```

- Spark SQL Shell without Apache Hive Jackson dependencies.
```
$ bin/spark-sql --driver-default-class-path ""
```

- Spark Thrift Server without Apache Hive Jackson dependencies.
```
$ sbin/start-thriftserver.sh --driver-default-class-path ""
```

In addition, last but not least, this PR eliminates `CodeHaus Jackson` dependencies from the following Apache Spark deamons (using `spark-daemon.sh start`) because they don't require Hive `CodeHaus Jackson` dependencies

- Spark Master
- Spark Worker
- Spark History Server

```
$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \
```

### Does this PR introduce _any_ user-facing change?

No. There is no user-facing change by default.

- For the distributions with `hive-jackson-provided` profile, the `scope` of Apache Hive Jackson dependencies is `provided` and `hive-jackson` directory is not created at all.
- For the distributions with default setting, the `scope` of Apache Hive Jackson dependencies is still `compile`. In addition, they are in the Apache Spark's built-in class path like the following.

![Screenshot 2024-02-23 at 16 48 08](https://github.com/apache/spark/assets/9700541/99ed0f02-2792-4666-ae19-ce4f4b7b8ff9)

- The following Spark Deamon don't use `CodeHaus Jackson` dependencies.
  - Spark Master
  - Spark Worker
  - Spark History Server

### How was this patch tested?

Pass the CIs and manually build a distribution and check the class paths in the `Environment` Tab.

```
$ dev/make-distribution.sh -Phive,hive-thriftserver
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45237 from dongjoon-hyun/SPARK-47152.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
jpcorreia99 pushed a commit to jpcorreia99/spark that referenced this pull request Feb 26, 2024
… a new optional directory

### What changes were proposed in this pull request?

This PR aims to provide `Apache Hive`'s `CodeHaus Jackson` dependencies via a new optional directory, `hive-jackson`, instead of the standard `jars` directory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values are `hive-jackson/*`.

  - `spark.driver.defaultExtraClassPath`
  - `spark.executor.defaultExtraClassPath`

For example, Apache Spark distributions have been providing `spark-*-yarn-shuffle.jar` file under `yarn` directory instead of `jars`.

**YARN SHUFFLE EXAMPLE**
```
$ ls -al yarn/*jar
-rw-r--r--  1 dongjoon  staff  77352048 Sep  8 19:08 yarn/spark-3.5.0-yarn-shuffle.jar
```

This PR changes `Apache Hive`'s `CodeHaus Jackson` dependencies in a similar way.

**BEFORE**
```
$ ls -al jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Sep  8 19:08 jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Sep  8 19:08 jars/jackson-mapper-asl-1.9.13.jar
```

**AFTER**
```
$ ls -al jars/*asl*
zsh: no matches found: jars/*asl*

$ ls -al hive-jackson
total 1984
drwxr-xr-x   4 dongjoon  staff     128 Feb 23 15:37 .
drwxr-xr-x  16 dongjoon  staff     512 Feb 23 16:34 ..
-rw-r--r--   1 dongjoon  staff  232248 Feb 23 15:37 jackson-core-asl-1.9.13.jar
-rw-r--r--   1 dongjoon  staff  780664 Feb 23 15:37 jackson-mapper-asl-1.9.13.jar
```

### Why are the changes needed?

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.

  - apache#40893
  - apache#42446

SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0.

  - apache#45201

This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0.

- Spark Shell without Apache Hive Jackson dependencies.
```
$ bin/spark-shell --driver-default-class-path ""
```

- Spark SQL Shell without Apache Hive Jackson dependencies.
```
$ bin/spark-sql --driver-default-class-path ""
```

- Spark Thrift Server without Apache Hive Jackson dependencies.
```
$ sbin/start-thriftserver.sh --driver-default-class-path ""
```

In addition, last but not least, this PR eliminates `CodeHaus Jackson` dependencies from the following Apache Spark deamons (using `spark-daemon.sh start`) because they don't require Hive `CodeHaus Jackson` dependencies

- Spark Master
- Spark Worker
- Spark History Server

```
$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \
```

### Does this PR introduce _any_ user-facing change?

No. There is no user-facing change by default.

- For the distributions with `hive-jackson-provided` profile, the `scope` of Apache Hive Jackson dependencies is `provided` and `hive-jackson` directory is not created at all.
- For the distributions with default setting, the `scope` of Apache Hive Jackson dependencies is still `compile`. In addition, they are in the Apache Spark's built-in class path like the following.

![Screenshot 2024-02-23 at 16 48 08](https://github.com/apache/spark/assets/9700541/99ed0f02-2792-4666-ae19-ce4f4b7b8ff9)

- The following Spark Deamon don't use `CodeHaus Jackson` dependencies.
  - Spark Master
  - Spark Worker
  - Spark History Server

### How was this patch tested?

Pass the CIs and manually build a distribution and check the class paths in the `Environment` Tab.

```
$ dev/make-distribution.sh -Phive,hive-thriftserver
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45237 from dongjoon-hyun/SPARK-47152.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
### What changes were proposed in this pull request?

This PR changes jackson-mapper-asl's scope from `test` to `${hive.deps.scope}`.

### Why are the changes needed?

Fix `NoClassDefFoundError` when using Hive UDF:
```
spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
Time taken: 0.413 seconds
spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 'net.petrabarus.hiveudfs.LongToIP';
Time taken: 0.038 seconds
spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT long_to_ip(2130706433L) FROM range(10)]
java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
	at org.apache.hadoop.hive.ql.udf.UDFJson.<clinit>(UDFJson.java:64)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
...
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

manual test.

Closes apache#42446 from wangyum/SPARK-44719.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Kent Yao <yao@apache.org>
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Mar 5, 2024
### What changes were proposed in this pull request?

This PR aims to provide a new profile, `hive-jackson-provided`, for Apache Spark 4.0.0.

### Why are the changes needed?

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.
- apache#40893
- apache#42446

To allow Apache Spark 4.0 users
- To provide their own CodeHaus Jackson libraries
- To exclude them completely if they don't use `Hive UDF`.

### Does this PR introduce _any_ user-facing change?

No, this is a new profile.

### How was this patch tested?

Pass the CIs and manual build.

**Without `hive-jackson-provided`**
```
$ dev/make-distribution.sh -Phive,hive-thriftserver
$ ls -al dist/jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Feb 21 10:53 dist.org/jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Feb 21 10:53 dist.org/jars/jackson-mapper-asl-1.9.13.jar
```

**With `hive-jackson-provided`**
```
$ dev/make-distribution.sh -Phive,hive-thriftserver,hive-jackson-provided
$ ls -al dist/jars/*asl*
zsh: no matches found: dist/jars/*asl*

$ ls -al dist/jars/*hive*
-rw-r--r--  1 dongjoon  staff    183633 Feb 21 11:00 dist/jars/hive-beeline-2.3.9.jar
-rw-r--r--  1 dongjoon  staff     44704 Feb 21 11:00 dist/jars/hive-cli-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    436169 Feb 21 11:00 dist/jars/hive-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff  10840949 Feb 21 11:00 dist/jars/hive-exec-2.3.9-core.jar
-rw-r--r--  1 dongjoon  staff    116364 Feb 21 11:00 dist/jars/hive-jdbc-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    326585 Feb 21 11:00 dist/jars/hive-llap-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff   8195966 Feb 21 11:00 dist/jars/hive-metastore-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    916630 Feb 21 11:00 dist/jars/hive-serde-2.3.9.jar
-rw-r--r--  1 dongjoon  staff   1679366 Feb 21 11:00 dist/jars/hive-service-rpc-3.1.3.jar
-rw-r--r--  1 dongjoon  staff     53902 Feb 21 11:00 dist/jars/hive-shims-0.23-2.3.9.jar
-rw-r--r--  1 dongjoon  staff      8786 Feb 21 11:00 dist/jars/hive-shims-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    120293 Feb 21 11:00 dist/jars/hive-shims-common-2.3.9.jar
-rw-r--r--  1 dongjoon  staff     12923 Feb 21 11:00 dist/jars/hive-shims-scheduler-2.3.9.jar
-rw-r--r--  1 dongjoon  staff    258346 Feb 21 11:00 dist/jars/hive-storage-api-2.8.1.jar
-rw-r--r--  1 dongjoon  staff    581739 Feb 21 11:00 dist/jars/spark-hive-thriftserver_2.13-4.0.0-SNAPSHOT.jar
-rw-r--r--  1 dongjoon  staff    687446 Feb 21 11:00 dist/jars/spark-hive_2.13-4.0.0-SNAPSHOT.jar
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45201 from dongjoon-hyun/SPARK-47119.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Mar 5, 2024
… a new optional directory

### What changes were proposed in this pull request?

This PR aims to provide `Apache Hive`'s `CodeHaus Jackson` dependencies via a new optional directory, `hive-jackson`, instead of the standard `jars` directory of Apache Spark binary distribution. Additionally, two internal configurations are added whose default values are `hive-jackson/*`.

  - `spark.driver.defaultExtraClassPath`
  - `spark.executor.defaultExtraClassPath`

For example, Apache Spark distributions have been providing `spark-*-yarn-shuffle.jar` file under `yarn` directory instead of `jars`.

**YARN SHUFFLE EXAMPLE**
```
$ ls -al yarn/*jar
-rw-r--r--  1 dongjoon  staff  77352048 Sep  8 19:08 yarn/spark-3.5.0-yarn-shuffle.jar
```

This PR changes `Apache Hive`'s `CodeHaus Jackson` dependencies in a similar way.

**BEFORE**
```
$ ls -al jars/*asl*
-rw-r--r--  1 dongjoon  staff  232248 Sep  8 19:08 jars/jackson-core-asl-1.9.13.jar
-rw-r--r--  1 dongjoon  staff  780664 Sep  8 19:08 jars/jackson-mapper-asl-1.9.13.jar
```

**AFTER**
```
$ ls -al jars/*asl*
zsh: no matches found: jars/*asl*

$ ls -al hive-jackson
total 1984
drwxr-xr-x   4 dongjoon  staff     128 Feb 23 15:37 .
drwxr-xr-x  16 dongjoon  staff     512 Feb 23 16:34 ..
-rw-r--r--   1 dongjoon  staff  232248 Feb 23 15:37 jackson-core-asl-1.9.13.jar
-rw-r--r--   1 dongjoon  staff  780664 Feb 23 15:37 jackson-mapper-asl-1.9.13.jar
```

### Why are the changes needed?

Since Apache Hadoop 3.3.5, only Apache Hive requires old CodeHaus Jackson dependencies.

Apache Spark 3.5.0 tried to eliminate them completely but it's reverted due to Hive UDF support.

  - apache#40893
  - apache#42446

SPARK-47119 added a way to exclude Apache Hive Jackson dependencies at the distribution building stage for Apache Spark 4.0.0.

  - apache#45201

This PR provides a way to exclude Apache Hive Jackson dependencies at runtime for Apache Spark 4.0.0.

- Spark Shell without Apache Hive Jackson dependencies.
```
$ bin/spark-shell --driver-default-class-path ""
```

- Spark SQL Shell without Apache Hive Jackson dependencies.
```
$ bin/spark-sql --driver-default-class-path ""
```

- Spark Thrift Server without Apache Hive Jackson dependencies.
```
$ sbin/start-thriftserver.sh --driver-default-class-path ""
```

In addition, last but not least, this PR eliminates `CodeHaus Jackson` dependencies from the following Apache Spark deamons (using `spark-daemon.sh start`) because they don't require Hive `CodeHaus Jackson` dependencies

- Spark Master
- Spark Worker
- Spark History Server

```
$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \
```

### Does this PR introduce _any_ user-facing change?

No. There is no user-facing change by default.

- For the distributions with `hive-jackson-provided` profile, the `scope` of Apache Hive Jackson dependencies is `provided` and `hive-jackson` directory is not created at all.
- For the distributions with default setting, the `scope` of Apache Hive Jackson dependencies is still `compile`. In addition, they are in the Apache Spark's built-in class path like the following.

![Screenshot 2024-02-23 at 16 48 08](https://github.com/apache/spark/assets/9700541/99ed0f02-2792-4666-ae19-ce4f4b7b8ff9)

- The following Spark Deamon don't use `CodeHaus Jackson` dependencies.
  - Spark Master
  - Spark Worker
  - Spark History Server

### How was this patch tested?

Pass the CIs and manually build a distribution and check the class paths in the `Environment` Tab.

```
$ dev/make-distribution.sh -Phive,hive-thriftserver
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45237 from dongjoon-hyun/SPARK-47152.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants