[SPARK-7853] [SQL] Fix HiveContext in Spark Shell #6459

yhuai · 2015-05-28T16:00:19Z

https://issues.apache.org/jira/browse/SPARK-7853

This fixes the problem introduced by my change in #6435, which causes that Hive Context fails to create in spark shell because of the class loader issue.

SparkQA · 2015-05-28T17:58:15Z

Test build #33662 has finished for PR 6459 at commit 3737766.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KryoSerializationStream(
- class KryoDeserializationStream(

…metastore utility functions.

SparkQA · 2015-05-28T20:36:32Z

Test build #33670 has finished for PR 6459 at commit 35d86f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-28T20:44:23Z

Test build #33671 has finished for PR 6459 at commit 005649b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-05-28T21:20:25Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

  /** Non-partitionKey attributes */
-  val attributes = hiveQlTable.getCols.map(_.toAttribute)
+  val attributes = hiveQlTable.getTTable.getSd.getCols.map(_.toAttribute)


Can we just get both of these from the spark sql HiveTable instead?

marmbrus · 2015-05-28T21:20:35Z

One comment otherwise LGTM.

SparkQA · 2015-05-28T23:49:31Z

Test build #33679 timed out for PR 6459 at commit 47cdb6d after a configured wait of 150m.

SparkQA · 2015-05-29T00:10:07Z

Test build #33686 has finished for PR 6459 at commit 37ad33e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- implicit class SchemaAttribute(f: HiveColumn)

yhuai · 2015-05-29T00:12:12Z

I am merging it to branch 1.4 and master.

https://issues.apache.org/jira/browse/SPARK-7853 This fixes the problem introduced by my change in #6435, which causes that Hive Context fails to create in spark shell because of the class loader issue. Author: Yin Huai <yhuai@databricks.com> Closes #6459 from yhuai/SPARK-7853 and squashes the following commits: 37ad33e [Yin Huai] Do not use hiveQlTable at all. 47cdb6d [Yin Huai] Move hiveconf.set to the end of setConf. 005649b [Yin Huai] Update comment. 35d86f3 [Yin Huai] Access TTable directly to make sure Hive will not internally use any metastore utility functions. 3737766 [Yin Huai] Recursively find all jars. (cherry picked from commit 572b62c) Signed-off-by: Yin Huai <yhuai@databricks.com>

https://issues.apache.org/jira/browse/SPARK-7853 This fixes the problem introduced by my change in apache#6435, which causes that Hive Context fails to create in spark shell because of the class loader issue. Author: Yin Huai <yhuai@databricks.com> Closes apache#6459 from yhuai/SPARK-7853 and squashes the following commits: 37ad33e [Yin Huai] Do not use hiveQlTable at all. 47cdb6d [Yin Huai] Move hiveconf.set to the end of setConf. 005649b [Yin Huai] Update comment. 35d86f3 [Yin Huai] Access TTable directly to make sure Hive will not internally use any metastore utility functions. 3737766 [Yin Huai] Recursively find all jars.

…uiltin' Hive version for metadata client ### What changes were proposed in this pull request? When using the 'builtin' Hive version for the Hive metadata client, do not create a separate classloader, and rather continue to use the overall user/application classloader (regardless of Java version). This standardizes the behavior for all Java versions with that of Java 9+. See SPARK-42539 for more details on why this approach was chosen. ### Why are the changes needed? Please see a much more detailed description in SPARK-42539. The tl;dr is that user-provided JARs (such as `hive-exec-2.3.8.jar`) take precedence over Spark/system JARs when constructing the classloader used by `IsolatedClientLoader` on Java 8 in 'builtin' mode, which can cause unexpected behavior and/or breakages. This violates the expectation that, unless user-first classloader mode is used, Spark JARs should be prioritized over user JARs. It also seems that this separate classloader was unnecessary from the start, since the intent of 'builtin' mode is to use the JARs already existing on the regular classloader (as alluded to [here](#24057 (comment))). The isolated clientloader was originally added in #5876 in 2015. This bit in the PR description is the only mention of the behavior for "builtin": > attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive. I can't follow the logic here; the user classloader clearly has all of the necessary Hive JARs, since that's where we're getting the JAR URLs from, so we could just use that directly instead of grabbing the URLs. When this was initially added, it only used the JARs from the user classloader, not any of its parents, which I suspect was the motivating factor (to try to avoid more Spark classes being duplicated inside of the isolated classloader, I guess). But that was changed a month later anyway in #6435 / #6459, so I think this may have basically been deadcode from the start. It has also caused at least one issue over the years, e.g. SPARK-21428, which disables the new-classloader behavior in the case of running inside of a CLI session. ### Does this PR introduce _any_ user-facing change? No, except to protect Spark itself from potentially being broken by bad user JARs. ### How was this patch tested? This includes a new unit test in `HiveUtilsSuite` which demonstrates the issue and shows that this approach resolves it. It has also been tested on a live cluster running Java 8 and Hive communication functionality continues to work as expected. Closes #40144 from xkrogen/xkrogen/SPARK-42539/hive-isolatedclientloader-builtin-user-jar-conflict-fix/java9strategy. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Chao Sun <sunchao@apple.com>

…uiltin' Hive version for metadata client When using the 'builtin' Hive version for the Hive metadata client, do not create a separate classloader, and rather continue to use the overall user/application classloader (regardless of Java version). This standardizes the behavior for all Java versions with that of Java 9+. See SPARK-42539 for more details on why this approach was chosen. Please see a much more detailed description in SPARK-42539. The tl;dr is that user-provided JARs (such as `hive-exec-2.3.8.jar`) take precedence over Spark/system JARs when constructing the classloader used by `IsolatedClientLoader` on Java 8 in 'builtin' mode, which can cause unexpected behavior and/or breakages. This violates the expectation that, unless user-first classloader mode is used, Spark JARs should be prioritized over user JARs. It also seems that this separate classloader was unnecessary from the start, since the intent of 'builtin' mode is to use the JARs already existing on the regular classloader (as alluded to [here](apache#24057 (comment))). The isolated clientloader was originally added in apache#5876 in 2015. This bit in the PR description is the only mention of the behavior for "builtin": > attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive. I can't follow the logic here; the user classloader clearly has all of the necessary Hive JARs, since that's where we're getting the JAR URLs from, so we could just use that directly instead of grabbing the URLs. When this was initially added, it only used the JARs from the user classloader, not any of its parents, which I suspect was the motivating factor (to try to avoid more Spark classes being duplicated inside of the isolated classloader, I guess). But that was changed a month later anyway in apache#6435 / apache#6459, so I think this may have basically been deadcode from the start. It has also caused at least one issue over the years, e.g. SPARK-21428, which disables the new-classloader behavior in the case of running inside of a CLI session. No, except to protect Spark itself from potentially being broken by bad user JARs. This includes a new unit test in `HiveUtilsSuite` which demonstrates the issue and shows that this approach resolves it. It has also been tested on a live cluster running Java 8 and Hive communication functionality continues to work as expected.

…uiltin' Hive version for metadata client ### What changes were proposed in this pull request? When using the 'builtin' Hive version for the Hive metadata client, do not create a separate classloader, and rather continue to use the overall user/application classloader (regardless of Java version). This standardizes the behavior for all Java versions with that of Java 9+. See SPARK-42539 for more details on why this approach was chosen. Please note that this is a re-submit of #40144. That one introduced test failures, and potentially a real issue, because the PR works by setting `isolationOn = false` for `builtin` mode. In addition to adjusting the classloader, `HiveClientImpl` relies on `isolationOn` to determine if it should use an isolated copy of `SessionState`, so the PR inadvertently switched to using a shared `SessionState` object. I think we do want to continue to have the isolated session state even in `builtin` mode, so this adds a new flag `sessionStateIsolationOn` which controls whether the session state should be isolated, _separately_ from the `isolationOn` flag which controls whether the classloader should be isolated. Default behavior is for `sessionStateIsolationOn` to be set equal to `isolationOn`, but for `builtin` mode, we override it to enable session state isolated even though classloader isolation is turned off. ### Why are the changes needed? Please see a much more detailed description in SPARK-42539. The tl;dr is that user-provided JARs (such as `hive-exec-2.3.8.jar`) take precedence over Spark/system JARs when constructing the classloader used by `IsolatedClientLoader` on Java 8 in 'builtin' mode, which can cause unexpected behavior and/or breakages. This violates the expectation that, unless user-first classloader mode is used, Spark JARs should be prioritized over user JARs. It also seems that this separate classloader was unnecessary from the start, since the intent of 'builtin' mode is to use the JARs already existing on the regular classloader (as alluded to [here](#24057 (comment))). The isolated clientloader was originally added in #5876 in 2015. This bit in the PR description is the only mention of the behavior for "builtin": > attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive. I can't follow the logic here; the user classloader clearly has all of the necessary Hive JARs, since that's where we're getting the JAR URLs from, so we could just use that directly instead of grabbing the URLs. When this was initially added, it only used the JARs from the user classloader, not any of its parents, which I suspect was the motivating factor (to try to avoid more Spark classes being duplicated inside of the isolated classloader, I guess). But that was changed a month later anyway in #6435 / #6459, so I think this may have basically been deadcode from the start. It has also caused at least one issue over the years, e.g. SPARK-21428, which disables the new-classloader behavior in the case of running inside of a CLI session. ### Does this PR introduce _any_ user-facing change? No, except to protect Spark itself from potentially being broken by bad user JARs. ### How was this patch tested? This includes a new unit test in `HiveUtilsSuite` which demonstrates the issue and shows that this approach resolves it. It has also been tested on a live cluster running Java 8 and Hive communication functionality continues to work as expected. Unit tests failing in #40144 have been locally tested (`HiveUtilsSuite`, `HiveSharedStateSuite`, `HiveCliSessionStateSuite`, `JsonHadoopFsRelationSuite`). Closes #40224 from xkrogen/xkrogen/SPARK-42539/hive-isolatedclientloader-builtin-user-jar-conflict-fix/take2. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Chao Sun <sunchao@apple.com>

Recursively find all jars.

3737766

yhuai changed the title ~~[SPARK-7853] [SQL] Fix Spark Shell~~ [SPARK-7853] [SQL] Fix HiveContext in Spark Shell May 28, 2015

yhuai added 2 commits May 28, 2015 11:36

Access TTable directly to make sure Hive will not internally use any …

35d86f3

…metastore utility functions.

Update comment.

005649b

Move hiveconf.set to the end of setConf.

47cdb6d

marmbrus reviewed May 28, 2015
View reviewed changes

Do not use hiveQlTable at all.

37ad33e

asfgit closed this in 572b62c May 29, 2015

xkrogen mentioned this pull request Feb 23, 2023

[SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client #40144

Closed

xkrogen mentioned this pull request Feb 28, 2023

[SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client #40224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-7853] [SQL] Fix HiveContext in Spark Shell #6459

[SPARK-7853] [SQL] Fix HiveContext in Spark Shell #6459

yhuai commented May 28, 2015

SparkQA commented May 28, 2015

SparkQA commented May 28, 2015

SparkQA commented May 28, 2015

marmbrus May 28, 2015

marmbrus commented May 28, 2015

SparkQA commented May 28, 2015

SparkQA commented May 29, 2015

yhuai commented May 29, 2015

[SPARK-7853] [SQL] Fix HiveContext in Spark Shell #6459

[SPARK-7853] [SQL] Fix HiveContext in Spark Shell #6459

Conversation

yhuai commented May 28, 2015

SparkQA commented May 28, 2015

SparkQA commented May 28, 2015

SparkQA commented May 28, 2015

marmbrus May 28, 2015

Choose a reason for hiding this comment

marmbrus commented May 28, 2015

SparkQA commented May 28, 2015

SparkQA commented May 29, 2015

yhuai commented May 29, 2015