[SPARK-6908] [SQL] Use isolated Hive client #5876

marmbrus · 2015-05-04T00:11:34Z

This PR switches Spark SQL's Hive support to use the isolated hive client interface introduced by #5851, instead of directly interacting with the client. By using this isolated client we can now allow users to dynamically configure the version of Hive that they are connecting to by setting spark.sql.hive.metastore.version without the need recompile. This also greatly reduces the surface area for our interaction with the hive libraries, hopefully making it easier to support other versions in the future.

Jars for the desired hive version can be configured using spark.sql.hive.metastore.jars, which accepts the following options:

a colon-separated list of jar files or directories for hive and hadoop.
builtin - attempt to discover the jars that were used to load Spark SQL and use those. This
option is only valid when using the execution version of Hive.
maven - download the correct version of hive on demand from maven.

By default, builtin is used for Hive 13.

This PR also removes the test step for building against Hive 12, as this will no longer be required to talk to Hive 12 metastores. However, the full removal of the Shim is deferred until a later PR.

Remaining TODOs:

Remove the Hive Shims and inline code for Hive 13.
Several HiveCompatibility tests are not yet passing.
- nullformatCTAS - As detailed below, we now are handling CTAS parsing ourselves instead of hacking into the Hive semantic analyzer. However, we currently only handle the common cases and not things like CTAS where the null format is specified.
- combine1 now leaks state about compression somehow, breaking all subsequent tests. As such we currently add it to the blacklist
- part_inherit_tbl_props and part_inherit_tbl_props_with_star do not work anymore. We are correctly propagating the information
- "load_dyn_part14.*" - These tests pass when run on their own, but fail when run with all other tests. It seems our RESET mechanism may not be as robust as it used to be?

Other required changes:

CreateTableAsSelect no longer carries parts of the HiveQL AST with it through the query execution pipeline. Instead, we parse CTAS during the HiveQL conversion and construct a HiveTable. The full parsing here is not yet complete as detailed above in the remaining TODOs. Since the operator is Hive specific, it is moved to the hive package.
Command is simplified to be a trait that simply acts as a marker for a LogicalPlan that should be eagerly evaluated.

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala

SparkQA · 2015-05-04T00:18:03Z

Test build #31713 has finished for PR 5876 at commit 8843a25.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait RunnableCommand extends LogicalPlan with logical.Command
- case class CreateTableAsSelect(

SparkQA · 2015-05-04T00:38:35Z

Test build #31714 has finished for PR 5876 at commit 4d8bf02.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ParamGridBuilder(object):
- trait RunnableCommand extends LogicalPlan with logical.Command
- case class CreateTableAsSelect(
- case class HiveDatabase(
- abstract class TableType
- case class HiveStorageDescriptor(
- case class HivePartition(
- case class HiveColumn(name: String, hiveType: String, comment: String)
- case class HiveTable(
- trait ClientInterface
- class ClientWrapper(
- class IsolatedClientLoader(
- protected trait ReflectionMagic
- protected implicit class InstanceMagic(a: Any)
- protected implicit class StaticMagic(c: Class[_])

marmbrus · 2015-05-04T23:27:38Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala

+          outputFormat = None,
+          serde = None)
+
+      // TODO: Handle all the cases here...


@chenghao-intel, do you have time to help finish the implementation here? I had to remove your version that relied on Hive's semantic analyzer.

SparkQA · 2015-05-06T07:04:34Z

Test build #31958 has finished for PR 5876 at commit ab07f7e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateTableAsSelect(

SparkQA · 2015-05-06T07:29:14Z

Test build #31961 has finished for PR 5876 at commit a6f5df1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateTableAsSelect(

SparkQA · 2015-05-06T17:54:48Z

Test build #32003 has finished for PR 5876 at commit 1c50813.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-06T20:34:53Z

Test build #32013 has finished for PR 5876 at commit 1d8ae44.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- case class CreateTableAsSelect(

Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala

SparkQA · 2015-05-06T22:43:16Z

Test build #32028 has finished for PR 5876 at commit 81711c4.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- case class CreateTableAsSelect(

SparkQA · 2015-05-06T23:13:24Z

Test build #32030 has finished for PR 5876 at commit e7b3941.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateTableAsSelect(

SparkQA · 2015-05-06T23:30:47Z

Test build #32049 has finished for PR 5876 at commit 11e9c72.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateTableAsSelect(

SparkQA · 2015-05-07T01:18:55Z

Test build #32041 has finished for PR 5876 at commit 7e8f010.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateTableAsSelect(

chenghao-intel · 2015-05-07T01:21:35Z

@marmbrus unit test passed. I will update the code for supporting SerDe in CTAS once this PR merged.

SparkQA · 2015-05-07T02:15:09Z

Test build #32053 has finished for PR 5876 at commit f5de7de.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateTableAsSelect(

yhuai · 2015-05-07T16:47:39Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala

+  @transient
+  protected[hive] lazy val executionConf = new HiveConf()
+  executionConf.set(
+    "javax.jdo.option.ConnectionURL", s"jdbc:derby:;databaseName=$localMetastore;create=true")


We will not contact this metastore and the only purpose of it is to make the Hive used for execution happy, right?

Correct, but some operations do instantiate the client so we need to point it at a dummy database otherwise in many cases derby will complain that more that one processes is trying to open the database.

This PR switches Spark SQL's Hive support to use the isolated hive client interface introduced by #5851, instead of directly interacting with the client. By using this isolated client we can now allow users to dynamically configure the version of Hive that they are connecting to by setting `spark.sql.hive.metastore.version` without the need recompile. This also greatly reduces the surface area for our interaction with the hive libraries, hopefully making it easier to support other versions in the future. Jars for the desired hive version can be configured using `spark.sql.hive.metastore.jars`, which accepts the following options: - a colon-separated list of jar files or directories for hive and hadoop. - `builtin` - attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive. - `maven` - download the correct version of hive on demand from maven. By default, `builtin` is used for Hive 13. This PR also removes the test step for building against Hive 12, as this will no longer be required to talk to Hive 12 metastores. However, the full removal of the Shim is deferred until a later PR. Remaining TODOs: - Remove the Hive Shims and inline code for Hive 13. - Several HiveCompatibility tests are not yet passing. - `nullformatCTAS` - As detailed below, we now are handling CTAS parsing ourselves instead of hacking into the Hive semantic analyzer. However, we currently only handle the common cases and not things like CTAS where the null format is specified. - `combine1` now leaks state about compression somehow, breaking all subsequent tests. As such we currently add it to the blacklist - `part_inherit_tbl_props` and `part_inherit_tbl_props_with_star` do not work anymore. We are correctly propagating the information - "load_dyn_part14.*" - These tests pass when run on their own, but fail when run with all other tests. It seems our `RESET` mechanism may not be as robust as it used to be? Other required changes: - `CreateTableAsSelect` no longer carries parts of the HiveQL AST with it through the query execution pipeline. Instead, we parse CTAS during the HiveQL conversion and construct a `HiveTable`. The full parsing here is not yet complete as detailed above in the remaining TODOs. Since the operator is Hive specific, it is moved to the hive package. - `Command` is simplified to be a trait that simply acts as a marker for a LogicalPlan that should be eagerly evaluated. Author: Michael Armbrust <michael@databricks.com> Closes #5876 from marmbrus/useIsolatedClient and squashes the following commits: 258d000 [Michael Armbrust] really really correct path handling e56fd4a [Michael Armbrust] getAbsolutePath 5a259f5 [Michael Armbrust] fix typos 81bb366 [Michael Armbrust] comments from vanzin 5f3945e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient 4b5cd41 [Michael Armbrust] yin's comments f5de7de [Michael Armbrust] cleanup 11e9c72 [Michael Armbrust] better coverage in versions suite 7e8f010 [Michael Armbrust] better error messages and jar handling e7b3941 [Michael Armbrust] more permisive checking for function registration da91ba7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient 5fe5894 [Michael Armbrust] fix serialization suite 81711c4 [Michael Armbrust] Initial support for running without maven 1d8ae44 [Michael Armbrust] fix final tests? 1c50813 [Michael Armbrust] more comments a3bee70 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient a6f5df1 [Michael Armbrust] style ab07f7e [Michael Armbrust] WIP 4d8bf02 [Michael Armbrust] Remove hive 12 compilation 8843a25 [Michael Armbrust] [SPARK-6908] [SQL] Use isolated Hive client (cherry picked from commit cd1d411) Signed-off-by: Yin Huai <yhuai@databricks.com>

This is a follow up of #5876 and should be merged after #5876. Let's wait for unit testing result from Jenkins. Author: Cheng Hao <hao.cheng@intel.com> Closes #5963 from chenghao-intel/useIsolatedClient and squashes the following commits: f87ace6 [Cheng Hao] remove the TODO and add `resolved condition` for HiveTable a8260e8 [Cheng Hao] Update code as feedback f4e243f [Cheng Hao] remove the serde setting for SequenceFile d166afa [Cheng Hao] style issue d25a4aa [Cheng Hao] Add SerDe support for CTAS (cherry picked from commit e35d878) Signed-off-by: Michael Armbrust <michael@databricks.com>

This is a follow up of #5876 and should be merged after #5876. Let's wait for unit testing result from Jenkins. Author: Cheng Hao <hao.cheng@intel.com> Closes #5963 from chenghao-intel/useIsolatedClient and squashes the following commits: f87ace6 [Cheng Hao] remove the TODO and add `resolved condition` for HiveTable a8260e8 [Cheng Hao] Update code as feedback f4e243f [Cheng Hao] remove the serde setting for SequenceFile d166afa [Cheng Hao] style issue d25a4aa [Cheng Hao] Add SerDe support for CTAS

coderfi · 2015-05-22T04:29:09Z

Has this been tested with mapr4?

When running under python, I am running into an exception

Caused by: java.lang.UnsatisfiedLinkError: Native Library /tmp/mapr-root-libMapRClient.4.0.2-mapr.so already loaded in another classloader
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1931)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1890)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1851)
at java.lang.Runtime.load0(Runtime.java:795)
at java.lang.System.load(System.java:1062)
at com.mapr.fs.shim.LibraryLoader.load(LibraryLoader.java:29)

This occurs under the following scenarios:
create a spark context + hive context
run a hive query
stop the spark context
create another spark context + hive context again
run a hive query
I get the Native Library exception

From what I can tell, when the ClassLoader is hitting a MAPR class which is attempting to load the native library (I presume, as part of a static initializer).
Unfortunately, the JVM prohibits this the second time around. :(

I could file a ticket with a full stack trace if you'd like.

BTW: I'm running into this problem building against the head of the 1.4 branch.

thanks,
Fi

marmbrus · 2015-05-22T04:36:17Z

Yes, please file a JIRA.
On May 21, 2015 9:29 PM, "coderfi" notifications@github.com wrote:

Has this been tested with mapr4?

When running under python, I am running into an exception

Caused by: java.lang.UnsatisfiedLinkError: Native Library /tmp/
mapr-root-libMapRClient.4.0.2-mapr.so already loaded in another
classloader
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1931)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1890)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1851)
at java.lang.Runtime.load0(Runtime.java:795)
at java.lang.System.load(System.java:1062)
at com.mapr.fs.shim.LibraryLoader.load(LibraryLoader.java:29)

This occurs under the following scenarios:
create a spark context + hive context
run a hive query
stop the spark context
create another spark context + hive context again
run a hive query
I get the Native Library exception

From what I can tell, when the ClassLoader is hitting a MAPR class which
is attempting to load the native library (I presume, as part of a static
initializer).
Unfortunately, the JVM prohibits this the second time around. :(

I could file a ticket with a full stack trace if you'd like.

thanks,
Fi

—
Reply to this email directly or view it on GitHub
#5876 (comment).

coderfi · 2015-05-22T05:02:21Z

K. Filed as https://issues.apache.org/jira/browse/SPARK-7819

thank you for looking into this!

Fi

This PR switches Spark SQL's Hive support to use the isolated hive client interface introduced by apache#5851, instead of directly interacting with the client. By using this isolated client we can now allow users to dynamically configure the version of Hive that they are connecting to by setting `spark.sql.hive.metastore.version` without the need recompile. This also greatly reduces the surface area for our interaction with the hive libraries, hopefully making it easier to support other versions in the future. Jars for the desired hive version can be configured using `spark.sql.hive.metastore.jars`, which accepts the following options: - a colon-separated list of jar files or directories for hive and hadoop. - `builtin` - attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive. - `maven` - download the correct version of hive on demand from maven. By default, `builtin` is used for Hive 13. This PR also removes the test step for building against Hive 12, as this will no longer be required to talk to Hive 12 metastores. However, the full removal of the Shim is deferred until a later PR. Remaining TODOs: - Remove the Hive Shims and inline code for Hive 13. - Several HiveCompatibility tests are not yet passing. - `nullformatCTAS` - As detailed below, we now are handling CTAS parsing ourselves instead of hacking into the Hive semantic analyzer. However, we currently only handle the common cases and not things like CTAS where the null format is specified. - `combine1` now leaks state about compression somehow, breaking all subsequent tests. As such we currently add it to the blacklist - `part_inherit_tbl_props` and `part_inherit_tbl_props_with_star` do not work anymore. We are correctly propagating the information - "load_dyn_part14.*" - These tests pass when run on their own, but fail when run with all other tests. It seems our `RESET` mechanism may not be as robust as it used to be? Other required changes: - `CreateTableAsSelect` no longer carries parts of the HiveQL AST with it through the query execution pipeline. Instead, we parse CTAS during the HiveQL conversion and construct a `HiveTable`. The full parsing here is not yet complete as detailed above in the remaining TODOs. Since the operator is Hive specific, it is moved to the hive package. - `Command` is simplified to be a trait that simply acts as a marker for a LogicalPlan that should be eagerly evaluated. Author: Michael Armbrust <michael@databricks.com> Closes apache#5876 from marmbrus/useIsolatedClient and squashes the following commits: 258d000 [Michael Armbrust] really really correct path handling e56fd4a [Michael Armbrust] getAbsolutePath 5a259f5 [Michael Armbrust] fix typos 81bb366 [Michael Armbrust] comments from vanzin 5f3945e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient 4b5cd41 [Michael Armbrust] yin's comments f5de7de [Michael Armbrust] cleanup 11e9c72 [Michael Armbrust] better coverage in versions suite 7e8f010 [Michael Armbrust] better error messages and jar handling e7b3941 [Michael Armbrust] more permisive checking for function registration da91ba7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient 5fe5894 [Michael Armbrust] fix serialization suite 81711c4 [Michael Armbrust] Initial support for running without maven 1d8ae44 [Michael Armbrust] fix final tests? 1c50813 [Michael Armbrust] more comments a3bee70 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient a6f5df1 [Michael Armbrust] style ab07f7e [Michael Armbrust] WIP 4d8bf02 [Michael Armbrust] Remove hive 12 compilation 8843a25 [Michael Armbrust] [SPARK-6908] [SQL] Use isolated Hive client

This is a follow up of apache#5876 and should be merged after apache#5876. Let's wait for unit testing result from Jenkins. Author: Cheng Hao <hao.cheng@intel.com> Closes apache#5963 from chenghao-intel/useIsolatedClient and squashes the following commits: f87ace6 [Cheng Hao] remove the TODO and add `resolved condition` for HiveTable a8260e8 [Cheng Hao] Update code as feedback f4e243f [Cheng Hao] remove the serde setting for SequenceFile d166afa [Cheng Hao] style issue d25a4aa [Cheng Hao] Add SerDe support for CTAS

jeanlyn · 2015-06-01T11:56:54Z

I set the metastore to 0.12.0 by follow steps ,but get classnotdef exception:

I chang the spark.sql.hive.metastore.version in spark-defaults.conf to 0.12.0,i got
set spark.sql.hive.metastore.jars to maven,or set spark.sql.hive.metastore.jars to classpath of hadoop and hive i got

 java.lang.NoClassDefFoundError: com/google/common/base/Preconditions when creating Hive client using classpath

I am not sure understand correctly.
Thanks

jeanlyn · 2015-06-01T13:23:15Z

I can use the build in class(0.13.1) to connect 0.12.0 remote metastore correctly except some warnings sand errors which does not effect running

5/06/01 21:20:09 WARN metastore.RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect.
org.apache.thrift.TApplicationException: Invalid method name: 'get_functions'
       at org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
       at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
       at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_functions(ThriftHiveMetastore.java:2886)
       at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_functions(ThriftHiveMetastore.java:2872)
       at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunctions(HiveMetaStoreClient.java:1727)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
       at $Proxy10.getFunctions(Unknown Source)
       at org.apache.hadoop.hive.ql.metadata.Hive.getFunctions(Hive.java:2670)
       at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionNames(FunctionRegistry.java:674)
       at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionNames(FunctionRegistry.java:662)
       at org.apache.hadoop.hive.cli.CliDriver.getCommandCompletor(CliDriver.java:540)
       at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:174)
       at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
       at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
       at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/06/01 21:20:10 INFO hive.metastore: Trying to connect to metastore with URI thrift://172.19.154.28:9084
15/06/01 21:20:10 INFO hive.metastore: Connected to metastore.
15/06/01 21:20:10 ERROR exec.FunctionRegistry: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Invalid method name: 'get_functions'
spark-sql>

Thanks

yhuai · 2015-06-01T15:59:07Z

@jeanlyn The warning message related get_functions is expected if you are using a Hive 0.13.1 metastore client to connect to a Hive 0.12.0 metastore server because get_functions was not in Hive 0.12.0. Regarding using Hive 0.12 metastore client, can you post the full stack trace? Also, we can move our discussion to user mailing list. So, people who may have the same question can see our thread.

yhuai · 2015-06-01T22:14:52Z

@jeanlyn I think you are hitting https://issues.apache.org/jira/browse/SPARK-8020.

dougb · 2015-06-04T14:14:04Z

FYI, this change broke getting a DF via sqlContext.table("database.table")
which worked from 1.3.0 up until this pr.

This PR switches Spark SQL's Hive support to use the isolated hive client interface introduced by apache#5851, instead of directly interacting with the client. By using this isolated client we can now allow users to dynamically configure the version of Hive that they are connecting to by setting `spark.sql.hive.metastore.version` without the need recompile. This also greatly reduces the surface area for our interaction with the hive libraries, hopefully making it easier to support other versions in the future. Jars for the desired hive version can be configured using `spark.sql.hive.metastore.jars`, which accepts the following options: - a colon-separated list of jar files or directories for hive and hadoop. - `builtin` - attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive. - `maven` - download the correct version of hive on demand from maven. By default, `builtin` is used for Hive 13. This PR also removes the test step for building against Hive 12, as this will no longer be required to talk to Hive 12 metastores. However, the full removal of the Shim is deferred until a later PR. Remaining TODOs: - Remove the Hive Shims and inline code for Hive 13. - Several HiveCompatibility tests are not yet passing. - `nullformatCTAS` - As detailed below, we now are handling CTAS parsing ourselves instead of hacking into the Hive semantic analyzer. However, we currently only handle the common cases and not things like CTAS where the null format is specified. - `combine1` now leaks state about compression somehow, breaking all subsequent tests. As such we currently add it to the blacklist - `part_inherit_tbl_props` and `part_inherit_tbl_props_with_star` do not work anymore. We are correctly propagating the information - "load_dyn_part14.*" - These tests pass when run on their own, but fail when run with all other tests. It seems our `RESET` mechanism may not be as robust as it used to be? Other required changes: - `CreateTableAsSelect` no longer carries parts of the HiveQL AST with it through the query execution pipeline. Instead, we parse CTAS during the HiveQL conversion and construct a `HiveTable`. The full parsing here is not yet complete as detailed above in the remaining TODOs. Since the operator is Hive specific, it is moved to the hive package. - `Command` is simplified to be a trait that simply acts as a marker for a LogicalPlan that should be eagerly evaluated. Author: Michael Armbrust <michael@databricks.com> Closes apache#5876 from marmbrus/useIsolatedClient and squashes the following commits: 258d000 [Michael Armbrust] really really correct path handling e56fd4a [Michael Armbrust] getAbsolutePath 5a259f5 [Michael Armbrust] fix typos 81bb366 [Michael Armbrust] comments from vanzin 5f3945e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient 4b5cd41 [Michael Armbrust] yin's comments f5de7de [Michael Armbrust] cleanup 11e9c72 [Michael Armbrust] better coverage in versions suite 7e8f010 [Michael Armbrust] better error messages and jar handling e7b3941 [Michael Armbrust] more permisive checking for function registration da91ba7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient 5fe5894 [Michael Armbrust] fix serialization suite 81711c4 [Michael Armbrust] Initial support for running without maven 1d8ae44 [Michael Armbrust] fix final tests? 1c50813 [Michael Armbrust] more comments a3bee70 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient a6f5df1 [Michael Armbrust] style ab07f7e [Michael Armbrust] WIP 4d8bf02 [Michael Armbrust] Remove hive 12 compilation 8843a25 [Michael Armbrust] [SPARK-6908] [SQL] Use isolated Hive client

This is a follow up of apache#5876 and should be merged after apache#5876. Let's wait for unit testing result from Jenkins. Author: Cheng Hao <hao.cheng@intel.com> Closes apache#5963 from chenghao-intel/useIsolatedClient and squashes the following commits: f87ace6 [Cheng Hao] remove the TODO and add `resolved condition` for HiveTable a8260e8 [Cheng Hao] Update code as feedback f4e243f [Cheng Hao] remove the serde setting for SequenceFile d166afa [Cheng Hao] style issue d25a4aa [Cheng Hao] Add SerDe support for CTAS

This PR switches Spark SQL's Hive support to use the isolated hive client interface introduced by apache#5851, instead of directly interacting with the client. By using this isolated client we can now allow users to dynamically configure the version of Hive that they are connecting to by setting `spark.sql.hive.metastore.version` without the need recompile. This also greatly reduces the surface area for our interaction with the hive libraries, hopefully making it easier to support other versions in the future. Jars for the desired hive version can be configured using `spark.sql.hive.metastore.jars`, which accepts the following options: - a colon-separated list of jar files or directories for hive and hadoop. - `builtin` - attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive. - `maven` - download the correct version of hive on demand from maven. By default, `builtin` is used for Hive 13. This PR also removes the test step for building against Hive 12, as this will no longer be required to talk to Hive 12 metastores. However, the full removal of the Shim is deferred until a later PR. Remaining TODOs: - Remove the Hive Shims and inline code for Hive 13. - Several HiveCompatibility tests are not yet passing. - `nullformatCTAS` - As detailed below, we now are handling CTAS parsing ourselves instead of hacking into the Hive semantic analyzer. However, we currently only handle the common cases and not things like CTAS where the null format is specified. - `combine1` now leaks state about compression somehow, breaking all subsequent tests. As such we currently add it to the blacklist - `part_inherit_tbl_props` and `part_inherit_tbl_props_with_star` do not work anymore. We are correctly propagating the information - "load_dyn_part14.*" - These tests pass when run on their own, but fail when run with all other tests. It seems our `RESET` mechanism may not be as robust as it used to be? Other required changes: - `CreateTableAsSelect` no longer carries parts of the HiveQL AST with it through the query execution pipeline. Instead, we parse CTAS during the HiveQL conversion and construct a `HiveTable`. The full parsing here is not yet complete as detailed above in the remaining TODOs. Since the operator is Hive specific, it is moved to the hive package. - `Command` is simplified to be a trait that simply acts as a marker for a LogicalPlan that should be eagerly evaluated. Author: Michael Armbrust <michael@databricks.com> Closes apache#5876 from marmbrus/useIsolatedClient and squashes the following commits: 258d000 [Michael Armbrust] really really correct path handling e56fd4a [Michael Armbrust] getAbsolutePath 5a259f5 [Michael Armbrust] fix typos 81bb366 [Michael Armbrust] comments from vanzin 5f3945e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient 4b5cd41 [Michael Armbrust] yin's comments f5de7de [Michael Armbrust] cleanup 11e9c72 [Michael Armbrust] better coverage in versions suite 7e8f010 [Michael Armbrust] better error messages and jar handling e7b3941 [Michael Armbrust] more permisive checking for function registration da91ba7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient 5fe5894 [Michael Armbrust] fix serialization suite 81711c4 [Michael Armbrust] Initial support for running without maven 1d8ae44 [Michael Armbrust] fix final tests? 1c50813 [Michael Armbrust] more comments a3bee70 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient a6f5df1 [Michael Armbrust] style ab07f7e [Michael Armbrust] WIP 4d8bf02 [Michael Armbrust] Remove hive 12 compilation 8843a25 [Michael Armbrust] [SPARK-6908] [SQL] Use isolated Hive client

This is a follow up of apache#5876 and should be merged after apache#5876. Let's wait for unit testing result from Jenkins. Author: Cheng Hao <hao.cheng@intel.com> Closes apache#5963 from chenghao-intel/useIsolatedClient and squashes the following commits: f87ace6 [Cheng Hao] remove the TODO and add `resolved condition` for HiveTable a8260e8 [Cheng Hao] Update code as feedback f4e243f [Cheng Hao] remove the serde setting for SequenceFile d166afa [Cheng Hao] style issue d25a4aa [Cheng Hao] Add SerDe support for CTAS

…uiltin' Hive version for metadata client ### What changes were proposed in this pull request? When using the 'builtin' Hive version for the Hive metadata client, do not create a separate classloader, and rather continue to use the overall user/application classloader (regardless of Java version). This standardizes the behavior for all Java versions with that of Java 9+. See SPARK-42539 for more details on why this approach was chosen. ### Why are the changes needed? Please see a much more detailed description in SPARK-42539. The tl;dr is that user-provided JARs (such as `hive-exec-2.3.8.jar`) take precedence over Spark/system JARs when constructing the classloader used by `IsolatedClientLoader` on Java 8 in 'builtin' mode, which can cause unexpected behavior and/or breakages. This violates the expectation that, unless user-first classloader mode is used, Spark JARs should be prioritized over user JARs. It also seems that this separate classloader was unnecessary from the start, since the intent of 'builtin' mode is to use the JARs already existing on the regular classloader (as alluded to [here](#24057 (comment))). The isolated clientloader was originally added in #5876 in 2015. This bit in the PR description is the only mention of the behavior for "builtin": > attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive. I can't follow the logic here; the user classloader clearly has all of the necessary Hive JARs, since that's where we're getting the JAR URLs from, so we could just use that directly instead of grabbing the URLs. When this was initially added, it only used the JARs from the user classloader, not any of its parents, which I suspect was the motivating factor (to try to avoid more Spark classes being duplicated inside of the isolated classloader, I guess). But that was changed a month later anyway in #6435 / #6459, so I think this may have basically been deadcode from the start. It has also caused at least one issue over the years, e.g. SPARK-21428, which disables the new-classloader behavior in the case of running inside of a CLI session. ### Does this PR introduce _any_ user-facing change? No, except to protect Spark itself from potentially being broken by bad user JARs. ### How was this patch tested? This includes a new unit test in `HiveUtilsSuite` which demonstrates the issue and shows that this approach resolves it. It has also been tested on a live cluster running Java 8 and Hive communication functionality continues to work as expected. Closes #40144 from xkrogen/xkrogen/SPARK-42539/hive-isolatedclientloader-builtin-user-jar-conflict-fix/java9strategy. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Chao Sun <sunchao@apple.com>

…uiltin' Hive version for metadata client When using the 'builtin' Hive version for the Hive metadata client, do not create a separate classloader, and rather continue to use the overall user/application classloader (regardless of Java version). This standardizes the behavior for all Java versions with that of Java 9+. See SPARK-42539 for more details on why this approach was chosen. Please see a much more detailed description in SPARK-42539. The tl;dr is that user-provided JARs (such as `hive-exec-2.3.8.jar`) take precedence over Spark/system JARs when constructing the classloader used by `IsolatedClientLoader` on Java 8 in 'builtin' mode, which can cause unexpected behavior and/or breakages. This violates the expectation that, unless user-first classloader mode is used, Spark JARs should be prioritized over user JARs. It also seems that this separate classloader was unnecessary from the start, since the intent of 'builtin' mode is to use the JARs already existing on the regular classloader (as alluded to [here](apache#24057 (comment))). The isolated clientloader was originally added in apache#5876 in 2015. This bit in the PR description is the only mention of the behavior for "builtin": > attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive. I can't follow the logic here; the user classloader clearly has all of the necessary Hive JARs, since that's where we're getting the JAR URLs from, so we could just use that directly instead of grabbing the URLs. When this was initially added, it only used the JARs from the user classloader, not any of its parents, which I suspect was the motivating factor (to try to avoid more Spark classes being duplicated inside of the isolated classloader, I guess). But that was changed a month later anyway in apache#6435 / apache#6459, so I think this may have basically been deadcode from the start. It has also caused at least one issue over the years, e.g. SPARK-21428, which disables the new-classloader behavior in the case of running inside of a CLI session. No, except to protect Spark itself from potentially being broken by bad user JARs. This includes a new unit test in `HiveUtilsSuite` which demonstrates the issue and shows that this approach resolves it. It has also been tested on a live cluster running Java 8 and Hive communication functionality continues to work as expected.

…uiltin' Hive version for metadata client ### What changes were proposed in this pull request? When using the 'builtin' Hive version for the Hive metadata client, do not create a separate classloader, and rather continue to use the overall user/application classloader (regardless of Java version). This standardizes the behavior for all Java versions with that of Java 9+. See SPARK-42539 for more details on why this approach was chosen. Please note that this is a re-submit of #40144. That one introduced test failures, and potentially a real issue, because the PR works by setting `isolationOn = false` for `builtin` mode. In addition to adjusting the classloader, `HiveClientImpl` relies on `isolationOn` to determine if it should use an isolated copy of `SessionState`, so the PR inadvertently switched to using a shared `SessionState` object. I think we do want to continue to have the isolated session state even in `builtin` mode, so this adds a new flag `sessionStateIsolationOn` which controls whether the session state should be isolated, _separately_ from the `isolationOn` flag which controls whether the classloader should be isolated. Default behavior is for `sessionStateIsolationOn` to be set equal to `isolationOn`, but for `builtin` mode, we override it to enable session state isolated even though classloader isolation is turned off. ### Why are the changes needed? Please see a much more detailed description in SPARK-42539. The tl;dr is that user-provided JARs (such as `hive-exec-2.3.8.jar`) take precedence over Spark/system JARs when constructing the classloader used by `IsolatedClientLoader` on Java 8 in 'builtin' mode, which can cause unexpected behavior and/or breakages. This violates the expectation that, unless user-first classloader mode is used, Spark JARs should be prioritized over user JARs. It also seems that this separate classloader was unnecessary from the start, since the intent of 'builtin' mode is to use the JARs already existing on the regular classloader (as alluded to [here](#24057 (comment))). The isolated clientloader was originally added in #5876 in 2015. This bit in the PR description is the only mention of the behavior for "builtin": > attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive. I can't follow the logic here; the user classloader clearly has all of the necessary Hive JARs, since that's where we're getting the JAR URLs from, so we could just use that directly instead of grabbing the URLs. When this was initially added, it only used the JARs from the user classloader, not any of its parents, which I suspect was the motivating factor (to try to avoid more Spark classes being duplicated inside of the isolated classloader, I guess). But that was changed a month later anyway in #6435 / #6459, so I think this may have basically been deadcode from the start. It has also caused at least one issue over the years, e.g. SPARK-21428, which disables the new-classloader behavior in the case of running inside of a CLI session. ### Does this PR introduce _any_ user-facing change? No, except to protect Spark itself from potentially being broken by bad user JARs. ### How was this patch tested? This includes a new unit test in `HiveUtilsSuite` which demonstrates the issue and shows that this approach resolves it. It has also been tested on a live cluster running Java 8 and Hive communication functionality continues to work as expected. Unit tests failing in #40144 have been locally tested (`HiveUtilsSuite`, `HiveSharedStateSuite`, `HiveCliSessionStateSuite`, `JsonHadoopFsRelationSuite`). Closes #40224 from xkrogen/xkrogen/SPARK-42539/hive-isolatedclientloader-builtin-user-jar-conflict-fix/take2. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Chao Sun <sunchao@apple.com>

[SPARK-6908] [SQL] Use isolated Hive client

8843a25

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala

Remove hive 12 compilation

4d8bf02

marmbrus reviewed May 4, 2015
View reviewed changes

WIP

ab07f7e

style

a6f5df1

marmbrus added 2 commits May 6, 2015 17:48

Merge remote-tracking branch 'origin/master' into useIsolatedClient

a3bee70

more comments

1c50813

fix final tests?

1d8ae44

marmbrus force-pushed the useIsolatedClient branch from 8108442 to 1d8ae44 Compare May 6, 2015 18:38

marmbrus added 4 commits May 6, 2015 20:53

Initial support for running without maven

81711c4

fix serialization suite

5fe5894

Merge remote-tracking branch 'origin/master' into useIsolatedClient

da91ba7

Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala

more permisive checking for function registration

e7b3941

better error messages and jar handling

7e8f010

better coverage in versions suite

11e9c72

cleanup

f5de7de

chenghao-intel mentioned this pull request May 7, 2015

[SPARK-7411] [SQL] Support SerDe for HiveQl in CTAS #5963

Closed

yhuai reviewed May 7, 2015
View reviewed changes

asfgit closed this in cd1d411 May 8, 2015

viirya mentioned this pull request Jun 20, 2015

[SPARK-7626][SQL] Consider the columns in Hive table but not in partition when filling values #6146

Closed

marmbrus deleted the useIsolatedClient branch March 8, 2016 00:07

xkrogen mentioned this pull request Feb 23, 2023

[SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client #40144

Closed

xkrogen mentioned this pull request Feb 28, 2023

[SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client #40224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6908] [SQL] Use isolated Hive client #5876

[SPARK-6908] [SQL] Use isolated Hive client #5876

marmbrus commented May 4, 2015

SparkQA commented May 4, 2015

SparkQA commented May 4, 2015

marmbrus May 4, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 7, 2015

chenghao-intel commented May 7, 2015

SparkQA commented May 7, 2015

yhuai May 7, 2015

marmbrus May 7, 2015

coderfi commented May 22, 2015

marmbrus commented May 22, 2015

coderfi commented May 22, 2015

jeanlyn commented Jun 1, 2015

jeanlyn commented Jun 1, 2015

yhuai commented Jun 1, 2015

yhuai commented Jun 1, 2015

dougb commented Jun 4, 2015

[SPARK-6908] [SQL] Use isolated Hive client #5876

[SPARK-6908] [SQL] Use isolated Hive client #5876

Conversation

marmbrus commented May 4, 2015

SparkQA commented May 4, 2015

SparkQA commented May 4, 2015

marmbrus May 4, 2015

Choose a reason for hiding this comment

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 7, 2015

chenghao-intel commented May 7, 2015

SparkQA commented May 7, 2015

yhuai May 7, 2015

Choose a reason for hiding this comment

marmbrus May 7, 2015

Choose a reason for hiding this comment

coderfi commented May 22, 2015

marmbrus commented May 22, 2015

coderfi commented May 22, 2015

jeanlyn commented Jun 1, 2015

jeanlyn commented Jun 1, 2015

yhuai commented Jun 1, 2015

yhuai commented Jun 1, 2015

dougb commented Jun 4, 2015