[SPARK-15455] For IsolatedClientLoader, we need to provide a conf to disable sharing Hadoop classes #13236

yhuai · 2016-05-20T22:58:05Z

What changes were proposed in this pull request?

Right now, we always share Hadoop classes between the Spark side and the metastore client side. However, these two sides may use different versions of Hadoop. So, we need to have a way to disable sharing Hadoop classes.

How was this patch tested?

Existing tests and manual tests.

…ClientLoader.

SparkQA · 2016-05-21T00:05:33Z

Test build #59043 has finished for PR 13236 at commit c0e3ae0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-05-21T00:23:44Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala

@marmbrus I am setting this conf to false by default, which changes the behavior (our master always share hadoop clsses).

SparkQA · 2016-05-21T01:04:22Z

Test build #59046 has finished for PR 13236 at commit c696264.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-21T01:15:37Z

Test build #59049 has finished for PR 13236 at commit 8adc940.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-05-21T01:20:45Z

test this please

SparkQA · 2016-05-21T02:12:50Z

Test build #59050 has finished for PR 13236 at commit 8adc940.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…on object.

yhuai · 2016-05-22T22:31:02Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

    override val version: HiveVersion,
    sparkConf: SparkConf,
-    hadoopConf: Configuration,
+    hadoopConf: Map[String, String],


@rdblue FYI. I am changing the type from Configuration to Map[String, String] because it is possible that Hadoop classes are not shared between Spark side and the IsolatedClientLoader side. When Hadoop classes are not shared, if we pass a Configuration object to HiveClientImpl, the class of this Configuration object (this class is loaded by Spark's classloader) is different from the Configuration class loaded by IsolatedClientLoader.

Also, the reason that I am adding a flag to not Hadoop classes sharing is that the versions used by Spark and the HiveClientImpl can be different.

SparkQA · 2016-05-22T23:20:05Z

Test build #59115 has finished for PR 13236 at commit 084ae00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2016-05-23T17:06:58Z

I don't think a map preserves behavior. Hadoop Configuration instances have a set of final properties that can't be changed. This loses that information and then applies properties from the SparkConf and extraConfig, which could potentially overwrite final properties.

There are also two more issues that come to mind:

The Configuration has a ClassLoader that will be used by [getClass](https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/conf/Configuration.html#getClass%28java.lang.String, java.lang.Class%29) methods.
Configuration.addDefaultResource can be called at any time and adds properties to all configurations.

The main reason I added hadoopConf was to ensure properties that I set in my Spark defaults were applied to this HiveConf, and this PR preserves that behavior. But, best practice in Hadoop is to use the copy constructor to preserve final props and class loading. I think that those two issues should be fixed.

I don't see a good way around the addDefaultResource problem other than to have Hive to use the same version of Hadoop that Spark uses (that seems reasonable to me, but we can discuss that separately).

yhuai · 2016-05-23T18:11:52Z

@rdblue Thank you for looking at this!

The reason that I added the flag to disable sharing Hadoop classes is that hadoops used by Spark and metastore client may not be binary compatible (e.g. hadoop 2 used by spark and hadoop 1 used by the metastore client).

For HiveClientImpl, it is a wrapper of a Hive's metastore client. As long as we can propagate user set configurations to its internal HiveConf (to make it talk the metastore correctly), it is fine.

Regarding the classloader, we actually do not use the classloader originally associated with the Hadoop Configuration and we always explicitly set the classloader associated with the HiveConf created inside HiveClientImpl.

Regarding Configuration.addDefaultResource, before 2.0, we did not pass Hadoop Configuration to HiveClientImpl (it was called ClientWrapper in Spark 1.6). Since we were not relying on Configuration.addDefaultResource, using a Map should not change anything.

btw, I am fine to change the default value of the flag to true (sharing hadoop classes) if you think it represents common use cases.

rdblue · 2016-05-23T19:02:15Z

@yhuai, Hive uses shims to be compatible with Hadoop 1 and Hadoop 2. I think it would be better to use the existing mechanism in Hive to deal with this.

I know that that this didn't use the copy constructor before, but that was a bug. That had a few flaws, including the fact that properties set in the hadoopConf didn't show up in the HiveConf. While this version accounts for missing properties, the behavior of those properties will not be correct according to the normal way Hadoop/Hive work: final properties can be accidentally overwritten by this implementation. The addDefaultResource problem is just another way this causes Spark's behavior to diverge from what users expect and there may be others that I didn't think of.

I think the ideal situation is to continue using the copy constructor to avoid unknown behavior differences, but at a minimum I think this needs to correctly copy final properties to the new HiveConf. The class loader situation sounds fine as you describe it.

yhuai · 2016-05-23T19:23:06Z

I can create a Hadoop Configuration inside HiveClientImpl and use it to create the HiveConf. The main issue is that we cannot pass a Hadoop Configuration in a HiveClientImpl because there may be two versions of Configuration loaded by two classloaders. What do you think?

rdblue · 2016-05-23T19:42:22Z

Why is Hive's ClassLoader loading Hadoop classes itself rather than delegating to the ClassLoader that is responsible for Hadoop? Hive should be using shims to interact with whatever Hadoop classes are available. Is there a reason not to share Hadoop classes?

I think rebuilding a Configuration and passing it is fine, but then you have the problem of making an exact replica of a Configuration object. I'm not sure what needs to happen (other than at least the final properties). And that still doesn't keep the behavior or addDefaultResource. Like I said, I think the ideal solution is to use the incoming Configuration. Otherwise you can probably get away with copying properties and preserving which properties are final.

yhuai · 2016-05-23T20:07:53Z

We support different versions of Hive metastore (for example, Spark depends on Hive 1.2.1 but we can talk to a metastore using Hive 0.12). Also, the Hadoop used by this client can be different from the Hadoop used by Spark. We have a IsolatedClientLoader for the class isolation purpose. So, it is possible that we have to reload Hadoop classes (those classes are from a different version of Hadoop than the Hadoop used by Spark). This is the reason that we need to have a option to allow us load Hadoop classes instead of always sharing Hadoop classes. Because of this, we cannot pass Configuration to HiveClientImpl.

At here, Hadoop Configuration is only the container of all users' Hadoop settings.

rdblue · 2016-05-23T20:19:12Z

@yhuai, I know that the IsolatedClientLoader is used to load multiple versions of Hive, but Hive should be able to use the version of Hadoop that Spark has already loaded without problems. I think there should only be one version of Hadoop, the one that Spark provides.

I don't see the value in loading a version of Hadoop only for Hive, when Hive is built to be able to work with any version of Hadoop.

yhuai · 2016-05-24T00:21:31Z

@rdblue Thanks. That is a good point :) I am closing this PR.

yhuai added 4 commits May 20, 2016 15:12

Add a conf to control if we want to share Hadoop classes for Isolated…

06168a6

…ClientLoader.

Do not pass hadoop Configuration to HiveClientImpl

4639c85

style

c0e3ae0

update

c696264

yhuai reviewed May 21, 2016
View reviewed changes

yhuai added 2 commits May 22, 2016 14:25

update doc and default value

1219637

Use a map to store Hadoop confs instead of using a Hadoop Configurati…

084ae00

…on object.

yhuai reviewed May 22, 2016
View reviewed changes

yhuai closed this May 24, 2016

[SPARK-15455] For IsolatedClientLoader, we need to provide a conf to disable sharing Hadoop classes #13236

[SPARK-15455] For IsolatedClientLoader, we need to provide a conf to disable sharing Hadoop classes #13236

Uh oh!

Conversation

yhuai commented May 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 21, 2016

Uh oh!

yhuai May 21, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 21, 2016

Uh oh!

SparkQA commented May 21, 2016

Uh oh!

yhuai commented May 21, 2016

Uh oh!

SparkQA commented May 21, 2016

Uh oh!

yhuai May 22, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai May 22, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 22, 2016

Uh oh!

rdblue commented May 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhuai commented May 23, 2016

Uh oh!

rdblue commented May 23, 2016

Uh oh!

yhuai commented May 23, 2016

Uh oh!

rdblue commented May 23, 2016

Uh oh!

yhuai commented May 23, 2016

Uh oh!

rdblue commented May 23, 2016

Uh oh!

yhuai commented May 24, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yhuai commented May 20, 2016 •

edited

Loading

rdblue commented May 23, 2016 •

edited

Loading