Skip to content

Conversation

@yhuai
Copy link
Contributor

@yhuai yhuai commented May 20, 2016

What changes were proposed in this pull request?

Right now, we always share Hadoop classes between the Spark side and the metastore client side. However, these two sides may use different versions of Hadoop. So, we need to have a way to disable sharing Hadoop classes.

How was this patch tested?

Existing tests and manual tests.

@SparkQA
Copy link

SparkQA commented May 21, 2016

Test build #59043 has finished for PR 13236 at commit c0e3ae0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus I am setting this conf to false by default, which changes the behavior (our master always share hadoop clsses).

@SparkQA
Copy link

SparkQA commented May 21, 2016

Test build #59046 has finished for PR 13236 at commit c696264.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 21, 2016

Test build #59049 has finished for PR 13236 at commit 8adc940.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor Author

yhuai commented May 21, 2016

test this please

@SparkQA
Copy link

SparkQA commented May 21, 2016

Test build #59050 has finished for PR 13236 at commit 8adc940.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

override val version: HiveVersion,
sparkConf: SparkConf,
hadoopConf: Configuration,
hadoopConf: Map[String, String],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue FYI. I am changing the type from Configuration to Map[String, String] because it is possible that Hadoop classes are not shared between Spark side and the IsolatedClientLoader side. When Hadoop classes are not shared, if we pass a Configuration object to HiveClientImpl, the class of this Configuration object (this class is loaded by Spark's classloader) is different from the Configuration class loaded by IsolatedClientLoader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the reason that I am adding a flag to not Hadoop classes sharing is that the versions used by Spark and the HiveClientImpl can be different.

@SparkQA
Copy link

SparkQA commented May 22, 2016

Test build #59115 has finished for PR 13236 at commit 084ae00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented May 23, 2016

I don't think a map preserves behavior. Hadoop Configuration instances have a set of final properties that can't be changed. This loses that information and then applies properties from the SparkConf and extraConfig, which could potentially overwrite final properties.

There are also two more issues that come to mind:

  1. The Configuration has a ClassLoader that will be used by [getClass](https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/conf/Configuration.html#getClass%28java.lang.String, java.lang.Class%29) methods.
  2. Configuration.addDefaultResource can be called at any time and adds properties to all configurations.

The main reason I added hadoopConf was to ensure properties that I set in my Spark defaults were applied to this HiveConf, and this PR preserves that behavior. But, best practice in Hadoop is to use the copy constructor to preserve final props and class loading. I think that those two issues should be fixed.

I don't see a good way around the addDefaultResource problem other than to have Hive to use the same version of Hadoop that Spark uses (that seems reasonable to me, but we can discuss that separately).

@yhuai
Copy link
Contributor Author

yhuai commented May 23, 2016

@rdblue Thank you for looking at this!

The reason that I added the flag to disable sharing Hadoop classes is that hadoops used by Spark and metastore client may not be binary compatible (e.g. hadoop 2 used by spark and hadoop 1 used by the metastore client).

For HiveClientImpl, it is a wrapper of a Hive's metastore client. As long as we can propagate user set configurations to its internal HiveConf (to make it talk the metastore correctly), it is fine.

Regarding the classloader, we actually do not use the classloader originally associated with the Hadoop Configuration and we always explicitly set the classloader associated with the HiveConf created inside HiveClientImpl.

Regarding Configuration.addDefaultResource, before 2.0, we did not pass Hadoop Configuration to HiveClientImpl (it was called ClientWrapper in Spark 1.6). Since we were not relying on Configuration.addDefaultResource, using a Map should not change anything.

btw, I am fine to change the default value of the flag to true (sharing hadoop classes) if you think it represents common use cases.

@rdblue
Copy link
Contributor

rdblue commented May 23, 2016

@yhuai, Hive uses shims to be compatible with Hadoop 1 and Hadoop 2. I think it would be better to use the existing mechanism in Hive to deal with this.

I know that that this didn't use the copy constructor before, but that was a bug. That had a few flaws, including the fact that properties set in the hadoopConf didn't show up in the HiveConf. While this version accounts for missing properties, the behavior of those properties will not be correct according to the normal way Hadoop/Hive work: final properties can be accidentally overwritten by this implementation. The addDefaultResource problem is just another way this causes Spark's behavior to diverge from what users expect and there may be others that I didn't think of.

I think the ideal situation is to continue using the copy constructor to avoid unknown behavior differences, but at a minimum I think this needs to correctly copy final properties to the new HiveConf. The class loader situation sounds fine as you describe it.

@yhuai
Copy link
Contributor Author

yhuai commented May 23, 2016

I can create a Hadoop Configuration inside HiveClientImpl and use it to create the HiveConf. The main issue is that we cannot pass a Hadoop Configuration in a HiveClientImpl because there may be two versions of Configuration loaded by two classloaders. What do you think?

@rdblue
Copy link
Contributor

rdblue commented May 23, 2016

Why is Hive's ClassLoader loading Hadoop classes itself rather than delegating to the ClassLoader that is responsible for Hadoop? Hive should be using shims to interact with whatever Hadoop classes are available. Is there a reason not to share Hadoop classes?

I think rebuilding a Configuration and passing it is fine, but then you have the problem of making an exact replica of a Configuration object. I'm not sure what needs to happen (other than at least the final properties). And that still doesn't keep the behavior or addDefaultResource. Like I said, I think the ideal solution is to use the incoming Configuration. Otherwise you can probably get away with copying properties and preserving which properties are final.

@yhuai
Copy link
Contributor Author

yhuai commented May 23, 2016

We support different versions of Hive metastore (for example, Spark depends on Hive 1.2.1 but we can talk to a metastore using Hive 0.12). Also, the Hadoop used by this client can be different from the Hadoop used by Spark. We have a IsolatedClientLoader for the class isolation purpose. So, it is possible that we have to reload Hadoop classes (those classes are from a different version of Hadoop than the Hadoop used by Spark). This is the reason that we need to have a option to allow us load Hadoop classes instead of always sharing Hadoop classes. Because of this, we cannot pass Configuration to HiveClientImpl.

At here, Hadoop Configuration is only the container of all users' Hadoop settings.

@rdblue
Copy link
Contributor

rdblue commented May 23, 2016

@yhuai, I know that the IsolatedClientLoader is used to load multiple versions of Hive, but Hive should be able to use the version of Hadoop that Spark has already loaded without problems. I think there should only be one version of Hadoop, the one that Spark provides.

I don't see the value in loading a version of Hadoop only for Hive, when Hive is built to be able to work with any version of Hadoop.

@yhuai
Copy link
Contributor Author

yhuai commented May 24, 2016

@rdblue Thanks. That is a good point :) I am closing this PR.

@yhuai yhuai closed this May 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants