-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-15455] For IsolatedClientLoader, we need to provide a conf to disable sharing Hadoop classes #13236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #59043 has finished for PR 13236 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marmbrus I am setting this conf to false by default, which changes the behavior (our master always share hadoop clsses).
|
Test build #59046 has finished for PR 13236 at commit
|
|
Test build #59049 has finished for PR 13236 at commit
|
|
test this please |
|
Test build #59050 has finished for PR 13236 at commit
|
| override val version: HiveVersion, | ||
| sparkConf: SparkConf, | ||
| hadoopConf: Configuration, | ||
| hadoopConf: Map[String, String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue FYI. I am changing the type from Configuration to Map[String, String] because it is possible that Hadoop classes are not shared between Spark side and the IsolatedClientLoader side. When Hadoop classes are not shared, if we pass a Configuration object to HiveClientImpl, the class of this Configuration object (this class is loaded by Spark's classloader) is different from the Configuration class loaded by IsolatedClientLoader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the reason that I am adding a flag to not Hadoop classes sharing is that the versions used by Spark and the HiveClientImpl can be different.
|
Test build #59115 has finished for PR 13236 at commit
|
|
I don't think a map preserves behavior. Hadoop There are also two more issues that come to mind:
The main reason I added I don't see a good way around the |
|
@rdblue Thank you for looking at this! The reason that I added the flag to disable sharing Hadoop classes is that hadoops used by Spark and metastore client may not be binary compatible (e.g. hadoop 2 used by spark and hadoop 1 used by the metastore client). For HiveClientImpl, it is a wrapper of a Hive's metastore client. As long as we can propagate user set configurations to its internal HiveConf (to make it talk the metastore correctly), it is fine. Regarding the classloader, we actually do not use the classloader originally associated with the Hadoop Configuration and we always explicitly set the classloader associated with the HiveConf created inside HiveClientImpl. Regarding Configuration.addDefaultResource, before 2.0, we did not pass Hadoop Configuration to HiveClientImpl (it was called ClientWrapper in Spark 1.6). Since we were not relying on Configuration.addDefaultResource, using a Map should not change anything. btw, I am fine to change the default value of the flag to true (sharing hadoop classes) if you think it represents common use cases. |
|
@yhuai, Hive uses shims to be compatible with Hadoop 1 and Hadoop 2. I think it would be better to use the existing mechanism in Hive to deal with this. I know that that this didn't use the copy constructor before, but that was a bug. That had a few flaws, including the fact that properties set in the hadoopConf didn't show up in the HiveConf. While this version accounts for missing properties, the behavior of those properties will not be correct according to the normal way Hadoop/Hive work: final properties can be accidentally overwritten by this implementation. The addDefaultResource problem is just another way this causes Spark's behavior to diverge from what users expect and there may be others that I didn't think of. I think the ideal situation is to continue using the copy constructor to avoid unknown behavior differences, but at a minimum I think this needs to correctly copy final properties to the new HiveConf. The class loader situation sounds fine as you describe it. |
|
I can create a Hadoop Configuration inside HiveClientImpl and use it to create the HiveConf. The main issue is that we cannot pass a Hadoop Configuration in a HiveClientImpl because there may be two versions of |
|
Why is Hive's ClassLoader loading Hadoop classes itself rather than delegating to the ClassLoader that is responsible for Hadoop? Hive should be using shims to interact with whatever Hadoop classes are available. Is there a reason not to share Hadoop classes? I think rebuilding a Configuration and passing it is fine, but then you have the problem of making an exact replica of a Configuration object. I'm not sure what needs to happen (other than at least the final properties). And that still doesn't keep the behavior or addDefaultResource. Like I said, I think the ideal solution is to use the incoming Configuration. Otherwise you can probably get away with copying properties and preserving which properties are final. |
|
We support different versions of Hive metastore (for example, Spark depends on Hive 1.2.1 but we can talk to a metastore using Hive 0.12). Also, the Hadoop used by this client can be different from the Hadoop used by Spark. We have a IsolatedClientLoader for the class isolation purpose. So, it is possible that we have to reload Hadoop classes (those classes are from a different version of Hadoop than the Hadoop used by Spark). This is the reason that we need to have a option to allow us load Hadoop classes instead of always sharing Hadoop classes. Because of this, we cannot pass Configuration to HiveClientImpl. At here, Hadoop Configuration is only the container of all users' Hadoop settings. |
|
@yhuai, I know that the IsolatedClientLoader is used to load multiple versions of Hive, but Hive should be able to use the version of Hadoop that Spark has already loaded without problems. I think there should only be one version of Hadoop, the one that Spark provides. I don't see the value in loading a version of Hadoop only for Hive, when Hive is built to be able to work with any version of Hadoop. |
|
@rdblue Thanks. That is a good point :) I am closing this PR. |
What changes were proposed in this pull request?
Right now, we always share Hadoop classes between the Spark side and the metastore client side. However, these two sides may use different versions of Hadoop. So, we need to have a way to disable sharing Hadoop classes.
How was this patch tested?
Existing tests and manual tests.