-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' option when creating client #20377
Conversation
Test build #86562 has finished for PR 20377 at commit
|
@@ -132,7 +131,8 @@ private[hive] class HiveClientImpl( | |||
if (ret != null) { | |||
// hive.metastore.warehouse.dir is determined in SharedState after the CliSessionState | |||
// instance constructed, we need to follow that change here. | |||
warehouseDir.foreach { dir => | |||
val conf = hadoopConf.asInstanceOf[Configuration] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're reintroducing the original bug here. You cannot cast this to Configuration
, because in the sharesHadoopClasses = false
case, there are two different instances of that class involved, and this will fail with a ClassCastException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to reproduce it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will try to reproduce it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm not sure just what I suggested will work. You need to reproduce the condition to trigger the if
above:
val ret = SessionState.get
if (ret != null) {
And I'm not sure how to do that. I'm just sure that if you do, your code will not work.
(You can try to remove the condition just to test.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Over heeerre ... you could say this is a good example of why getting reviews is important!
@gatorsmile this fix is wrong. If you want to test for yourself, you need to set |
I'm a little confused, I think this test should help us detect the wrong fix, but this PR passed all tests. Does it indicate that the test actually can't expose the original bug? |
The original test covers the original scenario. I think the one pointed by @vanzin is another issue. However, I do not have time to try it. |
if (!isolationOn) { | ||
return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config, | ||
baseClassLoader, this) | ||
return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the major concern is to hide the Configuration
class through the code path. How about we create a wrapper?
trait HadoopConfWrapper {
def get(key: String): String
def toIterator: Iterator[(String, String)]
}
and here
val wrapper = new HadoopConfWrapper {
def get(key: String) = hadoopConf.get(key)
def toIterator = hadoopConf.iterator().asScala
}
return new HiveClientImpl(..., wrapper, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @vanzin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me ask the question: what exactly is the problem with the argument I added? It solves the issue without having to write all this code.
If you really dislike the argument for some odd reason, you can get the config by iterating over the Iterable
in HiveClientImpl
, making an operation that is currently O(1) to be O(n).
But I really don't understand why you guys care about this argument so much. There are 2 call sites to that constructor in the whole code base, both in the same method in IsolatedClientLoader.scala
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the hot path. Passing extra parameters for this looks unnecessary. We need to keep the interface clean for better code maintenance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's such a subjective argument. The extra argument does not make the code more complicated, especially compared to anything else going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure that when people think about non-trivial code in Spark, this is exactly the line in the whole code base they think of... :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like this then if it really matters?
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
@@ -83,9 +83,8 @@ import org.apache.spark.util.{CircularBuffer, Utils}
*/
private[hive] class HiveClientImpl(
override val version: HiveVersion,
- warehouseDir: Option[String],
sparkConf: SparkConf,
- hadoopConf: JIterable[JMap.Entry[String, String]],
+ hadoopConf: HadoopConfiguration,
extraConfig: Map[String, String],
initClassLoader: ClassLoader,
val clientLoader: IsolatedClientLoader)
@@ -106,6 +105,8 @@ private[hive] class HiveClientImpl(
case hive.v2_1 => new Shim_v2_1()
}
+ private val hadoopConfMap = hadoopConf.iterator().asScala.map(e => e.getKey -> e.getValue).toMap
+
// Create an internal session state for this HiveClientImpl.
val state: SessionState = {
val original = Thread.currentThread().getContextClassLoader
@@ -132,7 +133,7 @@ private[hive] class HiveClientImpl(
if (ret != null) {
// hive.metastore.warehouse.dir is determined in SharedState after the CliSessionState
// instance constructed, we need to follow that change here.
- warehouseDir.foreach { dir =>
+ hadoopConfMap.get(ConfVars.METASTOREWAREHOUSE.varname).foreach { dir =>
ret.getConf.setVar(ConfVars.METASTOREWAREHOUSE, dir)
}
ret
@@ -166,8 +167,7 @@ private[hive] class HiveClientImpl(
// has hive-site.xml. So, HiveConf will use that to override its default values.
// 2: we set all spark confs to this hiveConf.
// 3: we set all entries in config to this hiveConf.
- (hadoopConf.iterator().asScala.map(kv => kv.getKey -> kv.getValue)
- ++ sparkConf.getAll.toMap ++ extraConfig).foreach { case (k, v) =>
+ (hadoopConfMap ++ sparkConf.getAll.toMap ++ extraConfig).foreach { case (k, v) =>
logDebug(
s"""
|Applying Hadoop/Hive/Spark and extra properties to Hive Conf:
@@ -847,6 +847,11 @@ private[hive] class HiveClientImpl(
}
private[hive] object HiveClientImpl {
+
+ // wider signature for hadoop conf for different versions blabla ..
+ // See SPARK-17088.
+ private type HadoopConfiguration = JIterable[JMap.Entry[String, String]]
+
/** Converts the native StructField to Hive's FieldSchema. */
def toHiveColumn(c: StructField): FieldSchema = {
val typeString = if (c.metadata.contains(HIVE_TYPE_STRING)) {
I think this could roughly address all concerns listed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or simply just iterate it with O(n) as said in https://github.com/apache/spark/pull/20377/files#r163920410.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon 's proposal looks good with this type alias and comments, so people can know what happened here. But due to personal taste, I prefer the wrapper solution as it looks cleaner to me and don't need to build a map or a O(n) lookup :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a comment about exactly why any of the proposed changes are needed is really the only thing that can make this code more understandable. I had that in the bug and in my commit message, but in hindsight, it really should be in the code.
Without the comment, all versions are equally complex, because the complexity has nothing to do with how many arguments you have or their type.
What changes were proposed in this pull request?
This PR is to remove useless
warehouseDir
, which is already contained inhadoopConf
How was this patch tested?
N/A