[SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' option when creating client #20377

gatorsmile · 2018-01-24T05:07:59Z

What changes were proposed in this pull request?

This PR is to remove useless warehouseDir, which is already contained in hadoopConf

How was this patch tested?

N/A

gatorsmile · 2018-01-24T05:18:37Z

cc @vanzin @cloud-fan @felixcheung @HyukjinKwon

SparkQA · 2018-01-24T07:41:22Z

Test build #86562 has finished for PR 20377 at commit 0574ec7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-01-24T17:57:03Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

@@ -132,7 +131,8 @@ private[hive] class HiveClientImpl(
      if (ret != null) {
        // hive.metastore.warehouse.dir is determined in SharedState after the CliSessionState
        // instance constructed, we need to follow that change here.
-        warehouseDir.foreach { dir =>
+        val conf = hadoopConf.asInstanceOf[Configuration]


You're reintroducing the original bug here. You cannot cast this to Configuration, because in the sharesHadoopClasses = false case, there are two different instances of that class involved, and this will fail with a ClassCastException.

How to reproduce it?

See other comment.

Thanks! Will try to reproduce it.

Actually I'm not sure just what I suggested will work. You need to reproduce the condition to trigger the if above:

val ret = SessionState.get if (ret != null) {

And I'm not sure how to do that. I'm just sure that if you do, your code will not work.

(You can try to remove the condition just to test.)

Over heeerre ... you could say this is a good example of why getting reviews is important!

vanzin · 2018-01-24T17:59:37Z

@gatorsmile this fix is wrong. If you want to test for yourself, you need to set METASTOREWAREHOUSE before instantiating the client in VersionsSuite.

cloud-fan · 2018-01-25T05:07:15Z

I'm a little confused, I think this test should help us detect the wrong fix, but this PR passed all tests. Does it indicate that the test actually can't expose the original bug?

gatorsmile · 2018-01-25T05:51:48Z

The original test covers the original scenario. I think the one pointed by @vanzin is another issue. However, I do not have time to try it.

cloud-fan · 2018-01-25T08:26:04Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala

    if (!isolationOn) {
-      return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config,
-        baseClassLoader, this)
+      return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this)


so the major concern is to hide the Configuration class through the code path. How about we create a wrapper?

trait HadoopConfWrapper { def get(key: String): String def toIterator: Iterator[(String, String)] }

and here

val wrapper = new HadoopConfWrapper { def get(key: String) = hadoopConf.get(key) def toIterator = hadoopConf.iterator().asScala } return new HiveClientImpl(..., wrapper, ...)

Let me ask the question: what exactly is the problem with the argument I added? It solves the issue without having to write all this code.

If you really dislike the argument for some odd reason, you can get the config by iterating over the Iterable in HiveClientImpl, making an operation that is currently O(1) to be O(n).

But I really don't understand why you guys care about this argument so much. There are 2 call sites to that constructor in the whole code base, both in the same method in IsolatedClientLoader.scala.

This is not the hot path. Passing extra parameters for this looks unnecessary. We need to keep the interface clean for better code maintenance.

That's such a subjective argument. The extra argument does not make the code more complicated, especially compared to anything else going on here.

I'm sure that when people think about non-trivial code in Spark, this is exactly the line in the whole code base they think of... :-/

How about something like this then if it really matters?

--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala @@ -83,9 +83,8 @@ import org.apache.spark.util.{CircularBuffer, Utils} */ private[hive] class HiveClientImpl( override val version: HiveVersion, - warehouseDir: Option[String], sparkConf: SparkConf, - hadoopConf: JIterable[JMap.Entry[String, String]], + hadoopConf: HadoopConfiguration, extraConfig: Map[String, String], initClassLoader: ClassLoader, val clientLoader: IsolatedClientLoader) @@ -106,6 +105,8 @@ private[hive] class HiveClientImpl( case hive.v2_1 => new Shim_v2_1() } + private val hadoopConfMap = hadoopConf.iterator().asScala.map(e => e.getKey -> e.getValue).toMap + // Create an internal session state for this HiveClientImpl. val state: SessionState = { val original = Thread.currentThread().getContextClassLoader @@ -132,7 +133,7 @@ private[hive] class HiveClientImpl( if (ret != null) { // hive.metastore.warehouse.dir is determined in SharedState after the CliSessionState // instance constructed, we need to follow that change here. - warehouseDir.foreach { dir => + hadoopConfMap.get(ConfVars.METASTOREWAREHOUSE.varname).foreach { dir => ret.getConf.setVar(ConfVars.METASTOREWAREHOUSE, dir) } ret @@ -166,8 +167,7 @@ private[hive] class HiveClientImpl( // has hive-site.xml. So, HiveConf will use that to override its default values. // 2: we set all spark confs to this hiveConf. // 3: we set all entries in config to this hiveConf. - (hadoopConf.iterator().asScala.map(kv => kv.getKey -> kv.getValue) - ++ sparkConf.getAll.toMap ++ extraConfig).foreach { case (k, v) => + (hadoopConfMap ++ sparkConf.getAll.toMap ++ extraConfig).foreach { case (k, v) => logDebug( s""" |Applying Hadoop/Hive/Spark and extra properties to Hive Conf: @@ -847,6 +847,11 @@ private[hive] class HiveClientImpl( } private[hive] object HiveClientImpl { + + // wider signature for hadoop conf for different versions blabla .. + // See SPARK-17088. + private type HadoopConfiguration = JIterable[JMap.Entry[String, String]] + /** Converts the native StructField to Hive's FieldSchema. */ def toHiveColumn(c: StructField): FieldSchema = { val typeString = if (c.metadata.contains(HIVE_TYPE_STRING)) {

I think this could roughly address all concerns listed here.

Or simply just iterate it with O(n) as said in https://github.com/apache/spark/pull/20377/files#r163920410.

@HyukjinKwon 's proposal looks good with this type alias and comments, so people can know what happened here. But due to personal taste, I prefer the wrapper solution as it looks cleaner to me and don't need to build a map or a O(n) lookup :-/

Adding a comment about exactly why any of the proposed changes are needed is really the only thing that can make this code more understandable. I had that in the bug and in my commit message, but in hindsight, it really should be in the code.

Without the comment, all versions are equally complex, because the complexity has nothing to do with how many arguments you have or their type.

remove warehousePath

0574ec7

gatorsmile mentioned this pull request Jan 24, 2018

[SPARK-17088][hive] Fix 'sharesHadoopClasses' option when creating client. #20169

Closed

felixcheung approved these changes Jan 24, 2018

View reviewed changes

vanzin reviewed Jan 24, 2018

View reviewed changes

cloud-fan reviewed Jan 25, 2018

View reviewed changes

gatorsmile closed this Dec 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' option when creating client #20377

[SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' option when creating client #20377

gatorsmile commented Jan 24, 2018

gatorsmile commented Jan 24, 2018

SparkQA commented Jan 24, 2018

vanzin Jan 24, 2018

gatorsmile Jan 24, 2018

vanzin Jan 24, 2018

gatorsmile Jan 24, 2018

vanzin Jan 24, 2018 •

edited

Loading

srowen Jan 24, 2018

vanzin commented Jan 24, 2018

cloud-fan commented Jan 25, 2018

gatorsmile commented Jan 25, 2018

cloud-fan Jan 25, 2018 •

edited

Loading

cloud-fan Jan 25, 2018

vanzin Jan 25, 2018

gatorsmile Jan 25, 2018

vanzin Jan 25, 2018

vanzin Jan 26, 2018

HyukjinKwon Jan 26, 2018

HyukjinKwon Jan 26, 2018

cloud-fan Jan 26, 2018

vanzin Jan 29, 2018

[SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' option when creating client #20377

[SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' option when creating client #20377

Conversation

gatorsmile commented Jan 24, 2018

What changes were proposed in this pull request?

How was this patch tested?

gatorsmile commented Jan 24, 2018

SparkQA commented Jan 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin Jan 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin commented Jan 24, 2018

cloud-fan commented Jan 25, 2018

gatorsmile commented Jan 25, 2018

cloud-fan Jan 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin Jan 24, 2018 •

edited

Loading

cloud-fan Jan 25, 2018 •

edited

Loading