Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' option when creating client #20377

Closed
wants to merge 1 commit into from

Conversation

gatorsmile
Copy link
Member

What changes were proposed in this pull request?

This PR is to remove useless warehouseDir, which is already contained in hadoopConf

How was this patch tested?

N/A

@gatorsmile
Copy link
Member Author

@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86562 has finished for PR 20377 at commit 0574ec7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -132,7 +131,8 @@ private[hive] class HiveClientImpl(
if (ret != null) {
// hive.metastore.warehouse.dir is determined in SharedState after the CliSessionState
// instance constructed, we need to follow that change here.
warehouseDir.foreach { dir =>
val conf = hadoopConf.asInstanceOf[Configuration]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're reintroducing the original bug here. You cannot cast this to Configuration, because in the sharesHadoopClasses = false case, there are two different instances of that class involved, and this will fail with a ClassCastException.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to reproduce it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will try to reproduce it.

Copy link
Contributor

@vanzin vanzin Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm not sure just what I suggested will work. You need to reproduce the condition to trigger the if above:

       val ret = SessionState.get
       if (ret != null) {

And I'm not sure how to do that. I'm just sure that if you do, your code will not work.

(You can try to remove the condition just to test.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over heeerre ... you could say this is a good example of why getting reviews is important!

@vanzin
Copy link
Contributor

vanzin commented Jan 24, 2018

@gatorsmile this fix is wrong. If you want to test for yourself, you need to set METASTOREWAREHOUSE before instantiating the client in VersionsSuite.

@cloud-fan
Copy link
Contributor

I'm a little confused, I think this test should help us detect the wrong fix, but this PR passed all tests. Does it indicate that the test actually can't expose the original bug?

@gatorsmile
Copy link
Member Author

The original test covers the original scenario. I think the one pointed by @vanzin is another issue. However, I do not have time to try it.

if (!isolationOn) {
return new HiveClientImpl(version, warehouseDir, sparkConf, hadoopConf, config,
baseClassLoader, this)
return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this)
Copy link
Contributor

@cloud-fan cloud-fan Jan 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the major concern is to hide the Configuration class through the code path. How about we create a wrapper?

trait HadoopConfWrapper {
  def get(key: String): String
  def toIterator: Iterator[(String, String)]
}

and here

val wrapper = new HadoopConfWrapper {
  def get(key: String) = hadoopConf.get(key)
  def toIterator = hadoopConf.iterator().asScala
}
return new HiveClientImpl(..., wrapper, ...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @vanzin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me ask the question: what exactly is the problem with the argument I added? It solves the issue without having to write all this code.

If you really dislike the argument for some odd reason, you can get the config by iterating over the Iterable in HiveClientImpl, making an operation that is currently O(1) to be O(n).

But I really don't understand why you guys care about this argument so much. There are 2 call sites to that constructor in the whole code base, both in the same method in IsolatedClientLoader.scala.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the hot path. Passing extra parameters for this looks unnecessary. We need to keep the interface clean for better code maintenance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's such a subjective argument. The extra argument does not make the code more complicated, especially compared to anything else going on here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure that when people think about non-trivial code in Spark, this is exactly the line in the whole code base they think of... :-/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this then if it really matters?

--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
@@ -83,9 +83,8 @@ import org.apache.spark.util.{CircularBuffer, Utils}
  */
 private[hive] class HiveClientImpl(
     override val version: HiveVersion,
-    warehouseDir: Option[String],
     sparkConf: SparkConf,
-    hadoopConf: JIterable[JMap.Entry[String, String]],
+    hadoopConf: HadoopConfiguration,
     extraConfig: Map[String, String],
     initClassLoader: ClassLoader,
     val clientLoader: IsolatedClientLoader)
@@ -106,6 +105,8 @@ private[hive] class HiveClientImpl(
     case hive.v2_1 => new Shim_v2_1()
   }

+  private val hadoopConfMap = hadoopConf.iterator().asScala.map(e => e.getKey -> e.getValue).toMap
+
   // Create an internal session state for this HiveClientImpl.
   val state: SessionState = {
     val original = Thread.currentThread().getContextClassLoader
@@ -132,7 +133,7 @@ private[hive] class HiveClientImpl(
       if (ret != null) {
         // hive.metastore.warehouse.dir is determined in SharedState after the CliSessionState
         // instance constructed, we need to follow that change here.
-        warehouseDir.foreach { dir =>
+        hadoopConfMap.get(ConfVars.METASTOREWAREHOUSE.varname).foreach { dir =>
           ret.getConf.setVar(ConfVars.METASTOREWAREHOUSE, dir)
         }
         ret
@@ -166,8 +167,7 @@ private[hive] class HiveClientImpl(
     // has hive-site.xml. So, HiveConf will use that to override its default values.
     // 2: we set all spark confs to this hiveConf.
     // 3: we set all entries in config to this hiveConf.
-    (hadoopConf.iterator().asScala.map(kv => kv.getKey -> kv.getValue)
-      ++ sparkConf.getAll.toMap ++ extraConfig).foreach { case (k, v) =>
+    (hadoopConfMap ++ sparkConf.getAll.toMap ++ extraConfig).foreach { case (k, v) =>
       logDebug(
         s"""
            |Applying Hadoop/Hive/Spark and extra properties to Hive Conf:
@@ -847,6 +847,11 @@ private[hive] class HiveClientImpl(
 }

 private[hive] object HiveClientImpl {
+
+  // wider signature for hadoop conf for different versions blabla ..
+  // See SPARK-17088.
+  private type HadoopConfiguration = JIterable[JMap.Entry[String, String]]
+
   /** Converts the native StructField to Hive's FieldSchema. */
   def toHiveColumn(c: StructField): FieldSchema = {
     val typeString = if (c.metadata.contains(HIVE_TYPE_STRING)) {

I think this could roughly address all concerns listed here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or simply just iterate it with O(n) as said in https://github.com/apache/spark/pull/20377/files#r163920410.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon 's proposal looks good with this type alias and comments, so people can know what happened here. But due to personal taste, I prefer the wrapper solution as it looks cleaner to me and don't need to build a map or a O(n) lookup :-/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a comment about exactly why any of the proposed changes are needed is really the only thing that can make this code more understandable. I had that in the bug and in my commit message, but in hindsight, it really should be in the code.

Without the comment, all versions are equally complex, because the complexity has nothing to do with how many arguments you have or their type.

@gatorsmile gatorsmile closed this Dec 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants