[SPARK-31926][SQL][test-hive1.2] Fix concurrency issue for ThriftCLIService to getPortNumber #28751

yaooqinn · 2020-06-08T08:35:59Z

What changes were proposed in this pull request?

When org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext called,
it starts ThriftCLIService in the background with a new Thread, at the same time we call ThriftCLIService.getPortNumber, we might not get the bound port if it's configured with 0.

This PR moves the TServer/HttpServer initialization code out of that new Thread.

Why are the changes needed?

Fix concurrency issue, improve test robustness.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

add new tests

…to getPortNumber

yaooqinn · 2020-06-08T08:38:40Z

cc @cloud-fan @juliuszsompolski @maropu please take a look at this PR.

SparkQA · 2020-06-08T09:17:37Z

Test build #123623 has finished for PR 28751 at commit 0379d6e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-06-08T09:53:41Z

retest this please

SparkQA · 2020-06-08T10:35:11Z

Test build #123627 has finished for PR 28751 at commit 0379d6e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-08T12:10:29Z

...-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SharedThriftServer.scala

@@ -33,6 +33,8 @@ trait SharedThriftServer extends SharedSparkSession {
  private var hiveServer2: HiveThriftServer2 = _
  private var serverPort: Int = 0

+  def mode: ServerMode.Value = ServerMode.binary


it seems weird to have a default value in the base test trait.

cloud-fan · 2020-06-08T12:10:59Z

...-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SharedThriftServer.scala

+      hiveServer2.getServices.asScala.foreach {
+        case t: ThriftCLIService if t.getPortNumber != 0 =>
+          serverPort = t.getPortNumber
+          logInfo(s"Started HiveThriftServer2: port=$serverPort, attempt=$attempt")


so we may not output this log?

Before this fix, yes. The port binding is in another background thread.

how does this patch fix it? It seems you just added a try-catch?

With https://github.com/apache/spark/pull/28751/files#diff-7610697b4f8f1bc4842c77e50807914cR178 and its implementations, the port binding is done in the same thread where we call getPortNumber later.

#28651 (comment) . there was a discussion with @juliuszsompolski before

Take ThriftBinaryCLIService for an example
Before:
we do TThreadPoolServer initialization and serve in the same run function of the background thread. Then if we call getPortNumber right after startWithContext, concurrency issue will occur. The portNum may not reset yet when we call.

After:
we do TThreadPoolServer initialization in the current thread and do serve in the run function of the background thread.

Ah, I see. Nice catch.

SparkQA · 2020-06-08T13:00:32Z

Test build #123628 has finished for PR 28751 at commit 72ac908.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-08T14:42:14Z

Test build #123633 has finished for PR 28751 at commit 0beaa69.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-08T15:28:34Z

Test build #123634 has finished for PR 28751 at commit 0a46508.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ThriftServerWithSparkContextInBinarySuite extends ThriftServerWithSparkContextSuite

juliuszsompolski

LGTM, but please revert the styling changes in java files.

juliuszsompolski · 2020-06-09T09:18:00Z

...riftserver/v1.2/src/main/java/org/apache/hive/service/cli/thrift/ThriftBinaryCLIService.java

-          workerKeepAliveTime, TimeUnit.SECONDS, new SynchronousQueue<Runnable>(),
-          new ThreadFactoryWithGarbageCleanup(threadPoolName));
+        workerKeepAliveTime, TimeUnit.SECONDS, new SynchronousQueue<Runnable>(),
+        new ThreadFactoryWithGarbageCleanup(threadPoolName));


Spark does not have an official styling guide for Java, so I think the previous 4 space indents.
Could you revert the indent/styling changes in those files, to make tracking changes and merging between branches easier?
I find it easier to track which code is directly imported from Hive, and which was modified for Spark, if it's not modified with styling changes, so I can diff it directly with Hive files.

juliuszsompolski · 2020-06-09T09:22:59Z

...-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SharedThriftServer.scala

+    // Set the HIVE_SERVER2_THRIFT_HTTP_PORT to 0, so it could randomly pick any free port to use.
+    // It's much more robust than set a random port generated by ourselves ahead


nit: this duplicates the comment above. The two comments could be merged.

SparkQA · 2020-06-09T12:03:50Z

Test build #123688 has finished for PR 28751 at commit 2d0403a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-06-09T12:07:46Z

retest this please

juliuszsompolski · 2020-06-09T12:11:03Z

...riftserver/v1.2/src/main/java/org/apache/hive/service/cli/thrift/ThriftBinaryCLIService.java

-      String msg = "Starting " + ThriftBinaryCLIService.class.getSimpleName() + " on port "
-          + serverSocket.getServerSocket().getLocalPort() + " with " + minWorkerThreads + "..." + maxWorkerThreads + " worker threads";
+      String msg = "Starting " + getName() + " on port " + portNum + " with " + minWorkerThreads +
+          "..." + maxWorkerThreads + " worker threads";


It this log message change needed? Does it introduce any actual changes?
I think they may be consumers waiting and parsing this line to make sure the Thriftserver is running, and parse the port etc, so if the format changes, it may break such matches.

unnecessary change, you are right. reverted

maropu

Nice catch. LGTM except for the minor comments.

maropu · 2020-06-09T12:16:05Z

...-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SharedThriftServer.scala

+     """.stripMargin.split("\n").mkString.trim
+  } else {
+    s"""jdbc:hive2://localhost:$serverPort"""
+  }


nit format:

private lazy val jdbcUri = if (mode == ServerMode.http) { s"jdbc:hive2://localhost:$serverPort/default;transportMode=http;httpPath=cliservice" } else { s"jdbc:hive2://localhost:$serverPort" }

?

I think the existing format is correct.

maropu · 2020-06-09T12:26:16Z

...-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SharedThriftServer.scala

+      hiveServer2.getServices.asScala.foreach {
+        case t: ThriftCLIService if t.getPortNumber != 0 =>
+          serverPort = t.getPortNumber
+          logInfo(s"Started HiveThriftServer2: port=$serverPort, attempt=$attempt")


Ah, I see. Nice catch.

maropu · 2020-06-09T12:26:58Z

...-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SharedThriftServer.scala

+      // Wait for thrift server to be ready to serve the query, via executing simple query
+      // till the query succeeds. See SPARK-30345 for more details.
+      eventually(timeout(30.seconds), interval(1.seconds)) {
+        withJdbcStatement {_.execute("SELECT 1")}


nit: withJdbcStatement { _.execute("SELECT 1") }

yaooqinn · 2020-06-09T12:37:23Z

Test build #123688 has finished for PR 28751 at commit 2d0403a.

This patch fails Spark unit tests.

This patch merges cleanly.

This patch adds no public classes.

Seems not related to this pr.
@cloud-fan any idea about one missing hour in the test failure for rebasing datetime?

SparkQA · 2020-06-09T14:59:19Z

Test build #123696 has finished for PR 28751 at commit fb985e6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-09T15:28:59Z

Test build #123695 has finished for PR 28751 at commit 2d0403a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-09T16:06:37Z

Test build #123697 has finished for PR 28751 at commit 04f0a1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-09T16:49:37Z

thanks, merging to master/3.0!

…ervice to getPortNumber ### What changes were proposed in this pull request? When` org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext` called, it starts `ThriftCLIService` in the background with a new Thread, at the same time we call `ThriftCLIService.getPortNumber,` we might not get the bound port if it's configured with 0. This PR moves the TServer/HttpServer initialization code out of that new Thread. ### Why are the changes needed? Fix concurrency issue, improve test robustness. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? add new tests Closes #28751 from yaooqinn/SPARK-31926. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 02f32cf) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun · 2020-06-10T23:56:55Z

Hi, All.
This PR seems to break all Maven jobs. Could you take a look?

java.lang.IllegalArgumentException: requirement failed: Failed to bind an actual port for HiveThriftServer2
      at scala.Predef$.require(Predef.scala:281)
      at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.withJdbcStatement(SharedThriftServer.scala:66)

dongjoon-hyun · 2020-06-10T23:58:22Z

project/SparkBuild.scala

@@ -480,7 +480,8 @@ object SparkParallelTestGrouping {
    "org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite",
    "org.apache.spark.sql.hive.thriftserver.ui.ThriftServerPageSuite",
    "org.apache.spark.sql.hive.thriftserver.ui.HiveThriftServer2ListenerSuite",
-    "org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite",
+    "org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite",
+    "org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInBinarySuite",


This looks like a SBT-only workaround, doesn't it?

dongjoon-hyun · 2020-06-11T00:19:45Z

...rc/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerWithSparkContextSuite.scala

@@ -42,3 +42,12 @@ class ThriftServerWithSparkContextSuite extends SharedThriftServer {
    }
  }
 }
+
+
+class ThriftServerWithSparkContextInBinarySuite extends ThriftServerWithSparkContextSuite {


This new test suite never passes.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3/403/

dongjoon-hyun · 2020-06-11T00:20:35Z

Sorry guys. This introduces a non-trivial consistent failure on both master/3.0 Maven jobs. I'll revert this inevitably. Please make another PR and pass with Maven.

HyukjinKwon · 2020-06-11T00:27:44Z

Yes, +1 to revert.

yaooqinn · 2020-06-11T01:52:35Z

Thanks @dongjoon-hyun @HyukjinKwon, and pardon the Inconvenience

cloud-fan · 2020-06-11T03:14:23Z

project/SparkBuild.scala

@@ -480,7 +480,8 @@ object SparkParallelTestGrouping {
    "org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite",
    "org.apache.spark.sql.hive.thriftserver.ui.ThriftServerPageSuite",
    "org.apache.spark.sql.hive.thriftserver.ui.HiveThriftServer2ListenerSuite",
-    "org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite",
+    "org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInHttpSuite",
+    "org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextInBinarySuite",
    "org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite"


does that mean this approach to speed up the test runner never works for maven? cc @gengliangwang @wangyum

Yes, it's not related to maven I think.

Agree with @gengliangwang.

Any way to run these test JVM-individually with maven? It seems not to be able to start 2 thrift servers with different kinds of transport modes on the shared spark session in one JVM

Can we just run these 2 test suites one by one?

The root cause I found so far is: in afterAll(), the spark session was stopped and detached from the thread-local variable, but the hive's SessionState was not, SO it gets reused next time which causes the newly defined configs in the new test file will not take effect.

If what I‘ve found is the only issue that stops these tests from running together in a single JVM(verified locally and went well), I guess we can remove these 2 lines eventually.

Sended a new PR, #28797

…currency issue for ThriftCLIService to getPortNumber ### What changes were proposed in this pull request? This PR brings #28751 back - It once reverted by 4a25200 because of inevitable maven test failure - See related updates in this followup a0187cd - And reverted again because of the flakiness of the added unit tests - In this PR, The flakiness reason found is caused by the hive metastore connection that the SparkSQLCLIService trying to create which turns out is unnecessary at all. This metastore client points to a dummy metastore server only. - Also, add some cleanups for SharedThriftServer trait in before and after to prevent its configurations being polluted or polluting others ### Why are the changes needed? fix flaky test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing sbt and maven tests Closes #28835 from yaooqinn/SPARK-31926-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-31926][SQL][TESTS] Fix concurrency issue for ThriftCLIService …

0379d6e

…to getPortNumber

probot-autolabeler bot added the SQL label Jun 8, 2020

hive 1.2 and dedicated jvm

72ac908

probot-autolabeler bot added the BUILD label Jun 8, 2020

yaooqinn changed the title ~~[SPARK-31926][SQL][TESTS] Fix concurrency issue for ThriftCLIService to getPortNumber~~ [SPARK-31926][SQL] Fix concurrency issue for ThriftCLIService to getPortNumber Jun 8, 2020

fix build

0beaa69

cloud-fan reviewed Jun 8, 2020

View reviewed changes

yaooqinn changed the title ~~[SPARK-31926][SQL] Fix concurrency issue for ThriftCLIService to getPortNumber~~ [SPARK-31926][SQL][test-hive1.2] Fix concurrency issue for ThriftCLIService to getPortNumber Jun 8, 2020

address comments

0a46508

cloud-fan approved these changes Jun 8, 2020

View reviewed changes

juliuszsompolski reviewed Jun 9, 2020

View reviewed changes

yaooqinn added 2 commits June 9, 2020 17:54

style

edc5b0d

style

2d0403a

juliuszsompolski reviewed Jun 9, 2020

View reviewed changes

style

fb985e6

maropu approved these changes Jun 9, 2020

View reviewed changes

address maropu's comments

04f0a1c

juliuszsompolski approved these changes Jun 9, 2020

View reviewed changes

cloud-fan closed this in 02f32cf Jun 9, 2020

dongjoon-hyun reviewed Jun 10, 2020

View reviewed changes

dongjoon-hyun reviewed Jun 11, 2020

View reviewed changes

cloud-fan reviewed Jun 11, 2020

View reviewed changes

yaooqinn mentioned this pull request Jun 18, 2020

[SPARK-31926][SQL][TESTS][FOLLOWUP][test-hive1.2][test-maven] Fix concurrency issue for ThriftCLIService to getPortNumber #28835

Closed

		// Set the HIVE_SERVER2_THRIFT_HTTP_PORT to 0, so it could randomly pick any free port to use.
		// It's much more robust than set a random port generated by ourselves ahead

[SPARK-31926][SQL][test-hive1.2] Fix concurrency issue for ThriftCLIService to getPortNumber #28751

[SPARK-31926][SQL][test-hive1.2] Fix concurrency issue for ThriftCLIService to getPortNumber #28751

Conversation

yaooqinn commented Jun 8, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

yaooqinn commented Jun 8, 2020 • edited

SparkQA commented Jun 8, 2020

yaooqinn commented Jun 8, 2020

SparkQA commented Jun 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaooqinn Jun 8, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 8, 2020

SparkQA commented Jun 8, 2020

SparkQA commented Jun 8, 2020

juliuszsompolski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 9, 2020

yaooqinn commented Jun 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaooqinn commented Jun 9, 2020 • edited

SparkQA commented Jun 9, 2020

SparkQA commented Jun 9, 2020

SparkQA commented Jun 9, 2020

cloud-fan commented Jun 9, 2020

dongjoon-hyun commented Jun 10, 2020

dongjoon-hyun Jun 10, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jun 11, 2020

HyukjinKwon commented Jun 11, 2020

yaooqinn commented Jun 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaooqinn Jun 11, 2020 • edited

Choose a reason for hiding this comment

yaooqinn commented Jun 8, 2020 •

edited

yaooqinn Jun 8, 2020 •

edited

yaooqinn commented Jun 9, 2020 •

edited

dongjoon-hyun Jun 10, 2020 •

edited

yaooqinn Jun 11, 2020 •

edited