[SPARK-5751] [SQL] Revamped HiveThriftServer2Suite for robustness #4720

liancheng · 2015-02-22T16:32:01Z

NOTICE Do NOT merge this, as we're waiting for #3881 to be merged.

HiveThriftServer2Suite has been notorious for its flakiness for a while. This was mostly due to spawning and communicate with external server processes. This PR revamps this test suite for better robustness:

Fixes a racing condition occurred while using tail -f to check log file

It's possible that the line we are looking for has already been printed into the log file before we start the tail -f process. This PR uses tail -n +0 -f to ensure all lines are checked.
Retries up to 3 times if the server fails to start

In most of the cases, the server fails to start because of port conflict. This PR no longer asks the system to choose an available TCP port, but uses a random port first, and retries up to 3 times if the server fails to start.
A server instance is reused among all test cases within a single suite

The original HiveThriftServer2Suite is splitted into two test suites, HiveThriftBinaryServerSuite and HiveThriftHttpServerSuite. Each suite starts a HiveThriftServer2 instance and reuses it for all of its test cases.

TODO

Starts the Thrift server in foreground once [SPARK-5964] Allow spark-daemon.sh to support foreground operation #3881 is merged (adding --foreground flag to spark-daemon.sh)

SparkQA · 2015-02-22T16:32:39Z

Test build #27835 has started for PR 4720 at commit 6f14eb1.

This patch merges cleanly.

SparkQA · 2015-02-22T17:48:50Z

Test build #27835 has finished for PR 4720 at commit 6f14eb1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-22T17:48:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27835/
Test PASSed.

liancheng · 2015-02-23T03:15:07Z

@JoshRosen �Using another Future to check whether the server process ends prematurely is not doable, because the Thrift server is started by spark-daemon.sh, which is asynchronous. Basically there are several cases that may fail the test suites:

Server fails to start because of port conflict
Server fails to start because of misconfigured classpath (e.g. datanucleus jars not found, etc.)
Server fails to start because another server instance was started within the same Spark home directory.
Server instance killed by kill -9
Log file missing or mark lines not found, thus causes time out

According to prior experiences, 1, 2 and 5 are most commonly seen, while 3 and 4 were not observed. In the case of 1 and 2, server instance stops gracefully and print related log lines. 5 is now fixed by the tail -n +0 -f trick. Log4j.properties issues similar to the previous jets3t case should be fixed separately if they emerge again. Thus I chose to check for specific log line to see whether the server instance ends prematurely.

Reusing the same server instance within a test suite also reduces undesired flakiness.

liancheng · 2015-02-23T03:20:31Z

...ftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala

+      case line if line.contains(LOG_FILE_MARK) => new File(line.drop(LOG_FILE_MARK.length))
+    }.getOrElse {
+      throw new RuntimeException("Failed to find HiveThriftServer2 log file.")
+    }


The start-thriftserver.sh script delegates to spark-daemon.sh and terminates pretty soon (about 1s). I found using ProcessBuilder.lines is much simpler and more straightforward than using a ProcessLogger.

start-thriftserver.sh itself may fail if there has already been a started server instance within the same Spark home directory. However, right now we only start a single instance a time.

JoshRosen · 2015-02-23T07:07:20Z

Using another Future to check whether the server process ends prematurely is not doable, because the Thrift server is started by spark-daemon.sh, which is asynchronous.

@liancheng Do you think that the foreground option proposed in #3881 could make this easier to test by removing a layer of indirection and asynchrony? I guess we probably want to target this fix for multiple branches to decrease the overall flakiness, so we'd have to port that other patch, too.

liancheng · 2015-02-23T12:43:08Z

@JoshRosen That's a good idea! Actually the first version of start-thriftserver.sh runs in foreground, and HiveThriftServer2Suite was also much simpler (without the log tailing part). Do you think we should wait for #3881 or merge this PR first and then port to #3881?

JoshRosen · 2015-02-23T23:44:19Z

If it won't take too much time, I suppose it would be cleaner to add the foreground flag and merge that other PR into all of the maintenance branches. We can always re-open our own PR to add the foreground flag if you want to move that along faster.

liancheng · 2015-02-24T02:21:33Z

@JoshRosen OK agree, then let's get #3881 merged first.

liancheng · 2015-02-24T02:23:50Z

Added WIP tag in the PR description so that this PR won't be merged prematurely by accident.

liancheng · 2015-02-24T14:22:27Z

@JoshRosen Just saw 3 more build failures because of HiveThriftServer2Suite. According to @andrewor14's statistics result, this is currently the No. 1 flaky test that causes many unnecessary build failures. I think we should merge this first in the sake of saving developer time.

liancheng · 2015-02-24T14:26:33Z

Observed several HiveThriftServer2Suite failures on Jenkins due to too strict timeout. According to the test failure output, everything went fine before the future hit its timeout. Relaxed from 1 minute to 2 minutes.

SparkQA · 2015-02-24T14:27:39Z

Test build #27890 has started for PR 4720 at commit d6c80eb.

This patch merges cleanly.

SparkQA · 2015-02-24T15:43:28Z

Test build #27890 has finished for PR 4720 at commit d6c80eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest
- class HiveThriftHttpServerSuite extends HiveThriftJdbcTest
- abstract class HiveThriftJdbcTest extends HiveThriftServer2Test
- abstract class HiveThriftServer2Test extends FunSuite with BeforeAndAfterAll with Logging

AmplabJenkins · 2015-02-24T15:43:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27890/
Test PASSed.

andrewor14 · 2015-02-24T18:30:23Z

Thanks a lot @liancheng. Let's test this one a few more times first to make sure it's not flaky.

SparkQA · 2015-02-24T18:31:31Z

Test build #615 has started for PR 4720 at commit d6c80eb.

This patch merges cleanly.

SparkQA · 2015-02-24T19:13:28Z

Test build #616 has started for PR 4720 at commit d6c80eb.

This patch merges cleanly.

SparkQA · 2015-02-24T19:13:40Z

Test build #617 has started for PR 4720 at commit d6c80eb.

This patch merges cleanly.

SparkQA · 2015-02-24T19:13:51Z

Test build #618 has started for PR 4720 at commit d6c80eb.

This patch merges cleanly.

SparkQA · 2015-02-24T19:14:17Z

Test build #619 has started for PR 4720 at commit d6c80eb.

This patch merges cleanly.

SparkQA · 2015-02-24T19:26:07Z

Test build #617 has finished for PR 4720 at commit d6c80eb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-24T19:39:59Z

Test build #615 has finished for PR 4720 at commit d6c80eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-24T20:28:30Z

Test build #616 has finished for PR 4720 at commit d6c80eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-24T20:31:37Z

Test build #619 has finished for PR 4720 at commit d6c80eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-24T20:32:31Z

Test build #618 has finished for PR 4720 at commit d6c80eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-02-25T00:22:26Z

@andrewor14 The only test failure was caused by another test suite, seems that it's pretty stable now :)

**NOTICE** Do NOT merge this, as we're waiting for #3881 to be merged. `HiveThriftServer2Suite` has been notorious for its flakiness for a while. This was mostly due to spawning and communicate with external server processes. This PR revamps this test suite for better robustness: 1. Fixes a racing condition occurred while using `tail -f` to check log file It's possible that the line we are looking for has already been printed into the log file before we start the `tail -f` process. This PR uses `tail -n +0 -f` to ensure all lines are checked. 2. Retries up to 3 times if the server fails to start In most of the cases, the server fails to start because of port conflict. This PR no longer asks the system to choose an available TCP port, but uses a random port first, and retries up to 3 times if the server fails to start. 3. A server instance is reused among all test cases within a single suite The original `HiveThriftServer2Suite` is splitted into two test suites, `HiveThriftBinaryServerSuite` and `HiveThriftHttpServerSuite`. Each suite starts a `HiveThriftServer2` instance and reuses it for all of its test cases. **TODO** - [ ] Starts the Thrift server in foreground once #3881 is merged (adding `--foreground` flag to `spark-daemon.sh`)  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4720)  Author: Cheng Lian <lian@databricks.com> Closes #4720 from liancheng/revamp-thrift-server-tests and squashes the following commits: d6c80eb [Cheng Lian] Relaxes server startup timeout 6f14eb1 [Cheng Lian] Revamped HiveThriftServer2Suite for robustness (cherry picked from commit f816e73) Signed-off-by: Cheng Lian <lian@databricks.com>

chenghao-intel · 2015-02-25T00:42:00Z

@liancheng The thriftservice test suites failure partially due to #4611, but I believe there are still some issues we still need to solve. e.g. when the Hive ThriftService starts, a pid file will be generated named /tmp/spark-{username}-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1.pid, there would be problem if multiple test processes run into the HiveThriftServer2Suites simultaneously, as the pid file will be overwritten by the last one.

liancheng · 2015-02-25T04:36:10Z

@chenghao-intel Yeah, just opened #4758 to fix the PID issue, thanks!

…ft server test suites This is a follow-up of #4720. By default, `spark-daemon.sh` writes PID files under `/tmp`, which makes it impossible to start multiple server instances simultaneously. This PR sets `SPARK_PID_DIR` to Spark home directory to workaround this problem. Many thanks to chenghao-intel for pointing out this issue!  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4758)  Author: Cheng Lian <lian@databricks.com> Closes #4758 from liancheng/thriftserver-pid-dir and squashes the following commits: 252fa0f [Cheng Lian] Uses temporary directory as Thrift server PID directory 1b3d1e3 [Cheng Lian] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift server test suites (cherry picked from commit 8c468a6) Signed-off-by: Cheng Lian <lian@databricks.com>

…ft server test suites This is a follow-up of #4720. By default, `spark-daemon.sh` writes PID files under `/tmp`, which makes it impossible to start multiple server instances simultaneously. This PR sets `SPARK_PID_DIR` to Spark home directory to workaround this problem. Many thanks to chenghao-intel for pointing out this issue!  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4758)  Author: Cheng Lian <lian@databricks.com> Closes #4758 from liancheng/thriftserver-pid-dir and squashes the following commits: 252fa0f [Cheng Lian] Uses temporary directory as Thrift server PID directory 1b3d1e3 [Cheng Lian] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift server test suites

Iceberg 0.13.0.3 - ADT 1.1.7 2022-05-20 PRs Merged * Internal: Parquet bloom filter support (apache#594 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/594)) * Internal: AWS Kms Client (apache#630 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/630)) * Internal: Core: Add client-side check of encryption properties (apache#626 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/626)) * Core: Align snapshot summary property names for delete files (apache#4766 (apache/iceberg#4766)) * Core: Add eq and pos delete file counts to snapshot summary (apache#4677 (apache/iceberg#4677)) * Spark 3.2: Clean static vars in SparkTableUtil (apache#4765 (apache/iceberg#4765)) * Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil (apache#4758 (apache/iceberg#4758)) * Core: Fix query failure when using projection on top of partitions metadata table (apache#4720) (apache#619 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/619)) Key Notes Bloom filter support and Client Side Encryption Features can be used in this release. Both features are only enabled with explicit flags and will not effect existing tables or jobs.

Revamped HiveThriftServer2Suite for robustness

6f14eb1

liancheng reviewed Feb 23, 2015
View reviewed changes

JoshRosen mentioned this pull request Feb 23, 2015

[SPARK-5964] Allow spark-daemon.sh to support foreground operation #3881

Closed

liancheng changed the title ~~[SPARK-5751] [SQL] Revamped HiveThriftServer2Suite for robustness~~ [SPARK-5751] [SQL] [WIP] Revamped HiveThriftServer2Suite for robustness Feb 24, 2015

Relaxes server startup timeout

d6c80eb

asfgit closed this in f816e73 Feb 25, 2015

liancheng changed the title ~~[SPARK-5751] [SQL] [WIP] Revamped HiveThriftServer2Suite for robustness~~ [SPARK-5751] [SQL] Revamped HiveThriftServer2Suite for robustness Feb 25, 2015

liancheng mentioned this pull request Feb 25, 2015

[SPARK-5751] [SQL] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift server test suites #4758

Closed

liancheng deleted the revamp-thrift-server-tests branch February 25, 2015 04:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5751] [SQL] Revamped HiveThriftServer2Suite for robustness #4720

[SPARK-5751] [SQL] Revamped HiveThriftServer2Suite for robustness #4720

liancheng commented Feb 22, 2015

SparkQA commented Feb 22, 2015

SparkQA commented Feb 22, 2015

AmplabJenkins commented Feb 22, 2015

liancheng commented Feb 23, 2015

liancheng Feb 23, 2015

JoshRosen commented Feb 23, 2015

liancheng commented Feb 23, 2015

JoshRosen commented Feb 23, 2015

liancheng commented Feb 24, 2015

liancheng commented Feb 24, 2015

liancheng commented Feb 24, 2015

liancheng commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

AmplabJenkins commented Feb 24, 2015

andrewor14 commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

liancheng commented Feb 25, 2015

chenghao-intel commented Feb 25, 2015

liancheng commented Feb 25, 2015

[SPARK-5751] [SQL] Revamped HiveThriftServer2Suite for robustness #4720

[SPARK-5751] [SQL] Revamped HiveThriftServer2Suite for robustness #4720

Conversation

liancheng commented Feb 22, 2015

SparkQA commented Feb 22, 2015

SparkQA commented Feb 22, 2015

AmplabJenkins commented Feb 22, 2015

liancheng commented Feb 23, 2015

liancheng Feb 23, 2015

Choose a reason for hiding this comment

JoshRosen commented Feb 23, 2015

liancheng commented Feb 23, 2015

JoshRosen commented Feb 23, 2015

liancheng commented Feb 24, 2015

liancheng commented Feb 24, 2015

liancheng commented Feb 24, 2015

liancheng commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

AmplabJenkins commented Feb 24, 2015

andrewor14 commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

SparkQA commented Feb 24, 2015

liancheng commented Feb 25, 2015

chenghao-intel commented Feb 25, 2015

liancheng commented Feb 25, 2015