Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5751] [SQL] Revamped HiveThriftServer2Suite for robustness #4720

Closed
wants to merge 2 commits into from

Conversation

liancheng
Copy link
Contributor

NOTICE Do NOT merge this, as we're waiting for #3881 to be merged.

HiveThriftServer2Suite has been notorious for its flakiness for a while. This was mostly due to spawning and communicate with external server processes. This PR revamps this test suite for better robustness:

  1. Fixes a racing condition occurred while using tail -f to check log file

    It's possible that the line we are looking for has already been printed into the log file before we start the tail -f process. This PR uses tail -n +0 -f to ensure all lines are checked.

  2. Retries up to 3 times if the server fails to start

    In most of the cases, the server fails to start because of port conflict. This PR no longer asks the system to choose an available TCP port, but uses a random port first, and retries up to 3 times if the server fails to start.

  3. A server instance is reused among all test cases within a single suite

    The original HiveThriftServer2Suite is splitted into two test suites, HiveThriftBinaryServerSuite and HiveThriftHttpServerSuite. Each suite starts a HiveThriftServer2 instance and reuses it for all of its test cases.

TODO

Review on Reviewable

@SparkQA
Copy link

SparkQA commented Feb 22, 2015

Test build #27835 has started for PR 4720 at commit 6f14eb1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 22, 2015

Test build #27835 has finished for PR 4720 at commit 6f14eb1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27835/
Test PASSed.

@liancheng
Copy link
Contributor Author

@JoshRosen �Using another Future to check whether the server process ends prematurely is not doable, because the Thrift server is started by spark-daemon.sh, which is asynchronous. Basically there are several cases that may fail the test suites:

  1. Server fails to start because of port conflict
  2. Server fails to start because of misconfigured classpath (e.g. datanucleus jars not found, etc.)
  3. Server fails to start because another server instance was started within the same Spark home directory.
  4. Server instance killed by kill -9
  5. Log file missing or mark lines not found, thus causes time out

According to prior experiences, 1, 2 and 5 are most commonly seen, while 3 and 4 were not observed. In the case of 1 and 2, server instance stops gracefully and print related log lines. 5 is now fixed by the tail -n +0 -f trick. Log4j.properties issues similar to the previous jets3t case should be fixed separately if they emerge again. Thus I chose to check for specific log line to see whether the server instance ends prematurely.

Reusing the same server instance within a test suite also reduces undesired flakiness.

case line if line.contains(LOG_FILE_MARK) => new File(line.drop(LOG_FILE_MARK.length))
}.getOrElse {
throw new RuntimeException("Failed to find HiveThriftServer2 log file.")
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The start-thriftserver.sh script delegates to spark-daemon.sh and terminates pretty soon (about 1s). I found using ProcessBuilder.lines is much simpler and more straightforward than using a ProcessLogger.

start-thriftserver.sh itself may fail if there has already been a started server instance within the same Spark home directory. However, right now we only start a single instance a time.

@JoshRosen
Copy link
Contributor

Using another Future to check whether the server process ends prematurely is not doable, because the Thrift server is started by spark-daemon.sh, which is asynchronous.

@liancheng Do you think that the foreground option proposed in #3881 could make this easier to test by removing a layer of indirection and asynchrony? I guess we probably want to target this fix for multiple branches to decrease the overall flakiness, so we'd have to port that other patch, too.

@liancheng
Copy link
Contributor Author

@JoshRosen That's a good idea! Actually the first version of start-thriftserver.sh runs in foreground, and HiveThriftServer2Suite was also much simpler (without the log tailing part). Do you think we should wait for #3881 or merge this PR first and then port to #3881?

@JoshRosen
Copy link
Contributor

If it won't take too much time, I suppose it would be cleaner to add the foreground flag and merge that other PR into all of the maintenance branches. We can always re-open our own PR to add the foreground flag if you want to move that along faster.

@liancheng
Copy link
Contributor Author

@JoshRosen OK agree, then let's get #3881 merged first.

@liancheng liancheng changed the title [SPARK-5751] [SQL] Revamped HiveThriftServer2Suite for robustness [SPARK-5751] [SQL] [WIP] Revamped HiveThriftServer2Suite for robustness Feb 24, 2015
@liancheng
Copy link
Contributor Author

Added WIP tag in the PR description so that this PR won't be merged prematurely by accident.

@liancheng
Copy link
Contributor Author

@JoshRosen Just saw 3 more build failures because of HiveThriftServer2Suite. According to @andrewor14's statistics result, this is currently the No. 1 flaky test that causes many unnecessary build failures. I think we should merge this first in the sake of saving developer time.

@liancheng
Copy link
Contributor Author

Observed several HiveThriftServer2Suite failures on Jenkins due to too strict timeout. According to the test failure output, everything went fine before the future hit its timeout. Relaxed from 1 minute to 2 minutes.

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #27890 has started for PR 4720 at commit d6c80eb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #27890 has finished for PR 4720 at commit d6c80eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest
    • class HiveThriftHttpServerSuite extends HiveThriftJdbcTest
    • abstract class HiveThriftJdbcTest extends HiveThriftServer2Test
    • abstract class HiveThriftServer2Test extends FunSuite with BeforeAndAfterAll with Logging

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27890/
Test PASSed.

@andrewor14
Copy link
Contributor

Thanks a lot @liancheng. Let's test this one a few more times first to make sure it's not flaky.

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #615 has started for PR 4720 at commit d6c80eb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #616 has started for PR 4720 at commit d6c80eb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #617 has started for PR 4720 at commit d6c80eb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #618 has started for PR 4720 at commit d6c80eb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #619 has started for PR 4720 at commit d6c80eb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #617 has finished for PR 4720 at commit d6c80eb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #615 has finished for PR 4720 at commit d6c80eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #616 has finished for PR 4720 at commit d6c80eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #619 has finished for PR 4720 at commit d6c80eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #618 has finished for PR 4720 at commit d6c80eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

@andrewor14 The only test failure was caused by another test suite, seems that it's pretty stable now :)

asfgit pushed a commit that referenced this pull request Feb 25, 2015
**NOTICE** Do NOT merge this, as we're waiting for #3881 to be merged.

`HiveThriftServer2Suite` has been notorious for its flakiness for a while. This was mostly due to spawning and communicate with external server processes. This PR revamps this test suite for better robustness:

1. Fixes a racing condition occurred while using `tail -f` to check log file

   It's possible that the line we are looking for has already been printed into the log file before we start the `tail -f` process. This PR uses `tail -n +0 -f` to ensure all lines are checked.

2. Retries up to 3 times if the server fails to start

   In most of the cases, the server fails to start because of port conflict. This PR no longer asks the system to choose an available TCP port, but uses a random port first, and retries up to 3 times if the server fails to start.

3. A server instance is reused among all test cases within a single suite

   The original `HiveThriftServer2Suite` is splitted into two test suites, `HiveThriftBinaryServerSuite` and `HiveThriftHttpServerSuite`. Each suite starts a `HiveThriftServer2` instance and reuses it for all of its test cases.

**TODO**

- [ ] Starts the Thrift server in foreground once #3881 is merged (adding `--foreground` flag to `spark-daemon.sh`)

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4720)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4720 from liancheng/revamp-thrift-server-tests and squashes the following commits:

d6c80eb [Cheng Lian] Relaxes server startup timeout
6f14eb1 [Cheng Lian] Revamped HiveThriftServer2Suite for robustness

(cherry picked from commit f816e73)
Signed-off-by: Cheng Lian <lian@databricks.com>
@asfgit asfgit closed this in f816e73 Feb 25, 2015
@chenghao-intel
Copy link
Contributor

@liancheng The thriftservice test suites failure partially due to #4611, but I believe there are still some issues we still need to solve. e.g. when the Hive ThriftService starts, a pid file will be generated named /tmp/spark-{username}-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1.pid, there would be problem if multiple test processes run into the HiveThriftServer2Suites simultaneously, as the pid file will be overwritten by the last one.

@liancheng liancheng changed the title [SPARK-5751] [SQL] [WIP] Revamped HiveThriftServer2Suite for robustness [SPARK-5751] [SQL] Revamped HiveThriftServer2Suite for robustness Feb 25, 2015
@liancheng
Copy link
Contributor Author

@chenghao-intel Yeah, just opened #4758 to fix the PID issue, thanks!

@liancheng liancheng deleted the revamp-thrift-server-tests branch February 25, 2015 04:40
asfgit pushed a commit that referenced this pull request Feb 28, 2015
…ft server test suites

This is a follow-up of #4720. By default, `spark-daemon.sh` writes PID files under `/tmp`, which makes it impossible to start multiple server instances simultaneously. This PR sets `SPARK_PID_DIR` to Spark home directory to workaround this problem.

Many thanks to chenghao-intel for pointing out this issue!

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4758)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4758 from liancheng/thriftserver-pid-dir and squashes the following commits:

252fa0f [Cheng Lian] Uses temporary directory as Thrift server PID directory
1b3d1e3 [Cheng Lian] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift server test suites

(cherry picked from commit 8c468a6)
Signed-off-by: Cheng Lian <lian@databricks.com>
asfgit pushed a commit that referenced this pull request Feb 28, 2015
…ft server test suites

This is a follow-up of #4720. By default, `spark-daemon.sh` writes PID files under `/tmp`, which makes it impossible to start multiple server instances simultaneously. This PR sets `SPARK_PID_DIR` to Spark home directory to workaround this problem.

Many thanks to chenghao-intel for pointing out this issue!

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4758)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4758 from liancheng/thriftserver-pid-dir and squashes the following commits:

252fa0f [Cheng Lian] Uses temporary directory as Thrift server PID directory
1b3d1e3 [Cheng Lian] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift server test suites
kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
Iceberg 0.13.0.3 - ADT 1.1.7

2022-05-20

PRs Merged

* Internal: Parquet bloom filter support (apache#594 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/594))
* Internal: AWS Kms Client (apache#630 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/630))
* Internal: Core: Add client-side check of encryption properties (apache#626 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/626))
* Core: Align snapshot summary property names for delete files (apache#4766 (apache/iceberg#4766))
* Core: Add eq and pos delete file counts to snapshot summary (apache#4677 (apache/iceberg#4677))
* Spark 3.2: Clean static vars in SparkTableUtil (apache#4765 (apache/iceberg#4765))
* Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil (apache#4758 (apache/iceberg#4758))
* Core: Fix query failure when using projection on top of partitions metadata table (apache#4720) (apache#619 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/619))

Key Notes

Bloom filter support and Client Side Encryption Features can be used in this release. Both features are only enabled with explicit flags and will not effect existing tables or jobs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants