-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5751] [SQL] Revamped HiveThriftServer2Suite for robustness #4720
Conversation
Test build #27835 has started for PR 4720 at commit
|
Test build #27835 has finished for PR 4720 at commit
|
Test PASSed. |
@JoshRosen �Using another
According to prior experiences, 1, 2 and 5 are most commonly seen, while 3 and 4 were not observed. In the case of 1 and 2, server instance stops gracefully and print related log lines. 5 is now fixed by the Reusing the same server instance within a test suite also reduces undesired flakiness. |
case line if line.contains(LOG_FILE_MARK) => new File(line.drop(LOG_FILE_MARK.length)) | ||
}.getOrElse { | ||
throw new RuntimeException("Failed to find HiveThriftServer2 log file.") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The start-thriftserver.sh
script delegates to spark-daemon.sh
and terminates pretty soon (about 1s). I found using ProcessBuilder.lines
is much simpler and more straightforward than using a ProcessLogger
.
start-thriftserver.sh
itself may fail if there has already been a started server instance within the same Spark home directory. However, right now we only start a single instance a time.
@liancheng Do you think that the foreground option proposed in #3881 could make this easier to test by removing a layer of indirection and asynchrony? I guess we probably want to target this fix for multiple branches to decrease the overall flakiness, so we'd have to port that other patch, too. |
@JoshRosen That's a good idea! Actually the first version of |
If it won't take too much time, I suppose it would be cleaner to add the foreground flag and merge that other PR into all of the maintenance branches. We can always re-open our own PR to add the foreground flag if you want to move that along faster. |
@JoshRosen OK agree, then let's get #3881 merged first. |
Added WIP tag in the PR description so that this PR won't be merged prematurely by accident. |
@JoshRosen Just saw 3 more build failures because of |
Observed several |
Test build #27890 has started for PR 4720 at commit
|
Test build #27890 has finished for PR 4720 at commit
|
Test PASSed. |
Thanks a lot @liancheng. Let's test this one a few more times first to make sure it's not flaky. |
Test build #615 has started for PR 4720 at commit
|
Test build #616 has started for PR 4720 at commit
|
Test build #617 has started for PR 4720 at commit
|
Test build #618 has started for PR 4720 at commit
|
Test build #619 has started for PR 4720 at commit
|
Test build #617 has finished for PR 4720 at commit
|
Test build #615 has finished for PR 4720 at commit
|
Test build #616 has finished for PR 4720 at commit
|
Test build #619 has finished for PR 4720 at commit
|
Test build #618 has finished for PR 4720 at commit
|
@andrewor14 The only test failure was caused by another test suite, seems that it's pretty stable now :) |
**NOTICE** Do NOT merge this, as we're waiting for #3881 to be merged. `HiveThriftServer2Suite` has been notorious for its flakiness for a while. This was mostly due to spawning and communicate with external server processes. This PR revamps this test suite for better robustness: 1. Fixes a racing condition occurred while using `tail -f` to check log file It's possible that the line we are looking for has already been printed into the log file before we start the `tail -f` process. This PR uses `tail -n +0 -f` to ensure all lines are checked. 2. Retries up to 3 times if the server fails to start In most of the cases, the server fails to start because of port conflict. This PR no longer asks the system to choose an available TCP port, but uses a random port first, and retries up to 3 times if the server fails to start. 3. A server instance is reused among all test cases within a single suite The original `HiveThriftServer2Suite` is splitted into two test suites, `HiveThriftBinaryServerSuite` and `HiveThriftHttpServerSuite`. Each suite starts a `HiveThriftServer2` instance and reuses it for all of its test cases. **TODO** - [ ] Starts the Thrift server in foreground once #3881 is merged (adding `--foreground` flag to `spark-daemon.sh`) <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4720) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4720 from liancheng/revamp-thrift-server-tests and squashes the following commits: d6c80eb [Cheng Lian] Relaxes server startup timeout 6f14eb1 [Cheng Lian] Revamped HiveThriftServer2Suite for robustness (cherry picked from commit f816e73) Signed-off-by: Cheng Lian <lian@databricks.com>
@liancheng The thriftservice test suites failure partially due to #4611, but I believe there are still some issues we still need to solve. e.g. when the Hive ThriftService starts, a pid file will be generated named |
@chenghao-intel Yeah, just opened #4758 to fix the PID issue, thanks! |
…ft server test suites This is a follow-up of #4720. By default, `spark-daemon.sh` writes PID files under `/tmp`, which makes it impossible to start multiple server instances simultaneously. This PR sets `SPARK_PID_DIR` to Spark home directory to workaround this problem. Many thanks to chenghao-intel for pointing out this issue! <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4758) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4758 from liancheng/thriftserver-pid-dir and squashes the following commits: 252fa0f [Cheng Lian] Uses temporary directory as Thrift server PID directory 1b3d1e3 [Cheng Lian] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift server test suites (cherry picked from commit 8c468a6) Signed-off-by: Cheng Lian <lian@databricks.com>
…ft server test suites This is a follow-up of #4720. By default, `spark-daemon.sh` writes PID files under `/tmp`, which makes it impossible to start multiple server instances simultaneously. This PR sets `SPARK_PID_DIR` to Spark home directory to workaround this problem. Many thanks to chenghao-intel for pointing out this issue! <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4758) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4758 from liancheng/thriftserver-pid-dir and squashes the following commits: 252fa0f [Cheng Lian] Uses temporary directory as Thrift server PID directory 1b3d1e3 [Cheng Lian] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift server test suites
Iceberg 0.13.0.3 - ADT 1.1.7 2022-05-20 PRs Merged * Internal: Parquet bloom filter support (apache#594 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/594)) * Internal: AWS Kms Client (apache#630 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/630)) * Internal: Core: Add client-side check of encryption properties (apache#626 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/626)) * Core: Align snapshot summary property names for delete files (apache#4766 (apache/iceberg#4766)) * Core: Add eq and pos delete file counts to snapshot summary (apache#4677 (apache/iceberg#4677)) * Spark 3.2: Clean static vars in SparkTableUtil (apache#4765 (apache/iceberg#4765)) * Spark 3.2: Avoid reflection to load metadata tables in SparkTableUtil (apache#4758 (apache/iceberg#4758)) * Core: Fix query failure when using projection on top of partitions metadata table (apache#4720) (apache#619 (https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/619)) Key Notes Bloom filter support and Client Side Encryption Features can be used in this release. Both features are only enabled with explicit flags and will not effect existing tables or jobs.
NOTICE Do NOT merge this, as we're waiting for #3881 to be merged.
HiveThriftServer2Suite
has been notorious for its flakiness for a while. This was mostly due to spawning and communicate with external server processes. This PR revamps this test suite for better robustness:Fixes a racing condition occurred while using
tail -f
to check log fileIt's possible that the line we are looking for has already been printed into the log file before we start the
tail -f
process. This PR usestail -n +0 -f
to ensure all lines are checked.Retries up to 3 times if the server fails to start
In most of the cases, the server fails to start because of port conflict. This PR no longer asks the system to choose an available TCP port, but uses a random port first, and retries up to 3 times if the server fails to start.
A server instance is reused among all test cases within a single suite
The original
HiveThriftServer2Suite
is splitted into two test suites,HiveThriftBinaryServerSuite
andHiveThriftHttpServerSuite
. Each suite starts aHiveThriftServer2
instance and reuses it for all of its test cases.TODO
--foreground
flag tospark-daemon.sh
)