MINIFICPP-1424 Stabilize several flaky test suites#959
MINIFICPP-1424 Stabilize several flaky test suites#959lordgamez wants to merge 17 commits intoapache:mainfrom
Conversation
last_write_time using both boost and stat could be off by a millisecond which could cause the check to fail. Additional 10 milliseconds before writing the file makes sure the saved time and the file's timestamp are in order. On Windows we can experience that the last_write_time of the same file can be off by a milliseconds between 2 call. Checking this should be disabled on Windows.
Running the two tests above paralelly caused problems as both used the same flow file checkpoint directory. Separate directories are used from now on. Sleep values are also adjusted for the slower CI VMs.
The flakiness of the test was caused by resetting the server socket before the client had the chance of connecting. We need to wait for the client connection to be established before terminating the server, also lowering the timeout between socket accepts keeps the test's runtime short.
Sometimes it could occur that the generated long lines contained the same character string that we added in the next line and expected not to be in the final payload because of the maximum payload limit. By distinguishing the character set from which the generated lines and the added lines are consisted of this can be eliminated. Timeout from checks where we do not expect a log to be present was also removed to speed up the tests. One sleep was also missing between one instance of log controller reset and temp file appending.
Reconnect Interval was not working which caused the GetTCPTests to have a default 5000 ms reconnect interval which was too long and caused flakiness. Reconnection limit and interval of GetTCPTests also needed to be adjusted afterwards.
The default rocksdb state manager had a problem with parellel test running as all the tests had their state in the defualt local corecomponenstate directory where the LOCK file could be used by multiple processes causing the tests to fail.
Depending on the scheduler, the keys could already have been serialized when the check occurs, causing the test to fail. It is unnecessary to check the key count before the second addition, as we can verify the final count is below 400 at the end.
GetTcp tests were flaky on Windows because the error check for retry value was not in place. Only the Linux specific errno value was checked.
Fedora docker build would sometimes fail in Github Actions due to Curl timeout when retrieving packages from mirrors.fedoraproject.org. To fix the issues: - Added check to find the fastest fedora mirror - Added retry mechanism to fedora build's install phase
59d892d to
22d8fee
Compare
e35f799 to
bfc8394
Compare
bfc8394 to
3fc6c7a
Compare
100ms was too low for the propert in some CI runs, as the procedure between processor runs would take longer in this test. Raising it to 500ms avoids this issue.
| // the resurrected FlowFiles | ||
| std::this_thread::sleep_for(std::chrono::milliseconds{100}); | ||
|
|
||
| std::this_thread::sleep_for(std::chrono::milliseconds(100)); |
There was a problem hiding this comment.
Is there a better way to wait for the repository to start, eg. by checking some atomic variable? This sleep looks like it can be a source of flakiness, too.
(Same comment on the two similar sleeps in RepoTests.)
There was a problem hiding this comment.
Unfortunately I could not find a way to check it, as the ff repository only sets the running variable to true and goes on to start a separate thread which watches this running variable.
There was a problem hiding this comment.
How about waiting for a log line? Is there a candidate?
There was a problem hiding this comment.
As I checked the RepoTests there isn't anything logged there from the FlowFileRepository perspective:
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::FlowFileRecord logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::utils::IdGenerator logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::FlowFile logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::provenance::ProvenanceEventRecord logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::internal::RocksDatabase logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::Properties logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] Set following pattern on loggers: [%Y-%m-%d %H:%M:%S.%e] [%n] [%l] %v
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] LogTestController logger got sinks from namespace root and level info from namespace LogTestController
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::Connectable logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::Repository logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::repository::VolatileRepository<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::repository::VolatileContentRepository logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::Connection logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::repository::FlowFileRepository logger got sinks from namespace root and level error from namespace root
There was a problem hiding this comment.
Added log checks instead of sleeps in the PersistenceTests in eaf4925. Checking for "Found connection for" log entry which indicates a the restore procedure to be taken place seems to be working for our case. Sleeping in RepoTests was also unnecessary, so it is removed.
There was a problem hiding this comment.
It seems unfortunately wait is still needed in RepoTests and one of them cannot be checked by any log entries either, so one sleep is necessary (updated in 986ff8d).
| RUN echo "fastestmirror=True" | tee -a /etc/dnf/dnf.conf | ||
| RUN for iter in {1..10}; do yum update -y && \ | ||
| yum -y install java-1.8.0-openjdk java-1.8.0-openjdk-devel flex bison make patch sudo git which maven libtool autoconf automake && \ | ||
| yum clean all && exit_code=0 && break || exit_code=$? && echo "yum error: retry $iter in 10s" && sleep 10; done; \ |
There was a problem hiding this comment.
What was the issue here? Why do we have to sleep?
There was a problem hiding this comment.
The original issue:
Curl error (28): Timeout was reached for https://mirrors.fedoraproject.org/metalink?repo=fedora-29&arch=x86_64 [Connection timed out after 30000 milliseconds]
I'm not sure what the root cause of this, maybe something in the environment is not set up yet, so we just wait 10 seconds between tries, to wait for the environment setup to complete. The issue was quite common and disappeared after this fix. The way it is done comes from the ElesticSearch repository.
| RUN for iter in {1..10}; do yum update -y && \ | ||
| yum -y install java-1.8.0-openjdk java-1.8.0-openjdk-devel flex bison make patch sudo git which maven libtool autoconf automake && \ | ||
| yum clean all && exit_code=0 && break || exit_code=$? && echo "yum error: retry $iter in 10s" && sleep 10; done; \ | ||
| (exit $exit_code) |
There was a problem hiding this comment.
What does this do? It opens a subshell and exits it with exit code?
There was a problem hiding this comment.
Yes, exactly, setting the last exit code, which is then checked by the docker build.
There was a problem hiding this comment.
Why not just return $exit_code?
There was a problem hiding this comment.
The command above is run as a shell command in the docker build container, and the return keyword could only be used in the scope of a shell function.
| } | ||
|
|
||
| FlowFileRepository(const std::string repo_name = "", std::string directory = FLOWFILE_REPOSITORY_DIRECTORY, int64_t maxPartitionMillis = MAX_FLOWFILE_REPOSITORY_ENTRY_LIFE_TIME, | ||
| FlowFileRepository(const std::string repo_name = "", std::string checkpoint_dir = FLOWFILE_CHECKPOINT_DIRECTORY, std::string directory = FLOWFILE_REPOSITORY_DIRECTORY, int64_t maxPartitionMillis = MAX_FLOWFILE_REPOSITORY_ENTRY_LIFE_TIME, |
There was a problem hiding this comment.
Minor, but why not const references for strings? It would remove the need for std::move.
| std::string reconnect_interval_str; | ||
| if (context->getProperty(ReconnectInterval.getName(), reconnect_interval_str) && | ||
| core::Property::getTimeMSFromString(reconnect_interval_str, reconnect_interval_)) { | ||
| logger_->log_debug("Reconnect interval is %d ms", reconnect_interval_); |
There was a problem hiding this comment.
Minor but this might overflow for uint64_t:
https://godbolt.org/z/WsTK8W
|
|
||
| REQUIRE(provdb.getKeyCount() == 250); | ||
|
|
||
| /** |
There was a problem hiding this comment.
What happened here? Can you modify the comment as well why we dont have 250 entries?
There was a problem hiding this comment.
I think the comment is okay as we do put 250 entries to the DB. The problem here was that sometimes when we get to this point of the execution the key count is already 96 instead of 250, as the serialized records have probably already expired.
hunyadi-dev
left a comment
There was a problem hiding this comment.
Great effort finding all these failures and correcting them.
593f5cd to
eaf4925
Compare
eaf4925 to
6a82717
Compare
6a82717 to
986ff8d
Compare
This PR addresses several issues in tests, addressed by each commit, so it is advised to review on a per-commit basis. Fixes included for the following test suites (more information in the commit messages):
Thank you for submitting a contribution to Apache NiFi - MiNiFi C++.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with MINIFICPP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check GitHub Actions CI results for build issues and submit an update to your PR as soon as possible.