MINIFICPP-1424 Stabilize several flaky test suites by lordgamez · Pull Request #959 · apache/nifi-minifi-cpp

lordgamez · 2020-12-14T15:30:04Z

This PR addresses several issues in tests, addressed by each commit, so it is advised to review on a per-commit basis. Fixes included for the following test suites (more information in the commit messages):

FileUtilsTests
- Fix for last_write_time anomalies
RepoTests and PersistanceTests
- Separate checkpoint usage when running in parallel
SecureSocketGetTCP tests
- Fix timing issues
TailFileTests
- Fix random generation and timing problems
- Adjust LookupFrequency property for the test environment
GetTCPTests
- Added missing ReconnectInterval property
- Fixed recv error checking on Windows to properly try to reconnect and adjusted reconnection property values
Flow-Tests
- Replace state manager provider with UnorderedMapPersistableKeyValueStoreService to avoid parallel running issues
DBProvenanceRepositoryTests
- Remove an unreliable check
Fedora build
- Added check for fastest mirror and retry mechanism for curl timeout problems
FlowControllerTests
- Adjusted timeouts

Thank you for submitting a contribution to Apache NiFi - MiNiFi C++.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with MINIFICPP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?

For code changes:

If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file?
If applicable, have you updated the NOTICE file?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions CI results for build issues and submit an update to your PR as soon as possible.

last_write_time using both boost and stat could be off by a millisecond which could cause the check to fail. Additional 10 milliseconds before writing the file makes sure the saved time and the file's timestamp are in order. On Windows we can experience that the last_write_time of the same file can be off by a milliseconds between 2 call. Checking this should be disabled on Windows.

Running the two tests above paralelly caused problems as both used the same flow file checkpoint directory. Separate directories are used from now on. Sleep values are also adjusted for the slower CI VMs.

The flakiness of the test was caused by resetting the server socket before the client had the chance of connecting. We need to wait for the client connection to be established before terminating the server, also lowering the timeout between socket accepts keeps the test's runtime short.

Sometimes it could occur that the generated long lines contained the same character string that we added in the next line and expected not to be in the final payload because of the maximum payload limit. By distinguishing the character set from which the generated lines and the added lines are consisted of this can be eliminated. Timeout from checks where we do not expect a log to be present was also removed to speed up the tests. One sleep was also missing between one instance of log controller reset and temp file appending.

Reconnect Interval was not working which caused the GetTCPTests to have a default 5000 ms reconnect interval which was too long and caused flakiness. Reconnection limit and interval of GetTCPTests also needed to be adjusted afterwards.

The default rocksdb state manager had a problem with parellel test running as all the tests had their state in the defualt local corecomponenstate directory where the LOCK file could be used by multiple processes causing the tests to fail.

Depending on the scheduler, the keys could already have been serialized when the check occurs, causing the test to fail. It is unnecessary to check the key count before the second addition, as we can verify the final count is below 400 at the end.

GetTcp tests were flaky on Windows because the error check for retry value was not in place. Only the Linux specific errno value was checked.

Fedora docker build would sometimes fail in Github Actions due to Curl timeout when retrieving packages from mirrors.fedoraproject.org. To fix the issues: - Added check to find the fastest fedora mirror - Added retry mechanism to fedora build's install phase

libminifi/test/persistence-tests/PersistenceTests.cpp

libminifi/include/io/tls/TLSServerSocket.h

libminifi/src/io/tls/TLSServerSocket.cpp

100ms was too low for the propert in some CI runs, as the procedure between processor runs would take longer in this test. Raising it to 500ms avoids this issue.

fgerlits · 2020-12-15T10:35:17Z

libminifi/test/persistence-tests/PersistenceTests.cpp

    // the resurrected FlowFiles
-    std::this_thread::sleep_for(std::chrono::milliseconds{100});
-
+    std::this_thread::sleep_for(std::chrono::milliseconds(100));


Is there a better way to wait for the repository to start, eg. by checking some atomic variable? This sleep looks like it can be a source of flakiness, too.

(Same comment on the two similar sleeps in RepoTests.)

Unfortunately I could not find a way to check it, as the ff repository only sets the running variable to true and goes on to start a separate thread which watches this running variable.

How about waiting for a log line? Is there a candidate?

As I checked the RepoTests there isn't anything logged there from the FlowFileRepository perspective:

[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::FlowFileRecord logger got sinks from namespace root and level error from namespace root [2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::utils::IdGenerator logger got sinks from namespace root and level error from namespace root [2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::FlowFile logger got sinks from namespace root and level error from namespace root [2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::provenance::ProvenanceEventRecord logger got sinks from namespace root and level error from namespace root [2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::internal::RocksDatabase logger got sinks from namespace root and level error from namespace root [2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::Properties logger got sinks from namespace root and level error from namespace root [2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] Set following pattern on loggers: [%Y-%m-%d %H:%M:%S.%e] [%n] [%l] %v [2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] LogTestController logger got sinks from namespace root and level info from namespace LogTestController [2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::Connectable logger got sinks from namespace root and level error from namespace root [2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::Repository logger got sinks from namespace root and level error from namespace root [2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::repository::VolatileRepository<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > logger got sinks from namespace root and level error from namespace root [2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::repository::VolatileContentRepository logger got sinks from namespace root and level error from namespace root [2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::Connection logger got sinks from namespace root and level error from namespace root [2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::repository::FlowFileRepository logger got sinks from namespace root and level error from namespace root

Added log checks instead of sleeps in the PersistenceTests in eaf4925. Checking for "Found connection for" log entry which indicates a the restore procedure to be taken place seems to be working for our case. Sleeping in RepoTests was also unnecessary, so it is removed.

It seems unfortunately wait is still needed in RepoTests and one of them cannot be checked by any log entries either, so one sleep is necessary (updated in 986ff8d).

hunyadi-dev · 2020-12-15T13:53:08Z

docker/fedora/Dockerfile

+RUN echo "fastestmirror=True" | tee -a /etc/dnf/dnf.conf
+RUN for iter in {1..10}; do yum update -y && \
+    yum -y install java-1.8.0-openjdk java-1.8.0-openjdk-devel flex bison make patch sudo git which maven libtool autoconf automake && \
+    yum clean all && exit_code=0 && break || exit_code=$? && echo "yum error: retry $iter in 10s" && sleep 10; done; \


What was the issue here? Why do we have to sleep?

The original issue:
Curl error (28): Timeout was reached for https://mirrors.fedoraproject.org/metalink?repo=fedora-29&arch=x86_64 [Connection timed out after 30000 milliseconds]
I'm not sure what the root cause of this, maybe something in the environment is not set up yet, so we just wait 10 seconds between tries, to wait for the environment setup to complete. The issue was quite common and disappeared after this fix. The way it is done comes from the ElesticSearch repository.

hunyadi-dev · 2020-12-15T13:54:01Z

docker/fedora/Dockerfile

+RUN for iter in {1..10}; do yum update -y && \
+    yum -y install java-1.8.0-openjdk java-1.8.0-openjdk-devel flex bison make patch sudo git which maven libtool autoconf automake && \
+    yum clean all && exit_code=0 && break || exit_code=$? && echo "yum error: retry $iter in 10s" && sleep 10; done; \
+    (exit $exit_code)


What does this do? It opens a subshell and exits it with exit code?

Yes, exactly, setting the last exit code, which is then checked by the docker build.

Why not just return $exit_code?

The command above is run as a shell command in the docker build container, and the return keyword could only be used in the scope of a shell function.

hunyadi-dev · 2020-12-15T13:58:55Z

extensions/rocksdb-repos/FlowFileRepository.h

  }

-  FlowFileRepository(const std::string repo_name = "", std::string directory = FLOWFILE_REPOSITORY_DIRECTORY, int64_t maxPartitionMillis = MAX_FLOWFILE_REPOSITORY_ENTRY_LIFE_TIME,
+  FlowFileRepository(const std::string repo_name = "", std::string checkpoint_dir = FLOWFILE_CHECKPOINT_DIRECTORY, std::string directory = FLOWFILE_REPOSITORY_DIRECTORY, int64_t maxPartitionMillis = MAX_FLOWFILE_REPOSITORY_ENTRY_LIFE_TIME,


Minor, but why not const references for strings? It would remove the need for std::move.

Done in c0110dc

hunyadi-dev · 2020-12-15T14:22:55Z

extensions/standard-processors/processors/GetTCP.cpp

+  std::string reconnect_interval_str;
+  if (context->getProperty(ReconnectInterval.getName(), reconnect_interval_str) &&
+      core::Property::getTimeMSFromString(reconnect_interval_str, reconnect_interval_)) {
+    logger_->log_debug("Reconnect interval is %d ms", reconnect_interval_);


Minor but this might overflow for uint64_t:
https://godbolt.org/z/WsTK8W

Done in c0110dc

hunyadi-dev · 2020-12-15T14:48:24Z

libminifi/test/rocksdb-tests/DBProvenanceRepositoryTests.cpp


-  REQUIRE(provdb.getKeyCount() == 250);
-
  /**


What happened here? Can you modify the comment as well why we dont have 250 entries?

I think the comment is okay as we do put 250 entries to the DB. The problem here was that sometimes when we get to this point of the execution the key count is already 96 instead of 250, as the serialized records have probably already expired.

hunyadi-dev

Great effort finding all these failures and correcting them.

lordgamez added 10 commits December 14, 2020 15:23

Stabilize RepoTests and PersistenceTests

63c45e9

Running the two tests above paralelly caused problems as both used the same flow file checkpoint directory. Separate directories are used from now on. Sleep values are also adjusted for the slower CI VMs.

Add missing Reconnect Interval to GetTCP

4ad413a

Reconnect Interval was not working which caused the GetTCPTests to have a default 5000 ms reconnect interval which was too long and caused flakiness. Reconnection limit and interval of GetTCPTests also needed to be adjusted afterwards.

Fix recv error check on Windows

2e6a83b

GetTcp tests were flaky on Windows because the error check for retry value was not in place. Only the Linux specific errno value was checked.

Adjust timeouts for flow file enqueue in FlowControllerTests

22d8fee

lordgamez force-pushed the MINIFICPP-1424 branch from 59d892d to 22d8fee Compare December 14, 2020 15:51

fgerlits reviewed Dec 14, 2020

View reviewed changes

libminifi/test/persistence-tests/PersistenceTests.cpp Outdated Show resolved Hide resolved

libminifi/include/io/tls/TLSServerSocket.h Outdated Show resolved Hide resolved

fgerlits reviewed Dec 14, 2020

View reviewed changes

libminifi/src/io/tls/TLSServerSocket.cpp Outdated Show resolved Hide resolved

lordgamez force-pushed the MINIFICPP-1424 branch from e35f799 to bfc8394 Compare December 14, 2020 18:16

lordgamez added 3 commits December 15, 2020 10:11

Use verifyEventHappenedInPollTime instead of manual looping

7118255

Replace uint16_t with chrono::milliseconds in callback

f358a71

Fix narrowing

3fc6c7a

lordgamez force-pushed the MINIFICPP-1424 branch from bfc8394 to 3fc6c7a Compare December 15, 2020 09:11

Raise LookupFrequency property in TailFileTests

b0ae092

100ms was too low for the propert in some CI runs, as the procedure between processor runs would take longer in this test. Raising it to 500ms avoids this issue.

fgerlits reviewed Dec 15, 2020

View reviewed changes

fgerlits approved these changes Dec 15, 2020

View reviewed changes

hunyadi-dev reviewed Dec 15, 2020

View reviewed changes

hunyadi-dev approved these changes Dec 15, 2020

View reviewed changes

lordgamez added 2 commits December 15, 2020 16:38

Incorporate review comments

c0110dc

Add wait for PersistenceTests

1e41c1b

lordgamez force-pushed the MINIFICPP-1424 branch from 593f5cd to eaf4925 Compare December 16, 2020 10:24

lordgamez force-pushed the MINIFICPP-1424 branch from eaf4925 to 6a82717 Compare December 16, 2020 11:05

Remove unnecessary sleeps and replace with log checks when needed

986ff8d

lordgamez force-pushed the MINIFICPP-1424 branch from 6a82717 to 986ff8d Compare December 16, 2020 11:28

szaszm approved these changes Dec 16, 2020

View reviewed changes

szaszm closed this in 1d43069 Dec 16, 2020

Comments

Conversation

lordgamez commented Dec 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For all changes:

For code changes:

For documentation related changes:

Note:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hunyadi-dev Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lordgamez Dec 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hunyadi-dev Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hunyadi-dev left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lordgamez commented Dec 14, 2020 •

edited

Loading

hunyadi-dev Dec 15, 2020 •

edited

Loading

lordgamez Dec 16, 2020 •

edited

Loading

hunyadi-dev Dec 15, 2020 •

edited

Loading