Skip to content

Comments

MINIFICPP-1424 Stabilize several flaky test suites#959

Closed
lordgamez wants to merge 17 commits intoapache:mainfrom
lordgamez:MINIFICPP-1424
Closed

MINIFICPP-1424 Stabilize several flaky test suites#959
lordgamez wants to merge 17 commits intoapache:mainfrom
lordgamez:MINIFICPP-1424

Conversation

@lordgamez
Copy link
Contributor

@lordgamez lordgamez commented Dec 14, 2020

This PR addresses several issues in tests, addressed by each commit, so it is advised to review on a per-commit basis. Fixes included for the following test suites (more information in the commit messages):

  • FileUtilsTests
    • Fix for last_write_time anomalies
  • RepoTests and PersistanceTests
    • Separate checkpoint usage when running in parallel
  • SecureSocketGetTCP tests
    • Fix timing issues
  • TailFileTests
    • Fix random generation and timing problems
    • Adjust LookupFrequency property for the test environment
  • GetTCPTests
    • Added missing ReconnectInterval property
    • Fixed recv error checking on Windows to properly try to reconnect and adjusted reconnection property values
  • Flow-Tests
    • Replace state manager provider with UnorderedMapPersistableKeyValueStoreService to avoid parallel running issues
  • DBProvenanceRepositoryTests
    • Remove an unreliable check
  • Fedora build
    • Added check for fastest mirror and retry mechanism for curl timeout problems
  • FlowControllerTests
    • Adjusted timeouts

Thank you for submitting a contribution to Apache NiFi - MiNiFi C++.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with MINIFICPP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file?
  • If applicable, have you updated the NOTICE file?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions CI results for build issues and submit an update to your PR as soon as possible.

last_write_time using both boost and stat  could be off by a
millisecond which could cause the check to fail. Additional 10
milliseconds before writing the file makes sure the saved time and the
file's timestamp are in order.

On Windows we can experience that the last_write_time of the same file
can be off by a milliseconds between 2 call. Checking this should be
disabled on Windows.
Running the two tests above paralelly caused problems as both used the
same flow file checkpoint directory. Separate directories are used from
now on.

Sleep values are also adjusted for the slower CI VMs.
The flakiness of the test was caused by resetting the server socket
before the client had the chance of connecting. We need to wait for the
client connection to be established before terminating the server, also
lowering the timeout between socket accepts keeps the test's runtime
short.
Sometimes it could occur that the generated long lines contained the
same character string that we added in the next line and expected not to
be in the final payload because of the maximum payload limit. By
distinguishing the character set from which the generated lines and the
added lines are consisted of this can be eliminated.

Timeout from checks where we do not expect a log to be present was also
removed to speed up the tests.

One sleep was also missing between one instance of log controller reset
and temp file appending.
Reconnect Interval was not working which caused the GetTCPTests to
have a default 5000 ms reconnect interval which was too long and caused
flakiness.

Reconnection limit and interval of GetTCPTests also needed to be
adjusted afterwards.
The default rocksdb state manager had a problem with parellel test running
as all the tests had their state in the defualt local corecomponenstate
directory where the LOCK file could be used by multiple processes causing
the tests to fail.
Depending on the scheduler, the keys could already have been serialized
when the check occurs, causing the test to fail. It is unnecessary to check
the key count before the second addition, as we can verify the final count is
below 400 at the end.
GetTcp tests were flaky on Windows because the error check for retry
value was not in place. Only the Linux specific errno value was checked.
Fedora docker build would sometimes fail in Github Actions due to Curl
timeout when retrieving packages from mirrors.fedoraproject.org. To fix
the issues:
- Added check to find the fastest fedora mirror
- Added retry mechanism to fedora build's install phase
100ms was too low for the propert in some CI runs, as the procedure
between processor runs would take longer in this test. Raising it to
500ms avoids this issue.
// the resurrected FlowFiles
std::this_thread::sleep_for(std::chrono::milliseconds{100});

std::this_thread::sleep_for(std::chrono::milliseconds(100));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to wait for the repository to start, eg. by checking some atomic variable? This sleep looks like it can be a source of flakiness, too.

(Same comment on the two similar sleeps in RepoTests.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I could not find a way to check it, as the ff repository only sets the running variable to true and goes on to start a separate thread which watches this running variable.

Copy link
Contributor

@hunyadi-dev hunyadi-dev Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about waiting for a log line? Is there a candidate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I checked the RepoTests there isn't anything logged there from the FlowFileRepository perspective:

[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::FlowFileRecord logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::utils::IdGenerator logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::FlowFile logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::provenance::ProvenanceEventRecord logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::internal::RocksDatabase logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::Properties logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] Set following pattern on loggers: [%Y-%m-%d %H:%M:%S.%e] [%n] [%l] %v
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] LogTestController logger got sinks from namespace root and level info from namespace LogTestController
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::Connectable logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::Repository logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::repository::VolatileRepository<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::repository::VolatileContentRepository logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::Connection logger got sinks from namespace root and level error from namespace root
[2020-12-15 15:59:02.639] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::repository::FlowFileRepository logger got sinks from namespace root and level error from namespace root

Copy link
Contributor Author

@lordgamez lordgamez Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added log checks instead of sleeps in the PersistenceTests in eaf4925. Checking for "Found connection for" log entry which indicates a the restore procedure to be taken place seems to be working for our case. Sleeping in RepoTests was also unnecessary, so it is removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unfortunately wait is still needed in RepoTests and one of them cannot be checked by any log entries either, so one sleep is necessary (updated in 986ff8d).

RUN echo "fastestmirror=True" | tee -a /etc/dnf/dnf.conf
RUN for iter in {1..10}; do yum update -y && \
yum -y install java-1.8.0-openjdk java-1.8.0-openjdk-devel flex bison make patch sudo git which maven libtool autoconf automake && \
yum clean all && exit_code=0 && break || exit_code=$? && echo "yum error: retry $iter in 10s" && sleep 10; done; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the issue here? Why do we have to sleep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original issue:
Curl error (28): Timeout was reached for https://mirrors.fedoraproject.org/metalink?repo=fedora-29&arch=x86_64 [Connection timed out after 30000 milliseconds]
I'm not sure what the root cause of this, maybe something in the environment is not set up yet, so we just wait 10 seconds between tries, to wait for the environment setup to complete. The issue was quite common and disappeared after this fix. The way it is done comes from the ElesticSearch repository.

RUN for iter in {1..10}; do yum update -y && \
yum -y install java-1.8.0-openjdk java-1.8.0-openjdk-devel flex bison make patch sudo git which maven libtool autoconf automake && \
yum clean all && exit_code=0 && break || exit_code=$? && echo "yum error: retry $iter in 10s" && sleep 10; done; \
(exit $exit_code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do? It opens a subshell and exits it with exit code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly, setting the last exit code, which is then checked by the docker build.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just return $exit_code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command above is run as a shell command in the docker build container, and the return keyword could only be used in the scope of a shell function.

}

FlowFileRepository(const std::string repo_name = "", std::string directory = FLOWFILE_REPOSITORY_DIRECTORY, int64_t maxPartitionMillis = MAX_FLOWFILE_REPOSITORY_ENTRY_LIFE_TIME,
FlowFileRepository(const std::string repo_name = "", std::string checkpoint_dir = FLOWFILE_CHECKPOINT_DIRECTORY, std::string directory = FLOWFILE_REPOSITORY_DIRECTORY, int64_t maxPartitionMillis = MAX_FLOWFILE_REPOSITORY_ENTRY_LIFE_TIME,
Copy link
Contributor

@hunyadi-dev hunyadi-dev Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but why not const references for strings? It would remove the need for std::move.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c0110dc

std::string reconnect_interval_str;
if (context->getProperty(ReconnectInterval.getName(), reconnect_interval_str) &&
core::Property::getTimeMSFromString(reconnect_interval_str, reconnect_interval_)) {
logger_->log_debug("Reconnect interval is %d ms", reconnect_interval_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor but this might overflow for uint64_t:
https://godbolt.org/z/WsTK8W

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c0110dc


REQUIRE(provdb.getKeyCount() == 250);

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened here? Can you modify the comment as well why we dont have 250 entries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is okay as we do put 250 entries to the DB. The problem here was that sometimes when we get to this point of the execution the key count is already 96 instead of 250, as the serialized records have probably already expired.

Copy link
Contributor

@hunyadi-dev hunyadi-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great effort finding all these failures and correcting them.

@szaszm szaszm closed this in 1d43069 Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants