-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIFI-12236 Improving fault tolerancy of the QuestDB backed metrics repository #8152
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating the new pull request and making the adjustments @simonbence. I have yet not review the entire implementation, but I noted a few recommendations.
Having this in a separate module and avoiding changes to the nifi-api is helpful. I am not sure about the general reuse of the retry interfaces, so I'm not sure whether it warrants adding them to nifi-utils
. Perhaps they would be better in a separate nifi-retry-utils
under nifi-commons
, or just kept as part of the QuestDB module in a separate package until it gets to the point where they could be reused.
nifi-commons/nifi-utils/src/main/java/org/apache/nifi/retry/NoReturnCallable.java
Outdated
Show resolved
Hide resolved
...ar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-status-history-shared/pom.xml
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/nifi/controller/status/history/questdb/BufferedStatusHistoryStorage.java
Outdated
Show resolved
Hide resolved
...-history/src/main/java/org/apache/nifi/controller/status/history/questdb/CapturedStatus.java
Outdated
Show resolved
Hide resolved
...main/java/org/apache/nifi/controller/status/history/questdb/QuestDbStatusHistoryStorage.java
Outdated
Show resolved
Hide resolved
...ry/src/main/java/org/apache/nifi/controller/status/history/questdb/StatusHistoryStorage.java
Show resolved
Hide resolved
...fi-questdb/src/main/java/org/apache/nifi/questdb/embedded/ClientIsDisconnectedException.java
Outdated
Show resolved
Hide resolved
...estdb-bundle/nifi-questdb/src/main/java/org/apache/nifi/questdb/embedded/EmbeddedClient.java
Outdated
Show resolved
Hide resolved
I hope this PR will be more fit to go on with :) As of retry: even if it is currently used only by the QuestDB module, I would consider it as a general util or tool, just like other parts of the given module, like for example the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the initial adjustments @simonbence, I went through the code again and noted several recommendations and questions.
...ons/nifi-utils/src/test/java/org/apache/nifi/retry/ExceptionExcludingRetryConditionTest.java
Outdated
Show resolved
Hide resolved
...ons/nifi-utils/src/test/java/org/apache/nifi/retry/ExceptionExcludingRetryConditionTest.java
Outdated
Show resolved
Hide resolved
...commons/nifi-utils/src/test/java/org/apache/nifi/retry/LimitedAttemptRetryConditionTest.java
Outdated
Show resolved
Hide resolved
nifi-commons/nifi-utils/src/test/java/org/apache/nifi/retry/SynchronousRetryTemplateTest.java
Outdated
Show resolved
Hide resolved
...main/java/org/apache/nifi/controller/status/history/questdb/QuestDbStatusHistoryStorage.java
Outdated
Show resolved
Hide resolved
...i-questdb/src/main/java/org/apache/nifi/questdb/embedded/EmbeddedDatabaseManagerContext.java
Outdated
Show resolved
Hide resolved
...questdb-bundle/nifi-questdb/src/main/java/org/apache/nifi/questdb/embedded/LockedClient.java
Outdated
Show resolved
Hide resolved
...questdb-bundle/nifi-questdb/src/main/java/org/apache/nifi/questdb/embedded/LockedClient.java
Show resolved
Hide resolved
...tdb/src/main/java/org/apache/nifi/questdb/embedded/SimpleEmbeddedDatabaseManagerContext.java
Outdated
Show resolved
Hide resolved
...-bundle/nifi-questdb/src/main/java/org/apache/nifi/questdb/mapping/SimpleRequestMapping.java
Outdated
Show resolved
Hide resolved
…tion; smaller review changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the adjustments, this looks cleaner with the move to spring-retry
.
I noted a few minor things, but also there appears to be a file-locking problem resulting in test failures on Windows. Linux and macOS are generally more tolerant in this area, but it highlights either a test problem, or a problem with open files in the runtime class. See the Windows build for more details.
Error: org.apache.nifi.questdb.embedded.EmbeddedClientTest.testInsertAndQuery -- Time elapsed: 0.121 s <<< ERROR!
java.io.IOException: Failed to delete temp directory C:\Users\RUNNER~1\AppData\Local\Temp\junit16271263950658316199. The following paths could not be deleted (see suppressed exceptions for details): <root>, event, event\2024-01-14, event\2024-01-14\capturedAt.d, event\2024-01-14\subject.d, event\2024-01-14\value.d, event\subject.c, event\subject.k, event\subject.o, event\subject.v, event\_cv, event\_meta, event\_txn, event\_txn_scoreboard, tables.d.0, _tab_index.d
...estdb-bundle/nifi-questdb/src/main/java/org/apache/nifi/questdb/embedded/EmbeddedClient.java
Outdated
Show resolved
Hide resolved
...dle/nifi-questdb/src/main/java/org/apache/nifi/questdb/embedded/EmbeddedDatabaseManager.java
Outdated
Show resolved
Hide resolved
...dle/nifi-questdb/src/main/java/org/apache/nifi/questdb/embedded/EmbeddedDatabaseManager.java
Outdated
Show resolved
Hide resolved
...e/nifi-questdb/src/main/java/org/apache/nifi/questdb/rollover/DeleteOldRolloverStrategy.java
Show resolved
Hide resolved
...ndles/nifi-questdb-bundle/nifi-questdb/src/test/java/org/apache/nifi/questdb/util/Event.java
Outdated
Show resolved
Hide resolved
It looks like it has something to do with the |
Thanks for looking into it. The |
This was my idea as well. I will add some explicit tearing down to see if it is sufficient for the TempDir |
…er review comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @simonbence, unfortunately it looks like the cleanup approach is still failing on Windows builds.
The failed checks look to be in connection with Python processors and minify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the updates @simonbence.
On closer review of the test changes, there is still a problem with manager tests not cleaning up temporary files on completion. Disabling the methods on Windows masks this problem, but running the tests on macOS shows junit
directories left behind after completion. I recommend removing the DisabledOnOS
annotations and tracking down the source of the problem. Whether in the test class, or the implementation itself, temporary files should not remain after successful test completion.
I also noticed several QuestDB table definitions that still use capturedAt
as the column name, so this should be checked across the set of changes and updated.
import static org.junit.jupiter.api.Assertions.assertTrue; | ||
import static org.junit.jupiter.api.Assertions.fail; | ||
|
||
public class EmbeddedDatabaseManagerTest extends EmbeddedQuestDbTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class leaves junit
temporary directories around after completion. Disabling some methods on Windows masks the fundamental problem of the test not cleaning up after itself, so this needs more evaluation to improve the behavior.
I take a look on the leftover temporary files. As for disable: It was a deliberate choice after looking into the reasons. I picked adding them due to the test setup, not because of production reasons. In order to emulate corrupted database files I directly touch the files and write some gibberish into that. This is something normally would happened outside of our code, most reasonably would happen within the database. I did not found other way to trigger this however Windows does not let one to touch a file is under lock by an other process. If there is an other way to deliberatly corrupt the database (not directly working with the underlying files) than it could work out, but I am not aware of such a method. |
Thanks for looking into the details. Manual file changes for the test methods seems like a reasonable strategy, so if you can track down and resolve the leftover files issue, I would expect that should also resolve the test failures on Windows. |
…ning up files after tests
e2439e0
to
41ff175
Compare
Notable changes:
|
…er to be visible for the QuestDB bundle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continued work on these changes @simonbence.
On further review, I returned to a situation that came up in an earlier comment regarding the NoOpClient
.
In the current revision, the implementation logs an error and return the NoOpClient
, which effectively disables all Status History tracking. Although this avoids causing NiFi to crash when the QuestDB database is corrupted, is also disables an important function. This might be acceptable as a non-default configuration option, but it would be a blocker for considering this as a default setting. It also seems problematic in general, even as an opt-in solution.
One option is to throw an exception that causes NiFi to fail on startup, similar to current behavior with the QuestDB repository. Another option would be to fall back to in-memory tracking, which seems better. The latter option could be more involved, but I mention it for consideration.
...src/main/java/org/apache/nifi/controller/status/history/questdb/StorateStatusDataSource.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/nifi/controller/status/history/questdb/BufferedStatusHistoryStorage.java
Outdated
Show resolved
Hide resolved
...nifi-questdb/src/test/java/org/apache/nifi/questdb/embedded/EmbeddedDatabaseManagerTest.java
Show resolved
Hide resolved
...nifi-questdb/src/test/java/org/apache/nifi/questdb/embedded/EmbeddedDatabaseManagerTest.java
Outdated
Show resolved
Hide resolved
...dle/nifi-questdb/src/main/java/org/apache/nifi/questdb/embedded/EmbeddedDatabaseManager.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One additional note on test behavior, running tests on Ubuntu Linux 22.04 still leaves behind backup directories in a directory named questDbBackup
. There is a new backup directory for each subsequent build of the nifi-questdb
module.
@Test | ||
@DisabledOnOs(OS.WINDOWS) | ||
// This testcase cannot be reproduced under Windows using Junit | ||
public void testDatabaseRestorationAfterCorruptedFiles() throws DatabaseException, IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For additional reference, this is the test method that leaves the questDbBackup
directory with backup
sub-directories after test completion. This is probably the reason for the failure on Windows, but on both macOS and Linux, the directory remains after all tests have completed. It may be necessary to implement some manual deletion logic to ensure the directory is cleared. If the directory is cleared on macOS and Linux, it might be worth enabling this test on Windows again, as it more clearly surfaces the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest updates in e33dbe9 now clear out the questDbBackup
directory, but the directory itself still remains. It looks like adding one more step to delete the directory itself should resolve this particular issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, however the folder was left behind a different test case. I double checked if any leaves anything behind in multiple platform and I found that working well. Thanks for the finding!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the latest updates and responses @simonbence.
This looks close to completion, the last remaining items are related to test behavior and stability. The directory cleanup is looking better, but still leaves an empty directory.
The other concern is the timing and intermittent failures for testRollover()
. As mentioned in the detailed comment, the Thread.sleep()
approach takes a lot of time for a unit test, and appears to be unreliable, so another more robust strategy for determining rollover completion seems necessary.
client.insert(EVENT_TABLE_NAME, QuestDbTestUtil.getEventTableDataSource(testData)); | ||
|
||
// The rollover runs in every 5 seconds | ||
Thread.sleep(TimeUnit.SECONDS.toMillis(6)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a long time to sleep for a unit test, by adjusting the rollover time, can this value be reduced?
final List<Event> result = client.query(SELECT_QUERY, RequestMapping.getResultProcessor(QuestDbTestUtil.EVENT_TABLE_REQUEST_MAPPING)); | ||
testSubject.close(); | ||
|
||
assertEquals(1, result.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When building, the module, this test method fails intermittently. It seems like the Thread.sleep()
approach is not a strong enough guarantee of rollover completion. It may be necessary to do some additional checking on the directory content itself to determine if a rollover has occurred. If all else fails, it would be better to remove the test method, as opposed to having a test that fails intermittently.
[ERROR] Tests run: 13, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 18.24 s <<< FAILURE! -- in org.apache.nifi.questdb.embedded.EmbeddedDatabaseManagerTest
[ERROR] org.apache.nifi.questdb.embedded.EmbeddedDatabaseManagerTest.testRollover -- Time elapsed: 7.888 s <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <1> but was: <2>
at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:527)
at org.apache.nifi.questdb.embedded.EmbeddedDatabaseManagerTest.testRollover(EmbeddedDatabaseManagerTest.java:109)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an answer for both this and the previous one.
In general I would lean toward to use something like a latch, which guarantees the end of the background task. Due to the nature of the test case I could not use a spy for this, and I rely on waiting. As I left some gap between the roundtrip and wait time, I did not have issue with it so far, but based on your concerns I proposed a more refined way to do this. The fundamentals are the same: the test thread waits, but it polls multiple times to shorten the usual test duration and now it has 2 sec of gaps between the scheduled round trips and the waiting times. In case 3 seconds is not enough for the positive result I assume we have an issue indeed. Either with the test or the environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adjusting the approach, the polling strategy looks good, and avoids the timing issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the adjustments to the test method @simonbence.
In the course of local build execution, I still noticed intermittent remnants of a JUnit temporary directory containing QuestDB files. I could reproduce it fairly often by just building the nifi-questdb
module with associated tests. It is not clear which test method is leaving files behind, but the directory name indicates it is a random temporary directory from the @TempDir
annotation. The intermittent nature of the behavior seems to indicate one remaining timing issue.
I will take a closer look at narrowing down the source of the behavior, but if we can get that isolated and resolved, then this should be ready to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After additional local build evaluation, the remnant temporary directory only seems to occur on Linux, not on macOS. Given that appears test-related, and does not cause failures, I don't think that it needs to be a blocker.
I pushed one minor adjustment to the test logger configuration, setting the level to CRITICAL
, which avoids writing unnecessary QuestDB output that can result in corrupting the forked Surefire runner stream. Planning to merge pending one more successful build cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonbence Following one more review, I tracked down the source of the orphaned files to a particular test method that was missing a close call on the DatabaseManager
. After adding the close()
, I could run the tests multiple times without seeing orphaned temporary directories.
I pushed one more commit to address that behavior in the test class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working through the feedback and addressing the various issues @simonbence. The latest version looks good. +1 merging
This is a refined PR for NIFI-12236 after feedback and discussion on PR 8123. The general approach and purpose remained the same: the PR aims to refine error handling of the original impelmentation and in order to reach this, the QuestDb handling was refactored:
Most of the guidelines from the original PR are still relevant, but here are some key things have changed:
Thank you for taking your time with the PR!
Summary
NIFI-12236
Tracking
Please complete the following tracking steps prior to pull request creation.
Issue Tracking
Pull Request Tracking
NIFI-00000
NIFI-00000
Pull Request Formatting
main
branchVerification
Please indicate the verification steps performed prior to pull request creation.
Build
mvn clean install -P contrib-check
Licensing
LICENSE
andNOTICE
filesDocumentation