Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-12236 Improving fault tolerancy of the QuestDB backed metrics repository #8152

Closed
wants to merge 12 commits into from

Conversation

simonbence
Copy link
Contributor

@simonbence simonbence commented Dec 12, 2023

This is a refined PR for NIFI-12236 after feedback and discussion on PR 8123. The general approach and purpose remained the same: the PR aims to refine error handling of the original impelmentation and in order to reach this, the QuestDb handling was refactored:

  • All QuestDB related code has been moved to a separate NAR
  • QuestDB usage is moved beding a Client interface, with some changes of the original usage
    Most of the guidelines from the original PR are still relevant, but here are some key things have changed:
  • NiFi API changes have been reverted
  • Now the Volatile implementation is the default again
  • Following the changes from NIFI-12492 all QuestDB related code is not within the QuestDB bundle. Note: it has been moved out from under the framework and is part of the nifi-nar-bundles module, as other "includable" NARs.

Thank you for taking your time with the PR!

Summary

NIFI-12236

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 21

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating the new pull request and making the adjustments @simonbence. I have yet not review the entire implementation, but I noted a few recommendations.

Having this in a separate module and avoiding changes to the nifi-api is helpful. I am not sure about the general reuse of the retry interfaces, so I'm not sure whether it warrants adding them to nifi-utils. Perhaps they would be better in a separate nifi-retry-utils under nifi-commons, or just kept as part of the QuestDB module in a separate package until it gets to the point where they could be reused.

@simonbence
Copy link
Contributor Author

Thanks for creating the new pull request and making the adjustments @simonbence. I have yet not review the entire implementation, but I noted a few recommendations.

Having this in a separate module and avoiding changes to the nifi-api is helpful. I am not sure about the general reuse of the retry interfaces, so I'm not sure whether it warrants adding them to nifi-utils. Perhaps they would be better in a separate nifi-retry-utils under nifi-commons, or just kept as part of the QuestDB module in a separate package until it gets to the point where they could be reused.

I hope this PR will be more fit to go on with :)

As of retry: even if it is currently used only by the QuestDB module, I would consider it as a general util or tool, just like other parts of the given module, like for example the InputStream implementations. The concept is simple and might be applied in many place and I think we would benefit from having one extracted and tested code piece do this work if necessary. So I would not think moving it to the QuestDB module would be the right way but if you insist, I can separate it to a distinct module under nifi-commons (however I think that would be a bit of an exaggeration). Please share your thoughts, thanks!

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the initial adjustments @simonbence, I went through the code again and noted several recommendations and questions.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the adjustments, this looks cleaner with the move to spring-retry.

I noted a few minor things, but also there appears to be a file-locking problem resulting in test failures on Windows. Linux and macOS are generally more tolerant in this area, but it highlights either a test problem, or a problem with open files in the runtime class. See the Windows build for more details.

Error:  org.apache.nifi.questdb.embedded.EmbeddedClientTest.testInsertAndQuery -- Time elapsed: 0.121 s <<< ERROR!
java.io.IOException: Failed to delete temp directory C:\Users\RUNNER~1\AppData\Local\Temp\junit16271263950658316199. The following paths could not be deleted (see suppressed exceptions for details): <root>, event, event\2024-01-14, event\2024-01-14\capturedAt.d, event\2024-01-14\subject.d, event\2024-01-14\value.d, event\subject.c, event\subject.k, event\subject.o, event\subject.v, event\_cv, event\_meta, event\_txn, event\_txn_scoreboard, tables.d.0, _tab_index.d

@simonbence
Copy link
Contributor Author

Thanks for making the adjustments, this looks cleaner with the move to spring-retry.

I noted a few minor things, but also there appears to be a file-locking problem resulting in test failures on Windows. Linux and macOS are generally more tolerant in this area, but it highlights either a test problem, or a problem with open files in the runtime class. See the Windows build for more details.

Error:  org.apache.nifi.questdb.embedded.EmbeddedClientTest.testInsertAndQuery -- Time elapsed: 0.121 s <<< ERROR!
java.io.IOException: Failed to delete temp directory C:\Users\RUNNER~1\AppData\Local\Temp\junit16271263950658316199. The following paths could not be deleted (see suppressed exceptions for details): <root>, event, event\2024-01-14, event\2024-01-14\capturedAt.d, event\2024-01-14\subject.d, event\2024-01-14\value.d, event\subject.c, event\subject.k, event\subject.o, event\subject.v, event\_cv, event\_meta, event\_txn, event\_txn_scoreboard, tables.d.0, _tab_index.d

It looks like it has something to do with the @TempDir annotation. I am looking after it.

@exceptionfactory
Copy link
Contributor

Thanks for making the adjustments, this looks cleaner with the move to spring-retry.
I noted a few minor things, but also there appears to be a file-locking problem resulting in test failures on Windows. Linux and macOS are generally more tolerant in this area, but it highlights either a test problem, or a problem with open files in the runtime class. See the Windows build for more details.

Error:  org.apache.nifi.questdb.embedded.EmbeddedClientTest.testInsertAndQuery -- Time elapsed: 0.121 s <<< ERROR!
java.io.IOException: Failed to delete temp directory C:\Users\RUNNER~1\AppData\Local\Temp\junit16271263950658316199. The following paths could not be deleted (see suppressed exceptions for details): <root>, event, event\2024-01-14, event\2024-01-14\capturedAt.d, event\2024-01-14\subject.d, event\2024-01-14\value.d, event\subject.c, event\subject.k, event\subject.o, event\subject.v, event\_cv, event\_meta, event\_txn, event\_txn_scoreboard, tables.d.0, _tab_index.d

It looks like it has something to do with the @TempDir annotation. I am looking after it.

Thanks for looking into it. The @TempDir annotation is useful because JUnit will automatically clear the directory after running a test method, so those types of locking errors usually indicate that there is some other thread which has not closed a file stream.

@simonbence
Copy link
Contributor Author

Thanks for making the adjustments, this looks cleaner with the move to spring-retry.
I noted a few minor things, but also there appears to be a file-locking problem resulting in test failures on Windows. Linux and macOS are generally more tolerant in this area, but it highlights either a test problem, or a problem with open files in the runtime class. See the Windows build for more details.

Error:  org.apache.nifi.questdb.embedded.EmbeddedClientTest.testInsertAndQuery -- Time elapsed: 0.121 s <<< ERROR!
java.io.IOException: Failed to delete temp directory C:\Users\RUNNER~1\AppData\Local\Temp\junit16271263950658316199. The following paths could not be deleted (see suppressed exceptions for details): <root>, event, event\2024-01-14, event\2024-01-14\capturedAt.d, event\2024-01-14\subject.d, event\2024-01-14\value.d, event\subject.c, event\subject.k, event\subject.o, event\subject.v, event\_cv, event\_meta, event\_txn, event\_txn_scoreboard, tables.d.0, _tab_index.d

It looks like it has something to do with the @TempDir annotation. I am looking after it.

Thanks for looking into it. The @TempDir annotation is useful because JUnit will automatically clear the directory after running a test method, so those types of locking errors usually indicate that there is some other thread which has not closed a file stream.

This was my idea as well. I will add some explicit tearing down to see if it is sufficient for the TempDir

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @simonbence, unfortunately it looks like the cleanup approach is still failing on Windows builds.

@simonbence
Copy link
Contributor Author

The failed checks look to be in connection with Python processors and minify

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the updates @simonbence.

On closer review of the test changes, there is still a problem with manager tests not cleaning up temporary files on completion. Disabling the methods on Windows masks this problem, but running the tests on macOS shows junit directories left behind after completion. I recommend removing the DisabledOnOS annotations and tracking down the source of the problem. Whether in the test class, or the implementation itself, temporary files should not remain after successful test completion.

I also noticed several QuestDB table definitions that still use capturedAt as the column name, so this should be checked across the set of changes and updated.

nifi-nar-bundles/nifi-questdb-bundle/nifi-questdb/pom.xml Outdated Show resolved Hide resolved
import static org.junit.jupiter.api.Assertions.assertTrue;
import static org.junit.jupiter.api.Assertions.fail;

public class EmbeddedDatabaseManagerTest extends EmbeddedQuestDbTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class leaves junit temporary directories around after completion. Disabling some methods on Windows masks the fundamental problem of the test not cleaning up after itself, so this needs more evaluation to improve the behavior.

@simonbence
Copy link
Contributor Author

Thanks for making the updates @simonbence.

On closer review of the test changes, there is still a problem with manager tests not cleaning up temporary files on completion. Disabling the methods on Windows masks this problem, but running the tests on macOS shows junit directories left behind after completion. I recommend removing the DisabledOnOS annotations and tracking down the source of the problem. Whether in the test class, or the implementation itself, temporary files should not remain after successful test completion.

I also noticed several QuestDB table definitions that still use capturedAt as the column name, so this should be checked across the set of changes and updated.

I take a look on the leftover temporary files.

As for disable: It was a deliberate choice after looking into the reasons. I picked adding them due to the test setup, not because of production reasons. In order to emulate corrupted database files I directly touch the files and write some gibberish into that. This is something normally would happened outside of our code, most reasonably would happen within the database. I did not found other way to trigger this however Windows does not let one to touch a file is under lock by an other process. If there is an other way to deliberatly corrupt the database (not directly working with the underlying files) than it could work out, but I am not aware of such a method.

@exceptionfactory
Copy link
Contributor

Thanks for making the updates @simonbence.
On closer review of the test changes, there is still a problem with manager tests not cleaning up temporary files on completion. Disabling the methods on Windows masks this problem, but running the tests on macOS shows junit directories left behind after completion. I recommend removing the DisabledOnOS annotations and tracking down the source of the problem. Whether in the test class, or the implementation itself, temporary files should not remain after successful test completion.
I also noticed several QuestDB table definitions that still use capturedAt as the column name, so this should be checked across the set of changes and updated.

I take a look on the leftover temporary files.

As for disable: It was a deliberate choice after looking into the reasons. I picked adding them due to the test setup, not because of production reasons. In order to emulate corrupted database files I directly touch the files and write some gibberish into that. This is something normally would happened outside of our code, most reasonably would happen within the database. I did not found other way to trigger this however Windows does not let one to touch a file is under lock by an other process. If there is an other way to deliberatly corrupt the database (not directly working with the underlying files) than it could work out, but I am not aware of such a method.

Thanks for looking into the details. Manual file changes for the test methods seems like a reasonable strategy, so if you can track down and resolve the leftover files issue, I would expect that should also resolve the test failures on Windows.

@simonbence
Copy link
Contributor Author

Notable changes:

  • CapturedAt database fields were renamed to captured
  • I found the reason of the remaining test files and fixed it
  • I adjusted the NAR structure, now it is inherited from framework again, fixing an instance of issue
  • With this I renamed the NAR. I would prefer the more generic nifi-questdb-nar as name but with the framework as parent we would have challenges with processors in the NAR. So probably if we want to add them, it will be a separate NAR thus the more specific name is beneficial.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continued work on these changes @simonbence.

On further review, I returned to a situation that came up in an earlier comment regarding the NoOpClient.

In the current revision, the implementation logs an error and return the NoOpClient, which effectively disables all Status History tracking. Although this avoids causing NiFi to crash when the QuestDB database is corrupted, is also disables an important function. This might be acceptable as a non-default configuration option, but it would be a blocker for considering this as a default setting. It also seems problematic in general, even as an opt-in solution.

One option is to throw an exception that causes NiFi to fail on startup, similar to current behavior with the QuestDB repository. Another option would be to fall back to in-memory tracking, which seems better. The latter option could be more involved, but I mention it for consideration.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional note on test behavior, running tests on Ubuntu Linux 22.04 still leaves behind backup directories in a directory named questDbBackup. There is a new backup directory for each subsequent build of the nifi-questdb module.

@Test
@DisabledOnOs(OS.WINDOWS)
// This testcase cannot be reproduced under Windows using Junit
public void testDatabaseRestorationAfterCorruptedFiles() throws DatabaseException, IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For additional reference, this is the test method that leaves the questDbBackup directory with backup sub-directories after test completion. This is probably the reason for the failure on Windows, but on both macOS and Linux, the directory remains after all tests have completed. It may be necessary to implement some manual deletion logic to ensure the directory is cleared. If the directory is cleared on macOS and Linux, it might be worth enabling this test on Windows again, as it more clearly surfaces the problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest updates in e33dbe9 now clear out the questDbBackup directory, but the directory itself still remains. It looks like adding one more step to delete the directory itself should resolve this particular issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, however the folder was left behind a different test case. I double checked if any leaves anything behind in multiple platform and I found that working well. Thanks for the finding!

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the latest updates and responses @simonbence.

This looks close to completion, the last remaining items are related to test behavior and stability. The directory cleanup is looking better, but still leaves an empty directory.

The other concern is the timing and intermittent failures for testRollover(). As mentioned in the detailed comment, the Thread.sleep() approach takes a lot of time for a unit test, and appears to be unreliable, so another more robust strategy for determining rollover completion seems necessary.

client.insert(EVENT_TABLE_NAME, QuestDbTestUtil.getEventTableDataSource(testData));

// The rollover runs in every 5 seconds
Thread.sleep(TimeUnit.SECONDS.toMillis(6));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a long time to sleep for a unit test, by adjusting the rollover time, can this value be reduced?

final List<Event> result = client.query(SELECT_QUERY, RequestMapping.getResultProcessor(QuestDbTestUtil.EVENT_TABLE_REQUEST_MAPPING));
testSubject.close();

assertEquals(1, result.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When building, the module, this test method fails intermittently. It seems like the Thread.sleep() approach is not a strong enough guarantee of rollover completion. It may be necessary to do some additional checking on the directory content itself to determine if a rollover has occurred. If all else fails, it would be better to remove the test method, as opposed to having a test that fails intermittently.

[ERROR] Tests run: 13, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 18.24 s <<< FAILURE! -- in org.apache.nifi.questdb.embedded.EmbeddedDatabaseManagerTest
[ERROR] org.apache.nifi.questdb.embedded.EmbeddedDatabaseManagerTest.testRollover -- Time elapsed: 7.888 s <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <1> but was: <2>
	at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
	at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
	at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:527)
	at org.apache.nifi.questdb.embedded.EmbeddedDatabaseManagerTest.testRollover(EmbeddedDatabaseManagerTest.java:109)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an answer for both this and the previous one.

In general I would lean toward to use something like a latch, which guarantees the end of the background task. Due to the nature of the test case I could not use a spy for this, and I rely on waiting. As I left some gap between the roundtrip and wait time, I did not have issue with it so far, but based on your concerns I proposed a more refined way to do this. The fundamentals are the same: the test thread waits, but it polls multiple times to shorten the usual test duration and now it has 2 sec of gaps between the scheduled round trips and the waiting times. In case 3 seconds is not enough for the positive result I assume we have an issue indeed. Either with the test or the environment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adjusting the approach, the polling strategy looks good, and avoids the timing issues.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the adjustments to the test method @simonbence.

In the course of local build execution, I still noticed intermittent remnants of a JUnit temporary directory containing QuestDB files. I could reproduce it fairly often by just building the nifi-questdb module with associated tests. It is not clear which test method is leaving files behind, but the directory name indicates it is a random temporary directory from the @TempDir annotation. The intermittent nature of the behavior seems to indicate one remaining timing issue.

I will take a closer look at narrowing down the source of the behavior, but if we can get that isolated and resolved, then this should be ready to go.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After additional local build evaluation, the remnant temporary directory only seems to occur on Linux, not on macOS. Given that appears test-related, and does not cause failures, I don't think that it needs to be a blocker.

I pushed one minor adjustment to the test logger configuration, setting the level to CRITICAL, which avoids writing unnecessary QuestDB output that can result in corrupting the forked Surefire runner stream. Planning to merge pending one more successful build cycle.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonbence Following one more review, I tracked down the source of the orphaned files to a particular test method that was missing a close call on the DatabaseManager. After adding the close(), I could run the tests multiple times without seeing orphaned temporary directories.

I pushed one more commit to address that behavior in the test class.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through the feedback and addressing the various issues @simonbence. The latest version looks good. +1 merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants