-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MINIFICPP-621 Nanofi Tailfile example #590
Conversation
98e2d96
to
335ae96
Compare
Please also fix the build error in Travis:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also create some tests if possible.
ea46206
to
e0837ca
Compare
69e2e9a
to
b681714
Compare
nanofi/ecu/CMakeLists.txt
Outdated
|
||
endif() | ||
|
||
if (WIN32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't belong to the scope of this ticket, but copy-pasting this flag setting to all cmake files is something we should stop doing.
transmit_flow_files(instance); | ||
free_flow_file_list(&ff_list); | ||
free_flowfile(new_ff); | ||
sleep(intrvl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpicking, but sleeping after receiving a signal is not so nice. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests for newly created utils!
be869dc
to
80ddd46
Compare
|
||
// Although there is no delimiter within the string that is equal to 4096 bytes or longer | ||
// tail_file creates a flow file for each subsequent 4096 byte chunk. It leaves the last chunk | ||
// if it is smaller than 4096 bytes and not delimited |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both NiFi (https://github.com/apache/nifi/blob/ea9b0db2f620526c8dd0db595cf8b44c3ef835be/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L832) and MiNiFi (https://github.com/apache/nifi-minifi-cpp/blob/master/libminifi/src/core/ProcessSession.cpp#L560) implement the TailFile processor to only ingest data (create flow files) when it encounters a delimiter. It is also in the specification of the processor (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.TailFile/).
I think that in nanofi we can deviate from the behaviour of NiFi or MiNiFi when it is necessary or beneficial.
In this case, however, I do not see how this behaviour (creating flow files from the data if it is larger than or equal to 4096 bytes even if there was no delimiter) would be beneficial. On the contrary, I think it is an unexpected behaviour, and makes the processor really hard to use, as it can potentially chunk your data into arbitrary pieces which are hard to parse and manage later on.
I am willing to be convinced that this behaviour is better (I do not think it is necessary), but so far I have not seen a reason to believe that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, I have misread how MiNiFi's TailFile works. It has two modes: with delimiter and without delimiter. If a delimiter is not set, it will read until EOF and create a flow file from all of that data.
If a delimiter is set, it will indeed only try to read at max 4095+1 bytes to find a delimiter.
Unfortunately that implementation is wrong: std::ifstream's getline will set the failbit, if it doesn't find the delimiter in the given size. We will check for this failbit, and stop reading the file without creating any FlowFiles.
And because we only set TailFile's current offset in the file based on the generated FlowFiles, this means that if TailFile encounters a line longer than 4095 bytes without a delimiter, it will stop working indefinitely (this was a bug in a previous version of this PR too).
I have just tried this out, and it indeed fails. :(
This means that MiNiFi's TailFile will have to be fixed: at a minimum, to fix the implementation, but I propose to change it to match NiFi's behaviour, and only ingest data when a delimiter is found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the line that most probably generates warning.
Otherwise okay with some topics left to follow-up.
|
||
char * end = NULL; | ||
|
||
while ((end = strchr(begin, delim))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't it produce warning?
const char * is assigned to char * (end should be const, too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strchr returns const char*, so I agree, end should be const too.
} | ||
} | ||
|
||
std::string join_strings(const std::vector<std::string>& strings, const std::string& token) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more reason to move that PR forward (which provides a much more generic and efficient way for doing the same).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking tickets for the discussed refactor has been created, this is a review with that in mind.
Please see the inline comments.
As the unit tests test the tokenizer, and not the entire ECU, there are some untested functionality (creating and sending flow files, restoring offset). Have you run manual tests to test these?
nanofi/ecu/tail_file.c
Outdated
char file_path[4096]; | ||
char delimiter[2]; | ||
|
||
if (get_property(ctx, "file_path", file_path, 50) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffer has been extended to 4096 bytes but here the size has been left 50.
Perhaps using sizeof(file_path) (and sizeof(delimiter) below) as the sizes would be less error-prone.
nanofi/ecu/tail_file.c
Outdated
printf("Delimiter not specified or it is empty\n"); | ||
return; | ||
} | ||
char delim = '\0'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire section until checking whether the delimiter is an escape sequence is unnecessary.
You have already returned if strlen(delimiter) == 0
, so strlen(delimiter) > 0
is true.
As strlen(delimiter) > 0
is true, the first character of the string cannot be a terminating null character, so the if (delim == '\0')
check is unnecessary as well.
The entire section can be substituted with char delim = delimiter[0];
case '\\': | ||
delim = '\\'; | ||
break; | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning an error here instead of silently using \ as the delimiter would be better, but I can live with it.
nanofi/ecu/tail_file.c
Outdated
struct stat stats; | ||
int ret = stat(file, &stats); | ||
|
||
errno = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you set errno here to 0, strerror(errno) will not return with stat's error.
while (head) { | ||
free_flowfile(head->ff_record); | ||
free(head); | ||
head = head->next; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a use-after-free.
You have to save the next pointer before freeing the head structure.
nanofi/src/core/string_utils.c
Outdated
free(node->data); | ||
} | ||
free(node); | ||
memset(node, 0, sizeof(struct token_node)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a use-after-free, you can't null a struct after you have freed it.
nanofi/src/core/string_utils.c
Outdated
token_node * head = tokens->head; | ||
int i = 0; | ||
while (head) { | ||
printf("Token %d : %s Length = %d\n", i, head->data, strlen(head->data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strlen returns a size_t, the printf format specifier for it should be %zu, not %d.
|
||
char * end = NULL; | ||
|
||
while ((end = strchr(begin, delim))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strchr returns const char*, so I agree, end should be const too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on comments and recent change to resolve a potential warning we can merge this with follow-ons. Thanks @msharee9
Anyone else feel free to create blockers if need be for any follow ons.
@phrocker please see my last review. There are - among other things - multiple use-after-free issues in the code. It should not be merged until those are fixed. |
token_node * head = tokens->head; | ||
int i = 0; | ||
while (head) { | ||
printf("Token %d : %s Length = %lu\n", i, head->data, strlen(head->data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l is the length sub-specifier for long, strlen returns a size_t, and the sub-specifier for size_t is z, so this should be %zu not %lu
@@ -103,4 +105,5 @@ endif(ENABLE_PYTHON AND NOT STATIC_BUILD) | |||
|
|||
if (NOT DISABLE_CURL) | |||
add_subdirectory(examples) | |||
add_subdirectory(ecu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing this gets me super excited!
test_lists_equal(&tokens, std::vector<std::string>{std::move(s1), std::move(s2)}); | ||
} | ||
|
||
TEST_CASE("Test tail file having more than 4096 bytes with delimiter and second chunk less than 4096", "[testTailFileWithDelimitedString]") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking another casual look to make it look like I'm useful. I like the breadth of tests! Nice!
7709fda
to
e9f123d
Compare
nanofi/ecu/tail_file.c
Outdated
void on_trigger_callback(processor_session * ps, processor_context * ctx) { | ||
|
||
char file_path[4096]; | ||
char delimiter[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This includes the terminating null character, meaning that only 1 character of the parameter will be read, so escape sequences won't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple things to follow-up here, but let's merge for now as this is getting quite big already, hence approved.
Let me apologize for not mentioning earlier, but we have UT containers added as a third party ( https://github.com/troydhanson/uthash/tree/master/src ), so we don't need to implement containers like linked list.
It's fine to be cleaned up in a follow-up, just a notice to avoid implementing collections in the future.
@bakaid Are you good with merging? |
This closes apache#590. Signed-off-by: Marc Parisi <phrocker@apache.org>
Thank you for submitting a contribution to Apache NiFi - MiNiFi C++.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with MINIFICPP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.