Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-4789 Extract grok multi pattern support #2411

Closed
wants to merge 4 commits into from
Closed

NIFI-4789 Extract grok multi pattern support #2411

wants to merge 4 commits into from

Conversation

charlesporter
Copy link

ExtractGrok Processor supports multiple expressions
-separated by comma or by specified delimiter
-option to return on first match or to run all expressions in list
ralated enhancements:
-multiple pattern files support
-selectable result attribute prefix (default unchanged from '.grok')

implemented by creating multiple kraken-Grok objects. It is possible that performance could be improved by using elasticsearch implementation:
https://github.com/elastic/elasticsearch/tree/d6d0c13bd63f28347ebdb0f9364e44921c248b8b/modules/ingest-common/src/main/java/org/elasticsearch/ingest/commoncommon/src/main/java/org/elasticsearch/ingest/common/Grok.java

-separated by comma or specified delimiter
-option to return on first match or to run all expressions in list
other enhancements
-multiple pattern files
-selectable result attribute prefix
-separated by comma or specified delimiter
-option to return on first match or to run all expressions in list
other enhancements
-multiple pattern files
-selectable result attribute prefix
Copy link
Author

@charlesporter charlesporter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added apache_patterns to rat excludes

Copy link
Contributor

@markap14 markap14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesporter thanks for the update! I think the idea makes a lot of sense. The PR, as submitted, needs a bit of cleanup, though, I think, and there are some thread safety concerns. I commented more specifically inline.

I think it's worth considering also to use user-defined properties to specify additional grok expressions instead of using a delimiter, because delimiters can get really tricky to read and to write without realizing that you're using the delimiter, especially when mixed in with a syntax like Grok where you can use common delimiter such as commas in the syntax itself.

}
}

public static Validator validateMultipleFilesExist() {
return (subject, input, context) -> {
for (String s : input.split(PATTERN_FILE_LIST_SEPARATOR)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here doesn't appear to be right. This is going to create a validator for the first file that it encounters but not validate the others.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops.

public static final PropertyDescriptor GROK_EXPRESSION = new PropertyDescriptor.Builder()
.name("Grok Expression")
.description("Grok expression")
.name(GROK_EXPRESSION_KEY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand the idea behind pulling the string literal out into a member variable? Since the PropertyDescriptor's name is generally not referenced elsewhere, it's usually a lot cleaner to have the name inlined.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just because sometimes (maybe not in this processor) the literals for the properties do get re-used. I like consistency, even if in a particular situation it might not add much. If the NIFI style is to not to do it this way, I will happily use literals.

.description("Name of attribute to receive the name(s) of the matched expression(s).")
.required(true)
.defaultValue("matched_expression")
.addValidator(Validator.VALID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not support empty string.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops.

.build();


public static final PropertyDescriptor EXPRESSION_SEPARATOR = new PropertyDescriptor.Builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would break backward compatibility of this processor. It will be very common to have a comma in the Grok Expression, and this change would result in different behavior for the processor. If we use this approach, then we would have to make it an optional property (i.e., required(false)) and we would have to not set a default value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if it makes more sense to avoid using a delimiter at all here and instead support user-defined properties to indicate additional expressions. So we would keep the existing property, which is required. Then if the user adds a new property, say "my additional grok expression" the value of that would be a grok expression to match as well. Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point about the backward compatibility. I had imagined that users could select a separator if needed in that case. I now realize that is not reasonable. I did not originally go with user-defined properties cause I was not sure how they would be ordered. Are they delivered in the same order declared, or do I need to tell the user to name them in sortable order?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now recall my reasoning on this. The common way to use grok is to have most of the regexes in one or more grok pattern files, where they are easier to test then in an attribute property box. The expressions are then referenced by name in the expression passed in for evaluation, as in %{EX_1}. or even {EX_1} %{EX_2}. If someone really does want to do an interesting regex in line, the delimiter can be multiple characters, so it is easy to make a unique one like "<>". If I wanted to locally define a set of regexes, I would could also also use the ExtractText processor.
On the other hand, if we go with putting each expression in a separate attributes, the attributes need to be named in a way that can be sorted since we commit to order of evaluation. If the user wants to insert a new expression into the middle of the list, they have to (on average) recreate half of the rest of the attributes with new names. In my own use case of building a classifier, i am often tweaking the list, so having to recreate my attributes would be a nuisance.

PS: I will make the delimiter optional, with no default.

.required(true)
.allowableValues("true", "false")
.addValidator(StandardValidators.BOOLEAN_VALIDATOR)
.defaultValue("false")
.build();

public static final PropertyDescriptor BREAK_ON_FIRST_MATCH = new PropertyDescriptor.Builder()
.name(SINGLE_MATCH_KEY)
.description("Stop on first matched expression.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain better what this property is meant to do? How does the behavior of the processor change if i select 'true' vs. if I select 'false'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glad to get early warning that it was not intelligible. I will add a more complete just FYI: in my use case. i am using this processor as a classifier. I arrange my expressions, so that the first match it makes will be the best match..

descriptors = Collections.unmodifiableList(_descriptors);
}

private String resultPrefix = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not thread-safe. Since each invocation of the Processor may use a different thread, we must ensure that any member variable is stored in a thread-safe manner. Given that these variables appear to be set only in the @OnScheduled method and are not mutable objects, it should be sufficient to mark them as volatile. It is probably best, though, to move them up in the codebase with the other member variables, as that's the de-facto standard with nifi processors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I should conform to project style. will fix.

Grok grok = new Grok();
final String patternFileListString = context.getProperty(GROK_PATTERN_FILE).getValue();
for (String patternFile : patternFileListString.split(PATTERN_FILE_LIST_SEPARATOR)) {
grok.addPatternFromFile(patternFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes sense to call trim() on this value before passing it into the Grok object so that if the user enters something like "abc, xyz, 123" we don't pass in the white space.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm... ok. My feeling would be if they want to put spaces in their list they should include them in separator, but your way is more tolerant. Will fix.

private final static List<PropertyDescriptor> descriptors;
private final static Set<Relationship> relationships;

private volatile Grok grok = new Grok();
private volatile List<Grok> grokList = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ArrayList is modified once created, so marking it as volatile is not sufficient from a thread-safety point of view. We would need to either create a new list each time instead of calling clear() in the @OnScheduled method or make it a synchronized list.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will create a new list. In practice, it is never modified while running multi-threaded. but it is an easy and computationally cheap fix for safer code.

@@ -70,33 +65,51 @@
@Tags({"grok", "log", "text", "parse", "delimit", "extract"})
@CapabilityDescription("Evaluates one or more Grok Expressions against the content of a FlowFile, " +
"adding the results as attributes or replacing the content of the FlowFile with a JSON " +
"notation of the matched content")
"notation of the matched content\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid using the \n here as this is rendered in a couple of places in HTML and we don't guarantee that it will render as expected. If we are going to explicitly indicate which library is being used, we should also be very clear that this is the implementation that is currently used and that it could change at any time without notice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, i will use
instead of \n. and add the notice. (BTW: It is same implementation as was being used before, without credit)

"if the grok identifier \"timestamp\" is matched, then the value will be added to an attribute named \"{result prefix}timestamp\""),

@WritesAttribute(attribute = "ExtractGrok.exception", description = "if an error occurs, an exception will be written to this attribute, " +
"and the flow routed to 'unmatched' ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should read "and the FlowFile routed" rather than "and the flow routed". In such a case, though, we should not be routing to 'unmatched' but rather we should route to a failure relationship.

@github-actions
Copy link

We're marking this PR as stale due to lack of updates in the past few months. If after another couple of weeks the stale label has not been removed this PR will be closed. This stale marker and eventual auto close does not indicate a judgement of the PR just lack of reviewer bandwidth and helps us keep the PR queue more manageable. If you would like this PR re-opened you can do so and a committer can remove the stale tag. Or you can open a new PR. Try to help review other PRs to increase PR review bandwidth which in turn helps yours.

@github-actions github-actions bot added the Stale label May 12, 2021
@github-actions github-actions bot closed this May 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants