NIFI-4789 Extract grok multi pattern support #2411

charlesporter · 2018-01-17T22:19:58Z

ExtractGrok Processor supports multiple expressions
-separated by comma or by specified delimiter
-option to return on first match or to run all expressions in list
ralated enhancements:
-multiple pattern files support
-selectable result attribute prefix (default unchanged from '.grok')

implemented by creating multiple kraken-Grok objects. It is possible that performance could be improved by using elasticsearch implementation:
https://github.com/elastic/elasticsearch/tree/d6d0c13bd63f28347ebdb0f9364e44921c248b8b/modules/ingest-common/src/main/java/org/elasticsearch/ingest/commoncommon/src/main/java/org/elasticsearch/ingest/common/Grok.java

-separated by comma or specified delimiter -option to return on first match or to run all expressions in list other enhancements -multiple pattern files -selectable result attribute prefix

charlesporter

added apache_patterns to rat excludes

markap14

@charlesporter thanks for the update! I think the idea makes a lot of sense. The PR, as submitted, needs a bit of cleanup, though, I think, and there are some thread safety concerns. I commented more specifically inline.

I think it's worth considering also to use user-defined properties to specify additional grok expressions instead of using a delimiter, because delimiters can get really tricky to read and to write without realizing that you're using the delimiter, especially when mixed in with a syntax like Grok where you can use common delimiter such as commas in the syntax itself.

markap14 · 2018-01-29T17:15:06Z

.../nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExtractGrok.java

        }
    }

+    public static Validator validateMultipleFilesExist() {
+        return (subject, input, context) -> {
+            for (String s : input.split(PATTERN_FILE_LIST_SEPARATOR)) {


The logic here doesn't appear to be right. This is going to create a validator for the first file that it encounters but not validate the others.

markap14 · 2018-01-29T17:17:13Z

.../nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExtractGrok.java

    public static final PropertyDescriptor GROK_EXPRESSION = new PropertyDescriptor.Builder()
-        .name("Grok Expression")
-        .description("Grok expression")
+        .name(GROK_EXPRESSION_KEY)


Can you help me understand the idea behind pulling the string literal out into a member variable? Since the PropertyDescriptor's name is generally not referenced elsewhere, it's usually a lot cleaner to have the name inlined.

just because sometimes (maybe not in this processor) the literals for the properties do get re-used. I like consistency, even if in a particular situation it might not add much. If the NIFI style is to not to do it this way, I will happily use literals.

markap14 · 2018-01-29T17:20:08Z

.../nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExtractGrok.java

+        .description("Name of attribute to receive the name(s) of the matched expression(s).")
+        .required(true)
+        .defaultValue("matched_expression")
+        .addValidator(Validator.VALID)


This should not support empty string.

markap14 · 2018-01-29T17:22:07Z

.../nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExtractGrok.java

+        .build();
+
+
+    public static final PropertyDescriptor EXPRESSION_SEPARATOR = new PropertyDescriptor.Builder()


This would break backward compatibility of this processor. It will be very common to have a comma in the Grok Expression, and this change would result in different behavior for the processor. If we use this approach, then we would have to make it an optional property (i.e., required(false)) and we would have to not set a default value.

I am wondering if it makes more sense to avoid using a delimiter at all here and instead support user-defined properties to indicate additional expressions. So we would keep the existing property, which is required. Then if the user adds a new property, say "my additional grok expression" the value of that would be a grok expression to match as well. Thoughts?

good point about the backward compatibility. I had imagined that users could select a separator if needed in that case. I now realize that is not reasonable. I did not originally go with user-defined properties cause I was not sure how they would be ordered. Are they delivered in the same order declared, or do I need to tell the user to name them in sortable order?

I now recall my reasoning on this. The common way to use grok is to have most of the regexes in one or more grok pattern files, where they are easier to test then in an attribute property box. The expressions are then referenced by name in the expression passed in for evaluation, as in %{EX_1}. or even {EX_1} %{EX_2}. If someone really does want to do an interesting regex in line, the delimiter can be multiple characters, so it is easy to make a unique one like "<>". If I wanted to locally define a set of regexes, I would could also also use the ExtractText processor.
On the other hand, if we go with putting each expression in a separate attributes, the attributes need to be named in a way that can be sorted since we commit to order of evaluation. If the user wants to insert a new expression into the middle of the list, they have to (on average) recreate half of the rest of the attributes with new names. In my own use case of building a classifier, i am often tweaking the list, so having to recreate my attributes would be a nuisance.

PS: I will make the delimiter optional, with no default.

markap14 · 2018-01-29T17:22:47Z

.../nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExtractGrok.java

        .required(true)
        .allowableValues("true", "false")
        .addValidator(StandardValidators.BOOLEAN_VALIDATOR)
        .defaultValue("false")
        .build();

+    public static final PropertyDescriptor BREAK_ON_FIRST_MATCH = new PropertyDescriptor.Builder()
+        .name(SINGLE_MATCH_KEY)
+        .description("Stop on first matched expression.")


Can you explain better what this property is meant to do? How does the behavior of the processor change if i select 'true' vs. if I select 'false'?

glad to get early warning that it was not intelligible. I will add a more complete just FYI: in my use case. i am using this processor as a classifier. I arrange my expressions, so that the first match it makes will be the best match..

markap14 · 2018-01-29T17:28:30Z

.../nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExtractGrok.java

        descriptors = Collections.unmodifiableList(_descriptors);
    }

+    private String resultPrefix = "";


This is not thread-safe. Since each invocation of the Processor may use a different thread, we must ensure that any member variable is stored in a thread-safe manner. Given that these variables appear to be set only in the @OnScheduled method and are not mutable objects, it should be sufficient to mark them as volatile. It is probably best, though, to move them up in the codebase with the other member variables, as that's the de-facto standard with nifi processors.

I agree I should conform to project style. will fix.

markap14 · 2018-01-29T17:29:27Z

.../nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExtractGrok.java

+            Grok grok = new Grok();
+            final String patternFileListString = context.getProperty(GROK_PATTERN_FILE).getValue();
+            for (String patternFile : patternFileListString.split(PATTERN_FILE_LIST_SEPARATOR)) {
+                grok.addPatternFromFile(patternFile);


It probably makes sense to call trim() on this value before passing it into the Grok object so that if the user enters something like "abc, xyz, 123" we don't pass in the white space.

hmmm... ok. My feeling would be if they want to put spaces in their list they should include them in separator, but your way is more tolerant. Will fix.

markap14 · 2018-01-29T17:30:39Z

.../nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExtractGrok.java

    private final static List<PropertyDescriptor> descriptors;
    private final static Set<Relationship> relationships;

-    private volatile Grok grok = new Grok();
+    private volatile List<Grok> grokList = new ArrayList<>();


This ArrayList is modified once created, so marking it as volatile is not sufficient from a thread-safety point of view. We would need to either create a new list each time instead of calling clear() in the @OnScheduled method or make it a synchronized list.

ok, i will create a new list. In practice, it is never modified while running multi-threaded. but it is an easy and computationally cheap fix for safer code.

markap14 · 2018-01-29T17:51:37Z

.../nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExtractGrok.java

@@ -70,33 +65,51 @@
 @Tags({"grok", "log", "text", "parse", "delimit", "extract"})
 @CapabilityDescription("Evaluates one or more Grok Expressions against the content of a FlowFile, " +
    "adding the results as attributes or replacing the content of the FlowFile with a JSON " +
-    "notation of the matched content")
+    "notation of the matched content\n" +


I would avoid using the \n here as this is rendered in a couple of places in HTML and we don't guarantee that it will render as expected. If we are going to explicitly indicate which library is being used, we should also be very clear that this is the implementation that is currently used and that it could change at any time without notice.

thanks, i will use
instead of \n. and add the notice. (BTW: It is same implementation as was being used before, without credit)

markap14 · 2018-01-29T17:53:20Z

.../nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExtractGrok.java

+        "if the grok identifier \"timestamp\" is matched, then the value will be added to an attribute named \"{result prefix}timestamp\""),
+
+        @WritesAttribute(attribute = "ExtractGrok.exception", description = "if an error occurs, an exception will be written to this attribute, " +
+                "and the flow routed to 'unmatched' ")


This should read "and the FlowFile routed" rather than "and the flow routed". In such a case, though, we should not be routing to 'unmatched' but rather we should route to a failure relationship.

github-actions · 2021-05-12T00:04:31Z

We're marking this PR as stale due to lack of updates in the past few months. If after another couple of weeks the stale label has not been removed this PR will be closed. This stale marker and eventual auto close does not indicate a judgement of the PR just lack of reviewer bandwidth and helps us keep the PR queue more manageable. If you would like this PR re-opened you can do so and a committer can remove the stale tag. Or you can open a new PR. Try to help review other PRs to increase PR review bandwidth which in turn helps yours.

charlesporter added 3 commits January 16, 2018 20:14

ExtractGrok Supports multiple expressions

66b072a

-separated by comma or specified delimiter -option to return on first match or to run all expressions in list other enhancements -multiple pattern files -selectable result attribute prefix

ExtractGrok Supports multiple expressions

dd8bc22

-separated by comma or specified delimiter -option to return on first match or to run all expressions in list other enhancements -multiple pattern files -selectable result attribute prefix

remerge

f289df5

charlesporter commented Jan 19, 2018

View reviewed changes

add apache_patterns to rat excludes

c6cd348

markap14 requested changes Jan 29, 2018

View reviewed changes

github-actions bot added the Stale label May 12, 2021

github-actions bot closed this May 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIFI-4789 Extract grok multi pattern support #2411

NIFI-4789 Extract grok multi pattern support #2411

charlesporter commented Jan 17, 2018

charlesporter left a comment •

edited

markap14 left a comment

markap14 Jan 29, 2018

charlesporter Jan 29, 2018

markap14 Jan 29, 2018

charlesporter Jan 29, 2018

markap14 Jan 29, 2018

charlesporter Jan 29, 2018

markap14 Jan 29, 2018

markap14 Jan 29, 2018

charlesporter Jan 29, 2018

charlesporter Jan 30, 2018

markap14 Jan 29, 2018

charlesporter Jan 29, 2018

markap14 Jan 29, 2018

charlesporter Jan 29, 2018

markap14 Jan 29, 2018

charlesporter Jan 29, 2018

markap14 Jan 29, 2018

charlesporter Jan 29, 2018

markap14 Jan 29, 2018

charlesporter Jan 29, 2018

markap14 Jan 29, 2018

github-actions bot commented May 12, 2021

		.build();


		public static final PropertyDescriptor EXPRESSION_SEPARATOR = new PropertyDescriptor.Builder()

NIFI-4789 Extract grok multi pattern support #2411

NIFI-4789 Extract grok multi pattern support #2411

Conversation

charlesporter commented Jan 17, 2018

charlesporter left a comment • edited

Choose a reason for hiding this comment

markap14 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 12, 2021

charlesporter left a comment •

edited