Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-2341 - Introduce ParseCEF processor #785

Closed
wants to merge 1 commit into from

Conversation

trixpan
Copy link
Contributor

@trixpan trixpan commented Aug 4, 2016

  • Implements a processor to parse HPE's highly popular Common Event Format
  • Please note this code relies on a recently written Parser published outside the NiFi source tree. This was done with the intent of allowing code reuse, given that many Open Source projects tend to tackle the CEF format in a myriad of ways and some are still trying to implement a solid parser (e.g. METRON-157). The Parser is functional but feedback over its implementation is welcome as well.

@trixpan trixpan force-pushed the NIFI-2341 branch 2 times, most recently from c729f2a to 678a9ab Compare August 5, 2016 03:56
@trixpan trixpan force-pushed the NIFI-2341 branch 15 times, most recently from b643185 to 3382b7e Compare August 27, 2016 03:45
@trixpan trixpan force-pushed the NIFI-2341 branch 2 times, most recently from 7677ce4 to 1eb1c92 Compare August 27, 2016 13:35
@trixpan
Copy link
Contributor Author

trixpan commented Aug 27, 2016

@mattyb149 - For some reason I don't truly grasp TestGetJMSQueue start to fail once this processor is introduced to standard-processors

Other than that code looks to be working as expected

@trixpan
Copy link
Contributor Author

trixpan commented Aug 27, 2016

figured out what it is. I needed to exclude slf4j-log4j12 from parCEFone dependency.

In the process found a few bugs. Fixing. :-)

@trixpan
Copy link
Contributor Author

trixpan commented Aug 27, 2016

Ready for review

"headers and extensions of the parts of the CEF message.\n" +
"Note: This Processor expects CEF messages WITHOUT the syslog headers (i.e. starting at \"CEF:0\"")
@WritesAttributes({@WritesAttribute(attribute = "cef.header.version", description = "The version of the CEF message."),
@WritesAttribute(attribute = "cef.header.deviceVendor", description = "The Device Vendor of the CEF message."),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic note, the description of the deviceVendor attribute uses proper nouns but the rest of the device* properties do not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Addressed. :-)

@trixpan trixpan force-pushed the NIFI-2341 branch 2 times, most recently from 82621db to 2638204 Compare September 2, 2016 23:46
@trixpan
Copy link
Contributor Author

trixpan commented Sep 2, 2016

@mattyb149

feedback addressed.

My comment is to play particular attention on threat concurrency around Jackson's ObjectMapper (jackson say it is threat safe but thread safety is not my forté) and the fiddling with timezones (that has played tricks in the past).

I will continue testing but overall I suspect it should be ok to review.

@mattyb149
Copy link
Contributor

Will do. Although I'm sure you meant "thread concurrency", I always keep an eye out for threat concurrency ;) I've just started a long holiday weekend so I may not be able to take a look until mid-next-week. Thanks for the updates!

@trixpan
Copy link
Contributor Author

trixpan commented Sep 3, 2016

Rofl. Well at least I got 1 out 3 right. As you can see we security folks are a bit obsessed with threats and risks :-)

@trixpan
Copy link
Contributor Author

trixpan commented Sep 23, 2016

@mattyb149 PR rebased and in theory should be ready for review

Copy link
Contributor

@mattyb149 mattyb149 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lookin' good, once these edits are in I will give it a run :)

public static final String DESTINATION_ATTRIBUTES = "flowfile-attribute";
public static final PropertyDescriptor FIELDS_DESTINATION = new PropertyDescriptor.Builder()
.name("FIELDS_DESTINATION")
.displayName("Parsed fields destination")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nitpick, I think the most common convention for property naming is capital first letters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to change that. I have been using all caps matching the property descriptor name within code but happy to adjust in case a naming convention already exists.

@@ -819,7 +819,7 @@ private void setParameter(final PreparedStatement stmt, final String attrName, f
bValue = DatatypeConverter.parseBase64Binary(parameterValue);
break;
default:
throw new ParseException("Unable to parse binary data using the formatter `" + valueFormat + "`.",0);
throw new ParseException("Unable to parse binary data using the simpleDateFormat `" + valueFormat + "`.",0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks odd (no other reference to such things), perhaps a rebase is needed? Or maybe a search-and-replace ran rampant and this file should be removed from the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I had the file opened in intellij and made changes to it instead of to ParseCEF.

addressed

"dpt=1234 agt=123.123.0.124 dlat=40.366633";

@Test
public void testInvalidMessage() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other inputs or properties that could cause a different error? If so additional tests would be nice (for illustration and regression purposes).

"dpt=1234 agt=123.123.0.124 dlat=40.366633";

@Test
public void testInvalidMessage() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can certainly think about invalid messages but generally speaking this is ParCEFone's land. In the library did I either do full validation or no validation at all (we use full validation in ParseCEF).

Happy to port the jUnits across to NIFI.

One thing worth of notice is that the positive sample above contains all the data types that ParCEFone can process (i.e. String, float, long, MacAddress, IPAddr (v4 and v6) and timestamps).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If full or no validation is the behavior of the library then no need for extra unit tests here, I will run it on a real system and merge if all looks well :)

@@ -254,6 +254,17 @@ language governing permissions and limitations under the License. -->
<artifactId>org.everit.json.schema</artifactId>
<version>1.4.0</version>
</dependency>
<dependency>
<groupId>com.fluenda</groupId>
<artifactId>ParCEFone</artifactId>
Copy link
Contributor

@mattyb149 mattyb149 Sep 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be overly picky. This library uses javax.el-api which is licensed under CDDL 1.1, and its NOTICE has been copied to the NiFi overall NOTICE and the assembly's NOTICE. However it is not in the standard-nar's notice (nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-nar/src/main/resources/META-INF/NOTICE). This should have already been included here since the Jolt UI uses it, but since it has not been included, do you mind adding it? Just need to add the following line to the CDDL 1.1 section of the aforementioned file:

(CDDL 1.1) (GPL2 w/ CPE) Expression Language 2.2.4 API (javax.el:javax.el-api:jar:2.2.4 - http://uel-spec.java.net)

Note that the Jolt UI uses el-api version 3.0.0 but your library is using 2.2.4. This will cause two different versions (in two different areas once unpacked, so no eviction or other issues), but as a NAR we would want either two entries in the NOTICE (one for each version), or perhaps consider upgrading your library to use 3.0.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L&N are areas we must be picky so I wil be happy to address the NOTICE. I promise one day I will get this thing right. 😃

I will check if I can upgrade to 3.0.0 so to reduce the amount of "paperwork".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both done.

@trixpan trixpan force-pushed the NIFI-2341 branch 3 times, most recently from 8a720ab to 29218eb Compare September 30, 2016 23:06
@trixpan
Copy link
Contributor Author

trixpan commented Sep 30, 2016

@mattyb149 all feedback addressed.

this should be ready for review

@mattyb149
Copy link
Contributor

Sounds good thanks, will take a look soon

@trixpan trixpan mentioned this pull request Oct 5, 2016
11 tasks
Copy link
Contributor

@mattyb149 mattyb149 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, couple more points and need some clarification on my test cases to see if its a bug or I have bad files.

});

// Update the provenance for good measure
session.getProvenanceReporter().modifyContent(flowFile, "Replaced content with parsed CEF fields and values");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the output in this case is JSON, it should update the "mime.type" attribute with "application/json". I don't see a MIME type for CEF, so text/plain is probably fine for the DESTINATION_ATTRIBUTES case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to add the mime.type attribute update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oopsie. Done.

public static final PropertyDescriptor APPEND_RAW_MESSAGE_TO_JSON = new PropertyDescriptor.Builder()
.name("APPEND_RAW_MESSAGE_TO_JSON")
.displayName("Append raw message to JSON")
.description("When using flowfile-content, add the original flow content as JSON node _raw, " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for this field? The incoming flow file content is CEF not JSON, so it looks like this property just lets you choose whether the output is flat JSON or flat JSON one level down (in a single _raw) object. Is that the intent? Not saying it's bad, just curious :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to parse the whole message, creating the appropriate parsed CEF in JSON format but keep append the original message to the parsed JSON so that user can still have the original message (as it was received from the upstream connect somewhere in the JSON payload).

This is analogous to logstash behaviour and quite frequently requested.

Have you observed a different behaviour?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No just mis-read the field, with your explanation I now realize you're putting the original message in that field, not the parsed message, sorry about that! Will the characters in that field be escaped correctly for JSON?

For failed flowfiles, they'll have the raw CEF as content right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded the description a bit to clarify the intent of the setting.

The characters should be escaped by jackson when added as an element to the object.

Added a unit test to cover that scenario.

Failed flowfiles as passed through with no modification to content.

}

@Test
public void testSuccessfulParseToAttributes() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could get your example CEF file to parse correctly, but I had trouble with the following two inputs I got from the internet:

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
(https://httpd.apache.org/docs/trunk/logs.html#common)

CEF:0|ArcSight|ArcSight|6.0.3.6664.0|agent:030|Agent [test] type [testalertng] started|Low| eventId=1 mrt=1396328238973 categorySignificance=/Normal categoryBehavior=/Execute/Start categoryDeviceGroup=/Application catdt=Security Mangement categoryOutcome=/Success categoryObject=/Host/Application/Service art=1396328241038 cat=/Agent/Started deviceSeverity=Warning rt=1396328238937 fileType=Agent cs2=<Resource ID\="3DxKlG0UBABCAA0cXXAZIwA\=\="/> c6a4=fe80:0:0:0:495d:cc3c:db1a:de71 cs2Label=Configuration Resource c6a4Label=Agent IPv6 Address ahost=SKEELES10 agt=888.99.100.1 agentZoneURI=/All Zones/ArcSight System/Private Address Space Zones/RFC1918: 888.99.0.0-888.200.255.255 av=6.0.3.6664.0 atz=Australia/Sydney aid=3DxKlG0UBABCAA0cXXAZIwA\=\= at=testalertng dvchost=SKEELES10 dvc=888.99.100.1 deviceZoneURI=/All Zones/ArcSight System/Private Address Space Zones/RFC1918: 888.99.0.0-888.200.255.255 dtz=Australia/Sydney _cefVer=0.1
(https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/FlexTables/LoadCEFData.htm)

In both cases it said the file could not be parsed as it was not in CEF format. I presumed the first one is missing the CEF:0 header, and maybe the second one has weird characters? I removed all newlines so that second example is all on one line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first one certainly wouldn't parse as that seems to be a Apache httpd log format, so not in scope of CEF parsing.

I will have a look at second one as it seems like a CEF example from hell. Will adjust ParCEFone if I find something missing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattyb149 - Thanks for sending this sample. I reviewed the underlying library and code is working as designed.

The trick is that fieldsdvc, agt should contain valid IP addresses, however, their values are not valid IP addresses and as such ParCEFone rejects the message.

I sincerely don't think we should deviate from data input validation, would you agree?

Best regards

@trixpan
Copy link
Contributor Author

trixpan commented Oct 15, 2016

@mattyb149 hopefully all addressed

@mattyb149
Copy link
Contributor

+1 LGTM, thanks much! Merging to master.

@asfgit asfgit closed this in b864d49 Oct 20, 2016
@trixpan trixpan deleted the NIFI-2341 branch October 23, 2016 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants