NIFI-3726 - Introduces CompareFuzzyHash processor #1692

trixpan · 2017-04-25T12:15:37Z

      - Abstract FuzzyhashContent to reduce a bit of code
        duplication

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?

For code changes:

Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
Have you written or updated unit tests to verify your changes?
~~~If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?~~~
~~~If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?~~~
~~~If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?~~~
If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

trixpan · 2017-04-25T12:17:49Z

@apiri and now the comparison processor!!

For the record: I tried to illustrate the potential of the processor by using two of the NiFi own 'pom.xml' and displaying how despite lack of 1 to 1 match, the processor is still capable of detecting similarities on both cases.

alopresto · 2017-04-25T16:26:51Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/CompareFuzzyHash.java

+        "appending an attribute to the FlowFile in case of a successful match.")
+
+@WritesAttributes({
+        @WritesAttribute(attribute = "XXXX.N.match", description = "The match that ressambles the attribute specified " +


Typo: ressambles -> resembles.

alopresto · 2017-04-25T16:27:14Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/CompareFuzzyHash.java

+        @WritesAttribute(attribute = "XXXX.N.match", description = "The match that ressambles the attribute specified " +
+                "by the <Hash Attribute Name> property. Note that: 'XXX' gets replaced with the <Hash Attribute Name>"),
+        @WritesAttribute(attribute = "XXXX.N.similarity", description = "The similarity score between this flowfile" +
+                "and the its match of the same number N. Note that: 'XXX' gets replaced with the <Hash Attribute Name>")})


Typo: the its -> its.

alopresto · 2017-04-25T16:28:21Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/CompareFuzzyHash.java

+    public static final PropertyDescriptor MATCHING_MODE = new PropertyDescriptor.Builder()
+            .name("MATCHING_MODE")
+            .displayName("Matching mode")
+            .description("The ")


Probably meant to build out the description field.

alopresto · 2017-04-25T16:28:50Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/CompareFuzzyHash.java

+            .build();
+
+    public static final Relationship REL_NON_MATCH = new Relationship.Builder()
+            .name("non-match")


We should be consistent with the naming style of the relationship fields.

Good call. renamed to "found", "not found" as per GeoEnrichIP and also added "failure"

alopresto · 2017-04-25T16:35:06Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/CompareFuzzyHash.java

+        }
+
+        Digest inputDigest = null;
+        SpamSum spamSum = null;


It looks like there are definite similarities between the two algorithms so I understand why the method is structured this way, but it seems to interleave the logic a bit much for me. If we added a third implementation, I could see this getting very convoluted. My recommendation is to encapsulate each of the algorithms into their own private processing method (better yet, a shared interface and library implementations) that have minimal shared code (file loading, etc.) and can be invoked the same way. I am happy to help with refactoring if necessary.

In fact, thinking about exposing the common interface would allow for a method boolean matchExceedsThreshold(double match, double matchThreshold) which avoids the "some hashes use higher numbers to indicate similarity, some use lower" issue as well.

Makes total sense. Will refactor

trixpan · 2017-04-26T16:33:59Z

@alopresto - I put together the interface based approach and indeed it seems a much cleaner code so thank you for pointing me into this direction.

The code has not been optimised and still has a few sharp edges around String split that must be solved before merging but I would appreciate if you could confirm I am moving in the right direction.

Once again, thank you for the suggestion. It is certainly a great chance to try to do things "properly".

Cheers

alopresto · 2017-04-26T18:01:41Z

Thanks @trixpan . I will continue to review but I appreciate your efforts in response to my comments.

alopresto · 2017-04-26T23:18:45Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/CompareFuzzyHash.java

@@ -110,24 +107,25 @@
    public static final PropertyDescriptor MATCHING_MODE = new PropertyDescriptor.Builder()
            .name("MATCHING_MODE")


Sorry, I missed this one earlier. I think the style elsewhere is skinny snake case - matching-mode.

alopresto · 2017-04-26T23:19:15Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/CompareFuzzyHash.java

@@ -110,24 +107,25 @@
    public static final PropertyDescriptor MATCHING_MODE = new PropertyDescriptor.Builder()
            .name("MATCHING_MODE")
            .displayName("Matching mode")
-            .description("The ")
+            .description("Defines if the Processor should try to match as many entries as possible (" + multiMatch.getDisplayName() +
+                    ") or if it should stio after the first match (" + singleMatch.getDisplayName() + ")")


Typo: should stio -> should stop

alopresto · 2017-04-26T23:19:49Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/CompareFuzzyHash.java

-    public static final Relationship REL_NON_MATCH = new Relationship.Builder()
-            .name("non-match")
+    public static final Relationship REL_NOT_FOUND = new Relationship.Builder()
+            .name("not found")


I think we try to avoid spaces in names: not-found.

sounds like good practice to me. Fixed

alopresto · 2017-04-26T23:20:54Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/CompareFuzzyHash.java

-                }
+                fuzzyHashMatcher = new SSDeepHashMatcher(getLogger());
+                break;
+        }


I think there should be a default case to handle no algorithm provided.

alopresto · 2017-04-26T23:21:29Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/CompareFuzzyHash.java

+                break;
+        }
+
+        if (fuzzyHashMatcher.isValidHash(inputHash) == false) {


If the algorithm doesn't match one of the provided options, fuzzyHashMatcher will be null here. In addition, the condition can just be if (!fuzzyHashMatcher.isValidHash(inputHash)) {.

introduced a default to switch(algorithm) so that we don't get here with null anymore.

alopresto · 2017-04-26T23:30:35Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/CompareFuzzyHash.java

+            FileInputStream fileInputStream = new FileInputStream(file);
+            BufferedReader reader = new BufferedReader(new InputStreamReader(fileInputStream));
+
+            // If SSdeep skip the first line (as the usual format used by other tools add a header line


I think the details of the file-parsing should be delegated & encapsulated by the implementations. A method in the interface like Map<String, Double> getMatches(File definitionFile); would then hide the various line split delimiters, etc. If you want to keep the loop logic here in the main processor, at least implement void prepareReader(BufferedReader reader); (no-op in most, skip a line in SSDeep), and String getHashToCompare(String line); -- the general idea is to remove the need for switch statements to handle custom per-implementation logic from this level.

make sense. This should help in case we happen to have a non-file based source later on in the future.

Will adjust

alopresto · 2017-04-26T23:34:21Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/CompareFuzzyHash.java

+                for (Map.Entry<String, Double> entry : matched.entrySet()) {
+                    // defining attributes accordingly
+                    attributes.put(
+                            context.getProperty(ATTRIBUTE_NAME).getValue() + "." + x + ".match",


Use Extract Variable refactor on context.getProperty(ATTRIBUTE_NAME).getValue() + "." + x to avoid multiple calls to two external objects and repeated String concatenation.

wow... wasn't aware of the refactoring hotkey for this. Super thanks!

alopresto · 2017-04-26T23:35:43Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/FuzzyHashContent.java


        // Check if content matches minimum length requirement
-        if (context.getProperty(HASH_ALGORITHM).equals(allowableValueTLSH) && flowFile.getSize() < 512 ) {
+
+        if (checkMinimumAlgorithmRequirements(algorithm, flowFile) == false) {
            logger.info("The content of {} is smaller than the minimum required by TLSH, routing to failure", new Object[]{flowFile});


Use the value of algorithm instead of hardcoded TLSH in the message, and we probably don't want to write the entire flowfile content to the log -- use the unique ID for traceability instead.

alopresto · 2017-04-27T00:08:41Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/FuzzyHashContent.java

@@ -178,11 +161,11 @@ public void process(final InputStream in) throws IOException {
                    try (ByteArrayOutputStream holder = new ByteArrayOutputStream()) {
                        StreamUtils.copy(in,holder);

-                        if (context.getProperty(HASH_ALGORITHM).getValue().equals(allowableValueSSDEEP.getValue())) {
+                        if (algorithm.equals(allowableValueSSDEEP.getValue())) {


These can probably be extracted to the interface as well.

added to the same class as the checkMinimumRequirements. wonder if we should create a separate class just to hold those help functions?

alopresto · 2017-05-02T15:08:37Z

Will finish reviewing and merge shortly.

mattyb149

@trixpan I didn't get far with testing this one, there are a couple of issues I ran into beforehand, related to #1619. The NAR isn't included in the assembly POM or the top-level POM, so it doesn't get into a full build. Also I think FuzzyHashContent needs some work, the relationships are capitalized (minor) but I sent a flow file with "Hello world!" in it to FHC with TLSH and it routes the flow file to failure with no bulletin or error message. Are you ok with leaving this one out of the 1.2.0 release so we can spend some more time on the NAR?

mattyb149 · 2017-05-02T19:25:42Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/FuzzyHashContent.java

@@ -73,31 +70,14 @@
        "evaluations in memory. Accordingly, it is important to consider the anticipated profile of content being " +
        "evaluated by this processor and the hardware supporting it especially when working against large files.")

-@SeeAlso({HashContent.class})
+@SeeAlso({HashContent.class, CompareFuzzyHash.class})


This refers to a processor in another NAR, which won't be able to be found by the framework. I'll remove the reference to HashContent (and the standard processors dependency in the POM) on merge

mattyb149 · 2017-05-02T19:26:00Z

...rity-processors/src/main/java/org/apache/nifi/processors/cybersecurity/FuzzyHashContent.java

@@ -73,31 +70,14 @@
        "evaluations in memory. Accordingly, it is important to consider the anticipated profile of content being " +
        "evaluated by this processor and the hardware supporting it especially when working against large files.")

-@SeeAlso({HashContent.class})
+@SeeAlso({HashContent.class, CompareFuzzyHash.class})
 @ReadsAttributes({@ReadsAttribute(attribute="", description="")})


This one's empty, will remove before merge

alopresto · 2017-05-02T19:41:55Z

@trixpan I have been reviewing this along with @mattyb149 and @bbende and we believe there may be a dependency issue here. I had only reviewed the code design via GitHub but upon trying to build the application, Jetty startup fails with:

2017-05-02 15:01:28,815 WARN [main] org.apache.nifi.web.server.JettyServer Failed to start web server... shutting down.
java.lang.ArrayStoreException: sun.reflect.annotation.TypeNotPresentExceptionProxy
        at sun.reflect.annotation.AnnotationParser.parseClassArray(AnnotationParser.java:724)
        at sun.reflect.annotation.AnnotationParser.parseArray(AnnotationParser.java:531)
        at sun.reflect.annotation.AnnotationParser.parseMemberValue(AnnotationParser.java:355)
        at sun.reflect.annotation.AnnotationParser.parseAnnotation2(AnnotationParser.java:286)
        at sun.reflect.annotation.AnnotationParser.parseAnnotations2(AnnotationParser.java:120)
        at sun.reflect.annotation.AnnotationParser.parseAnnotations(AnnotationParser.java:72)
        at java.lang.Class.createAnnotationData(Class.java:3521)
        at java.lang.Class.annotationData(Class.java:3510)
        at java.lang.Class.getAnnotation(Class.java:3415)
        at java.lang.reflect.AnnotatedElement.isAnnotationPresent(AnnotatedElement.java:258)
        at java.lang.Class.isAnnotationPresent(Class.java:3425)
        at org.apache.nifi.nar.ExtensionManager.checkControllerServiceReferenceEligibility(ExtensionManager.java:189)
        at org.apache.nifi.nar.ExtensionManager.loadExtensions(ExtensionManager.java:160)
        at org.apache.nifi.nar.ExtensionManager.discoverExtensions(ExtensionManager.java:113)
        at org.apache.nifi.web.server.JettyServer.start(JettyServer.java:689)
        at org.apache.nifi.NiFi.<init>(NiFi.java:160)
        at org.apache.nifi.NiFi.main(NiFi.java:267)

After further discussion, we believe this is related to the added dependency on nifi-standard-processors in the processor NAR pom.xml, which should not be there.

I think this needs another iteration to ensure all the wrinkles are ironed out. I don't believe this was requested by the community for this release, so as long as you can continue to get value from it locally, I'll work with you to get it into 1.3.0 but I suggest moving it from this release. Is that ok with you?

trixpan · 2017-05-02T20:39:24Z

Andy, easy to solve

- Abstract FuzzyhashContent to reduce a bit of code duplication

…o message too short

trixpan · 2017-05-02T21:27:46Z

@alopresto @mattyb149 last two commits should address the two residual issues you identified.

One was a simple reference to HashContent.class instead of the fully qualified class path as string.

The other one was the use of .info in message too short when it should be error.

trixpan · 2017-05-02T21:27:54Z

pom has also been fixed

trixpan · 2017-05-02T21:49:18Z

@mattyb149 just to register: TLSH requires more than 512 characters to work so your hello message triggers a safety mechanism that routes the flow to error.

I should have used error but I think I changed this to info during the peer review

Should be fixed now

…ash source that matches but lacks a separator AND filename or matches but lacks a filename(i.e. ends with trailing separator)

mattyb149 · 2017-05-03T15:02:41Z

+1 LGTM, ran unit tests, verified the processors are in the assembly, and ran some tests on a live NiFi. Thanks for the contribution, this is good stuff! Merging to master

trixpan force-pushed the NIFI-3726 branch from 447c112 to 635a820 Compare April 25, 2017 15:16

alopresto reviewed Apr 25, 2017

View reviewed changes

trixpan force-pushed the NIFI-3726 branch from b60b7fb to 854c482 Compare April 26, 2017 16:37

alopresto reviewed Apr 26, 2017

View reviewed changes

alopresto reviewed Apr 27, 2017

View reviewed changes

trixpan force-pushed the NIFI-3726 branch 2 times, most recently from 5168d62 to 8cd23e8 Compare April 28, 2017 13:38

mattyb149 requested changes May 2, 2017

View reviewed changes

trixpan added 5 commits May 3, 2017 06:56

NIFI-3726 - Introduces CompareFuzzyHash processor

35dc5b8

- Abstract FuzzyhashContent to reduce a bit of code duplication

NIFI-3726 - Attempt to address peer review comments

9c57a4d

NIFI-3726 - Addresses additional PR comments

e075843

NIFI-3726 - Fix checkstyle violations

8ba4d15

NIFI-3726 - Fix issues with POMs

7f649bb

trixpan force-pushed the NIFI-3726 branch from e0076ca to 7f649bb Compare May 2, 2017 21:06

NIFI-3726 - Fixes the logging level to error in case of failure due t…

00d45a8

…o message too short

trixpan force-pushed the NIFI-3726 branch 4 times, most recently from 86d97be to 5db5cf5 Compare May 3, 2017 12:28

NIFI-3726 - Capture ArrayIndexOutOfBoundsException when consuming a h…

9ba4898

…ash source that matches but lacks a separator AND filename or matches but lacks a filename(i.e. ends with trailing separator)

trixpan force-pushed the NIFI-3726 branch from 5db5cf5 to 9ba4898 Compare May 3, 2017 12:34

asfgit closed this in 54d47c7 May 3, 2017

trixpan deleted the NIFI-3726 branch May 15, 2017 13:25

		@@ -110,24 +107,25 @@
		public static final PropertyDescriptor MATCHING_MODE = new PropertyDescriptor.Builder()
		.name("MATCHING_MODE")

NIFI-3726 - Introduces CompareFuzzyHash processor #1692

NIFI-3726 - Introduces CompareFuzzyHash processor #1692

Conversation

trixpan commented Apr 25, 2017

For all changes:

For code changes:

For documentation related changes:

Note:

trixpan commented Apr 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alopresto Apr 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trixpan commented Apr 26, 2017

alopresto commented Apr 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alopresto Apr 26, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alopresto Apr 26, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alopresto commented May 2, 2017

mattyb149 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alopresto commented May 2, 2017

trixpan commented May 2, 2017

trixpan commented May 2, 2017

trixpan commented May 2, 2017

trixpan commented May 2, 2017

mattyb149 commented May 3, 2017

alopresto Apr 25, 2017 •

edited

alopresto Apr 26, 2017 •

edited

alopresto Apr 26, 2017 •

edited