Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-3726 - Introduces CompareFuzzyHash processor #1692

Closed
wants to merge 7 commits into from

Conversation

trixpan
Copy link
Contributor

@trixpan trixpan commented Apr 25, 2017

      - Abstract FuzzyhashContent to reduce a bit of code
        duplication

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • ~~~If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?~~~
  • ~~~If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?~~~
  • ~~~If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?~~~
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

@trixpan
Copy link
Contributor Author

trixpan commented Apr 25, 2017

@apiri and now the comparison processor!!

For the record: I tried to illustrate the potential of the processor by using two of the NiFi own 'pom.xml' and displaying how despite lack of 1 to 1 match, the processor is still capable of detecting similarities on both cases.

"appending an attribute to the FlowFile in case of a successful match.")

@WritesAttributes({
@WritesAttribute(attribute = "XXXX.N.match", description = "The match that ressambles the attribute specified " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: ressambles -> resembles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@WritesAttribute(attribute = "XXXX.N.match", description = "The match that ressambles the attribute specified " +
"by the <Hash Attribute Name> property. Note that: 'XXX' gets replaced with the <Hash Attribute Name>"),
@WritesAttribute(attribute = "XXXX.N.similarity", description = "The similarity score between this flowfile" +
"and the its match of the same number N. Note that: 'XXX' gets replaced with the <Hash Attribute Name>")})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: the its -> its.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

public static final PropertyDescriptor MATCHING_MODE = new PropertyDescriptor.Builder()
.name("MATCHING_MODE")
.displayName("Matching mode")
.description("The ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably meant to build out the description field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

.build();

public static final Relationship REL_NON_MATCH = new Relationship.Builder()
.name("non-match")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be consistent with the naming style of the relationship fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. renamed to "found", "not found" as per GeoEnrichIP and also added "failure"

}

Digest inputDigest = null;
SpamSum spamSum = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are definite similarities between the two algorithms so I understand why the method is structured this way, but it seems to interleave the logic a bit much for me. If we added a third implementation, I could see this getting very convoluted. My recommendation is to encapsulate each of the algorithms into their own private processing method (better yet, a shared interface and library implementations) that have minimal shared code (file loading, etc.) and can be invoked the same way. I am happy to help with refactoring if necessary.

Copy link
Contributor

@alopresto alopresto Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, thinking about exposing the common interface would allow for a method boolean matchExceedsThreshold(double match, double matchThreshold) which avoids the "some hashes use higher numbers to indicate similarity, some use lower" issue as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes total sense. Will refactor

@trixpan
Copy link
Contributor Author

trixpan commented Apr 26, 2017

@alopresto - I put together the interface based approach and indeed it seems a much cleaner code so thank you for pointing me into this direction.

The code has not been optimised and still has a few sharp edges around String split that must be solved before merging but I would appreciate if you could confirm I am moving in the right direction.

Once again, thank you for the suggestion. It is certainly a great chance to try to do things "properly".

Cheers

@alopresto
Copy link
Contributor

Thanks @trixpan . I will continue to review but I appreciate your efforts in response to my comments.

@@ -110,24 +107,25 @@
public static final PropertyDescriptor MATCHING_MODE = new PropertyDescriptor.Builder()
.name("MATCHING_MODE")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this one earlier. I think the style elsewhere is skinny snake case - matching-mode.

@@ -110,24 +107,25 @@
public static final PropertyDescriptor MATCHING_MODE = new PropertyDescriptor.Builder()
.name("MATCHING_MODE")
.displayName("Matching mode")
.description("The ")
.description("Defines if the Processor should try to match as many entries as possible (" + multiMatch.getDisplayName() +
") or if it should stio after the first match (" + singleMatch.getDisplayName() + ")")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: should stio -> should stop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

public static final Relationship REL_NON_MATCH = new Relationship.Builder()
.name("non-match")
public static final Relationship REL_NOT_FOUND = new Relationship.Builder()
.name("not found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we try to avoid spaces in names: not-found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like good practice to me. Fixed

}
fuzzyHashMatcher = new SSDeepHashMatcher(getLogger());
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a default case to handle no algorithm provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

break;
}

if (fuzzyHashMatcher.isValidHash(inputHash) == false) {
Copy link
Contributor

@alopresto alopresto Apr 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the algorithm doesn't match one of the provided options, fuzzyHashMatcher will be null here. In addition, the condition can just be if (!fuzzyHashMatcher.isValidHash(inputHash)) {.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

introduced a default to switch(algorithm) so that we don't get here with null anymore.

FileInputStream fileInputStream = new FileInputStream(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(fileInputStream));

// If SSdeep skip the first line (as the usual format used by other tools add a header line
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the details of the file-parsing should be delegated & encapsulated by the implementations. A method in the interface like Map<String, Double> getMatches(File definitionFile); would then hide the various line split delimiters, etc. If you want to keep the loop logic here in the main processor, at least implement void prepareReader(BufferedReader reader); (no-op in most, skip a line in SSDeep), and String getHashToCompare(String line); -- the general idea is to remove the need for switch statements to handle custom per-implementation logic from this level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. This should help in case we happen to have a non-file based source later on in the future.

Will adjust

for (Map.Entry<String, Double> entry : matched.entrySet()) {
// defining attributes accordingly
attributes.put(
context.getProperty(ATTRIBUTE_NAME).getValue() + "." + x + ".match",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Extract Variable refactor on context.getProperty(ATTRIBUTE_NAME).getValue() + "." + x to avoid multiple calls to two external objects and repeated String concatenation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow... wasn't aware of the refactoring hotkey for this. Super thanks!


// Check if content matches minimum length requirement
if (context.getProperty(HASH_ALGORITHM).equals(allowableValueTLSH) && flowFile.getSize() < 512 ) {

if (checkMinimumAlgorithmRequirements(algorithm, flowFile) == false) {
logger.info("The content of {} is smaller than the minimum required by TLSH, routing to failure", new Object[]{flowFile});
Copy link
Contributor

@alopresto alopresto Apr 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the value of algorithm instead of hardcoded TLSH in the message, and we probably don't want to write the entire flowfile content to the log -- use the unique ID for traceability instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -178,11 +161,11 @@ public void process(final InputStream in) throws IOException {
try (ByteArrayOutputStream holder = new ByteArrayOutputStream()) {
StreamUtils.copy(in,holder);

if (context.getProperty(HASH_ALGORITHM).getValue().equals(allowableValueSSDEEP.getValue())) {
if (algorithm.equals(allowableValueSSDEEP.getValue())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can probably be extracted to the interface as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to the same class as the checkMinimumRequirements. wonder if we should create a separate class just to hold those help functions?

@trixpan trixpan force-pushed the NIFI-3726 branch 2 times, most recently from 5168d62 to 8cd23e8 Compare April 28, 2017 13:38
@alopresto
Copy link
Contributor

Will finish reviewing and merge shortly.

Copy link
Contributor

@mattyb149 mattyb149 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trixpan I didn't get far with testing this one, there are a couple of issues I ran into beforehand, related to #1619. The NAR isn't included in the assembly POM or the top-level POM, so it doesn't get into a full build. Also I think FuzzyHashContent needs some work, the relationships are capitalized (minor) but I sent a flow file with "Hello world!" in it to FHC with TLSH and it routes the flow file to failure with no bulletin or error message. Are you ok with leaving this one out of the 1.2.0 release so we can spend some more time on the NAR?

@@ -73,31 +70,14 @@
"evaluations in memory. Accordingly, it is important to consider the anticipated profile of content being " +
"evaluated by this processor and the hardware supporting it especially when working against large files.")

@SeeAlso({HashContent.class})
@SeeAlso({HashContent.class, CompareFuzzyHash.class})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to a processor in another NAR, which won't be able to be found by the framework. I'll remove the reference to HashContent (and the standard processors dependency in the POM) on merge

@@ -73,31 +70,14 @@
"evaluations in memory. Accordingly, it is important to consider the anticipated profile of content being " +
"evaluated by this processor and the hardware supporting it especially when working against large files.")

@SeeAlso({HashContent.class})
@SeeAlso({HashContent.class, CompareFuzzyHash.class})
@ReadsAttributes({@ReadsAttribute(attribute="", description="")})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one's empty, will remove before merge

@alopresto
Copy link
Contributor

@trixpan I have been reviewing this along with @mattyb149 and @bbende and we believe there may be a dependency issue here. I had only reviewed the code design via GitHub but upon trying to build the application, Jetty startup fails with:

2017-05-02 15:01:28,815 WARN [main] org.apache.nifi.web.server.JettyServer Failed to start web server... shutting down.
java.lang.ArrayStoreException: sun.reflect.annotation.TypeNotPresentExceptionProxy
        at sun.reflect.annotation.AnnotationParser.parseClassArray(AnnotationParser.java:724)
        at sun.reflect.annotation.AnnotationParser.parseArray(AnnotationParser.java:531)
        at sun.reflect.annotation.AnnotationParser.parseMemberValue(AnnotationParser.java:355)
        at sun.reflect.annotation.AnnotationParser.parseAnnotation2(AnnotationParser.java:286)
        at sun.reflect.annotation.AnnotationParser.parseAnnotations2(AnnotationParser.java:120)
        at sun.reflect.annotation.AnnotationParser.parseAnnotations(AnnotationParser.java:72)
        at java.lang.Class.createAnnotationData(Class.java:3521)
        at java.lang.Class.annotationData(Class.java:3510)
        at java.lang.Class.getAnnotation(Class.java:3415)
        at java.lang.reflect.AnnotatedElement.isAnnotationPresent(AnnotatedElement.java:258)
        at java.lang.Class.isAnnotationPresent(Class.java:3425)
        at org.apache.nifi.nar.ExtensionManager.checkControllerServiceReferenceEligibility(ExtensionManager.java:189)
        at org.apache.nifi.nar.ExtensionManager.loadExtensions(ExtensionManager.java:160)
        at org.apache.nifi.nar.ExtensionManager.discoverExtensions(ExtensionManager.java:113)
        at org.apache.nifi.web.server.JettyServer.start(JettyServer.java:689)
        at org.apache.nifi.NiFi.<init>(NiFi.java:160)
        at org.apache.nifi.NiFi.main(NiFi.java:267)

After further discussion, we believe this is related to the added dependency on nifi-standard-processors in the processor NAR pom.xml, which should not be there.

I think this needs another iteration to ensure all the wrinkles are ironed out. I don't believe this was requested by the community for this release, so as long as you can continue to get value from it locally, I'll work with you to get it into 1.3.0 but I suggest moving it from this release. Is that ok with you?

@trixpan
Copy link
Contributor Author

trixpan commented May 2, 2017

Andy, easy to solve

@trixpan
Copy link
Contributor Author

trixpan commented May 2, 2017

@alopresto @mattyb149 last two commits should address the two residual issues you identified.

One was a simple reference to HashContent.class instead of the fully qualified class path as string.

The other one was the use of .info in message too short when it should be error.

@trixpan
Copy link
Contributor Author

trixpan commented May 2, 2017

pom has also been fixed

@trixpan
Copy link
Contributor Author

trixpan commented May 2, 2017

@mattyb149 just to register: TLSH requires more than 512 characters to work so your hello message triggers a safety mechanism that routes the flow to error.

I should have used error but I think I changed this to info during the peer review

Should be fixed now

image

@trixpan trixpan force-pushed the NIFI-3726 branch 4 times, most recently from 86d97be to 5db5cf5 Compare May 3, 2017 12:28
…ash source

            that matches but lacks a separator AND filename or matches but lacks
            a filename(i.e. ends with trailing separator)
@mattyb149
Copy link
Contributor

+1 LGTM, ran unit tests, verified the processors are in the assembly, and ran some tests on a live NiFi. Thanks for the contribution, this is good stuff! Merging to master

@asfgit asfgit closed this in 54d47c7 May 3, 2017
@trixpan trixpan deleted the NIFI-3726 branch May 15, 2017 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants