-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIFI-3726 - Introduces CompareFuzzyHash processor #1692
Conversation
@apiri and now the comparison processor!! For the record: I tried to illustrate the potential of the processor by using two of the NiFi own 'pom.xml' and displaying how despite lack of 1 to 1 match, the processor is still capable of detecting similarities on both cases. |
"appending an attribute to the FlowFile in case of a successful match.") | ||
|
||
@WritesAttributes({ | ||
@WritesAttribute(attribute = "XXXX.N.match", description = "The match that ressambles the attribute specified " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: ressambles -> resembles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@WritesAttribute(attribute = "XXXX.N.match", description = "The match that ressambles the attribute specified " + | ||
"by the <Hash Attribute Name> property. Note that: 'XXX' gets replaced with the <Hash Attribute Name>"), | ||
@WritesAttribute(attribute = "XXXX.N.similarity", description = "The similarity score between this flowfile" + | ||
"and the its match of the same number N. Note that: 'XXX' gets replaced with the <Hash Attribute Name>")}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: the its -> its.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
public static final PropertyDescriptor MATCHING_MODE = new PropertyDescriptor.Builder() | ||
.name("MATCHING_MODE") | ||
.displayName("Matching mode") | ||
.description("The ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably meant to build out the description field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
.build(); | ||
|
||
public static final Relationship REL_NON_MATCH = new Relationship.Builder() | ||
.name("non-match") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be consistent with the naming style of the relationship fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. renamed to "found", "not found" as per GeoEnrichIP and also added "failure"
} | ||
|
||
Digest inputDigest = null; | ||
SpamSum spamSum = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there are definite similarities between the two algorithms so I understand why the method is structured this way, but it seems to interleave the logic a bit much for me. If we added a third implementation, I could see this getting very convoluted. My recommendation is to encapsulate each of the algorithms into their own private
processing method (better yet, a shared interface and library implementations) that have minimal shared code (file loading, etc.) and can be invoked the same way. I am happy to help with refactoring if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, thinking about exposing the common interface would allow for a method boolean matchExceedsThreshold(double match, double matchThreshold)
which avoids the "some hashes use higher numbers to indicate similarity, some use lower" issue as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes total sense. Will refactor
@alopresto - I put together the interface based approach and indeed it seems a much cleaner code so thank you for pointing me into this direction. The code has not been optimised and still has a few sharp edges around String split that must be solved before merging but I would appreciate if you could confirm I am moving in the right direction. Once again, thank you for the suggestion. It is certainly a great chance to try to do things "properly". Cheers |
Thanks @trixpan . I will continue to review but I appreciate your efforts in response to my comments. |
@@ -110,24 +107,25 @@ | |||
public static final PropertyDescriptor MATCHING_MODE = new PropertyDescriptor.Builder() | |||
.name("MATCHING_MODE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this one earlier. I think the style elsewhere is skinny snake case - matching-mode
.
@@ -110,24 +107,25 @@ | |||
public static final PropertyDescriptor MATCHING_MODE = new PropertyDescriptor.Builder() | |||
.name("MATCHING_MODE") | |||
.displayName("Matching mode") | |||
.description("The ") | |||
.description("Defines if the Processor should try to match as many entries as possible (" + multiMatch.getDisplayName() + | |||
") or if it should stio after the first match (" + singleMatch.getDisplayName() + ")") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: should stio -> should stop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
public static final Relationship REL_NON_MATCH = new Relationship.Builder() | ||
.name("non-match") | ||
public static final Relationship REL_NOT_FOUND = new Relationship.Builder() | ||
.name("not found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we try to avoid spaces in name
s: not-found
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like good practice to me. Fixed
} | ||
fuzzyHashMatcher = new SSDeepHashMatcher(getLogger()); | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be a default
case to handle no algorithm
provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
break; | ||
} | ||
|
||
if (fuzzyHashMatcher.isValidHash(inputHash) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the algorithm
doesn't match one of the provided options, fuzzyHashMatcher
will be null
here. In addition, the condition can just be if (!fuzzyHashMatcher.isValidHash(inputHash)) {
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
introduced a default
to switch(algorithm)
so that we don't get here with null anymore.
FileInputStream fileInputStream = new FileInputStream(file); | ||
BufferedReader reader = new BufferedReader(new InputStreamReader(fileInputStream)); | ||
|
||
// If SSdeep skip the first line (as the usual format used by other tools add a header line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the details of the file-parsing should be delegated & encapsulated by the implementations. A method in the interface like Map<String, Double> getMatches(File definitionFile);
would then hide the various line split delimiters, etc. If you want to keep the loop logic here in the main processor, at least implement void prepareReader(BufferedReader reader);
(no-op in most, skip a line in SSDeep), and String getHashToCompare(String line);
-- the general idea is to remove the need for switch
statements to handle custom per-implementation logic from this level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. This should help in case we happen to have a non-file based source later on in the future.
Will adjust
for (Map.Entry<String, Double> entry : matched.entrySet()) { | ||
// defining attributes accordingly | ||
attributes.put( | ||
context.getProperty(ATTRIBUTE_NAME).getValue() + "." + x + ".match", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Extract Variable refactor on context.getProperty(ATTRIBUTE_NAME).getValue() + "." + x
to avoid multiple calls to two external objects and repeated String concatenation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow... wasn't aware of the refactoring hotkey for this. Super thanks!
|
||
// Check if content matches minimum length requirement | ||
if (context.getProperty(HASH_ALGORITHM).equals(allowableValueTLSH) && flowFile.getSize() < 512 ) { | ||
|
||
if (checkMinimumAlgorithmRequirements(algorithm, flowFile) == false) { | ||
logger.info("The content of {} is smaller than the minimum required by TLSH, routing to failure", new Object[]{flowFile}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the value of algorithm
instead of hardcoded TLSH in the message, and we probably don't want to write the entire flowfile content to the log -- use the unique ID for traceability instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -178,11 +161,11 @@ public void process(final InputStream in) throws IOException { | |||
try (ByteArrayOutputStream holder = new ByteArrayOutputStream()) { | |||
StreamUtils.copy(in,holder); | |||
|
|||
if (context.getProperty(HASH_ALGORITHM).getValue().equals(allowableValueSSDEEP.getValue())) { | |||
if (algorithm.equals(allowableValueSSDEEP.getValue())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can probably be extracted to the interface as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added to the same class as the checkMinimumRequirements. wonder if we should create a separate class just to hold those help functions?
5168d62
to
8cd23e8
Compare
Will finish reviewing and merge shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trixpan I didn't get far with testing this one, there are a couple of issues I ran into beforehand, related to #1619. The NAR isn't included in the assembly POM or the top-level POM, so it doesn't get into a full build. Also I think FuzzyHashContent needs some work, the relationships are capitalized (minor) but I sent a flow file with "Hello world!" in it to FHC with TLSH and it routes the flow file to failure with no bulletin or error message. Are you ok with leaving this one out of the 1.2.0 release so we can spend some more time on the NAR?
@@ -73,31 +70,14 @@ | |||
"evaluations in memory. Accordingly, it is important to consider the anticipated profile of content being " + | |||
"evaluated by this processor and the hardware supporting it especially when working against large files.") | |||
|
|||
@SeeAlso({HashContent.class}) | |||
@SeeAlso({HashContent.class, CompareFuzzyHash.class}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refers to a processor in another NAR, which won't be able to be found by the framework. I'll remove the reference to HashContent (and the standard processors dependency in the POM) on merge
@@ -73,31 +70,14 @@ | |||
"evaluations in memory. Accordingly, it is important to consider the anticipated profile of content being " + | |||
"evaluated by this processor and the hardware supporting it especially when working against large files.") | |||
|
|||
@SeeAlso({HashContent.class}) | |||
@SeeAlso({HashContent.class, CompareFuzzyHash.class}) | |||
@ReadsAttributes({@ReadsAttribute(attribute="", description="")}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one's empty, will remove before merge
@trixpan I have been reviewing this along with @mattyb149 and @bbende and we believe there may be a dependency issue here. I had only reviewed the code design via GitHub but upon trying to build the application, Jetty startup fails with:
After further discussion, we believe this is related to the added dependency on I think this needs another iteration to ensure all the wrinkles are ironed out. I don't believe this was requested by the community for this release, so as long as you can continue to get value from it locally, I'll work with you to get it into 1.3.0 but I suggest moving it from this release. Is that ok with you? |
Andy, easy to solve |
- Abstract FuzzyhashContent to reduce a bit of code duplication
…o message too short
@alopresto @mattyb149 last two commits should address the two residual issues you identified. One was a simple reference to HashContent.class instead of the fully qualified class path as string. The other one was the use of .info in message too short when it should be error. |
pom has also been fixed |
@mattyb149 just to register: TLSH requires more than 512 characters to work so your hello message triggers a safety mechanism that routes the flow to error. I should have used error but I think I changed this to info during the peer review Should be fixed now |
86d97be
to
5db5cf5
Compare
…ash source that matches but lacks a separator AND filename or matches but lacks a filename(i.e. ends with trailing separator)
+1 LGTM, ran unit tests, verified the processors are in the assembly, and ran some tests on a live NiFi. Thanks for the contribution, this is good stuff! Merging to master |
Thank you for submitting a contribution to Apache NiFi.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.