Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-5147 Implement CalculateAttributeHash processor #2980

Closed
wants to merge 20 commits into from

Conversation

alopresto
Copy link
Contributor

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

ottobackwards and others added 20 commits August 30, 2018 17:37
- added properties to control behavior when attributes that are configured are partially or completely missing
- set charset with a property
- added tests
Cleaned up typos and descriptions.
Added unit test demonstrating missing Blake2 algorithm.
Unit test for all default test vectors passes.
@alopresto
Copy link
Contributor Author

This encapsulates the changes @ottobackwards made in PR 2836, but also:

  • Adds the SHA-224, SHA-512/224, SHA-512/256, SHA-3 (SHA3-224, SHA3-256, SHA3-384, SHA3-512), and BLAKE2 (BLAKE2-160, BLAKE2-256, BLAKE2-384, BLAKE2-512) functions
  • Moves the hashing functionality into an enum and service which can be reused by HashContent
  • Clearly marks cryptographically broken algorithms as such
  • Adds unit tests

I will open follow-on issues to:

  1. Add documentation to HashAttribute to explain the different scenarios where these processors are used
  2. Refactor HashContent to use the HashService

* * SHA-512 (SHA2)
* * SHA-512/224 (SHA2)
* * SHA-512/256 (SHA2)
* * SHA3-256
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add SHA3-224 and Blake2b-160.

Copy link
Contributor

@ottobackwards ottobackwards left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, just a couple of comments

public boolean isStrongAlgorithm() {
return (!BROKEN_ALGORITHMS.contains(name));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the isBlake2 check about? Is there a way to make it more general? It seems strange to call out by the name as opposed to the "why"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Blake2 implementations need BouncyCastle and use different API calls than the other MessageDigest instances.

return traditionalHash(algorithm, value);
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put this functionality in the enum and simplify this class? Having the specialization there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to move the execution logic into the enum. The enum is there to capture metadata about the acceptable values, while the logic is independent from that selection.

@thenatog
Copy link
Contributor

thenatog commented Sep 7, 2018

Reviewing..

@thenatog
Copy link
Contributor

thenatog commented Sep 7, 2018

Tested out the HashAttribute processor. This all worked fine:

  • MD5 and creating a new attribute
  • MD5 and overwriting the attribute with hashed value
  • SHA256 and creating a new attribute
  • MD5 of chinese characters using UTF-8 (matched web tool hasher and command line md5 utility)

UTF-16 is where I came unstuck:

  • MD5 of simple string using "UTF-16" encoding, I get a different hash to what I expect.
  • MD5 of simple string using "UTF-16BE" and "UTF-16LE" encoding DO match what I expect.

Test input string in all cases: “hehe”

NiFi CalculateAttributeHash:
UTF-8:MD5 = 529ca8050a00180790cf88b63468826a
UTF-16BE:MD5 = b0ed26b524e0b0606551d78e42b5b7bc
UTF-16LE:MD5 = 2db0ecc27f7abd29ba95412feb3b5e07
UTF-16:MD5 = 9b6dcd3887ebdb43d66fb4b3ef9c259b

CyberChef (https://gchq.github.io/CyberChef/#recipe=Encode_text('UTF16BE%20(1201)')MD5()&input=aGVoZQ):
UTF-8:MD5 = 529ca8050a00180790cf88b63468826a
UTF-16BE:MD5 = b0ed26b524e0b0606551d78e42b5b7bc
UTF-16LE:MD5 = 2db0ecc27f7abd29ba95412feb3b5e07

I found that “UTF-16” is different because when encoding, Java adds a big-endian BOM: “When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.” As expected, adding the BOM changes the output bytes which are then hashed, resulting in a different hash to “UTF-16BE” encoding. Is this a problem or is this simply expected behaviour - ie. should the user realize that there will be a difference between UTF-16 and UTF-16BE encoding and the resulting hash?

@alopresto
Copy link
Contributor Author

Thanks for discovering this @thenatog . This is an excellent catch.

I've added behavior to catch this, better documentation, and unit tests. However, I added them on the branch that includes PR 2983. Let's mark this PR as closed and just review the other one, as it is more complete and addresses this issue.

2018-09-07 21:21:19,784 WARN [Timer-Driven Process Thread-6] o.a.n.security.util.crypto.HashService The charset provided was UTF-16, but Java will insert a Big Endian BOM in the decoded message before hashing, so switching to UTF-16BE
2018-09-07 21:21:19,797 INFO [Timer-Driven Process Thread-9] o.a.n.processors.standard.LogAttribute LogAttribute[id=b15f3209-344d-10a6-4a7b-454530bb72fc] logging for flow file StandardFlowFileRecord[uuid=a4a223fb-aa11-43b9-93a3-d7675c44593c,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1536378604366-1, container=default, section=1], offset=56, length=4],offset=0,name=33467912436349,size=4]
--------------------[SUCCESS] --------------------
Standard FlowFile Attributes
Key: 'entryDate'
	Value: 'Fri Sep 07 21:21:19 PDT 2018'
Key: 'lineageStartDate'
	Value: 'Fri Sep 07 21:21:19 PDT 2018'
Key: 'fileSize'
	Value: '4'
FlowFile Attribute Map Content
Key: 'filename'
	Value: '33467912436349'
Key: 'path'
	Value: './'
Key: 'test_attribute'
	Value: 'hehe'
Key: 'test_attribute_md5_utf16le'
	Value: '2db0ecc27f7abd29ba95412feb3b5e07'
Key: 'uuid'
	Value: 'a4a223fb-aa11-43b9-93a3-d7675c44593c'
--------------------[SUCCESS] --------------------
hehe
2018-09-07 21:21:19,799 INFO [Timer-Driven Process Thread-9] o.a.n.processors.standard.LogAttribute LogAttribute[id=b15f3209-344d-10a6-4a7b-454530bb72fc] logging for flow file StandardFlowFileRecord[uuid=b7459e40-500b-488d-a0dc-3e09ebc6b86e,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1536378604366-1, container=default, section=1], offset=56, length=4],offset=0,name=33467912436349,size=4]
--------------------[SUCCESS] --------------------
Standard FlowFile Attributes
Key: 'entryDate'
	Value: 'Fri Sep 07 21:21:19 PDT 2018'
Key: 'lineageStartDate'
	Value: 'Fri Sep 07 21:21:19 PDT 2018'
Key: 'fileSize'
	Value: '4'
FlowFile Attribute Map Content
Key: 'filename'
	Value: '33467912436349'
Key: 'path'
	Value: './'
Key: 'test_attribute'
	Value: 'hehe'
Key: 'test_attribute_md5_utf16'
	Value: 'b0ed26b524e0b0606551d78e42b5b7bc'
Key: 'uuid'
	Value: 'b7459e40-500b-488d-a0dc-3e09ebc6b86e'
--------------------[SUCCESS] --------------------
hehe
2018-09-07 21:21:19,801 INFO [Timer-Driven Process Thread-9] o.a.n.processors.standard.LogAttribute LogAttribute[id=b15f3209-344d-10a6-4a7b-454530bb72fc] logging for flow file StandardFlowFileRecord[uuid=25c5d1b1-faa4-418d-911c-5c0cea399b83,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1536378604366-1, container=default, section=1], offset=56, length=4],offset=0,name=33467912436349,size=4]
--------------------[SUCCESS] --------------------
Standard FlowFile Attributes
Key: 'entryDate'
	Value: 'Fri Sep 07 21:21:19 PDT 2018'
Key: 'lineageStartDate'
	Value: 'Fri Sep 07 21:21:19 PDT 2018'
Key: 'fileSize'
	Value: '4'
FlowFile Attribute Map Content
Key: 'filename'
	Value: '33467912436349'
Key: 'path'
	Value: './'
Key: 'test_attribute'
	Value: 'hehe'
Key: 'test_attribute_md5_utf16be'
	Value: 'b0ed26b524e0b0606551d78e42b5b7bc'
Key: 'uuid'
	Value: '25c5d1b1-faa4-418d-911c-5c0cea399b83'
--------------------[SUCCESS] --------------------
hehe

@alopresto alopresto closed this Sep 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants