New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIFI-6047 Add DeduplicateRecords (combines 6047 and 6014) #4646
Conversation
@adamfisher FYSA |
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
@mattyb149 thanks for taking a look. I'll try to carve out some time in the evening to address. |
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...ard-processors/src/test/java/org/apache/nifi/processors/standard/TestDeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...ard-processors/src/test/java/org/apache/nifi/processors/standard/TestDeduplicateRecords.java
Outdated
Show resolved
Hide resolved
050ad52
to
1739348
Compare
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...resources/docs/org.apache.nifi.processors.standard.DeduplicateRecords/additionalDetails.html
Outdated
Show resolved
Hide resolved
We're marking this PR as stale due to lack of updates in the past few months. If after another couple of weeks the stale label has not been removed this PR will be closed. This stale marker and eventual auto close does not indicate a judgement of the PR just lack of reviewer bandwidth and helps us keep the PR queue more manageable. If you would like this PR re-opened you can do so and a committer can remove the stale tag. Or you can open a new PR. Try to help review other PRs to increase PR review bandwidth which in turn helps yours. |
I would prefer not to see this go away. I put a lot of time into making it generic and robust. Really I just ran into problems near the end when I had to get it merged in properly and Mike is the git ninja. I thought this was almost across the finish line? |
@adamfisher that was an automated message |
I think it is close to being done, but it fell off everyone's radars (including @adamfisher). @ottobackwards you can pick up the review now if @mattyb149 is low on time :-D |
...resources/docs/org.apache.nifi.processors.standard.DeduplicateRecords/additionalDetails.html
Show resolved
Hide resolved
...ard-processors/src/test/java/org/apache/nifi/processors/standard/TestDeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
Left a couple of small comments. Also, I think for maintainability, it would be nice if there were some comments and javadoc in the processor, as to the overall logic/process, and what the methods are doing / returning. |
Thanks @ottobackwards. I'll try to make some time to knock these out tomorrow. |
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
2a9365e
to
8cda708
Compare
We're marking this PR as stale due to lack of updates in the past few months. If after another couple of weeks the stale label has not been removed this PR will be closed. This stale marker and eventual auto close does not indicate a judgement of the PR just lack of reviewer bandwidth and helps us keep the PR queue more manageable. If you would like this PR re-opened you can do so and a committer can remove the stale tag. Or you can open a new PR. Try to help review other PRs to increase PR review bandwidth which in turn helps yours. |
If there is something I can do to get this across the finish line I would love to help. I'm just not familiar with the roadblocks I faced in the git process. I put a lot of work into it and I think it would be a very useful processor block for deduping records. |
8cda708
to
75326ca
Compare
Revisiting this. @ottobackwards @mattyb149 @exceptionfactory going to work through any remaining comments and see if we can close this out this week. |
Thank you so much @MikeThomsen! Excited to have this processor block see the light of day. 🌞 🌞 🌞 |
DeduplicateRecords_TestWithCassandraDMC.xml.txt Here is a test flow.
|
@mattyb149 @exceptionfactory @ottobackwards I think we're good new. See the attacked template and steps to setup Cassandra in docker to quickly test. |
Using a DistributedMapCache client & server (rather than Cassandra, just to be different), I couldn't get any records/flowfiles on the duplicate relationship after sending 2 of the same flowfile into the DeduplicateRecords processor (with the config from your template above, Multiple Files e.g.). I had to enable the rest of the flow so the cache values would be populated by the PutDistributedMapCache processor. Is that intended? If so the docs should reflect that and if not, the processor itself should handle writing the cache identifier so no additional processor is needed, or perhaps it would at least be configurable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for revisiting this pull request @MikeThomsen. Coming back to this after some time, I noted a few additional issues related to message digest handling and options, as well as a few other minor things related to logging and testing. Others may be able to provide additional feedback on general functionality, and I can take another look soon.
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...ard-processors/src/test/java/org/apache/nifi/processors/standard/TestDeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...ard-processors/src/test/java/org/apache/nifi/processors/standard/TestDeduplicateRecords.java
Outdated
Show resolved
Hide resolved
Good feedback. Will incorporate. |
I'll update the docs. We can't have this processor updating the DMC because otherwise it's going to be such a thing upstream of the operations that need to complete before a DMC entry is written. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the things I found were with non-ideal-path situations such as operator error :) , CS failures, etc. Otherwise this is looking and working well, it's getting close!
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...tandard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...ard-processors/src/test/java/org/apache/nifi/processors/standard/TestDeduplicateRecords.java
Outdated
Show resolved
Hide resolved
...ard-processors/src/test/java/org/apache/nifi/processors/standard/TestDeduplicateRecords.java
Outdated
Show resolved
Hide resolved
@exceptionfactory @ottobackwards @mattyb149 should be good to go now. |
@exceptionfactory @mattyb149 any chance we could close this out? |
Please don't use plural form on processor naming. DeduplicateRecord |
Added NiFi DetectDuplicateRecord standard processor. Adding some documentation and PR review tweaks. Exposing processor Documentation updates, exception handling consolidation, added support for record path field variables. Added tests. Build bump. Migrated cache service to groovy folder. Moved declarations for properties to @BeforeClass lifecycle method. Adding some documentation and PR review tweaks. Documentation updates, exception handling consolidation, added support for record path field variables. Added tests. Build bump. Migrated cache service to groovy folder. Fixed variable type bug. Fixed mapping of test params to usage. Fixed potential illegal state exception bug.
Removed DMC. NIFI-6047 Started integrating changes from NIFI-6014. NIFI-6047 Added DMC tests. NIFI-6047 Added cache identifier recordpath test. NIFI-6047 Added additional details. NIFI-6047 Removed old additional details. NIFI-6047 made some changes requested in a follow up review. NIFI-6047 latest.
3c094ac
to
6dc77b3
Compare
@joewitt addressed your request. |
+1 LGTM, tried with various happy and non-happy scenarios, verified the expected results. Thanks for sticking with this new feature! Merging to main |
Removed DMC. NIFI-6047 Started integrating changes from NIFI-6014. NIFI-6047 Added DMC tests. NIFI-6047 Added cache identifier recordpath test. NIFI-6047 Added additional details. NIFI-6047 Removed old additional details. NIFI-6047 made some changes requested in a follow up review. NIFI-6047 latest. NIFI-6047 Finished updates First round of code review cleanup Latest Removed EL from the dynamic properties. Finished code review requested refactoring. Checkstyle fix. Removed a Java 11 API NIFI-6047 Renamed processor to DeduplicateRecord Signed-off-by: Matthew Burgess <mattyb149@apache.org> This closes apache#4646
Removed DMC. NIFI-6047 Started integrating changes from NIFI-6014. NIFI-6047 Added DMC tests. NIFI-6047 Added cache identifier recordpath test. NIFI-6047 Added additional details. NIFI-6047 Removed old additional details. NIFI-6047 made some changes requested in a follow up review. NIFI-6047 latest. NIFI-6047 Finished updates First round of code review cleanup Latest Removed EL from the dynamic properties. Finished code review requested refactoring. Checkstyle fix. Removed a Java 11 API NIFI-6047 Renamed processor to DeduplicateRecord Signed-off-by: Matthew Burgess <mattyb149@apache.org> This closes apache#4646
Thank you for submitting a contribution to Apache NiFi.
Please provide a short description of the PR here:
Description of PR
Enables X functionality; fixes bug NIFI-YYYY.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically
main
)?Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not
squash
or use--force
when pushing to allow for clean monitoring of changes.For code changes:
mvn -Pcontrib-check clean install
at the rootnifi
folder?LICENSE
file, including the mainLICENSE
file undernifi-assembly
?NOTICE
file, including the mainNOTICE
file found undernifi-assembly
?.displayName
in addition to .name (programmatic access) for each of the new properties?For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.