Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-6047 Add DeduplicateRecords (combines 6047 and 6014) #4646

Closed
wants to merge 10 commits into from

Conversation

MikeThomsen
Copy link
Contributor

Thank you for submitting a contribution to Apache NiFi.

Please provide a short description of the PR here:

Description of PR

Enables X functionality; fixes bug NIFI-YYYY.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not squash or use --force when pushing to allow for clean monitoring of changes.

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • Have you verified that the full build is successful on JDK 8?
  • Have you verified that the full build is successful on JDK 11?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.

@MikeThomsen
Copy link
Contributor Author

@adamfisher FYSA

@MikeThomsen
Copy link
Contributor Author

@mattyb149 thanks for taking a look. I'll try to carve out some time in the evening to address.

@github-actions
Copy link

We're marking this PR as stale due to lack of updates in the past few months. If after another couple of weeks the stale label has not been removed this PR will be closed. This stale marker and eventual auto close does not indicate a judgement of the PR just lack of reviewer bandwidth and helps us keep the PR queue more manageable. If you would like this PR re-opened you can do so and a committer can remove the stale tag. Or you can open a new PR. Try to help review other PRs to increase PR review bandwidth which in turn helps yours.

@github-actions github-actions bot added the Stale label Apr 15, 2021
@adamfisher
Copy link
Contributor

I would prefer not to see this go away. I put a lot of time into making it generic and robust. Really I just ran into problems near the end when I had to get it merged in properly and Mike is the git ninja. I thought this was almost across the finish line?

@ottobackwards
Copy link
Contributor

@adamfisher that was an automated message

@ottobackwards
Copy link
Contributor

@MikeThomsen?

@MikeThomsen
Copy link
Contributor Author

I think it is close to being done, but it fell off everyone's radars (including @adamfisher). @ottobackwards you can pick up the review now if @mattyb149 is low on time :-D

@ottobackwards
Copy link
Contributor

Left a couple of small comments. Also, I think for maintainability, it would be nice if there were some comments and javadoc in the processor, as to the overall logic/process, and what the methods are doing / returning.

@github-actions github-actions bot removed the Stale label Apr 16, 2021
@MikeThomsen
Copy link
Contributor Author

Thanks @ottobackwards. I'll try to make some time to knock these out tomorrow.

@github-actions
Copy link

We're marking this PR as stale due to lack of updates in the past few months. If after another couple of weeks the stale label has not been removed this PR will be closed. This stale marker and eventual auto close does not indicate a judgement of the PR just lack of reviewer bandwidth and helps us keep the PR queue more manageable. If you would like this PR re-opened you can do so and a committer can remove the stale tag. Or you can open a new PR. Try to help review other PRs to increase PR review bandwidth which in turn helps yours.

@github-actions github-actions bot added the Stale label Dec 10, 2021
@adamfisher
Copy link
Contributor

If there is something I can do to get this across the finish line I would love to help. I'm just not familiar with the roadblocks I faced in the git process. I put a lot of work into it and I think it would be a very useful processor block for deduping records.

@MikeThomsen
Copy link
Contributor Author

Revisiting this. @ottobackwards @mattyb149 @exceptionfactory going to work through any remaining comments and see if we can close this out this week.

@adamfisher
Copy link
Contributor

Revisiting this. @ottobackwards @mattyb149 @exceptionfactory going to work through any remaining comments and see if we can close this out this week.

Thank you so much @MikeThomsen! Excited to have this processor block see the light of day. 🌞 🌞 🌞

@MikeThomsen
Copy link
Contributor Author

DeduplicateRecords_TestWithCassandraDMC.xml.txt

Here is a test flow.

CREATE KEYSPACE nifi
      WITH REPLICATION = { 
       'class' : 'SimpleStrategy', 
       'replication_factor' : 1 
      };
use nifi;
create table map_cache (key blob, value blob, primary key(key));
docker run -p 9042:9042 --name nifi-cassandra cassandra:4

@MikeThomsen
Copy link
Contributor Author

@mattyb149 @exceptionfactory @ottobackwards I think we're good new. See the attacked template and steps to setup Cassandra in docker to quickly test.

@mattyb149
Copy link
Contributor

Using a DistributedMapCache client & server (rather than Cassandra, just to be different), I couldn't get any records/flowfiles on the duplicate relationship after sending 2 of the same flowfile into the DeduplicateRecords processor (with the config from your template above, Multiple Files e.g.). I had to enable the rest of the flow so the cache values would be populated by the PutDistributedMapCache processor. Is that intended? If so the docs should reflect that and if not, the processor itself should handle writing the cache identifier so no additional processor is needed, or perhaps it would at least be configurable.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for revisiting this pull request @MikeThomsen. Coming back to this after some time, I noted a few additional issues related to message digest handling and options, as well as a few other minor things related to logging and testing. Others may be able to provide additional feedback on general functionality, and I can take another look soon.

@MikeThomsen
Copy link
Contributor Author

Good feedback. Will incorporate.

@MikeThomsen
Copy link
Contributor Author

@mattyb149

I had to enable the rest of the flow so the cache values would be populated by the PutDistributedMapCache processor. Is that intended? If so the docs should reflect that and if not, the processor itself should handle writing the cache identifier so no additional processor is needed, or perhaps it would at least be configurable.

I'll update the docs. We can't have this processor updating the DMC because otherwise it's going to be such a thing upstream of the operations that need to complete before a DMC entry is written.

Copy link
Contributor

@mattyb149 mattyb149 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the things I found were with non-ideal-path situations such as operator error :) , CS failures, etc. Otherwise this is looking and working well, it's getting close!

@MikeThomsen
Copy link
Contributor Author

@exceptionfactory @ottobackwards @mattyb149 should be good to go now.

@MikeThomsen
Copy link
Contributor Author

@exceptionfactory @mattyb149 any chance we could close this out?

@joewitt
Copy link
Contributor

joewitt commented Mar 9, 2022

Please don't use plural form on processor naming.

DeduplicateRecord

adamfisher and others added 10 commits March 9, 2022 14:30
Added NiFi DetectDuplicateRecord standard processor.
Adding some documentation and PR review tweaks.
Exposing processor
Documentation updates, exception handling consolidation, added support for record path field variables.
Added tests.
Build bump.
Migrated cache service to groovy folder.
Moved declarations for properties to @BeforeClass lifecycle method.
Adding some documentation and PR review tweaks.
Documentation updates, exception handling consolidation, added support for record path field variables.
Added tests.
Build bump.
Migrated cache service to groovy folder.
Fixed variable type bug.
Fixed mapping of test params to usage.
Fixed potential illegal state exception bug.
Removed DMC.
NIFI-6047 Started integrating changes from NIFI-6014.
NIFI-6047 Added DMC tests.
NIFI-6047 Added cache identifier recordpath test.
NIFI-6047 Added additional details.
NIFI-6047 Removed old additional details.
NIFI-6047 made some changes requested in a follow up review.
NIFI-6047 latest.
@MikeThomsen
Copy link
Contributor Author

@joewitt addressed your request.

@mattyb149
Copy link
Contributor

+1 LGTM, tried with various happy and non-happy scenarios, verified the expected results. Thanks for sticking with this new feature! Merging to main

@mattyb149 mattyb149 closed this in df00cc6 Mar 10, 2022
@MikeThomsen MikeThomsen deleted the NIFI-6047-fixed branch March 10, 2022 00:14
krisztina-zsihovszki pushed a commit to krisztina-zsihovszki/nifi that referenced this pull request Jun 28, 2022
Removed DMC.
NIFI-6047 Started integrating changes from NIFI-6014.
NIFI-6047 Added DMC tests.
NIFI-6047 Added cache identifier recordpath test.
NIFI-6047 Added additional details.
NIFI-6047 Removed old additional details.
NIFI-6047 made some changes requested in a follow up review.
NIFI-6047 latest.
NIFI-6047 Finished updates
First round of code review cleanup
Latest
Removed EL from the dynamic properties.
Finished code review requested refactoring.
Checkstyle fix.
Removed a Java 11 API
NIFI-6047 Renamed processor to DeduplicateRecord

Signed-off-by: Matthew Burgess <mattyb149@apache.org>

This closes apache#4646
Lehel44 pushed a commit to Lehel44/nifi that referenced this pull request Jul 1, 2022
Removed DMC.
NIFI-6047 Started integrating changes from NIFI-6014.
NIFI-6047 Added DMC tests.
NIFI-6047 Added cache identifier recordpath test.
NIFI-6047 Added additional details.
NIFI-6047 Removed old additional details.
NIFI-6047 made some changes requested in a follow up review.
NIFI-6047 latest.
NIFI-6047 Finished updates
First round of code review cleanup
Latest
Removed EL from the dynamic properties.
Finished code review requested refactoring.
Checkstyle fix.
Removed a Java 11 API
NIFI-6047 Renamed processor to DeduplicateRecord

Signed-off-by: Matthew Burgess <mattyb149@apache.org>

This closes apache#4646
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants