NIFI-6636: Fixed ListGCSBucket file duplication error #3702

turcsanyip · 2019-09-06T14:52:01Z

ListGCSBucket duplicated files if they arrived not in alphabetical order.
The set storing the name of the latest blob (which was loaded with the highest
timestamp during the previous run of the processor) was cleared too early.

Also changed the state persisting logic: it is now saved only once at the end
of onTrigger() (similar to ListS3). Some inconsistent state (only blob names
without the timestamp) was also saved earlier.

Thank you for submitting a contribution to Apache NiFi.

Please provide a short description of the PR here:

Description of PR

Enables X functionality; fixes bug NIFI-YYYY.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not squash or use --force when pushing to allow for clean monitoring of changes.

For code changes:

Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
Have you written or updated unit tests to verify your changes?
Have you verified that the full build is successful on both JDK 8 and JDK 11?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

ListGCSBucket duplicated files if they arrived not in alphabetical order. The set storing the name of the latest blob (which was loaded with the highest timestamp during the previous run of the processor) was cleared too early. Also changed the state persisting logic: it is now saved only once at the end of onTrigger() (similar to ListS3). Some inconsistent state (only blob names without the timestamp) was also saved earlier.

markap14

Thanks for the PR @turcsanyip ! I had not noticed this bug before, but I was easily able to replicate the problem after reading the Jira. I was also able to verify that the problem was addressed by this PR. However, I do think we need to add back in the saving of the state. Removing that can result in huge amounts of duplication in certain situations.

markap14 · 2019-09-06T18:40:32Z

.../nifi-gcp-processors/src/main/java/org/apache/nifi/processors/gcp/storage/ListGCSBucket.java

            session.commit();
-            persistState(context);


I don't think removing this is a good idea. Consider performing a listing of a bucket that has 1 million entries. The processor runs and performs a listing of 800,000 elements, committing the session several times. Now NiFi is restarted. The state was not persisted. So now, upon restart, NiFi will duplicate each of those 800,000 elements. The call to persistState there makes sense to avoid huge amounts of duplication when performing a listing on a huge bucket.

@markap14 Thanks for the quick feedback!
I agree with you that there should be some checkpointing / recovery mechanism but the previous solution did not work either.
persistState() saves currentTimestamp and currentKeys. currentTimestamp gets updated only at the end, after the while cycle over the blob pages, so it is useless to save it within the cycle. On the other hand, we can't update currentTimestamp in the cycle because the blobs are not sorted by time (that is the next cycle / blob page can contain items that haven't been loaded yet but which are older than the max timestamp from the previous page).
Previously, only currentKeys were updated in the cycle but it was an inconsistent state without the timestamp, furthermore caused the current bug. So I removed it and also the useless persistState() call.

Implementing a proper solution to this problem could be the scope of a separate issue I think.
I could not find a way to parameterize the GCS call to return the blobs sorted by last modification time (only alphabetical order).
We could fetch all the blobs, sort them on our side and then create the flowfiles in a cycle with saving the current state. However, its performance impact (memory considerations) must be investigated beforehand.

Ah, OK, I didn't realize that was the case. I was able to verify this by performing a listing on a bucket with nearly 100,000 items. After about 40-50 thousand had been listed, i restarted NiFi and upon restart, sure enough it re-listed everything. I do think that we need to find a way to handle this better. But for now, this PR does fix a bug and doesn't make anything worse, so it makes sense to go ahead and merge as-is.

markap14 · 2019-09-06T18:43:27Z

.../nifi-gcp-processors/src/main/java/org/apache/nifi/processors/gcp/storage/ListGCSBucket.java

-            getLogger().info("Successfully listed {} new files from GCS; routing to success", new Object[] {listCount});
+    private void commit(final ProcessContext context, final ProcessSession session, int loadCount) {
+        if (loadCount > 0) {
+            getLogger().info("Successfully loaded {} new files from GCS; routing to success", new Object[] {loadCount});


I think 'listing' here is actually more accurate - saying that they were 'loaded' might imply that the contents were transferred from GCS to NiFi.

Absolutely right, I will rename it back.

Thanks, looks good to me.

markap14 · 2019-09-09T17:32:44Z

Thanks for the update @turcsanyip . All looks good at this point so +1 merged to master!

ListGCSBucket duplicated files if they arrived not in alphabetical order. The set storing the name of the latest blob (which was loaded with the highest timestamp during the previous run of the processor) was cleared too early. Also changed the state persisting logic: it is now saved only once at the end of onTrigger() (similar to ListS3). Some inconsistent state (only blob names without the timestamp) was also saved earlier. This closes apache#3702. Signed-off-by: Mark Payne <markap14@hotmail.com>

markap14 requested changes Sep 6, 2019

View reviewed changes

NIFI-6636: Review fix: rename loadCount back to listCount.

100c056

asfgit closed this in 21a27c8 Sep 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIFI-6636: Fixed ListGCSBucket file duplication error #3702

NIFI-6636: Fixed ListGCSBucket file duplication error #3702

turcsanyip commented Sep 6, 2019 •

edited

markap14 left a comment

markap14 Sep 6, 2019

turcsanyip Sep 8, 2019

markap14 Sep 9, 2019

markap14 Sep 6, 2019

turcsanyip Sep 8, 2019

markap14 Sep 9, 2019

markap14 commented Sep 9, 2019

NIFI-6636: Fixed ListGCSBucket file duplication error #3702

NIFI-6636: Fixed ListGCSBucket file duplication error #3702

Conversation

turcsanyip commented Sep 6, 2019 • edited

Description of PR

For all changes:

For code changes:

For documentation related changes:

Note:

markap14 left a comment

Choose a reason for hiding this comment

markap14 Sep 6, 2019

Choose a reason for hiding this comment

turcsanyip Sep 8, 2019

Choose a reason for hiding this comment

markap14 Sep 9, 2019

Choose a reason for hiding this comment

markap14 Sep 6, 2019

Choose a reason for hiding this comment

turcsanyip Sep 8, 2019

Choose a reason for hiding this comment

markap14 Sep 9, 2019

Choose a reason for hiding this comment

markap14 commented Sep 9, 2019

turcsanyip commented Sep 6, 2019 •

edited