Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-5642: QueryCassandra processor : output FlowFiles as soon fetch_size is reached #3051

Closed
wants to merge 6 commits into from

Conversation

aglotero
Copy link
Contributor

@aglotero aglotero commented Oct 8, 2018

…size is reached

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

Copy link
Contributor

@mattyb149 mattyb149 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you'd like to have the capability to start sending flow files downstream as soon as some threshold is reached, rather than waiting until a result set is exhausted before sending all the flow files at once. This pattern has showed up a few times already, I refer to it as "incremental commit" because in order to actually send flow files downstream, you have you commit the session. However at that point you lose the ability to keep session-level metrics such as result set count (some flow files will already be "gone" by the time you know how many were in the result set.

I agree with adding "incremental commit" behavior to QueryCassandra, but I think we need a different approach. The Fetch Size property is a setting at the "driver level" vs the user level, and refers to the number of results to fetch from the remote result set (aka "cursor") at once. For example, you may want to fetch 100 rows at a time even though you process them one at a time. This reduces the amount of network overhead from having to fetch each row individually. However at the user level, they may want to send out a flow file with 10 or 1000 results. So IMO we don't want to reuse the Fetch Size property for this.

Perhaps take a look at QueryDatabaseTable, it has a separate "Output Batch Size" property that I believe is more in line with what you'd like to do. Its documentation mentions the tradeoff between which attributes are populated in outgoing flow files (or not) depending on whether this property is set (i.e. if using "incremental commits").

One thing QueryDatabaseTable doesn't have is incoming connections, so it won't have an example of preserving lineage. But I believe if you keep the existing code (with "fileToProcess") intact, it will continue to work. If you need to generate more flow files, just use session.create(fileToProcess) as is done currently, and the session will handle the FORK provenance events, etc. for you.

@aglotero
Copy link
Contributor Author

aglotero commented Oct 9, 2018

Thanks for the comments @mattyb149!
Your understanding is correct regarding the intent of this change, I'll take a look on QueryDatabaseTable to understand how to add an "Output Batch Size" parameter on QueryCassandra processor.
Regards,
André

…size is reached

NIFI-5642: QueryCassandra processor : output FlowFiles as soon fetch_size is reached

Fixed checkstyle error

Delete build.sh

Delete local build file

NIFI-5642 : letting fetch_size to control the Cassandra data flow creating a new MAX_ROWS_PER_FLOW_FILE parameter

Fixed checkstyle error: no more import java.util.*

Fixed missing imports

NIFI-5642: added REL_ORIGINAL relationship in order to allow incremental commit
@aglotero
Copy link
Contributor Author

Hi @mattyb149!
This last commit contains all changes from the last commits, addressing your comments.
Regards,

@aglotero
Copy link
Contributor Author

Hi @mattyb149!
I've merged my changes with master, I don't know why git wasn't able to merge the changes.
Can you check it?
Thanks!

@MikeThomsen
Copy link
Contributor

@aglotero what happened was is you did a merge of master instead of a rebase. It's a common mistake, but one that can cause a lot of problems for you in cases like this. The proper work flow in cases like this is:

  1. git checkout main
  2. git pull apache main
  3. git checkout
  4. git rebase main
  5. git push origin --force (this is necessary since you updated the base of the feature branch)

@github-actions
Copy link

github-actions bot commented May 2, 2021

We're marking this PR as stale due to lack of updates in the past few months. If after another couple of weeks the stale label has not been removed this PR will be closed. This stale marker and eventual auto close does not indicate a judgement of the PR just lack of reviewer bandwidth and helps us keep the PR queue more manageable. If you would like this PR re-opened you can do so and a committer can remove the stale tag. Or you can open a new PR. Try to help review other PRs to increase PR review bandwidth which in turn helps yours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants