Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-12855: Add more information to provenance events to facilitate full graph traversal #8498

Closed
wants to merge 1 commit into from

Conversation

mattyb149
Copy link
Contributor

Summary

NIFI-12855 This PR augments the provenance capabilities to include the following features:

  • A reference in a provenance event to any parent events ("previousEventIds")
  • Add methods to GraphClientService to generate queries/statements in popular graph languages such as Tinkerpop/Gremlin, Cypher, and SQL
  • Add ArcadeDB service as reference implementation for SQL generation

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 21

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adjusting the approach in this pull request @mattyb149.

Two quick notes that will make this more straightforward to review:

  1. There are large number of white-space formatting changes that make this more difficult to evaluate. It would be very helpful to revert formatting changes so the scope of substantive changes is clearer
  2. The Graph Client component changes should be extracted to a separate pull request. Although it is helpful to see the relationship, it would be much better to isolate framework changes from component changes that introduce features not directly related.

…ull graph traversal

Co-authored-by: Timea Barna <timeabarna@apache.org>
@mattyb149
Copy link
Contributor Author

The graph client stuff has been removed, I'll do a separate PR but not yet in case the reviews here affect the clients. Once this is in it will facilitate more clients such as perhaps an RDF/SPARQL client. Thanks for the inputs!

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for scoping down the changes in this pull request @mattyb149, this makes the core changes much easier to follow.

Reviewing the addition of the previousEventIds raises several questions. The ProvenanceEventRecord already contains Parent and Child FlowFile identifiers, used for linking, and it seems like this should be sufficient to support graph traversal use cases. The Provenancen Event ID is more of an internal database identifier, so it seems less than optimal to promote additional usage of this field.

From a performance perspective, querying the Provenance Repository for previous identifiers also seems like a potential performance problem.

If you can provide more background on why this is necessary, that would be helpful, but otherwise, the proposed approach does not look like a good candidate to go forward.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattyb149 Given the current merge conflicts and the unresolved discussion on the need for these changes, I recommend closing this pull request and revisiting it when ready. We could continue the discussion on the Jira issue if that would be helpful to scope out the types of changes needed.

@mattyb149 mattyb149 closed this May 3, 2024

@Override
public void initialize(EventReporter eventReporter, Authorizer authorizer,
ProvenanceAuthorizableFactory factory, IdentifierLookup identifierLookup)
throws IOException {
ProvenanceAuthorizableFactory factory, IdentifierLookup identifierLookup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a number of entries here that look like whitespace additions. Please double check and if so remove

* @param previousEventIds The previous event IDs (usually one except for JOIN events and such)
* @return the builder
*/
ProvenanceEventBuilder setPreviousEventIds(List<Long> previousEventIds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply we're storing event ids for the before and after event on both the before and after event?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah maybe it is that we had parent FlowFile identifiers and now we'll have previous Event Identfiers.

@joewitt
Copy link
Contributor

joewitt commented May 3, 2024

@mattyb149 It is important to undo all the whitespace changes to improve the reviewer efficiency and avoid the whitespace in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants