Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-12855: Add more information to provenance events to facilitate full graph traversal #8476

Closed
wants to merge 4 commits into from

Conversation

mattyb149
Copy link
Contributor

Summary

NIFI-12855 This PR adds additional information to provenance events (such as previous event IDs) to help facilitate its representation as a property graph (in a graph database for example).

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 21

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

mattyb149 and others added 4 commits March 5, 2024 19:47
Change buildQueryFromNodes to return list of queries, added ArcadeDBClientService

Refactor query builder and arcadedb service (#28)

* Refactor query builder and arcadedb service

* Removing unnecessary languages from arcadedbservice
Change vertex type to provenance event type

Added more SQL processing
Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting the effort into this new feature @mattyb149.

Improving traceability for provenance reporting has a number of potential benefits, so it is great to see momentum in this direction.

On a cursory a review, I am concerned about the impact to the ProvenanceReporter methods, as evidenced by the number of changes. This constitutes a breaking change to all existing components, which requires careful consideration.

More importantly, directly associating every provenance event with a relationship seems questionable as a general practice. As Processors can emit multiple events, and send to multiple relationships, the concept of directly associating events with a relationship does not seem to be the right approach. I would like to take a closer look at the goals described, and consider other alternatives.

@mattyb149
Copy link
Contributor Author

If a processor issues an event for each FlowFile or for a whole set, they should all go to the relationship. Without having the relationship name you can still link the events but you can't traverse them properly. For example if there's no label on the relationship then when you query for the Flowfiles going through the processors you'll get the events for every path in the tree, instead of being able to query just for "success" for example.

These are definitely breaking (API) changes which is why I'm aiming for NiFi 2.0. However if we can't uniquely describe the edge between two provenance events, it will limit future provenance capabilities.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply.

I recognize that the Provenance Repository itself does not contain the relationship information, but did you consider other approaches to achieve similar outcomes? For example, is it possible to extract FlowFile Repository information and combine with Provenance Repository information to support traversal based on relationship? Understanding that such an approach would require work, the current approach seems to bend the Provenance Reporter API in a direction that does not seem to fit general use cases from a developer point of view.

@mattyb149
Copy link
Contributor Author

True, I considered that approach but it relies on the FlowFile repository to be as complete as the provenance repository but doesn't it usually "expire" earlier? That's also why I didn't tackle any content claim stuff (although that would be sweet for replay from an external source). Also does it contain relationship information? I didn't see it in the API.

I'm totally fine with investigating other approaches, it just seemed nothing quite fit the use cases I was hoping for, mainly to be able to store everything we know in a graph DB for querying / analysis. We can also add the flow graph to the same graph for more powerful queries, so you don't have to do your first search by provenance events, you can find the FlowFile or processor quickly then walk the graph.

@exceptionfactory
Copy link
Contributor

I agree that the FlowFile Repository only has part of the information as well, so an end-to-end solution would require some intermediate layer for sending some metadata elsewhere while maintaining the existing persistence requirements.

Taking another look at the options, the StandardProcessSession already has access to both transfer relationships and provenance records. At minimum, any changes should not require public-facing ProvenanceReporter changes because the StandardProcessSession could populate relationship information on provenance records. More evaluation would be required in that direction, but it highlights the importance of scoping any changes as narrowly as possible when it comes to the nifi-api interfaces. It might still be necessary to adjust the ProvenanceEventRecord definition with this type of approach, but it should not require changing the ProvenanceEventReporter and thus all referencing Processors.

With this background, it may be best to continue to the discussion on the Jira issue to iron out a general path forward, or providing another different with an alternative solution.

@exceptionfactory
Copy link
Contributor

I also recommend separating out any GraphClientService changes to their own pull request, as they should be more straightforward to review.

@markap14
Copy link
Contributor

markap14 commented Mar 6, 2024

Yeah, I would agree with you here @exceptionfactory . The API should not change for this. The Process Session knows which relationship each FlowFile is routed to.

That said, I don't know that it really makes sense to populate the relationship on Provenance Events for this purpose. The Relationship is really intended for ROUTE types of events. You could have a single FlowFile that has multiple events within the same session. For example, you could have a FORK, SEND, and DROP. The Relationship makes sense for ROUTE because it is explicitly stating "This FlowFile was routed to this Relationship." But for a FORK event to have a "relationship of 'original'" - conceptually it doesn't really make sense.

@mattyb149 mattyb149 closed this Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants