New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIFI-12855: Add more information to provenance events to facilitate full graph traversal #8476
Conversation
Change buildQueryFromNodes to return list of queries, added ArcadeDBClientService Refactor query builder and arcadedb service (#28) * Refactor query builder and arcadedb service * Removing unnecessary languages from arcadedbservice
Change vertex type to provenance event type Added more SQL processing
…ull graph traversal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting the effort into this new feature @mattyb149.
Improving traceability for provenance reporting has a number of potential benefits, so it is great to see momentum in this direction.
On a cursory a review, I am concerned about the impact to the ProvenanceReporter
methods, as evidenced by the number of changes. This constitutes a breaking change to all existing components, which requires careful consideration.
More importantly, directly associating every provenance event with a relationship seems questionable as a general practice. As Processors can emit multiple events, and send to multiple relationships, the concept of directly associating events with a relationship does not seem to be the right approach. I would like to take a closer look at the goals described, and consider other alternatives.
If a processor issues an event for each FlowFile or for a whole set, they should all go to the relationship. Without having the relationship name you can still link the events but you can't traverse them properly. For example if there's no label on the relationship then when you query for the Flowfiles going through the processors you'll get the events for every path in the tree, instead of being able to query just for "success" for example. These are definitely breaking (API) changes which is why I'm aiming for NiFi 2.0. However if we can't uniquely describe the edge between two provenance events, it will limit future provenance capabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply.
I recognize that the Provenance Repository itself does not contain the relationship information, but did you consider other approaches to achieve similar outcomes? For example, is it possible to extract FlowFile Repository information and combine with Provenance Repository information to support traversal based on relationship? Understanding that such an approach would require work, the current approach seems to bend the Provenance Reporter API in a direction that does not seem to fit general use cases from a developer point of view.
True, I considered that approach but it relies on the FlowFile repository to be as complete as the provenance repository but doesn't it usually "expire" earlier? That's also why I didn't tackle any content claim stuff (although that would be sweet for replay from an external source). Also does it contain relationship information? I didn't see it in the API. I'm totally fine with investigating other approaches, it just seemed nothing quite fit the use cases I was hoping for, mainly to be able to store everything we know in a graph DB for querying / analysis. We can also add the flow graph to the same graph for more powerful queries, so you don't have to do your first search by provenance events, you can find the FlowFile or processor quickly then walk the graph. |
I agree that the FlowFile Repository only has part of the information as well, so an end-to-end solution would require some intermediate layer for sending some metadata elsewhere while maintaining the existing persistence requirements. Taking another look at the options, the With this background, it may be best to continue to the discussion on the Jira issue to iron out a general path forward, or providing another different with an alternative solution. |
I also recommend separating out any GraphClientService changes to their own pull request, as they should be more straightforward to review. |
Yeah, I would agree with you here @exceptionfactory . The API should not change for this. The Process Session knows which relationship each FlowFile is routed to. That said, I don't know that it really makes sense to populate the relationship on Provenance Events for this purpose. The Relationship is really intended for ROUTE types of events. You could have a single FlowFile that has multiple events within the same session. For example, you could have a FORK, SEND, and DROP. The Relationship makes sense for ROUTE because it is explicitly stating "This FlowFile was routed to this Relationship." But for a FORK event to have a "relationship of 'original'" - conceptually it doesn't really make sense. |
Summary
NIFI-12855 This PR adds additional information to provenance events (such as previous event IDs) to help facilitate its representation as a property graph (in a graph database for example).
Tracking
Please complete the following tracking steps prior to pull request creation.
Issue Tracking
Pull Request Tracking
NIFI-00000
NIFI-00000
Pull Request Formatting
main
branchVerification
Please indicate the verification steps performed prior to pull request creation.
Build
mvn clean install -P contrib-check
Licensing
LICENSE
andNOTICE
filesDocumentation