NIFI-12855: Add more information to provenance events to facilitate full graph traversal #8476

mattyb149 · 2024-03-06T00:55:29Z

Summary

NIFI-12855 This PR adds additional information to provenance events (such as previous event IDs) to help facilitate its representation as a property graph (in a graph database for example).

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Apache NiFi Jira issue created

Pull Request Tracking

Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

Pull Request based on current revision of the main branch
Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

Build completed using mvn clean install -P contrib-check
- JDK 21

Licensing

New dependencies are compatible with the Apache License 2.0 according to the License Policy
New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

Documentation formatting appears as expected in rendered files

Change buildQueryFromNodes to return list of queries, added ArcadeDBClientService Refactor query builder and arcadedb service (#28) * Refactor query builder and arcadedb service * Removing unnecessary languages from arcadedbservice

Change vertex type to provenance event type Added more SQL processing

…ull graph traversal

exceptionfactory

Thanks for putting the effort into this new feature @mattyb149.

Improving traceability for provenance reporting has a number of potential benefits, so it is great to see momentum in this direction.

On a cursory a review, I am concerned about the impact to the ProvenanceReporter methods, as evidenced by the number of changes. This constitutes a breaking change to all existing components, which requires careful consideration.

More importantly, directly associating every provenance event with a relationship seems questionable as a general practice. As Processors can emit multiple events, and send to multiple relationships, the concept of directly associating events with a relationship does not seem to be the right approach. I would like to take a closer look at the goals described, and consider other alternatives.

mattyb149 · 2024-03-06T01:37:47Z

If a processor issues an event for each FlowFile or for a whole set, they should all go to the relationship. Without having the relationship name you can still link the events but you can't traverse them properly. For example if there's no label on the relationship then when you query for the Flowfiles going through the processors you'll get the events for every path in the tree, instead of being able to query just for "success" for example.

These are definitely breaking (API) changes which is why I'm aiming for NiFi 2.0. However if we can't uniquely describe the edge between two provenance events, it will limit future provenance capabilities.

exceptionfactory

Thanks for the reply.

I recognize that the Provenance Repository itself does not contain the relationship information, but did you consider other approaches to achieve similar outcomes? For example, is it possible to extract FlowFile Repository information and combine with Provenance Repository information to support traversal based on relationship? Understanding that such an approach would require work, the current approach seems to bend the Provenance Reporter API in a direction that does not seem to fit general use cases from a developer point of view.

mattyb149 · 2024-03-06T02:25:52Z

True, I considered that approach but it relies on the FlowFile repository to be as complete as the provenance repository but doesn't it usually "expire" earlier? That's also why I didn't tackle any content claim stuff (although that would be sweet for replay from an external source). Also does it contain relationship information? I didn't see it in the API.

I'm totally fine with investigating other approaches, it just seemed nothing quite fit the use cases I was hoping for, mainly to be able to store everything we know in a graph DB for querying / analysis. We can also add the flow graph to the same graph for more powerful queries, so you don't have to do your first search by provenance events, you can find the FlowFile or processor quickly then walk the graph.

exceptionfactory · 2024-03-06T03:40:18Z

I agree that the FlowFile Repository only has part of the information as well, so an end-to-end solution would require some intermediate layer for sending some metadata elsewhere while maintaining the existing persistence requirements.

Taking another look at the options, the StandardProcessSession already has access to both transfer relationships and provenance records. At minimum, any changes should not require public-facing ProvenanceReporter changes because the StandardProcessSession could populate relationship information on provenance records. More evaluation would be required in that direction, but it highlights the importance of scoping any changes as narrowly as possible when it comes to the nifi-api interfaces. It might still be necessary to adjust the ProvenanceEventRecord definition with this type of approach, but it should not require changing the ProvenanceEventReporter and thus all referencing Processors.

With this background, it may be best to continue to the discussion on the Jira issue to iron out a general path forward, or providing another different with an alternative solution.

exceptionfactory · 2024-03-06T03:41:22Z

I also recommend separating out any GraphClientService changes to their own pull request, as they should be more straightforward to review.

markap14 · 2024-03-06T19:16:56Z

Yeah, I would agree with you here @exceptionfactory . The API should not change for this. The Process Session knows which relationship each FlowFile is routed to.

That said, I don't know that it really makes sense to populate the relationship on Provenance Events for this purpose. The Relationship is really intended for ROUTE types of events. You could have a single FlowFile that has multiple events within the same session. For example, you could have a FORK, SEND, and DROP. The Relationship makes sense for ROUTE because it is explicitly stating "This FlowFile was routed to this Relationship." But for a FORK event to have a "relationship of 'original'" - conceptually it doesn't really make sense.

mattyb149 and others added 4 commits March 5, 2024 19:47

Add buildQueryFromNodes to GraphClientService

90bdae4

Change buildQueryFromNodes to return list of queries, added ArcadeDBClientService Refactor query builder and arcadedb service (#28) * Refactor query builder and arcadedb service * Removing unnecessary languages from arcadedbservice

Add previousEventIDs to provenance components for drawing edges

7399dc3

Change vertex type to provenance event type Added more SQL processing

Adding edge linking in Gremlin and ArcadeDBService error handling (#29)

af63871

NIFI-12855: Add more information to provenance events to facilitate f…

f7552c3

…ull graph traversal

mattyb149 force-pushed the NIFI-12855 branch from 81d7061 to f7552c3 Compare March 6, 2024 01:08

exceptionfactory requested changes Mar 6, 2024

View reviewed changes

exceptionfactory reviewed Mar 6, 2024

View reviewed changes

mattyb149 closed this Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIFI-12855: Add more information to provenance events to facilitate full graph traversal #8476

NIFI-12855: Add more information to provenance events to facilitate full graph traversal #8476

mattyb149 commented Mar 6, 2024

exceptionfactory left a comment

mattyb149 commented Mar 6, 2024

exceptionfactory left a comment

mattyb149 commented Mar 6, 2024

exceptionfactory commented Mar 6, 2024

exceptionfactory commented Mar 6, 2024

markap14 commented Mar 6, 2024

NIFI-12855: Add more information to provenance events to facilitate full graph traversal #8476

NIFI-12855: Add more information to provenance events to facilitate full graph traversal #8476

Conversation

mattyb149 commented Mar 6, 2024

Summary

Tracking

Issue Tracking

Pull Request Tracking

Pull Request Formatting

Verification

Build

Licensing

Documentation

exceptionfactory left a comment

Choose a reason for hiding this comment

mattyb149 commented Mar 6, 2024

exceptionfactory left a comment

Choose a reason for hiding this comment

mattyb149 commented Mar 6, 2024

exceptionfactory commented Mar 6, 2024

exceptionfactory commented Mar 6, 2024

markap14 commented Mar 6, 2024