New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIFI-5900 Add SelectJson processor #3455
Conversation
@ottobackwards This approach ( In other words, the required heap space is independent of incoming FlowFile size, for both |
Note that this PR is closely related to #3414. I don't intend for both PRs to be accepted. They are two different approaches to solve the same problem. Here are the high level questions/options that I can see:
The later two options involve possible incompatibilities between the underlying libraries ( |
These PR descriptions are amazingly helpful, thanks! I haven't followed the other PR closely, so not sure if this was already discussed, but is there significant benefit to using SelectJson/SplitLargeJson vs. the record processors like QueryRecord/PartitionRecord/SplitRecord with a JsonTreeReader and JsonRecordWriter? The original json processors like SplitJson and EvaluateJsonPath existed early on, way before the record concept, and seems like most of their purpose has been replaced by the record concept. |
@bbende I am not familiar with the split/partition records. The issues I think in the past would have been not having a schema, and the jsonpath vs record query language functionality. I know needing a schema is gone now, although I don't know if there is parity. |
So, I don't think split record supports a query. Where as split json can reach into a json structure and find an internal array and split that etc etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice. Thanks for putting it together. I am wondering what happens if a not supported json path is entered? Would it fail validation?
@bbende Thank you! As for your comment: It could be that record processing is the "new normal", in which case #3222 is an important PR. In either case, however, as long as the @ottobackwards I haven't written much because it seems we're still waiting for review by a different group of people? I would be happy to make identified changes if folks think that one of these two PRs (#3455 or #3414) have good potential to be approved. However, if neither PR looks very promising or likely to be accepted (that's OK of course, no problem!) then I don't want to spend any more time on this. |
@ottobackwards If an invalid JSON Path is submitted, the |
@bbende what would you suggest to move these two PR's forward ( or one of them rather )? I think that on of them should land, but the decision between using the lib or hand rolled json path capability is something that I think needs more community involvement. |
I think Matty B and Mark are the most involved with the record related processors, so they might be able to offer the best guidance. From a high-level, it seems like maybe we should have both approaches (this PR and Mike's streaming reader). |
@ottobackwards and/or @bbende Do you know what my next step would be, to push this PR forward? bbende mentioned "Matty B" or "Mark". See also my comment from May 21st. Thank you for any help or suggestions. |
Besides de-conflicting, from my POV, I think that there needs to be some consensus on which of the approaches makes for sense for the project. I think that should come from the committers. I personally would vote for this PR, using the surfer, as it has less maintenance burden. |
@ottobackwards Yeah I agree that this PR would probably have less maintenance burden. That's why I closed PR #3414 a few months ago. I would be happy to merge or to close this PR. I'd prefer the former but just want to go one way or another. In my opinion, both this PR and PR #3222 should be merged. This would allow for streaming json processing from either a record-based processing model or the "traditional"/original processing model. I think #3222 and this PR complement one another, but they're not dependent on each other. I did not submit PR #3222 and it is apparently closed. I submitted PR #3414 but I closed that one in favor of this one. Therefore, I think this PR should be merged and @MikeThomsen can decide separately whether he would like to resubmit his work. |
+1 from me |
e24c86f
to
664ebaf
Compare
@arenger hope you are holding out patience wise on this. Hopefully some traction soon ( although it looks like it will be post release ) |
664ebaf
to
943872b
Compare
@ottobackwards Thanks Otto, yeah I look forward to hearing whether or not folks want to include this. If they do, it should be all ready to go. |
closing due to inactivity but the effort/description/discussion are solid. i suspect we'll have to implement a stale PR bot soon and this would get swept. But great descriptions here. I really wonder if the record processors take care of this well enough now. Either way that is where the effort should go on a go forward basis for something like this. |
Overview
The goal of this PR is to further fortify NiFi when working with large JSON files. As noted in the NiFi overview, systems will invariably receive "data that is too big, too small, too fast, too slow, corrupt, wrong, or in the wrong format." In the case of "too big", NiFi (or any JVM) can continue just fine and handle large files with ease if it does so in a streaming fashion, but the current JSON processors use a DOM approach that is limited by available heap space. This PR recommends the addition of a
SelectJson
processor that can be employed when large JSON files are expected or possible.The current
EvaluateJsonPath
andSplitJson
processors both leverage the Jayway JsonPath library. The Jayway implementation has excellent support for JSON Path expressions, but requires that the entire JSON file be loaded into memory. It builds a document object model (DOM) before evaluating the targeted JSON Path. This is already noted as a "System Resource Consideration" in the documentation for theSplitJson
processor, and the same is true forEvaluateJsonPath
.The proposed
SelectJson
processor uses an alternate library called JsonSurfer to evaluate a JSON Path without loading the whole document into memory all at once, similar to SAX implementations for XML processing. This allows for near-constant memory usage, independent of file size, as shown in the following test results:The trade-off is between heap space usage and JSON Path functionality. The
SelectJson
processor supports almost all of JSON Path, with a few limitations mentioned in the@CapabilityDescription
. For full JSON Path support and/or multiple JSON Path expressions,EvaluateJsonPath
and/orSplitJson
processor should be used. When memory conservation is important, theSelectJson
processor should be used.Licensing
The JsonSurfer library is covered by the MIT License which is compatible with Apache 2.0.
Testing
This PR is a follow-on from #3414 in which I proposed a similar solution that required extenseive unit testing. Tests from that PR were adapted and preserved for this PR, even though many of them are testing the
JsonSurf
library. This is a much simpler PR since the path processing is handled in a third-party library.As for the memory statistics noted above, they were gathered using the same methodology described in #3414. For posterity, here's a python script to generate JSON files of arbitrary size:
How to use SelectJson Processor
Given an incoming FlowFile and a valid JSON Path setting,
SelectJson
will send one or more FlowFiles to theselected
relation, and the original FlowFile will be sent to theoriginal
relation. If JSON Path did not match any object or array in the document, then the document will be passed to thefailure
relation.JSON Path Examples
Here is a sample JSON file, followed by JSON Path expressions and the content of the FlowFiles that would be output from the
SplitLargeJson
processor.Sample JSON:
$[1].weather.*
{"main":"Mist","description":"mist"}
{"main":"Fog","description":"fog"}
$[1].name
"Washington, DC"
$[*]['weather'][*]['main']
"Snow"
"Mist"
"Fog"
Checklist
Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?
Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
(Note:
mvn clean install
completes without error after disablingFileBasedClusterNodeFirewallTest
andDBCPServiceTest
.Adding
-Pcontrib-check
fails , but it appears to fail onmaster
branch too)Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?
Have you ensured that format looks appropriate for the output in which it is rendered?
See Also
SplitLargeJson: #3414
StreamingJsonReader: #3222