New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIFI-5900 Add a SplitLargeJson processor #3414
Conversation
When I run I've tested the code as a separated/custom nar file (and with the unit tests, etc), but I would like to test it via the UI, as a bundled standard nar. |
How does your approach compare with https://github.com/jsurfer/JsonSurfer? In other Jira issues we have talked about possibly using that, although I'm not sure we have thought about a new processor, as opposed to improving splitjson. |
@ottobackwards I originally sought to improve
Again, I could submit a different pull request that targets an optimization of As for JsonSurfer, I had honestly never heard of it. My code here was from a work project I did a couple years ago that was finally approved for release to the public. I could probably make a change to Let me know what you think is best. |
@arenger, I did not mean to trivialize what you did, and I certainly not saying you are wrong in what you did, so don't worry.
|
Haha thanks Otto! I didn't feel trivialized. : ) My point was just that I was bummed to learn that a project like JsonSurfer already exists, but at the same time it looks pretty solid. Maybe we should work that in instead, now or in the future. I'm fine with whatever's decided about NiFi's JSON-splitting behaviors -- it'd be fun to contribute to NiFi but we should obviously do whatever's best for the code base / end users. |
Hey there, I don't have time for a review right now (traveling quite a bit) but wanted to take the time to thank you @arenger for this AMAZING pull request. It is incredibly well documented, with a lot of test cases and explanations. Really glad to see such contribution in the NiFi community! A big thank you for taking the time to do it. Just a quick remark so that my comment is not totally useless :) You'll have to update the NOTICE file of the standard NAR bundle to mention the javax.json dependency. |
@pvillard31 Thanks Pierre! I updated the NOTICE file as you suggested; good catch. @ottobackwards and @pvillard31 (and anyone else): From here, I think there are two main questions:
On the first question, I would still vote for a separate processor for the reasons I mentioned earlier: consistent output format (always emit a JSON doc instead of sometimes) and the ability to split objects instead of arrays only. I could help with any of the four permutations (one processor or two; with javax.json or without). I plan to look into the feasibility of employing JsonSurfer in splitting json, when I can make more time. |
My only question with JSONSurfer would be if the jsonpath support is equivalent |
...fi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitLargeJson.java
Show resolved
Hide resolved
...rd-processors/src/main/java/org/apache/nifi/processors/standard/util/JsonFragmentWriter.java
Show resolved
Hide resolved
...fi-standard-processors/src/main/java/org/apache/nifi/processors/standard/util/JsonStack.java
Show resolved
Hide resolved
...andard-processors/src/main/java/org/apache/nifi/processors/standard/util/SimpleJsonPath.java
Show resolved
Hide resolved
Just a quick review, I would like some javadoc so I can understand the intended purpose and implementation of the classes, so I can review for more than style |
Hello Otto. Yeah no problem, I'll add some more code comments and push them up sometime soon. Probably by monday morning. |
Hello @ottobackwards I added more javadoc comments. Maybe gitlab already notifies you when I push a commit; not sure. |
Sorry for the delay, I have been on vacation. I'll take a look at the new javadoc. |
No problem on delays, I hope you had a good vacation. As for additional detail to the html documentation you mentioned: Yes I could do that, but I don't know where to add it. Did you take a look at the Also, as I mentioned in an above comment, I think there are four roads we could take from here:
I looked briefly into the 2nd and 4th option but have yet to confirm whether the memory usage is comparable. In order to use
|
@arenger how did you suppress the antlr runtime? |
Well,
... I did not yet run unit/regression tests in |
Interesting, thanks |
I maintain simple syslog *, and I tried the new antlr runtime this morning, and I'm having issues compiling. I wonder if it just works if you use a different runtime... at runtime |
@arenger I have upgraded simple-syslog5424 to antlr 4.7.2. I have not cut the release yet. |
@ottobackwards Yeah that sounds good. On which branch did you make the updates to simple-syslog5424? I could pull those into a new branch that adds a new I ran the same memory tests on that processor and it behaves as advertised (i.e. doesn't create an in-memory DOM while processing): ... so that's good. How about I create a simpler PR for adding a new |
I actually was able to cut the release this morning. So, if in your branch you update the version to 0.0.12 you will get the new version. I think it should have been distributed out by now No problem with FilterJson -> PR's are free ;) I would like to get some committer level feedback at some point as well on the high level questions here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should have test for the fragment writer and basically all the classes used in here.
* match a range of array elements OR a range of object fields. For example, see the disjointEmbeddedStar | ||
* unit tests. This ambiguity ends up muddying the code. Accounting for a possible "alternative path" (altPath) | ||
* is probably the cleanest (though maybe not the most efficient) way to handle the oddity. */ | ||
public static SimpleJsonPath of(String str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these inline regex's be replaced by static compiled patterns?
What happens if not supported jsonpath is entered? |
Hi @ottobackwards the processor will report an invalid state when the JSON path is not supported... and yeah I bet the regex handling could be optimized. I'd be happy to do that if folks think that this or the other PR have some promise/likelihood of being accepted. No problem either way, but I shouldn't spend any more time here until we have a more authoritative "trajectory"/guidance regarding the questions I raise above. See also: |
You need to add an entry to FWIW, I think going to JsonSurfer is the right call, in part because it wraps known quantities like Gson and Jackson. When I did that streaming json parser service, it was pretty easy once I got the hang of it. The only thing in your implementation I am worried about is the 1:1 flowfile/record thing. I think you should have batching built in as an option. The use case that caused my initial use of JsonSurfer was a 16GB JSON file that would have put about 80M flowfiles into the queue. Not typical, but even putting a few hundred thousand tiny flowfiles at once is not necessarily a good thing either. |
Recommend you consider closing this out and putting up a PR for the JsonSurfer branch. |
Thanks @MikeThomsen . I already have a different PR for the alternate JsonSurfer implementation: #3455 We'll see what comes of it! |
Overview
The current
SplitJson
processor works great when the targeted documents are small. However, if a given flow needs to handle large JSON documents, the SplitJson processor can consume a lot of memory. This is already noted as a "System Resource Consideration" in the documentation for theSplitJson
processor.I created a
SplitLargeJson
processor to address this concern. It uses an event-based streaming API to process the JSON without loading the whole document into memory, similar to SAX implementations for XML processing. This allows for near-constant memory usage, independent of file size, as shown in the following test results:The trade-off is between heap space usage and JSON Path functionality. The
SplitLargeJson
processor supports a basic subset of JSON Path, excluding constructs that would require "backtracking" in the file. See below for examples. For full JSON Path support, theSplitJson
processor should be used. When memory conservation is important, theSplitLargeJson
processor should be used.SplitLargeJson
runs about 20 to 70 percent faster thanSplitJson
because it doesn't have to build a DOM before processing the JSON Path. See the "Test Methodology" section, below.Licensing
The
SplitLargeJson
processor depends onjavax.json
, which is licensed under either GNU v2 or CDDL. According to Apache legal docs, there is a "weak copyleft" provision allowing for compatibility between CDDL and ASF licensing. Furthermore, Apache NiFi already includesjavax.json
in other modules:The LICENSE and NOTICE files were not changed with this PR since
javax.json
is already included.Testing
This PR includes 30 new unit tests to verify the changes, primarily exercising different permutations of JSON Path expressions. In addition, memory and timing metrics were gathered in order to confirm the trade-offs between
SplitJson
andSplitLargeJson
. See the memory usage chart above and the "Test Methodology" section, below.How to use SplitLargeJson Processor
Given an incoming FlowFile and a valid JSON Path setting,
SplitLargeJson
will send one or more FlowFiles to thesplit
relation, and the original FlowFile will be sent to theoriginal
relation. If JSON Path did not match any object or array in the document, then the document will be passed to thefailure
relation.JSON Path Examples
Here is a sample JSON file, followed by JSON Path expressions and the content of the FlowFiles that would be output from the
SplitLargeJson
processor.Sample JSON:
$[1].weather
{"main":"Mist","description":"mist"}
{"main":"Fog","description":"fog"}
$[1].weather[0]
{"main":"Mist"}
{"description":"mist"}
$[*].name
["Seattle"]
["Washington, DC"]
$[*]['weather'][*]['main']
{"main":"Snow"}
{"main":"Mist"}
{"main":"Fog"}
Array slices (
:
), recursive descent (..
), and script expressions (()
) are not supported. However, afragment.index
attribute is set on every outgoing FlowFile to thesplit
relation. This, in combination with aRouteOnAttribute
processor, can be used in place of JSON Path array slices.Test Methodology
Measuring the memory usage of a NiFi processor can be tricky because these processors aren't meant to run in isolation. They're designed to run in the NiFi JVM, of course, but in that context a processor will share heap space with all the other processors, controller services, and "housekeeping" infrastructure required to support the data flow. This section describes how memory and timing metrics were obtained for evaluation of the
SplitLargeJson
processor.An extensive testing framework exists for Apache NiFi that allows for processors to be exercised outside of a NiFi data flow. The framework provides the ability to send FlowFiles to a processor and then assert the presence and content of FlowFiles generated by the processor. This framework is built using JUnit and Maven, and can be invoked within an IDE or via command line interface (CLI). Most Java IDEs are written in Java, and tests run from within the IDE share heap space with the IDE itself. Therefore, memory measurements were taken outside NiFi and outside an IDE, via CLI invocation of the NiFi testing framework.
The goal was to compare
SplitLargeJson
processor with the existingSplitJson
processor. To do this, NiFi was cloned, built locally, and instrumented to log its memory usage. The following four lines were added at the end of theonTrigger
method oforg.apache.nifi.processors.standard.SplitJson
:At this point in the code, resources used to split the original FlowFile are still in scope. A similar change was made to
SplitLargeJson
, to log memory before itsJsonParser
falls out of scope (after the thewhile
loop yet inside thetry
block of theonTrigger
method). A simple test was then created that runs an arbitrary JSON file through the processor:Five JSON files were generated using a python script, starting with 8 MiB and doubling until 128 MiB. Each file was tested with each processor via CLI invocation of the NiFi testing framework, as mentioned above. The resulting log output would then contain memory and timing information. For example, this command runs a 32 MiB file through
SplitLargeJson
and shows the memory and timing results:Here is the corresponding test and output for
SplitJson
:The attentive reader may notice that the above memory usage statistics don't match those presented in the graph from the "Overview" section. This is because the raw output needs to be adjusted to account for the fact that FlowFiles are kept in memory while Apache's NiFi testing framework is running. When these processors are running in an actual NiFi instance (under default configuration), the FlowFile content would be stored on disk.
It's worth noting that FlowFiles provided to and generated by processors in the testing framework are kept in memory (see the importFrom method and usage of the transferMap in MockProcessSession). This is why a deep and specific JSON Path was selected for the tests: to reduce the size and count of the resulting FlowFile splits. Otherwise, if a more inclusive JSON Path were to be used, the memory reading would need further correction to account for data generated by the processor in question. The content of the three FlowFiles generated in these tests were only 41 bytes each.
As for the timing results,
SplitLargeJson
completed between 20 and 70 percent faster thanSplitJson
, for the same five generated JSON files:The timing metrics will vary based on the structure of the JSON document being split. As an additional time test, a JSON file representing the City and County of San Francisco's Subdivision parcels was split into fifteen thousand FlowFiles using both processors. A perl script was used to create 5 different subsets from this file, at the same size increments as above (8, 16, 32, 64, and 128). These tests had run times between 2 seconds and 5 minutes, and
SplitLargeJson
ranged between 20 and 50 percent faster.Checklist
Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?
Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
(Note:
mvn clean install
completes without error after disablingFileBasedClusterNodeFirewallTest
andDBCPServiceTest
.Adding
-Pcontrib-check
fails , but it appears to fail onmaster
branch too)Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?
Have you ensured that format looks appropriate for the output in which it is rendered?