NIFI-5900 Add SelectJson processor #3455

arenger · 2019-04-25T22:37:15Z

Overview

The goal of this PR is to further fortify NiFi when working with large JSON files. As noted in the NiFi overview, systems will invariably receive "data that is too big, too small, too fast, too slow, corrupt, wrong, or in the wrong format." In the case of "too big", NiFi (or any JVM) can continue just fine and handle large files with ease if it does so in a streaming fashion, but the current JSON processors use a DOM approach that is limited by available heap space. This PR recommends the addition of a SelectJson processor that can be employed when large JSON files are expected or possible.

The current EvaluateJsonPath and SplitJson processors both leverage the Jayway JsonPath library. The Jayway implementation has excellent support for JSON Path expressions, but requires that the entire JSON file be loaded into memory. It builds a document object model (DOM) before evaluating the targeted JSON Path. This is already noted as a "System Resource Consideration" in the documentation for the SplitJson processor, and the same is true for EvaluateJsonPath.

The proposed SelectJson processor uses an alternate library called JsonSurfer to evaluate a JSON Path without loading the whole document into memory all at once, similar to SAX implementations for XML processing. This allows for near-constant memory usage, independent of file size, as shown in the following test results:

The trade-off is between heap space usage and JSON Path functionality. The SelectJson processor supports almost all of JSON Path, with a few limitations mentioned in the @CapabilityDescription. For full JSON Path support and/or multiple JSON Path expressions, EvaluateJsonPath and/or SplitJson processor should be used. When memory conservation is important, the SelectJson processor should be used.

Licensing

The JsonSurfer library is covered by the MIT License which is compatible with Apache 2.0.

Testing

This PR is a follow-on from #3414 in which I proposed a similar solution that required extenseive unit testing. Tests from that PR were adapted and preserved for this PR, even though many of them are testing the JsonSurf library. This is a much simpler PR since the path processing is handled in a third-party library.

As for the memory statistics noted above, they were gathered using the same methodology described in #3414. For posterity, here's a python script to generate JSON files of arbitrary size:

import uuid

(I, J, K) = (1, 8737, 3)
with open('out.json', 'w') as f:
    f.write("[")
    for i in range(0,I):
        f.write("[")
        for j in range(0,J):
            f.write("[")
            for k in range(0,K):
                f.write('"' + str(uuid.uuid4()) + '"');
                if (k < K - 1):
                    f.write(",")
            f.write("],\n" if j < J - 1 else "]\n")
        f.write("],\n" if i < I - 1 else "]\n")
    f.write("]\n")

How to use SelectJson Processor

Given an incoming FlowFile and a valid JSON Path setting, SelectJson will send one or more FlowFiles to the selected relation, and the original FlowFile will be sent to the original relation. If JSON Path did not match any object or array in the document, then the document will be passed to the failure relation.

JSON Path Examples

Here is a sample JSON file, followed by JSON Path expressions and the content of the FlowFiles that would be output from the SplitLargeJson processor.

Sample JSON:

[
  {
    "name": "Seattle",
    "weather": [
      {
        "main": "Snow",
        "description": "light snow"
      }
    ]
  },
  {
    "name": "Washington, DC",
    "weather": [
      {
        "main": "Mist",
        "description": "mist"
      },
      {
        "main": "Fog",
        "description": "fog"
      }
    ]
  }
]

JSON Path Expression: $[1].weather.*
- FlowFile 0: {"main":"Mist","description":"mist"}
- FlowFile 1: {"main":"Fog","description":"fog"}
JSON Path Expression: $[1].name
- FlowFile 0: "Washington, DC"
JSON Path Expression: $[*]['weather'][*]['main']
- FlowFile 0: "Snow"
- FlowFile 1: "Mist"
- FlowFile 2: "Fog"

Checklist

@ottobackwards This approach (SelectJson) is simpler than my original approach (SplitLargeJson -- #3414 ) and provides the same system resource advantages, with one exception: If the targeted JSON excerpts themselves are large, then I'm fairly certain the SplitLargeJson approach would be better. This is because the JsonSurfer library does not provide a means by which to supply an OutputStream to its bind signature... as far as I can see.

In other words, the required heap space is independent of incoming FlowFile size, for both SelectJson and SplitLargeJson, but if the outgoing FlowFile size is also large, then the SplitLargeJson approach is currently a better approach, from a system resource perspective.

arenger · 2019-04-25T22:57:48Z

Note that this PR is closely related to #3414. I don't intend for both PRs to be accepted. They are two different approaches to solve the same problem. Here are the high level questions/options that I can see:

Create a new SplitJsonProcessor that uses javax.json (PR 3414)
Create a new SelectJson that uses JsonSurfer (this PR)
Keep only SplitJson and optionally employ a streaming approach, backed by javax.json, when a new property is set
Keep only SplitJson and optionally employ a streaming approach, backed by JsonSurfer, when a new property is set

The later two options involve possible incompatibilities between the underlying libraries (javax.json and JsonSurfer) and the exiting processor behaviors. My vote after considering this for a while is to keep the EvaluateJsonPath processor, keep the SplitJson processor, and add the SelectJson processor.

bbende · 2019-04-26T15:54:49Z

These PR descriptions are amazingly helpful, thanks!

I haven't followed the other PR closely, so not sure if this was already discussed, but is there significant benefit to using SelectJson/SplitLargeJson vs. the record processors like QueryRecord/PartitionRecord/SplitRecord with a JsonTreeReader and JsonRecordWriter?

The original json processors like SplitJson and EvaluateJsonPath existed early on, way before the record concept, and seems like most of their purpose has been replaced by the record concept.

ottobackwards · 2019-04-29T12:54:25Z

@bbende I am not familiar with the split/partition records. The issues I think in the past would have been not having a schema, and the jsonpath vs record query language functionality.

I know needing a schema is gone now, although I don't know if there is parity.

ottobackwards · 2019-04-30T15:57:30Z

So, I don't think split record supports a query. Where as split json can reach into a json structure and find an internal array and split that etc etc

ottobackwards

This is nice. Thanks for putting it together. I am wondering what happens if a not supported json path is entered? Would it fail validation?

arenger · 2019-05-21T11:28:23Z

@bbende Thank you! As for your comment: It could be that record processing is the "new normal", in which case #3222 is an important PR. In either case, however, as long as the EvaluateJsonPath and SplitJson exist in NiFi, I think there is argument to provide memory-efficient counterparts.

@ottobackwards I haven't written much because it seems we're still waiting for review by a different group of people? I would be happy to make identified changes if folks think that one of these two PRs (#3455 or #3414) have good potential to be approved. However, if neither PR looks very promising or likely to be accepted (that's OK of course, no problem!) then I don't want to spend any more time on this.

arenger · 2019-05-21T11:42:59Z

@ottobackwards If an invalid JSON Path is submitted, the SelectJson processor would be in an invalid state. This PR relies on the underlying JsonSurfer for validation of the JSON path and I don't know whether that library will filter out / report the few edge cases where a path is valid but not supported. I could look into that further depending on overall trajectory, as mentioned above and on the other PR.

ottobackwards · 2019-05-21T12:27:11Z

@bbende what would you suggest to move these two PR's forward ( or one of them rather )? I think that on of them should land, but the decision between using the lib or hand rolled json path capability is something that I think needs more community involvement.

bbende · 2019-05-21T18:01:46Z

I think Matty B and Mark are the most involved with the record related processors, so they might be able to offer the best guidance. From a high-level, it seems like maybe we should have both approaches (this PR and Mike's streaming reader).

arenger · 2019-09-21T19:23:52Z

@ottobackwards and/or @bbende Do you know what my next step would be, to push this PR forward? bbende mentioned "Matty B" or "Mark". See also my comment from May 21st.

Thank you for any help or suggestions.

ottobackwards · 2019-09-22T12:56:19Z

Besides de-conflicting, from my POV, I think that there needs to be some consensus on which of the approaches makes for sense for the project. I think that should come from the committers.

I personally would vote for this PR, using the surfer, as it has less maintenance burden.

arenger · 2019-09-23T10:21:07Z

@ottobackwards Yeah I agree that this PR would probably have less maintenance burden. That's why I closed PR #3414 a few months ago. I would be happy to merge or to close this PR. I'd prefer the former but just want to go one way or another.

In my opinion, both this PR and PR #3222 should be merged. This would allow for streaming json processing from either a record-based processing model or the "traditional"/original processing model. I think #3222 and this PR complement one another, but they're not dependent on each other. I did not submit PR #3222 and it is apparently closed. I submitted PR #3414 but I closed that one in favor of this one. Therefore, I think this PR should be merged and @MikeThomsen can decide separately whether he would like to resubmit his work.

ottobackwards · 2019-09-23T10:57:29Z

+1 from me

ottobackwards · 2019-10-18T21:18:32Z

@arenger hope you are holding out patience wise on this. Hopefully some traction soon ( although it looks like it will be post release )

arenger · 2019-10-18T23:45:09Z

@ottobackwards Thanks Otto, yeah I look forward to hearing whether or not folks want to include this. If they do, it should be all ready to go.

joewitt · 2021-03-25T18:07:04Z

closing due to inactivity but the effort/description/discussion are solid. i suspect we'll have to implement a stale PR bot soon and this would get swept. But great descriptions here. I really wonder if the record processors take care of this well enough now. Either way that is where the effort should go on a go forward basis for something like this.

ottobackwards reviewed May 7, 2019

View reviewed changes

arenger mentioned this pull request May 21, 2019

NIFI-5900 Add a SplitLargeJson processor #3414

Closed

11 tasks

arenger force-pushed the NIFI-5900-RC3 branch 2 times, most recently from e24c86f to 664ebaf Compare October 18, 2019 20:23

NIFI-5900 Add SelectJson processor

943872b

arenger force-pushed the NIFI-5900-RC3 branch from 664ebaf to 943872b Compare October 18, 2019 23:24

joewitt closed this Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIFI-5900 Add SelectJson processor #3455

NIFI-5900 Add SelectJson processor #3455

arenger commented Apr 25, 2019

arenger commented Apr 25, 2019

arenger commented Apr 25, 2019

bbende commented Apr 26, 2019

ottobackwards commented Apr 29, 2019

ottobackwards commented Apr 30, 2019

ottobackwards left a comment

arenger commented May 21, 2019 •

edited

arenger commented May 21, 2019

ottobackwards commented May 21, 2019

bbende commented May 21, 2019

arenger commented Sep 21, 2019

ottobackwards commented Sep 22, 2019

arenger commented Sep 23, 2019 •

edited

ottobackwards commented Sep 23, 2019

ottobackwards commented Oct 18, 2019

arenger commented Oct 18, 2019

joewitt commented Mar 25, 2021

NIFI-5900 Add SelectJson processor #3455

NIFI-5900 Add SelectJson processor #3455

Conversation

arenger commented Apr 25, 2019

Overview

Licensing

Testing

How to use SelectJson Processor

JSON Path Examples

Checklist

See Also

arenger commented Apr 25, 2019

arenger commented Apr 25, 2019

bbende commented Apr 26, 2019

ottobackwards commented Apr 29, 2019

ottobackwards commented Apr 30, 2019

ottobackwards left a comment

Choose a reason for hiding this comment

arenger commented May 21, 2019 • edited

arenger commented May 21, 2019

ottobackwards commented May 21, 2019

bbende commented May 21, 2019

arenger commented Sep 21, 2019

ottobackwards commented Sep 22, 2019

arenger commented Sep 23, 2019 • edited

ottobackwards commented Sep 23, 2019

ottobackwards commented Oct 18, 2019

arenger commented Oct 18, 2019

joewitt commented Mar 25, 2021

arenger commented May 21, 2019 •

edited

arenger commented Sep 23, 2019 •

edited