Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-5900 Add SelectJson processor #3455

Closed
wants to merge 1 commit into from

Conversation

arenger
Copy link

@arenger arenger commented Apr 25, 2019

Overview

The goal of this PR is to further fortify NiFi when working with large JSON files. As noted in the NiFi overview, systems will invariably receive "data that is too big, too small, too fast, too slow, corrupt, wrong, or in the wrong format." In the case of "too big", NiFi (or any JVM) can continue just fine and handle large files with ease if it does so in a streaming fashion, but the current JSON processors use a DOM approach that is limited by available heap space. This PR recommends the addition of a SelectJson processor that can be employed when large JSON files are expected or possible.

The current EvaluateJsonPath and SplitJson processors both leverage the Jayway JsonPath library. The Jayway implementation has excellent support for JSON Path expressions, but requires that the entire JSON file be loaded into memory. It builds a document object model (DOM) before evaluating the targeted JSON Path. This is already noted as a "System Resource Consideration" in the documentation for the SplitJson processor, and the same is true for EvaluateJsonPath.

The proposed SelectJson processor uses an alternate library called JsonSurfer to evaluate a JSON Path without loading the whole document into memory all at once, similar to SAX implementations for XML processing. This allows for near-constant memory usage, independent of file size, as shown in the following test results:

SelectJsonMemory

The trade-off is between heap space usage and JSON Path functionality. The SelectJson processor supports almost all of JSON Path, with a few limitations mentioned in the @CapabilityDescription. For full JSON Path support and/or multiple JSON Path expressions, EvaluateJsonPath and/or SplitJson processor should be used. When memory conservation is important, the SelectJson processor should be used.

Licensing

The JsonSurfer library is covered by the MIT License which is compatible with Apache 2.0.

Testing

This PR is a follow-on from #3414 in which I proposed a similar solution that required extenseive unit testing. Tests from that PR were adapted and preserved for this PR, even though many of them are testing the JsonSurf library. This is a much simpler PR since the path processing is handled in a third-party library.

As for the memory statistics noted above, they were gathered using the same methodology described in #3414. For posterity, here's a python script to generate JSON files of arbitrary size:

import uuid

(I, J, K) = (1, 8737, 3)
with open('out.json', 'w') as f:
    f.write("[")
    for i in range(0,I):
        f.write("[")
        for j in range(0,J):
            f.write("[")
            for k in range(0,K):
                f.write('"' + str(uuid.uuid4()) + '"');
                if (k < K - 1):
                    f.write(",")
            f.write("],\n" if j < J - 1 else "]\n")
        f.write("],\n" if i < I - 1 else "]\n")
    f.write("]\n")

How to use SelectJson Processor

Given an incoming FlowFile and a valid JSON Path setting, SelectJson will send one or more FlowFiles to the selected relation, and the original FlowFile will be sent to the original relation. If JSON Path did not match any object or array in the document, then the document will be passed to the failure relation.

JSON Path Examples

Here is a sample JSON file, followed by JSON Path expressions and the content of the FlowFiles that would be output from the SplitLargeJson processor.

Sample JSON:

[
  {
    "name": "Seattle",
    "weather": [
      {
        "main": "Snow",
        "description": "light snow"
      }
    ]
  },
  {
    "name": "Washington, DC",
    "weather": [
      {
        "main": "Mist",
        "description": "mist"
      },
      {
        "main": "Fog",
        "description": "fog"
      }
    ]
  }
]
  • JSON Path Expression: $[1].weather.*
    • FlowFile 0: {"main":"Mist","description":"mist"}
    • FlowFile 1: {"main":"Fog","description":"fog"}
  • JSON Path Expression: $[1].name
    • FlowFile 0: "Washington, DC"
  • JSON Path Expression: $[*]['weather'][*]['main']
    • FlowFile 0: "Snow"
    • FlowFile 1: "Mist"
    • FlowFile 2: "Fog"

Checklist

  • Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit?

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
    (Note: mvn clean install completes without error after disabling FileBasedClusterNodeFirewallTest and DBCPServiceTest.
    Adding -Pcontrib-check fails , but it appears to fail on master branch too)

  • Have you written or updated unit tests to verify your changes?

  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?

  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?

  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?

  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

  • Have you ensured that format looks appropriate for the output in which it is rendered?

See Also

SplitLargeJson: #3414
StreamingJsonReader: #3222

@arenger
Copy link
Author

arenger commented Apr 25, 2019

@ottobackwards This approach (SelectJson) is simpler than my original approach (SplitLargeJson -- #3414 ) and provides the same system resource advantages, with one exception: If the targeted JSON excerpts themselves are large, then I'm fairly certain the SplitLargeJson approach would be better. This is because the JsonSurfer library does not provide a means by which to supply an OutputStream to its bind signature... as far as I can see.

In other words, the required heap space is independent of incoming FlowFile size, for both SelectJson and SplitLargeJson, but if the outgoing FlowFile size is also large, then the SplitLargeJson approach is currently a better approach, from a system resource perspective.

@arenger
Copy link
Author

arenger commented Apr 25, 2019

Note that this PR is closely related to #3414. I don't intend for both PRs to be accepted. They are two different approaches to solve the same problem. Here are the high level questions/options that I can see:

  1. Create a new SplitJsonProcessor that uses javax.json (PR 3414)
  2. Create a new SelectJson that uses JsonSurfer (this PR)
  3. Keep only SplitJson and optionally employ a streaming approach, backed by javax.json, when a new property is set
  4. Keep only SplitJson and optionally employ a streaming approach, backed by JsonSurfer, when a new property is set

The later two options involve possible incompatibilities between the underlying libraries (javax.json and JsonSurfer) and the exiting processor behaviors. My vote after considering this for a while is to keep the EvaluateJsonPath processor, keep the SplitJson processor, and add the SelectJson processor.

@bbende
Copy link
Contributor

bbende commented Apr 26, 2019

These PR descriptions are amazingly helpful, thanks!

I haven't followed the other PR closely, so not sure if this was already discussed, but is there significant benefit to using SelectJson/SplitLargeJson vs. the record processors like QueryRecord/PartitionRecord/SplitRecord with a JsonTreeReader and JsonRecordWriter?

The original json processors like SplitJson and EvaluateJsonPath existed early on, way before the record concept, and seems like most of their purpose has been replaced by the record concept.

@ottobackwards
Copy link
Contributor

@bbende I am not familiar with the split/partition records. The issues I think in the past would have been not having a schema, and the jsonpath vs record query language functionality.

I know needing a schema is gone now, although I don't know if there is parity.

@ottobackwards
Copy link
Contributor

So, I don't think split record supports a query. Where as split json can reach into a json structure and find an internal array and split that etc etc

Copy link
Contributor

@ottobackwards ottobackwards left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice. Thanks for putting it together. I am wondering what happens if a not supported json path is entered? Would it fail validation?

@arenger
Copy link
Author

arenger commented May 21, 2019

@bbende Thank you! As for your comment: It could be that record processing is the "new normal", in which case #3222 is an important PR. In either case, however, as long as the EvaluateJsonPath and SplitJson exist in NiFi, I think there is argument to provide memory-efficient counterparts.

@ottobackwards I haven't written much because it seems we're still waiting for review by a different group of people? I would be happy to make identified changes if folks think that one of these two PRs (#3455 or #3414) have good potential to be approved. However, if neither PR looks very promising or likely to be accepted (that's OK of course, no problem!) then I don't want to spend any more time on this.

@arenger
Copy link
Author

arenger commented May 21, 2019

@ottobackwards If an invalid JSON Path is submitted, the SelectJson processor would be in an invalid state. This PR relies on the underlying JsonSurfer for validation of the JSON path and I don't know whether that library will filter out / report the few edge cases where a path is valid but not supported. I could look into that further depending on overall trajectory, as mentioned above and on the other PR.

@ottobackwards
Copy link
Contributor

@bbende what would you suggest to move these two PR's forward ( or one of them rather )? I think that on of them should land, but the decision between using the lib or hand rolled json path capability is something that I think needs more community involvement.

@bbende
Copy link
Contributor

bbende commented May 21, 2019

I think Matty B and Mark are the most involved with the record related processors, so they might be able to offer the best guidance. From a high-level, it seems like maybe we should have both approaches (this PR and Mike's streaming reader).

@arenger
Copy link
Author

arenger commented Sep 21, 2019

@ottobackwards and/or @bbende Do you know what my next step would be, to push this PR forward? bbende mentioned "Matty B" or "Mark". See also my comment from May 21st.

Thank you for any help or suggestions.

@ottobackwards
Copy link
Contributor

Besides de-conflicting, from my POV, I think that there needs to be some consensus on which of the approaches makes for sense for the project. I think that should come from the committers.

I personally would vote for this PR, using the surfer, as it has less maintenance burden.

@arenger
Copy link
Author

arenger commented Sep 23, 2019

@ottobackwards Yeah I agree that this PR would probably have less maintenance burden. That's why I closed PR #3414 a few months ago. I would be happy to merge or to close this PR. I'd prefer the former but just want to go one way or another.

In my opinion, both this PR and PR #3222 should be merged. This would allow for streaming json processing from either a record-based processing model or the "traditional"/original processing model. I think #3222 and this PR complement one another, but they're not dependent on each other. I did not submit PR #3222 and it is apparently closed. I submitted PR #3414 but I closed that one in favor of this one. Therefore, I think this PR should be merged and @MikeThomsen can decide separately whether he would like to resubmit his work.

@ottobackwards
Copy link
Contributor

+1 from me

@arenger arenger force-pushed the NIFI-5900-RC3 branch 2 times, most recently from e24c86f to 664ebaf Compare October 18, 2019 20:23
@ottobackwards
Copy link
Contributor

@arenger hope you are holding out patience wise on this. Hopefully some traction soon ( although it looks like it will be post release )

@arenger
Copy link
Author

arenger commented Oct 18, 2019

@ottobackwards Thanks Otto, yeah I look forward to hearing whether or not folks want to include this. If they do, it should be all ready to go.

@joewitt
Copy link
Contributor

joewitt commented Mar 25, 2021

closing due to inactivity but the effort/description/discussion are solid. i suspect we'll have to implement a stale PR bot soon and this would get swept. But great descriptions here. I really wonder if the record processors take care of this well enough now. Either way that is where the effort should go on a go forward basis for something like this.

@joewitt joewitt closed this Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants