Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support reading from/writing to JSON-P parser/generator #224

Closed
emattheis opened this issue Feb 19, 2020 · 30 comments
Closed

support reading from/writing to JSON-P parser/generator #224

emattheis opened this issue Feb 19, 2020 · 30 comments
Labels
duplicate This issue or pull request already exists

Comments

@emattheis
Copy link

For efficient handling of large data sets, it is desirable to mix the JSON-P and JSON-B APIs in the same way that one can use the StAX and JAXB APIs for XML processing. If the Jsonb interface was extended with fromJson and toJson overloads for JsonParser and JsonGenerator, it would be possible to selectively deserialize/serialize JSON from/to arbitrary points in a larger stream.

It looks like Yasson already implements this functionality, and it would be great to see it become part of the formal spec.

@rmannibucau
Copy link

Hi @emattheis,

Both Yasson and Johnzon - https://github.com/apache/johnzon/blob/master/johnzon-jsonb/src/main/java/org/apache/johnzon/jsonb/api/experimental/JsonbExtension.java - can map a JsonValue to a POJO (and opposite) through JSON-B implementations.
Once made standard then JsonParser will enable to get a JsonValue and then JSON-B will enable to map it so guess we can rather standardize JsonValue <-> POJO mapping than the opposize which requires to define that the parser is already partially consumed which is not that good in terms of API since a jsonb.fromJson(Json.createParser(...), ...) will not work cause it is not at the same consumption state than a nested one.

wdyt?

@emattheis
Copy link
Author

@rmannibucau For parsing use cases, getting JsonValue first means you will have to eagerly deserialize the entire node, so that's not ideal. Using the POJO mapping directly against the JsonParser would allow for skipping unmapped parts of the node, or using a custom deserializer to limit the amount of data consumed.

@rmannibucau
Copy link

rmannibucau commented Feb 19, 2020

Yes and no. Yes it is an option but no cause nothing enforces it - jsonb is by design in mem + node would fit in mem anyway otherwise even the mapping of the pojo wouldnt fit.
So at the end the case where pojo mem is less than jsonobject one is not that important. What this api enables is only streaming objects and releasing their mem early (like a file of 10000000 records of some kilobytes).
Any other optimization can require a custom usage of jsonp only and I think it stays sane this way otherwise you would break jsonb deserializers foe instance which exactly enable to handle unbounded fields for examples.

@emattheis
Copy link
Author

@rmannibucau of course, the result of the JSON-B parsing is in memory, but the source certainly is not required to be - nor should it be! One of the great benefits of JSON-B is that one can selectively map the JSON structure into an object graph as opposed to the JSON-P approach where your only option for structured data is to deserialize the entire target node into a JsonStructure.

The use case where POJO mem is less than JsonObject is precisely what I'm interested in solving. It is common, in my experience, to have results from an JSON API that contain more data than I need and often unbounded structures that I cannot safely consume. A practical example is a search result. The ability to traverse a structure with JSON-P and selectively consume nodes using JSON-B while avoiding unmapped structures is useful.

Several popular JSON frameworks (including the JSON-B reference implementation) already include this functionality. Excluding it from the formal specification is a barrier to adoption for those of us who already use this pattern.

@rmannibucau
Copy link

Well, my point is that you can already do it through multiple fashion:

  1. Using a jsonb instance as a nested jsonparser handler
  2. Using a deserializer
  3. Using a custom parser impl passed through the jsonbbuilder jsonprovider wither
    (Probaly more)

Personally i use 3 since a few years and it is quite trivial.

So the new proposed api looks like breaking the loose coupling between jsonb parser - or generator - handling rather than a feature. IMHO the link should stay the loaded objects for the mapping+jsonp provider for the parser/reader/generator/writer look ups but nothing more otherwise you start to enforce the impl with no real feature gain.

Finally also note that as mentionned, the gain is just in one case if not done through 3, and nlt in cases you use deserializers or any way to forward unmapped values to the mapped bean, which is another common case.

So to summarize, this feature is already there and would break a bit more things that it would bring IMHO.

@emattheis
Copy link
Author

emattheis commented Feb 20, 2020

@rmannibucau can you elaborate or point me to examples of the existing solutions you propose? I'm totally onboard with using existing solutions within JSON-P and JSON-B, I just don't see how you can actually do what I need.

Here's an example of what I'm trying to accomplish https://gist.github.com/emattheis/20910c7034753e443852d924289a8954. Obviously in a real-world scenario the JSON source would be coming from an actual streaming source and have many more records. The key to the pattern is being able to process the overall document in a streaming fashion and still be able to pull out elements of the array as POJOs without pulling the JSON representation into memory.

As I was throwing together the gist, I can see more clearly what you are saying about positioning the JsonParser correctly. Ideally, the parser would provide a way to peek at the upcoming token so you don't have to rely on exception handling to break the loop.

@leadpony
Copy link

Hello everyone
Jackson's famous ObjectMapper has the methods in its public API.
Please see the Javadoc https://fasterxml.github.io/jackson-databind/javadoc/2.10/com/fasterxml/jackson/databind/ObjectMapper.html#readValue-com.fasterxml.jackson.core.JsonParser-java.lang.Class-

@rmannibucau
Copy link

Hello,
Jackson also exposes almost all its internals and doesnt decouple at its stack much in terms of API so not sure we should use it as a reference.

@emattheis Im a bit short this week (changing of job) but give a try to create a JsonProvider using a delegate pattern on JsonProvider.provider() which is able to wrap JsonParser creations (using delegate pattern again). It is enough to filter nodes you dont want either in blacklist or whitelist mode enabling both use cases we talked about (yours which sounds like a whitelist and mine which is more a blacklist to keep setUnknownProperty working as for openapi).

@emattheis
Copy link
Author

While I agree that Jackson is not a great reference for the standard in general due to its sprawling feature set, it does serve as a good example for this use case. In fact, Jackson is the de facto standard JSON library in many code bases and it would be beneficial to depend on a lighter API. In fact, interoperability between streaming and POJO mapping is a common theme among popular JSON libraries, so I still feel it deserves attention in the JSON-B and JSON-P specs.

@rmannibucau If I understand correctly, you're suggesting wrapping a JsonParser in order to filter out or explicitly allow certain parts of a JSON document during parsing a JsonStructure which could then be used in a future version of Jsonb to obtain a mapped POJO. This is of course possible, but still relies on non-standard behavior today. It also necessitates the creation of unnecessary intermediate JsonStructure objects. Ultimately, the resultant code is nowhere near as flexible as a solution that can simply rely on the the filtering inherent to the JSON-B mapping model. Considering that other JSON libraries can already do exactly what I want, the trade off in complexity for the sake of standardization is not reasonable for non-trivial use cases. Maintaining a comprehensive white list of acceptable nodes alongside the POJO mapping is brittle at best.

@rmannibucau
Copy link

rmannibucau commented Feb 21, 2020

@emattheis it is at parser level more than JsonStructure so you dont have unecessary allocations. AIso note in real life it is quite trivial to impl cause mapping is known and very worse case requires a linkedlist handling around the parser. Finally, just a deserializer already does exactly that even if not in jsonb.

Edit: also note that the generator side does not bring anything so API wouldn't be as symmetric sa others

@emattheis
Copy link
Author

it is at parser level more than JsonStructure so you dont have unecessary allocations.

My goal is to produce a POJO. You seem to be suggesting that I rely on the forthcoming support for producing a POJO from a JsonStructure, in which case the JsonStructure is an intermediate representation that is thrown away.

AIso note in real life it is quite trivial to impl cause mapping is known and very worse case requires a linkedlist handling around the parser.

It is certainly easy enough when you are simply trying to ignore/include top-level nodes from a structure, but I would argue that is is non-trivial when you wish to extract nodes at arbitrary depths within a nested structure.

Finally, just a deserializer already does exactly that even if not in jsonb.

I'm not sure what you mean by not in jsonb, but the fact that DeserializationContext does what I need is precisely the point of raising this issue. It is obviously possible to produce a stream of mapped POJOs from an arbitrary JsonParser, and JSON-B already has exactly what I need to select arbitrary values from a JSON representation without holding the entire representation in memory at once, but there is no specified way to do those two things together in JSON-B.

@rmannibucau if your position is that this is not a feature that should be in JSON-B, that's fine, but if you are suggesting it can already be done with the existing spec, please provide a concrete example when you have a chance.

This is a solved problem in the broader Java/JSON space. This issue is about making it part of the JSON-B standard.

@rmannibucau
Copy link

Did you try using JsonbConfig#withDeserializers?

@emattheis
Copy link
Author

How do you suppose that will help solve my use-case when there is no way to pass a JsonParser to Json::fromJson? If I had a deserializer implementation that did everything I need, I could use it directly. The point here is to leverage the JSON-B POJO mapping facilities.

I suppose another option to my proposal would be to expose serializers/deserializers for a given type and let callers use them directly, but that seems silly to me.

@rmannibucau
Copy link

If will be used when pass an inputstream or writer to jsonb which will then create a parser from it. Exactly what you ask (you dont care of the parser actually in your use case).

@emattheis
Copy link
Author

Please provide a concrete example. The gist I provided is a trivialized example, but it is not clear how you are suggesting to solve it. What type would I associate with the deserializer and use in the call to Jsonb::fromJson?

Bear in mind I do not want to consume the stream in a single pass, I want to extract POJOs in a streaming fashion in general. My trivial example is NOT the only use case this issue aims to address. The assumption is that the caller does have a need to directly control the JsonParser and is already doing so. Suppose, for example, you want to extract all the error fields from any location in an arbitrary JSON document.

@rmannibucau
Copy link

This is exactly what deserializers are about. Registered in the config they are automatically used and gives you all parser to pojo mapping features.

Some ex: https://github.com/apache/johnzon/blob/10ab78ea9f5e42bcc99b69f7962355fda6c42ee9/johnzon-jsonb/src/test/java/org/apache/johnzon/jsonb/SerializerTest.java#L194

@emattheis
Copy link
Author

emattheis commented Feb 21, 2020

@rmannibucau I understand how serializers and deserializers work. Obviously it is trivial to obtain an entire collection at once. That is not the use case I am trying to solve. If you look at the pattern in the gist I linked to, you will see what I'm trying to accomplish. The example you link to is not applicable as it simply consumes everything into a List.

The key issues with this approach are:

  1. the entire List is in memory
  2. fromJson will not return until the entire list has been consumed

The use cases I need to solve involved collections that cannot be assumed to fit in memory and the need to process elements as they arrive - e.g. a continuous stream of JSON messages.

@rmannibucau
Copy link

No:

  1. If you register a deserializer for subobjects + one for the list not materializing it (just delegating for each not) it is not the case. You can zvzn lzt the wrapper materialized while the big list is not which is simple in terms of api cause it just requires annotations to register deserializers. Note that for this point the api you proposed does not help much I think.
  2. It is the case anyway, even with your api proposal. While there is a stream the minimum for the framework is to consume it except if you fake the parser so until you do 1 which is a jsonb visitor pattern or the jsonprovider advanced technique then it will not work.

FYI it had been tested with data until 10G (files) forwarded through a websocket (nested of nested object as message) in a jvm having 4g of heap space so pretty sure it works already ;).

@emattheis
Copy link
Author

You can zvzn lzt the wrapper materialized while the big list is not which is simple in terms of api cause it just requires annotations to register deserializers.

I'm sorry, but I do not understand what you are saying here. A code sample would be extremely helpful.

  1. It is the case anyway, even with your api proposal.

This is only true if you are asking Jsonb to deserialize the entire list, which I am specifically aiming to avoid (again, please refer to the gist I provided).

FYI it had been tested with data until 10G (files) forwarded through a websocket (nested of nested object as message) in a jvm having 4g of heap space so pretty sure it works already ;).

Please show the code that achieves this.

@rmannibucau
Copy link

@emattheis not sure I got 2 right but there are two options to deserialize, either align on the incoming stream or align on the model. Both are good and wrong depending the input/output (if the payload has 10 attributes and the mapping 2 the mapping is better but if it is the opposite it would be better to reverse the pattern. JSON-B can't know that so this will stay an implementation detail so it can end up as needing to provide a custom filtering JsonProvider to achieve portably and reliably what you ask). That said here is how to already use a parser+jsonb mapping today:

@Test
public void noListMateralisation() throws Exception {
    final int max = 100;
    final InputStream longStream = new InputStream() {
        private final Function<String, String> animalFactory = name -> "{\"name\":\"" + name + "\"}";
        private InputStream delegate = new ByteArrayInputStream("{\"animals\":[".getBytes(StandardCharsets.UTF_8));
        private int animalsCount = max;

        @Override
        public int read() throws IOException {
            final int read = delegate.read();
            if (read < 0) {
                if (animalsCount > 0) {
                    animalsCount--;
                    delegate = new ByteArrayInputStream((
                            animalFactory.apply("animal_" + animalsCount) +
                                    (animalsCount == 0 ? "" : ","))
                            .getBytes(StandardCharsets.UTF_8));
                    return read();
                } else if (animalsCount == 0) { // close the root object
                    animalsCount--;
                    delegate = new ByteArrayInputStream("]}".getBytes(StandardCharsets.UTF_8));
                    return read();
                }
            }
            return read;
        }
    };

    final Collection<Animal> seenAnimals = new ArrayList<>();
    try (final InputStream stream = longStream;
         final Jsonb jsonb = JsonbBuilder.create()) {
        // using CDI avoids this threadlocal
        AnimalListDeserializerProcessor.fakeAccumulatorForTest.set(seenAnimals);
        try {
            final Root root = jsonb.fromJson(stream, Root.class);
            assertNull(root.animals);
        } finally {
            AnimalListDeserializerProcessor.fakeAccumulatorForTest.remove();
        }
        assertEquals(max, seenAnimals.size());
    }
}

public static class AnimalListDeserializerProcessor implements JsonbDeserializer<List<Animal>> {
    private static final ThreadLocal<Collection<Animal>> fakeAccumulatorForTest = new ThreadLocal<>();

    @Override
    public List<Animal> deserialize(final JsonParser parser, final DeserializationContext ctx,
                                    final Type rtType) {
        if (parser.next() != JsonParser.Event.START_ARRAY) {
            throw new IllegalArgumentException("expected [");
        }
        while (parser.next() != JsonParser.Event.END_ARRAY) {
            // we accumulate in a list outside deserialization process
            // but it but in practise we would process it there or async
            final Animal animal = ctx.deserialize(Animal.class, parser);
            fakeAccumulatorForTest.get().add(animal);
        }
        return null;
    }
}

public static class Root {
    @JsonbTypeDeserializer(AnimalListDeserializerProcessor.class)
    public List<Animal> animals;
}

@emattheis
Copy link
Author

Okay, I think I understand what you're getting at, and I created a revised gist exploiting a deserializer as you suggest: https://gist.github.com/emattheis/6f92ab0d04584273411a7330e32c0ea8. So I must concede that it is possible to achieve streaming POJO mapping with JSON-P and JSON-B with the current APIs.

That being said, I think this illustrates the point of this issue quite clearly. Hacking around the intended use of JsonbDeserializer simply to gain access to the underlying DeserializationContext is silly. Using a ThreadLocal to share state with with a deserializer that doesn't actually return a value is certainly inventive, but results in code with poor maintainability that is clearly not using JSON-B as intended.

I think we'll need to agree to disagree on this and see if this issue attracts some other opinions. Maybe @m0mus or @njr-11 can weigh in?

@rmannibucau
Copy link

You dont need a threadlocal if the instance is passed through the jsonbconfig or just use cdi. I used it in the standalone sample.

Also note it is not a workaround but the intended usage of the deserializer - if you dont want that control level you use an adapter. So this is the built in usage.

Now you are right jsonb is about in mem usage, not streming by design - even with your api - so maybe you should use javax streaming api directly and map it as intendee with subevents? Sounds cleaner to me whatever api is in jsonb.

@emattheis
Copy link
Author

Also note it is not a workaround but the intended usage of the deserializer - if you dont want that control level you use an adapter. So this is the built in usage.

Producing a stream of POJOs as a side effect, and then returning null is surely not the intended usage of the deserializer.

so maybe you should use javax streaming api directly and map it as intendee with subevents? Sounds cleaner to me whatever api is in jsonb.

This is exactly why I raised this issue! There is no good reason why future versions of JSON-P and JSON-B couldn't work together with less friction - as demonstrated by the many existing JSON libraries that already do this, including the JSON-B reference implementation.

I am not looking for a novel approach to meet my requirements. I already have a working solution in place using Jackson, but I would prefer to be able to rely on standard APIs.

@rmannibucau
Copy link

Fact is you have tons of options already.
My opinion is using the JsonObject to POJO is likely the sanest and will not corrupt any of the spec.
Real alternative solution is to create a jakarta.json.bind.streaming kind of API but what it would bring sounds very small regarding the investment.
Concretely, the overhead of going through a JsonObject will likely be negligible due to JsonParser nature/(required)impl and just requires yasson and johnzon to agree on method signatures (JsonStructure in yasson if Im not mistaken vs JsonValue in johnzon) to map a json value to a pojo.
Assuming this method is spec-ed, can you give a try in your project to validates it matches?
I can surely help in a few days if needed (still off computer for some days).

@emattheis
Copy link
Author

My opinion is using the JsonObject to POJO is likely the sanest and will not corrupt any of the spec.

Per my many comments above, using JsonObject as an intermediate step is not ideal and precludes the use of JSON-B POJO mapping to filter the JSON input. Perhaps we can focus on your concern of corrupting the spec. My view is that the JSON-P and JSON-B specs can be made to work together fairly seamlessly through this proposal and a corresponding proposal I made for JSON-P

Concretely, the overhead of going through a JsonObject will likely be negligible due to JsonParser nature/(required)impl and just requires yasson and johnzon to agree on method signatures (JsonStructure in yasson if Im not mistaken vs JsonValue in johnzon) to map a json value to a pojo.

I think it's a stretch to assume that such overhead will be negligible. Certainly the intermediate object overhead is not a major concern assuming the JsonStructure is adequately pruned, but the need to reinvent such pruning is the real issue here.

Assuming this method is spec-ed, can you give a try in your project to validates it matches?

No, because it means I would need to add significant code to my project to filter arbitrary JSON structures to match my intended POJO model. This functionality is provided by Jackson today, and COULD be easily provided by JSON-B.

I can surely help in a few days if needed (still off computer for some days).

With due respect, I am not looking for help solving my application challenges: they are ALREADY solved by Jackson. I would like to be able to use JSON-B instead, but not at the expense of the maintainability of my codebase.

Let's just let this issue rest until some more voices can be heard.

@rmannibucau
Copy link

My view is that the JSON-P and JSON-B specs can be made to work together fairly seamlessly through this proposal and a corresponding proposal I made for JSON-P

Your JSON-B proposal does not solve your need and assumes you already handle the parsing yourself, in a real code it would not be as smooth as you expect I fear since you still need to drop the undesired (big) elements.

Certainly the intermediate object overhead is not a major concern assuming the JsonStructure is adequately pruned, but the need to reinvent such pruning is the real issue here.

Exactly what does not work with your current proposal IMHO. I assume it works if you prehandle the parser state which also means - since JSON ordering is rarely guaranteed - that you would need to wrap the parser as I originally proposed to ensure you skip the key/values you don't want. JSON-B can't help much there.

No, because it means I would need to add significant code to my project to filter arbitrary JSON structures to match my intended POJO model.

Hmm, we are speaking of 1 line, not sure what you have in mind but reading it I fear you just want to migrate some jackson code without trying alternatives.

I would like to be able to use JSON-B instead, but not at the expense of the maintainability of my codebase.

Honestly I used all the proposed solutions in real-time and batch apps without mem issues. Code maintenance is a no-issue - ok I use lombok so any delegate pattern is smooth but even without it is code without any cleverness so maintenance cost is 0. So not sure how to interpret that except that you should probably not migrate the app? 🤔

@emattheis
Copy link
Author

There is clearly a communication issue here. I am not proposing anything other than a feature that already exists in all popular JSON libraries (including the JSON-B reference implementation). Namely, the ability to use a streaming parser and delegate to a POJO mapper. In real code it is precisely as smooth as I think. The streaming part obviously must handle walking the parsing tree to the correct location, but thereafter it can delegate to the POJO mapper and get all the benefits of that library to further extract exactly as much of the tree as desired. I'm not asking for JSON-B to do anything more than it already does with respect to deserializing POJOs, I simply want it to accept an arbitrary JsonParser as its entry point.

@rmannibucau
Copy link

rmannibucau commented Feb 24, 2020

It does not enable to not materialize the skipped entries, jsonp does not enable that today so you can still consume way too much memory. This is partly why this proposal is closer to a workaround to me.

Edit: created two related issues in jsonp, let see how it is received but think it is core to this issue and that jsonb must rely on jsonp before all to solve it properly.
Hope it makes sense.

@emattheis
Copy link
Author

Any efficiency shortcomings in JsonParser are ancillary to this issue. It is still desirable to have a way to pass a JsonParser to Jsonb for mapping POJOs from an already managed stream.

If JsonParser had a way to indicate its current event. Then Jsonb#fromJson(JsonParser, Class) could behave exactly like DeserializationContext#deserialize(Class, JsonParser). As you have argued in this thread, this functionality already exists under the covers. Raising it up to the Jsonb interface makes it far more convenient and doesn't force users to subvert the deserializer mechanism to achieve their goals.

@aguibert
Copy link
Contributor

aguibert commented Mar 9, 2020

Thanks for raising this ehnacement request @emattheis, but I'm going to close it as a duplicate of #122 which covers the same functionality you are requesting. (#122 is also on the roadmap for the next version of JSON-B)

@aguibert aguibert closed this as completed Mar 9, 2020
@aguibert aguibert added the duplicate This issue or pull request already exists label Mar 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

4 participants