Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Json trees #3773

Open
wants to merge 13 commits into
base: dev
Choose a base branch
from
Open

Json trees #3773

wants to merge 13 commits into from

Conversation

parrt
Copy link
Member

@parrt parrt commented Jul 4, 2022

See #3772

Signed-off-by: Terence Parr <parrt@antlr.org>
@parrt parrt changed the base branch from master to dev July 4, 2022 19:50
@parrt parrt added this to the 4.10.2 milestone Jul 4, 2022
parrt added 2 commits July 4, 2022 12:59
Signed-off-by: Terence Parr <parrt@antlr.org>
Signed-off-by: Terence Parr <parrt@antlr.org>
parrt added 4 commits July 4, 2022 13:51
Signed-off-by: Terence Parr <parrt@antlr.org>
Signed-off-by: Terence Parr <parrt@antlr.org>
Signed-off-by: Terence Parr <parrt@antlr.org>
Signed-off-by: Terence Parr <parrt@antlr.org>
@HSorensen
Copy link
Contributor

Are there any tree walkers or visitors that can utilize the JSON parse trees?

@KvanTTT
Copy link
Member

KvanTTT commented Jul 5, 2022

I think it's up for runtime.

@parrt
Copy link
Member Author

parrt commented Jul 5, 2022

Are there any tree walkers or visitors that can utilize the JSON parse trees?

Any Target language that knows how to read json, should be able to pull these in and walk the trees recursively. I will have to build one in JavaScript as I'm trying to build a server / client webpage that communicates using this format.

parrt added 2 commits July 5, 2022 10:51
Signed-off-by: Terence Parr <parrt@antlr.org>
Signed-off-by: Terence Parr <parrt@antlr.org>
@HSorensen
Copy link
Contributor

Any Target language that knows how to read json,
As you of course already know using either the tree walker or visitor patterns is just so much more efficient.

@parrt
Copy link
Member Author

parrt commented Jul 5, 2022

@HSorensen yep, what I meant was somebody will have to deserialize the json into a proper parse tree and then the usual visitor in listener patterns will work great. This is only for sending stuff across a wire. If it's in memory this is all unnecessary.

parrt added 3 commits July 5, 2022 11:42
Signed-off-by: Terence Parr <parrt@antlr.org>
Signed-off-by: Terence Parr <parrt@antlr.org>
Signed-off-by: Terence Parr <parrt@antlr.org>
@parrt
Copy link
Member Author

parrt commented Jul 5, 2022

@KvanTTT looking better, right?

Signed-off-by: Terence Parr <parrt@antlr.org>
@KvanTTT
Copy link
Member

KvanTTT commented Jul 5, 2022

Yes, separated class looks better.

@parrt
Copy link
Member Author

parrt commented Jul 5, 2022

Added sample output and python parsing of json here: #3772

@JamesRTaylor
Copy link
Contributor

This is really good stuff. Any work on the deserialization side? Do you think that's a bigger task?

@parrt
Copy link
Member Author

parrt commented Dec 10, 2022

Any work on the deserialization side? Do you think that's a bigger task?

Hi. Haven't done any work on deserialization. sorry.

@parrt
Copy link
Member Author

parrt commented Dec 10, 2022

There's a much better implementation I have for serialization in the antlr4-lab: https://github.com/antlr/antlr4-lab/blob/master/src/org/antlr/v4/server/JsonSerializer.java I hope to eventually fold this back into Antlr.

@kaby76
Copy link
Contributor

kaby76 commented Dec 10, 2022

BTW, I've spent probably two or three years going through different implementations for the parse tree representation and serialization. After working on tree rewriting problems, I've come to the conclusion that the Antlr tree/tokenstream/chastream/interval implementation is definitely not the best representation for tree rewriting, especially if there are hundreds of edits to do: keeping it all consistent is very time consuming, and very tedious. I've settled on a tree decorated with text and attribute nodes for tokens and skip and off-channel text. Plus it is more easily adapted to XPath and XSLT engines.

@parrt
Copy link
Member Author

parrt commented Dec 11, 2022 via email

@JamesRTaylor
Copy link
Contributor

JamesRTaylor commented Dec 13, 2022

My use case was fast deserialization of a parse tree with as compact as possible encoded data. This is using Go. The end goal was to make deserialization significantly faster than re-parsing the original string. I was able to reduce the encoded data to about 80% compared to the string and reduce the decode time to about 35% of the parse time. In the end, it wasn't significantly faster than re-parsing (kudos to the parser!) to justify the extra code and limitations imposed on grammar writing (see below). It did show some promise, though.

The approach I took was to:

  1. Serialize the parse tree (inspired by the serialization code) as a combination of token and rule indexes (not exactly a rule index as I needed to handle grammar # tags too).
  2. Generate code to enable deserialization by introspecting the visitor. The input was the original string (since in our use case this was always going to be persisted and available) and the output was the result of walking the visitor.
  • Re-tokenize the original string
  • Deserialize the parse tree (calling generated code using rule index)
  • Re-walk the visitor to produce the domain objects

The one limitation I had was that grammar variables were problematic in that I had no good way to re-establish their state. I could have serialized and deserialized them, but that would have bloated the encoded data pretty significantly (though of course that depends on how your grammar was written). I chose to just not use grammar variables in my tests.

@kaby76
Copy link
Contributor

kaby76 commented Dec 13, 2022

Thanks for the info.

The problem I'm working on, at the moment, is the scrape and conversion of the grammar for Python3, in Pegen syntax, to Antlr4. The parse of the Python3 grammar in Pegen syntax takes ~2.6s on a speedy machine--so long because the rules in a Pegen grammar do not have a rule terminator (e.g., the ';' at the end of a rule in Antlr4 grammars). Serialization of the parse tree, parser, and lexer tables takes ~0.03s, deserialization ~0.05s. The parse tree itself was changed to not use tokenstream/charstream/indices, but instead docorate the parse tree with text and attribute nodes for default channel and off-channel tokens and character strings. This representation allows for much faster tree node edits. In fact, the parse takes more time than deserialization, serialization, deleting and inserting hundreds of nodes involved in converting the grammar to Antlr4 syntax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants