gojsonlex
is a drop in replacement for encoding/json
lexer optimised for efficiency. gojsonlex
is 2-3 times
faster than encoding/json
and requires memory only enough to buffer the longest token in the input. Currently
gojsonlex
skips all delimiters (this behaviour will be changed).
https://pkg.go.dev/github.com/gibsn/gojsonlex
Let's consider a case when you want to parse the output of some tool that encodes binary data to one huge JSON dict:
{
"bands": [
{
"name": "Metallica",
"origin": "USA",
"albums": [
...
]
},
...
{
"name": "Enter Shikari",
"origin": "England"
"albums": [
...
]
}
]
}
Let's say "albums" can be arbitrary long, the whole JSON is 10GB, but you actually want to print out all "origin" values and don't care about the rest. You do not want to decode the whole JSON into one struct (like most JSON parsers do) since it can be huge. Luckily in this case you do not actually need to parse any arbitrary JSON, you are ok with a more narrow grammar. A parser for such a grammar could look like this:
for {
currToken, err := lexer.Token()
if err != nil {
// ...
}
switch state {
case searchingForOriginKey:
if currToken == "origin" {
state := pendingOriginValue
}
case pendingOriginValue:
fmt.Println(currToken)
state = searchingForOriginKey
}
}
Ok, so now you need a JSON lexer. Some lexers that I checked did buffer a large portion of input in order to parse a composite type (which is bad since "albums" can be huge). The only lexer that did not require that much memory was the standard encoding/json
, however it could be optimized to consume less CPU. That's how gojsonlex
was born.
Example from the previous section could be implemented with gojsonlex
like this:
l, err := gojsonlex.NewJSONLexer(r)
if err != nil {
// ...
}
state = stateSearchingForOriginKey
for {
currToken, err := lexer.Token()
if err != nil {
// ...
}
s, ok := currToken.(string)
if !ok {
continue
}
switch state {
case stateSearchingForOriginKey:
if s == "origin" {
state := pendingOriginValue
}
case statePendingOriginValue:
fmt.Println(s)
state = searchingForOriginKey
}
}
In order to maintain zero allocations Token()
will always return an unsafe string that is valid only until the next Token()
call. You must make a deep copy (using StringDeepCopy()
) of that string in case you may need it after the next Token()
call.
Though gojsonlex.Token()
is faster than that from encoding/json
, it sacfrifices performance in order to match the default interface. You may want to consider using TokenFast()
to achieve the best performance (in exchange for more coding):
for {
currToken, err := lexer.TokenFast()
if err != nil {
// ...
}
if currToken.Type() != LexerTokenTypeString {
continue
}
s := currToken.String()
switch state {
case stateSearchingForOriginKey:
if s == "origin" {
state := pendingOriginValue
}
case statePendingOriginValue:
fmt.Println(s)
state = searchingForOriginKey
}
}
Please refer to the 'examples' directory for the examples of gojsonlex
usage. Run make examples
to build all examples.
stdinparser
is a simple utility that reads JSON from StdIn and dumps JSON tokens to StdOut
BenchmarkEncodingJSON-8 576 1973465 ns/op 432581 B/op 26706 allocs/op
BenchmarkJSONLexer-8 1212 959528 ns/op 99200 B/op 6300 allocs/op
BenchmarkJSONLexerFast-8 1532 771233 ns/op 0 B/op 0 allocs/op
In development