Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ast.Visitor for fully decoding JSON into custom generic data containers without using ast.Node #471

Merged
merged 11 commits into from
Jul 6, 2023

Conversation

zhongxinghong
Copy link
Contributor

Background

In our business, we want to decode the whole JSON into our custom containers which are neither golang's struct nor map[string]interface{}. They are look like these and can't be replaced by other types:

type UserNode interface {}

// the following types implement the UserNode interface.
type (
    UserNull    struct{}
    UserBool    struct{ Value bool }
    UserInt64   struct{ Value int64 }
    UserFloat64 struct{ Value float64 }
    UserString  struct{ Value string }
    UserObject  struct{ Value map[string]UserNode }
    UserArray   struct{ Value []UserNode }
)

There are two ways to meet our needs if we use sonic:

  1. Use sonic.Unmarshal() to decode JSON into map[string]interface{} and then use a custom method to recursively convert interface{} to our UserNode types.
  2. Decode JSON into sonic/ast.Node types and then use another custom method to recursively convert ast.Node to our UserNode types.

We've also designed a custom JSON library to directly decode JSON into our UserNode types without any IR (intermediate representation: interface{} / reflect.Value / ...). It's based on encoding/json with some jsoniter-go's optimization but it's even faster than the second way above. (see benchmark in appendix)

Sonic/ast.Node is designed for processing partial JSON and its APIs are similar to gjson/sjson. It's faster than using map[string]interface{} as IR for decoding generic data. But it is also a kind of IR after all and have lazy-load design, which are useless in our case but causes performance loss. So we want to introduce a Node-free decoder to speed up this case.

Changes

  1. Introduce the ast.Visitor interface to represents a handler during preorder-traversal of a JSON AST.
  2. Introduce the ast.Preorder() method based on ast.Parser to preform the preorder-traversal.
  3. Add a demo Visitor implementation in ast/visitor_test.go to preform the unit-test / benchmark.
// Visitor handles the callbacks during preorder traversal of a JSON AST.
//
// According to the JSON RFC8259, a JSON AST can be defined by
// the following rules without seperator / whitespace tokens.
//
//  JSON-AST  = value
//  value     = false / null / true / object / array / number / string
//  object    = begin-object [ member *( member ) ] end-object
//  member    = string value
//  array     = begin-array [ value *( value ) ] end-array
//
type Visitor interface {

    // OnNull handles a JSON null value.
    OnNull() error

    // OnTrue handles a JSON true value.
    OnTrue() error

    // OnFalse handles a JSON false value.
    OnFalse() error

    // OnString handles a JSON string value.
    OnString(v string) error

    // OnNumber handles a JSON number value with its type after conversion.
    //
    // For a valid JSON, the v.Int64() method should parse the JSON number
    // correctly and return no error if isInt64 is true, otherwise
    // the v.Float64() method should work.
    OnNumber(v json.Number, isInt64 bool) error

    // OnObjectBegin handles the beginning of a JSON object value with a
    // suggested capacity that can be used to make your custom object container.
    //
    // After this point the visitor will receive a sequence of callbacks like
    // [string, value, string, value, ......, ObjectEnd].
    //
    // Notice that this is a recursive definition which means the value can
    // also be a JSON object / array described by a sequence of callbacks.
    OnObjectBegin(capacity int) error

    // OnObjectKey handles a JSON object key string in member.
    OnObjectKey(key string) error

    // OnObjectEnd handles the ending of a JSON object value.
    OnObjectEnd() error

    // OnArrayBegin handles the beginning of a JSON array value with a
    // suggested capacity that can be used to make your custom array container.
    //
    // After this point the visitor will receive a sequence of callbacks like
    // [value, value, value, ......, ArrayEnd].
    //
    // Notice that this is a recursive definition which means the value can
    // also be a JSON object / array described by a sequence of callbacks.
    OnArrayBegin(capacity int) error

    // OnArrayEnd handles the ending of a JSON array value.
    OnArrayEnd() error
}

Discussions

  1. There's still another style to pass the preorder-traversal to user program like simdjson by use cpp-style iterator, but we finally use visitor callback style. The reasons are:
    • The iterator has to record the decoder's context and switch to user program after every Next() calls and recover decoder context to continue the processing in subsequent Next() call. This causes additional performance overhead.
    • The iterator-style decoder's code is not look like the existing ast.Parser code, which brings additional maintenance costs.
  2. The Visitor interface is nearly unchangeable after released and might cause some compatibility problem in the future.

Appendix

Go Version

$ go version
go version go1.18.10 darwin/amd64

Benchmark ast.Node vs ast.Visitor

$ cd ast/
$ go test . -run=none -benchmem -bench BenchmarkVisitor -benchtime=100000x
Begin GC looping...
goos: darwin
goarch: amd64
pkg: github.com/bytedance/sonic/ast
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkVisitor_UserNode/AST-8                   100000            314353 ns/op          41.43 MB/s       97272 B/op        540 allocs/op
BenchmarkVisitor_UserNode/Visitor-8               100000            268001 ns/op          48.59 MB/s       44273 B/op        405 allocs/op
PASS
ok      github.com/bytedance/sonic/ast  58.519s

Benchmark in our business (use our dataset)

goos: darwin
goarch: amd64
pkg: ******
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkDecode_Case01_Sonic
BenchmarkDecode_Case01_Sonic/sonic
BenchmarkDecode_Case01_Sonic/sonic-8                 200          46148747 ns/op           0.37 MB/s    34555524 B/op     448516 allocs/op
BenchmarkDecode_Case01_Sonic/sonic_ast_node
BenchmarkDecode_Case01_Sonic/sonic_ast_node-8                200          40603735 ns/op           0.42 MB/s    43147743 B/op     287744 allocs/op
BenchmarkDecode_Case01_Sonic/sonic_ast_visitor
BenchmarkDecode_Case01_Sonic/sonic_ast_visitor-8             200          26099894 ns/op           0.65 MB/s    16027529 B/op     229515 allocs/op
BenchmarkDecode_Case01
BenchmarkDecode_Case01/ejson
BenchmarkDecode_Case01/ejson-8                               200          69715883 ns/op           0.24 MB/s    43855800 B/op     679455 allocs/op
BenchmarkDecode_Case01/ijson
BenchmarkDecode_Case01/ijson-8                               200          63586936 ns/op           0.27 MB/s    38632656 B/op     884192 allocs/op
BenchmarkDecode_Case01/tjson
BenchmarkDecode_Case01/tjson-8                               200          31347045 ns/op           0.54 MB/s    20710638 B/op     471284 allocs/op
PASS
  • sonic: string --(sonic.UnmarshalFromString)--> map[string]interface{} --(convert)--> UserNode
  • sonic_ast_node: string --(sonic.GetFromString / node.LoadAll)--> ast.Node --(convert)--> UserNode
  • sonic_ast_visitor: string --(sonic/ast.Preorder)--> UserNode
  • ejson: []byte --(encoding/json.NewDecoder, UseNumber)--> map[string]interface{} --(convert)--> UserNode
  • ijson: string --(jsoniter-go.NewDecoder, UseNumber)--> map[string]interface{} --(convert)--> UserNode
  • tjson: []byte --(our custom JSON library)--> UserNode
NOTE:
  • In sonic_ast_visitor, we ignore the capacity from Visitor.OnObjectBegin / Visitor.OnArrayBegin and use capacity = 0. This is different from BenchmarkVisitor_UserNode/Visitor.

@zhongxinghong
Copy link
Contributor Author

I will perfect the README after CR :)

@liuq19
Copy link
Collaborator

liuq19 commented Jun 28, 2023

Thanks, we will review it later

@codecov-commenter
Copy link

codecov-commenter commented Jun 30, 2023

Codecov Report

❗ No coverage uploaded for pull request base (main@6108485). Click here to learn what that means.
The diff coverage is n/a.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@           Coverage Diff           @@
##             main     #471   +/-   ##
=======================================
  Coverage        ?   77.92%           
=======================================
  Files           ?       63           
  Lines           ?    10565           
  Branches        ?        0           
=======================================
  Hits            ?     8233           
  Misses          ?     1971           
  Partials        ?      361           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Collaborator

@AsterDY AsterDY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

AsterDY
AsterDY previously approved these changes Jul 2, 2023
ast/visitor.go Show resolved Hide resolved
liuq19
liuq19 previously approved these changes Jul 3, 2023
ast/visitor.go Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
AsterDY
AsterDY previously approved these changes Jul 3, 2023
AsterDY
AsterDY previously approved these changes Jul 5, 2023
@AsterDY AsterDY enabled auto-merge (squash) July 6, 2023 07:37
@AsterDY AsterDY merged commit 78fd91e into bytedance:main Jul 6, 2023
18 checks passed
@zhongxinghong zhongxinghong deleted the feat/ast_visitor branch July 16, 2023 06:39
kodiakhq bot pushed a commit to cloudquery/plugin-pb-go that referenced this pull request Oct 1, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [github.com/bytedance/sonic](https://togithub.com/bytedance/sonic) | indirect | patch | `v1.10.0-rc` -> `v1.10.1` |

---

### Release Notes

<details>
<summary>bytedance/sonic (github.com/bytedance/sonic)</summary>

### [`v1.10.1`](https://togithub.com/bytedance/sonic/releases/tag/v1.10.1)

[Compare Source](https://togithub.com/bytedance/sonic/compare/v1.10.0...v1.10.1)

#### Feature

-   \[[#&#8203;511](https://togithub.com/bytedance/sonic/issues/511)] (ast) support sort keys on non-object node
-   \[[#&#8203;527](https://togithub.com/bytedance/sonic/issues/527)] (encoder) Add `NoValidateJSONMarshaler` option

#### Bugfix

-   \[[#&#8203;504](https://togithub.com/bytedance/sonic/issues/504)] (ast) check error before `Set/Unset/Add()`
-   \[[#&#8203;520](https://togithub.com/bytedance/sonic/issues/520)] (native) over boundary bugs of skip number and tolower in native c

#### New Contributors

-   [@&#8203;jimyag](https://togithub.com/jimyag) made their first contribution in [bytedance/sonic#501
-   [@&#8203;hitzhangjie](https://togithub.com/hitzhangjie) made their first contribution in [bytedance/sonic#505
-   [@&#8203;xiezheng-XD](https://togithub.com/xiezheng-XD) made their first contribution in [bytedance/sonic#516
-   [@&#8203;andeya](https://togithub.com/andeya) made their first contribution in [bytedance/sonic#527

**Full Changelog**: bytedance/sonic@v1.10.0...v1.10.1

### [`v1.10.0`](https://togithub.com/bytedance/sonic/releases/tag/v1.10.0)

[Compare Source](https://togithub.com/bytedance/sonic/compare/v1.10.0-rc3...v1.10.0)

#### Feature

-   \[[#&#8203;493](https://togithub.com/bytedance/sonic/issues/493)] **support Go1.21.0**
-   \[[#&#8203;471](https://togithub.com/bytedance/sonic/issues/471)] (ast) add `ast.Visitor` for transversing JSON in-place
-   \[[#&#8203;470](https://togithub.com/bytedance/sonic/issues/470)] add `Valid()` API

#### Bugfix

-   \[[#&#8203;486](https://togithub.com/bytedance/sonic/issues/486)] possible overflowed instruction while handling `byte` type
-   \[[#&#8203;484](https://togithub.com/bytedance/sonic/issues/484)] (decoder) avoid scratched memory of returned error
-   \[[#&#8203;496](https://togithub.com/bytedance/sonic/issues/496)] (ast) Exist() didn't check Valid() first
-   \[[#&#8203;498](https://togithub.com/bytedance/sonic/issues/498)] (ast) drop ast.Node API `UnsafeArray()` and `UnsafeMap()` (**Break Change**)

#### Optimization

-   \[[#&#8203;393](https://togithub.com/bytedance/sonic/issues/393)] **refactor `asm2asm` to avoid `SIGPROF` crashing, and enable traceback when C function panics**
-   \[[#&#8203;464](https://togithub.com/bytedance/sonic/issues/464)] (ast) use linked chunk as fundamental storage for nodes to keep node pointer valid (**Break Change**)
-   \[[#&#8203;464](https://togithub.com/bytedance/sonic/issues/464)] (ast) avoid malloc when meeting empty values, and inline header chunk into lazy-parsing stack to reduce malloc. **The performance of `Parse()\Load()\Interface()` promoted 10~60%**
-   \[[#&#8203;475](https://togithub.com/bytedance/sonic/issues/475)] (last) pass `skipnumber` flag to avoid decoding numbers

#### New Contributors

-   [@&#8203;xumingyukou](https://togithub.com/xumingyukou) made their first contribution in [bytedance/sonic#447
-   [@&#8203;zhongxinghong](https://togithub.com/zhongxinghong) made their first contribution in [bytedance/sonic#471

**Full Changelog**: bytedance/sonic@v1.9.2...v1.10.0

### [`v1.10.0-rc3`](https://togithub.com/bytedance/sonic/compare/v1.10.0-rc2...v1.10.0-rc3)

[Compare Source](https://togithub.com/bytedance/sonic/compare/v1.10.0-rc2...v1.10.0-rc3)

### [`v1.10.0-rc2`](https://togithub.com/bytedance/sonic/releases/tag/v1.10.0-rc2)

[Compare Source](https://togithub.com/bytedance/sonic/compare/v1.10.0-rc...v1.10.0-rc2)

#### Optimization

-   \[[#&#8203;475](https://togithub.com/bytedance/sonic/issues/475)] (ast) pass `skipnumber` flag to avoid decoding numbers
-   \[[#&#8203;483](https://togithub.com/bytedance/sonic/issues/483)] update base64x to finish asm2asm refactor

#### Feature

-   \[[#&#8203;471](https://togithub.com/bytedance/sonic/issues/471)] (ast) add `ast.Visitor` for iterating JSON into custom generic data containers in-place

#### New Contributors

-   [@&#8203;zhongxinghong](https://togithub.com/zhongxinghong) made their first contribution in [bytedance/sonic#471

**Full Changelog**: bytedance/sonic@v1.10.0-rc...v1.10.0-rc2

</details>

---

### Configuration

📅 **Schedule**: Branch creation - "before 4am on the first day of the month" (UTC), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://togithub.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4xMDkuNCIsInVwZGF0ZWRJblZlciI6IjM2LjEwOS40IiwidGFyZ2V0QnJhbmNoIjoibWFpbiJ9-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants